linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Best way (only?) to setup SSD's for using TRIM
@ 2012-10-28 18:59 Curtis J Blank
       [not found] ` <CAH3kUhHX28yNXggLuA+D_cH0STY-Rn_BjxVt_bh1sMeYLnM0cw@mail.gmail.com>
  2012-10-30  9:49 ` David Brown
  0 siblings, 2 replies; 23+ messages in thread
From: Curtis J Blank @ 2012-10-28 18:59 UTC (permalink / raw)
  To: linux-raid

I've got two new SSD's that I want to set up as RAID1 and use strictly 
for the OS and MySQL DB's partitioned accordingly.

I'll be using the 3.4.6 kernel for now in openSuSE 12.2 with ext4. So 
after a lot of Google'n and reading it is my understanding that discard 
is not sent to the devices via the raid drivers. I am aware of Shaohua 
Li's patches to make it work but am not inclined to use them due to 
openSuSE's Online Update replacing the kernel. I'm not against patching 
and gen'ing a kernel, that used to be SOP, but just don't want deal with 
that overhead. Of course unless I really need to.

So I've read, and if I understand things correctly, I can use LVM and 
RAID1 and the the discard commands will be sent to the devices. Is that 
correct and currently the only way or is/are there other ways?

I've also read that a lot of people are saying TRIM isn't needed because 
the SSD's garbage collection is so good now TRIM isn't needed. But I 
don't see how that could work because the SSD's don't have access to the 
file system so they don't know which pages in the blocks are marked 
unused to do any consolidation and erasing. And using TRIM is suggested 
in a OCZ document I read and who's drives these are. Unless, the SDD 
when it has to change a page moves the whole block then erases the old 
block? But without TRIM in could be moving invalid data too because it 
doesn't know that and that to me sure doesn't sound efficient and this 
operation would be a perfect time to get rid of the invalid data if it 
did know.

And due to PEC I'm wondering if this is even a good idea? Granted the OS 
files can be considered somewhat static, with the exception of /var/log 
so maybe that shouldn't go on the SSD, and maybe MySQL shouldn't either 
because for things like ZoneMinder it's DB is pretty dynamic. But with 
all the logging going on, and there is a lot, and the dynamic nature of 
the MySQL data is the exact reason I want it put it on SSD's, for the 
speed. See my quandary?

This is my best understanding of things right now so I came here to and 
am asking the experts for help in clarifying and  understanding this and 
pick the best direction to go. Have the SSD's for a couple of weeks now 
but holding off using them until the I can determine the best way to use 
them. Oh and the SSD's are OCZ Vertex 4 VTX4-25SAT3-256G.

Thanks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
       [not found] ` <CAH3kUhHX28yNXggLuA+D_cH0STY-Rn_BjxVt_bh1sMeYLnM0cw@mail.gmail.com>
@ 2012-10-29 14:35   ` Curtis J Blank
       [not found]   ` <508E9289.5070904@curtronics.com>
  1 sibling, 0 replies; 23+ messages in thread
From: Curtis J Blank @ 2012-10-29 14:35 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: linux-raid

Could you explain this a little more please? I don't understand why 
applications come into play? Thought the file system, in this case ext4, 
handles space allocation on the device not any application.

And here again, I don't need TRIM, but everything I've been reading, 
even from the SSD manufacturer, OCZ, says you do need TRIM to reduce 
write amplification.

If you don't need TRIM that makes things easy but my concern is write 
amplification over time.

[Geeze. Had to resend, twice now, evidently Robert's reply was in HTML 
"reason: 550 5.7.1 Content-Policy reject msg: The message contains HTML 
subpart, therefore we consider it SPAM or Outlook Virus." and 
ThunderBird even when set not to use HTML keeps it that way]

On 10/29/12 07:20, Roberto Spadim wrote:
> if you don't have DELETE and DROP in you application, you don't need
> TRIM...
>
> 2012/10/28 Curtis J Blank <curt@curtronics.com
> <mailto:curt@curtronics.com>>
>
>     I've got two new SSD's that I want to set up as RAID1 and use
>     strictly for the OS and MySQL DB's partitioned accordingly.
>
>     I'll be using the 3.4.6 kernel for now in openSuSE 12.2 with ext4.
>     So after a lot of Google'n and reading it is my understanding that
>     discard is not sent to the devices via the raid drivers. I am
>     aware of Shaohua Li's patches to make it work but am not inclined
>     to use them due to openSuSE's Online Update replacing the kernel.
>     I'm not against patching and gen'ing a kernel, that used to be
>     SOP, but just don't want deal with that overhead. Of course unless
>     I really need to.
>
>     So I've read, and if I understand things correctly, I can use LVM
>     and RAID1 and the the discard commands will be sent to the
>     devices. Is that correct and currently the only way or is/are
>     there other ways?
>
>     I've also read that a lot of people are saying TRIM isn't needed
>     because the SSD's garbage collection is so good now TRIM isn't
>     needed. But I don't see how that could work because the SSD's
>     don't have access to the file system so they don't know which
>     pages in the blocks are marked unused to do any consolidation and
>     erasing. And using TRIM is suggested in a OCZ document I read and
>     who's drives these are. Unless, the SDD when it has to change a
>     page moves the whole block then erases the old block? But without
>     TRIM in could be moving invalid data too because it doesn't know
>     that and that to me sure doesn't sound efficient and this
>     operation would be a perfect time to get rid of the invalid data
>     if it did know.
>
>     And due to PEC I'm wondering if this is even a good idea? Granted
>     the OS files can be considered somewhat static, with the exception
>     of /var/log so maybe that shouldn't go on the SSD, and maybe MySQL
>     shouldn't either because for things like ZoneMinder it's DB is
>     pretty dynamic. But with all the logging going on, and there is a
>     lot, and the dynamic nature of the MySQL data is the exact reason
>     I want it put it on SSD's, for the speed. See my quandary?
>
>     This is my best understanding of things right now so I came here
>     to and am asking the experts for help in clarifying and
>      understanding this and pick the best direction to go. Have the
>     SSD's for a couple of weeks now but holding off using them until
>     the I can determine the best way to use them. Oh and the SSD's are
>     OCZ Vertex 4 VTX4-25SAT3-256G.
>
>     Thanks.
>     --
>     To unsubscribe from this list: send the line "unsubscribe
>     linux-raid" in
>     the body of a message to majordomo@vger.kernel.org
>     <mailto:majordomo@vger.kernel.org>
>     More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
> Roberto Spadim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
       [not found]     ` <CAH3kUhEdOO+GXKK6ALFUYJdYeTw2Mx-PF9M=0vQvkzzidihxSg@mail.gmail.com>
@ 2012-10-29 17:08       ` Curt Blank
  2012-10-29 18:06         ` Roberto Spadim
  0 siblings, 1 reply; 23+ messages in thread
From: Curt Blank @ 2012-10-29 17:08 UTC (permalink / raw)
  To: Roberto Spadim; +Cc: linux-raid

Thanks. That made me think. Since as I said it's mainly going to contain 
the Linux OS that lessens my concern because of the static nature of those 
files. With the exception of /var/log and the MySQL DB's. /var/log files 
are rotated on a 24 hour basis, kept, then gzip'd after 14 days then kept, 
pretty much forever as long as space is available. MySQL DB's are mostly 
insert with some updates with the exception of the ZoneMinder DB which is 
mostly inserts and deletes continuously, majority of the data seldom lives 
past a month or two.

I can maybe see the need for TRIM on /var/log and maybe /mysql and 
possibly /usr/local (which is it's own MP) where I do all my code 
development.

Like I said, I'm not against compiling a kernel I just really don't want 
to have to do that every time the distribution updates and installs a new 
one.

So still looking for a way to use TRIM with RAID1 which gets me back to 
the LVM option I heard might work?

Also trying to find out if using my raid card might be the ticket but if 
not then it have to be the kernel if I want TRIM.

On Mon, 29 Oct 2012, Roberto Spadim wrote:

> trim is used by file system to clean space to new files be writen, in other
> words, if you never need new space you won't need trim
> does your application a log like app? if yes trim is useless, the only use
> here is if you delete old log files and it normally is very rare, maybe
> once a month, maybe a week?
> 
> if you need always delete files and write new ones, yes trim can help you
> but it will not save your life, it give more 'clean' space to garbage
> collect of ssd, and ssd don't reuse spaces with informations, it will use
> new ones, so in other words a trim help ssd to write faster and dont do
> (read-merge-write)
> vertex 4 have a good firmware, i used a vertex 2 some years ago without
> trim, and it runs nice with mysql in a generic aplication with delete
> update select, it execute >300iops very easly, and 300MB/s too, i didn't
> checked a slow don't after some time using it, so i don't see why trim
> could speed a lot my app,
> try, and say what you got, it's not unsafe, and sounds good, i don't know
> about suse kernel support, but it's not hard to compile a new kernel if you
> need it
> 
> i used the vertex2 in raid1 configuration (it's better for many threads
> apps like mysql, but worst for stream application, for stream a raid10 or
> raid0 is better)
> 
> 
> 2012/10/29 Curtis J Blank <curt@curtronics.com>
> 
> >  Could you explain this a little more please? I don't understand why
> > applications come into play? Thought the file system, in this case ext4,
> > handles space allocation on the device not any application.
> >
> > And here again, I don't need TRIM, but everything I've been reading, even
> > from the SSD manufacturer, OCZ, says you do need TRIM to reduce write
> > amplification.
> >
> > If you don't need TRIM that makes things easy but my concern is write
> > amplification over time.
> >
> > [Had to resend, got rejected due to "reason: 550 5.7.1 Content-Policy
> > reject msg: The message contains HTML subpart, therefore we consider it
> > SPAM or Outlook Virus." from ThunderBird]
> >
> >
> > On 10/29/12 07:20, Roberto Spadim wrote:
> >
> > if you don't have DELETE and DROP in you application, you don't need
> > TRIM...
> >
> > 2012/10/28 Curtis J Blank <curt@curtronics.com>
> >
> >> I've got two new SSD's that I want to set up as RAID1 and use strictly
> >> for the OS and MySQL DB's partitioned accordingly.
> >>
> >> I'll be using the 3.4.6 kernel for now in openSuSE 12.2 with ext4. So
> >> after a lot of Google'n and reading it is my understanding that discard is
> >> not sent to the devices via the raid drivers. I am aware of Shaohua Li's
> >> patches to make it work but am not inclined to use them due to openSuSE's
> >> Online Update replacing the kernel. I'm not against patching and gen'ing a
> >> kernel, that used to be SOP, but just don't want deal with that overhead.
> >> Of course unless I really need to.
> >>
> >> So I've read, and if I understand things correctly, I can use LVM and
> >> RAID1 and the the discard commands will be sent to the devices. Is that
> >> correct and currently the only way or is/are there other ways?
> >>
> >> I've also read that a lot of people are saying TRIM isn't needed because
> >> the SSD's garbage collection is so good now TRIM isn't needed. But I don't
> >> see how that could work because the SSD's don't have access to the file
> >> system so they don't know which pages in the blocks are marked unused to do
> >> any consolidation and erasing. And using TRIM is suggested in a OCZ
> >> document I read and who's drives these are. Unless, the SDD when it has to
> >> change a page moves the whole block then erases the old block? But without
> >> TRIM in could be moving invalid data too because it doesn't know that and
> >> that to me sure doesn't sound efficient and this operation would be a
> >> perfect time to get rid of the invalid data if it did know.
> >>
> >> And due to PEC I'm wondering if this is even a good idea? Granted the OS
> >> files can be considered somewhat static, with the exception of /var/log so
> >> maybe that shouldn't go on the SSD, and maybe MySQL shouldn't either
> >> because for things like ZoneMinder it's DB is pretty dynamic. But with all
> >> the logging going on, and there is a lot, and the dynamic nature of the
> >> MySQL data is the exact reason I want it put it on SSD's, for the speed.
> >> See my quandary?
> >>
> >> This is my best understanding of things right now so I came here to and
> >> am asking the experts for help in clarifying and  understanding this and
> >> pick the best direction to go. Have the SSD's for a couple of weeks now but
> >> holding off using them until the I can determine the best way to use them.
> >> Oh and the SSD's are OCZ Vertex 4 VTX4-25SAT3-256G.
> >>
> >> Thanks.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> >
> >
> >  --
> > Roberto Spadim
> >
> >
> >
> >
> 
> 
> -- 
> Roberto Spadim
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-29 17:08       ` Curt Blank
@ 2012-10-29 18:06         ` Roberto Spadim
  0 siblings, 0 replies; 23+ messages in thread
From: Roberto Spadim @ 2012-10-29 18:06 UTC (permalink / raw)
  To: Curt Blank; +Cc: linux-raid

maybe you want a 'cache', there's bcache and others patchs to kernel
that make harddisk 'hibrid' (ssd+hdd), it's nice and have a good
future in my opnion, google it, and tell what you think

2012/10/29 Curt Blank <curt@curtronics.com>
>
> Thanks. That made me think. Since as I said it's mainly going to contain
> the Linux OS that lessens my concern because of the static nature of those
> files. With the exception of /var/log and the MySQL DB's. /var/log files
> are rotated on a 24 hour basis, kept, then gzip'd after 14 days then kept,
> pretty much forever as long as space is available. MySQL DB's are mostly
> insert with some updates with the exception of the ZoneMinder DB which is
> mostly inserts and deletes continuously, majority of the data seldom lives
> past a month or two.
>
> I can maybe see the need for TRIM on /var/log and maybe /mysql and
> possibly /usr/local (which is it's own MP) where I do all my code
> development.
>
> Like I said, I'm not against compiling a kernel I just really don't want
> to have to do that every time the distribution updates and installs a new
> one.
>
> So still looking for a way to use TRIM with RAID1 which gets me back to
> the LVM option I heard might work?
>
> Also trying to find out if using my raid card might be the ticket but if
> not then it have to be the kernel if I want TRIM.
>
> On Mon, 29 Oct 2012, Roberto Spadim wrote:
>
> > trim is used by file system to clean space to new files be writen, in other
> > words, if you never need new space you won't need trim
> > does your application a log like app? if yes trim is useless, the only use
> > here is if you delete old log files and it normally is very rare, maybe
> > once a month, maybe a week?
> >
> > if you need always delete files and write new ones, yes trim can help you
> > but it will not save your life, it give more 'clean' space to garbage
> > collect of ssd, and ssd don't reuse spaces with informations, it will use
> > new ones, so in other words a trim help ssd to write faster and dont do
> > (read-merge-write)
> > vertex 4 have a good firmware, i used a vertex 2 some years ago without
> > trim, and it runs nice with mysql in a generic aplication with delete
> > update select, it execute >300iops very easly, and 300MB/s too, i didn't
> > checked a slow don't after some time using it, so i don't see why trim
> > could speed a lot my app,
> > try, and say what you got, it's not unsafe, and sounds good, i don't know
> > about suse kernel support, but it's not hard to compile a new kernel if you
> > need it
> >
> > i used the vertex2 in raid1 configuration (it's better for many threads
> > apps like mysql, but worst for stream application, for stream a raid10 or
> > raid0 is better)
> >
> >
> > 2012/10/29 Curtis J Blank <curt@curtronics.com>
> >
> > >  Could you explain this a little more please? I don't understand why
> > > applications come into play? Thought the file system, in this case ext4,
> > > handles space allocation on the device not any application.
> > >
> > > And here again, I don't need TRIM, but everything I've been reading, even
> > > from the SSD manufacturer, OCZ, says you do need TRIM to reduce write
> > > amplification.
> > >
> > > If you don't need TRIM that makes things easy but my concern is write
> > > amplification over time.
> > >
> > > [Had to resend, got rejected due to "reason: 550 5.7.1 Content-Policy
> > > reject msg: The message contains HTML subpart, therefore we consider it
> > > SPAM or Outlook Virus." from ThunderBird]
> > >
> > >
> > > On 10/29/12 07:20, Roberto Spadim wrote:
> > >
> > > if you don't have DELETE and DROP in you application, you don't need
> > > TRIM...
> > >
> > > 2012/10/28 Curtis J Blank <curt@curtronics.com>
> > >
> > >> I've got two new SSD's that I want to set up as RAID1 and use strictly
> > >> for the OS and MySQL DB's partitioned accordingly.
> > >>
> > >> I'll be using the 3.4.6 kernel for now in openSuSE 12.2 with ext4. So
> > >> after a lot of Google'n and reading it is my understanding that discard is
> > >> not sent to the devices via the raid drivers. I am aware of Shaohua Li's
> > >> patches to make it work but am not inclined to use them due to openSuSE's
> > >> Online Update replacing the kernel. I'm not against patching and gen'ing a
> > >> kernel, that used to be SOP, but just don't want deal with that overhead.
> > >> Of course unless I really need to.
> > >>
> > >> So I've read, and if I understand things correctly, I can use LVM and
> > >> RAID1 and the the discard commands will be sent to the devices. Is that
> > >> correct and currently the only way or is/are there other ways?
> > >>
> > >> I've also read that a lot of people are saying TRIM isn't needed because
> > >> the SSD's garbage collection is so good now TRIM isn't needed. But I don't
> > >> see how that could work because the SSD's don't have access to the file
> > >> system so they don't know which pages in the blocks are marked unused to do
> > >> any consolidation and erasing. And using TRIM is suggested in a OCZ
> > >> document I read and who's drives these are. Unless, the SDD when it has to
> > >> change a page moves the whole block then erases the old block? But without
> > >> TRIM in could be moving invalid data too because it doesn't know that and
> > >> that to me sure doesn't sound efficient and this operation would be a
> > >> perfect time to get rid of the invalid data if it did know.
> > >>
> > >> And due to PEC I'm wondering if this is even a good idea? Granted the OS
> > >> files can be considered somewhat static, with the exception of /var/log so
> > >> maybe that shouldn't go on the SSD, and maybe MySQL shouldn't either
> > >> because for things like ZoneMinder it's DB is pretty dynamic. But with all
> > >> the logging going on, and there is a lot, and the dynamic nature of the
> > >> MySQL data is the exact reason I want it put it on SSD's, for the speed.
> > >> See my quandary?
> > >>
> > >> This is my best understanding of things right now so I came here to and
> > >> am asking the experts for help in clarifying and  understanding this and
> > >> pick the best direction to go. Have the SSD's for a couple of weeks now but
> > >> holding off using them until the I can determine the best way to use them.
> > >> Oh and the SSD's are OCZ Vertex 4 VTX4-25SAT3-256G.
> > >>
> > >> Thanks.
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > >> the body of a message to majordomo@vger.kernel.org
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>
> > >
> > >
> > >
> > >  --
> > > Roberto Spadim
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Roberto Spadim
> >




--
Roberto Spadim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-28 18:59 Best way (only?) to setup SSD's for using TRIM Curtis J Blank
       [not found] ` <CAH3kUhHX28yNXggLuA+D_cH0STY-Rn_BjxVt_bh1sMeYLnM0cw@mail.gmail.com>
@ 2012-10-30  9:49 ` David Brown
  2012-10-30 14:29   ` Curtis J Blank
  1 sibling, 1 reply; 23+ messages in thread
From: David Brown @ 2012-10-30  9:49 UTC (permalink / raw)
  To: Curtis J Blank; +Cc: linux-raid

On 28/10/2012 19:59, Curtis J Blank wrote:
> I've got two new SSD's that I want to set up as RAID1 and use strictly
> for the OS and MySQL DB's partitioned accordingly.
>
> I'll be using the 3.4.6 kernel for now in openSuSE 12.2 with ext4. So
> after a lot of Google'n and reading it is my understanding that discard
> is not sent to the devices via the raid drivers. I am aware of Shaohua
> Li's patches to make it work but am not inclined to use them due to
> openSuSE's Online Update replacing the kernel. I'm not against patching
> and gen'ing a kernel, that used to be SOP, but just don't want deal with
> that overhead. Of course unless I really need to.
>
> So I've read, and if I understand things correctly, I can use LVM and
> RAID1 and the the discard commands will be sent to the devices. Is that
> correct and currently the only way or is/are there other ways?
>
> I've also read that a lot of people are saying TRIM isn't needed because
> the SSD's garbage collection is so good now TRIM isn't needed. But I
> don't see how that could work because the SSD's don't have access to the
> file system so they don't know which pages in the blocks are marked
> unused to do any consolidation and erasing. And using TRIM is suggested
> in a OCZ document I read and who's drives these are. Unless, the SDD
> when it has to change a page moves the whole block then erases the old
> block? But without TRIM in could be moving invalid data too because it
> doesn't know that and that to me sure doesn't sound efficient and this
> operation would be a perfect time to get rid of the invalid data if it
> did know.
>

TRIM is not necessary.

In some situations, TRIM can improve speed - in other cases, it can make 
the system significantly slower.  And it is only ever a help until the 
disk is getting fairly full.

Before deciding about TRIM, it is important to understand what it does, 
and how it works.  TRIM lets the filesystem tell the SSD that a 
particular logical disk block is no longer in use.  The SSD can then 
find the physical flash block associated with that logical block, and 
mark it for garbage collection.

If TRIM had been specified /properly/ for SATA (as it is for SCSI/SAS), 
then it would have been quite useful.  But it has two huge failings - 
there is no specification as to what the host will get if it tries to 
read the trimmed logical block (this is what makes it terrible for RAID 
systems), and it causes a pipeline flush and stall (which is what makes 
TRIM so slow).  The pipeline flushing and stalling will cause particular 
problems if you have a lot of metadata changes or small reads and writes 
in parallel - the sort of accesses you get with database servers.  So 
enabling TRIM will make databases significantly slower.

And what do you lose if you /don't/ enable TRIM?  When a filesystem 
deletes a file, it knows the logical blocks are free, but the SSD keeps 
them around.  When the filesystem re-uses them for new data, the SSD 
then knows that the old physical blocks can be garbage-collected and 
re-used.  So all you are really doing by not using TRIM is delaying the 
collection of unneeded blocks.  As long as the SSD has plenty of spare 
blocks (and this is one of the reasons why any half-decent SSD has 
over-provisioning), TRIM gains you nothing at all here.  (If you have a 
very old SSD, or a very small one, or a very cheap one, then you will 
have poor over-provisioning and poor garbage collection - TRIM might 
then improve the SSD speed as long as the disk is mostly empty.)

It is possible that blocks that could have been TRIMMED will get 
unnecessarily copied as part of a wear-levelling pass - but the effect 
of this is going to be completely negligible on the SSD's lifetime.


So TRIM complicates RAID, limits your flexibility for how to set up your 
disks and arrays, and slows down your metadata transactions and small 
accesses.


TRIM /did/ have a useful role for early SSDs - in particular, it 
improved the artificial benchmarks used by testers and reviewers.  So it 
has ended up being seen as a "must have" feature for both the SSD 
itself, and the software and filesystems accessing them.




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-30  9:49 ` David Brown
@ 2012-10-30 14:29   ` Curtis J Blank
  2012-10-30 14:33     ` Roberto Spadim
  2012-10-30 15:55     ` David Brown
  0 siblings, 2 replies; 23+ messages in thread
From: Curtis J Blank @ 2012-10-30 14:29 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On 10/30/12 04:49, David Brown wrote:
> On 28/10/2012 19:59, Curtis J Blank wrote:
>> I've got two new SSD's that I want to set up as RAID1 and use strictly
>> for the OS and MySQL DB's partitioned accordingly.
>>
>> I'll be using the 3.4.6 kernel for now in openSuSE 12.2 with ext4. So
>> after a lot of Google'n and reading it is my understanding that discard
>> is not sent to the devices via the raid drivers. I am aware of Shaohua
>> Li's patches to make it work but am not inclined to use them due to
>> openSuSE's Online Update replacing the kernel. I'm not against patching
>> and gen'ing a kernel, that used to be SOP, but just don't want deal with
>> that overhead. Of course unless I really need to.
>>
>> So I've read, and if I understand things correctly, I can use LVM and
>> RAID1 and the the discard commands will be sent to the devices. Is that
>> correct and currently the only way or is/are there other ways?
>>
>> I've also read that a lot of people are saying TRIM isn't needed because
>> the SSD's garbage collection is so good now TRIM isn't needed. But I
>> don't see how that could work because the SSD's don't have access to the
>> file system so they don't know which pages in the blocks are marked
>> unused to do any consolidation and erasing. And using TRIM is suggested
>> in a OCZ document I read and who's drives these are. Unless, the SDD
>> when it has to change a page moves the whole block then erases the old
>> block? But without TRIM in could be moving invalid data too because it
>> doesn't know that and that to me sure doesn't sound efficient and this
>> operation would be a perfect time to get rid of the invalid data if it
>> did know.
>>
>
> TRIM is not necessary.
>
> In some situations, TRIM can improve speed - in other cases, it can make
> the system significantly slower.  And it is only ever a help until the
> disk is getting fairly full.
>
> Before deciding about TRIM, it is important to understand what it does,
> and how it works.  TRIM lets the filesystem tell the SSD that a
> particular logical disk block is no longer in use.  The SSD can then
> find the physical flash block associated with that logical block, and
> mark it for garbage collection.
>
> If TRIM had been specified /properly/ for SATA (as it is for SCSI/SAS),
> then it would have been quite useful.  But it has two huge failings -
> there is no specification as to what the host will get if it tries to
> read the trimmed logical block (this is what makes it terrible for RAID
> systems), and it causes a pipeline flush and stall (which is what makes
> TRIM so slow).  The pipeline flushing and stalling will cause particular
> problems if you have a lot of metadata changes or small reads and writes
> in parallel - the sort of accesses you get with database servers.  So
> enabling TRIM will make databases significantly slower.
>
> And what do you lose if you /don't/ enable TRIM?  When a filesystem
> deletes a file, it knows the logical blocks are free, but the SSD keeps
> them around.  When the filesystem re-uses them for new data, the SSD
> then knows that the old physical blocks can be garbage-collected and
> re-used.  So all you are really doing by not using TRIM is delaying the
> collection of unneeded blocks.  As long as the SSD has plenty of spare
> blocks (and this is one of the reasons why any half-decent SSD has
> over-provisioning), TRIM gains you nothing at all here.  (If you have a
> very old SSD, or a very small one, or a very cheap one, then you will
> have poor over-provisioning and poor garbage collection - TRIM might
> then improve the SSD speed as long as the disk is mostly empty.)
>
> It is possible that blocks that could have been TRIMMED will get
> unnecessarily copied as part of a wear-levelling pass - but the effect
> of this is going to be completely negligible on the SSD's lifetime.
>
>
> So TRIM complicates RAID, limits your flexibility for how to set up your
> disks and arrays, and slows down your metadata transactions and small
> accesses.
>
>
> TRIM /did/ have a useful role for early SSDs - in particular, it
> improved the artificial benchmarks used by testers and reviewers.  So it
> has ended up being seen as a "must have" feature for both the SSD
> itself, and the software and filesystems accessing them.
>
>

Thanks for the explanation, makes a lot of sense, has me leaning towards 
not using TRIM.

But your explanation focused on blocks, leaving out pages. Does TRIM 
info sent to the device only do that on the block level or does it do it 
at the page level? I was thinking that if it did it at the page level 
the SSD's garbage collection would consolidate blocks by removing unused 
pages (akin to defragmenting) then erasing those pages thus making them 
ready to be written.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-30 14:29   ` Curtis J Blank
@ 2012-10-30 14:33     ` Roberto Spadim
  2012-10-30 15:55     ` David Brown
  1 sibling, 0 replies; 23+ messages in thread
From: Roberto Spadim @ 2012-10-30 14:33 UTC (permalink / raw)
  To: Curtis J Blank; +Cc: David Brown, linux-raid

a point you should understand (or already understand)
any block is writeable, the garbage collector only mark if that block
is empty or not
empty block just need to be write one time
non empty must be read merged and write (this take more time)

so for speed empty block is faster, but non empty block is fast too...
the other garbage colection function is know how many times block was
write, to predict if it could use that block or a newer
newer can have less problem of loss of information than old blocks,
because ssd have limited write cicles

that's the two main functions of trim... help garbage collection and
'block allocation'

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-30 14:29   ` Curtis J Blank
  2012-10-30 14:33     ` Roberto Spadim
@ 2012-10-30 15:55     ` David Brown
  2012-10-30 18:30       ` Curt Blank
  1 sibling, 1 reply; 23+ messages in thread
From: David Brown @ 2012-10-30 15:55 UTC (permalink / raw)
  To: Curtis J Blank; +Cc: linux-raid

On 30/10/2012 15:29, Curtis J Blank wrote:
> On 10/30/12 04:49, David Brown wrote:
>> On 28/10/2012 19:59, Curtis J Blank wrote:
>>> I've got two new SSD's that I want to set up as RAID1 and use strictly
>>> for the OS and MySQL DB's partitioned accordingly.
>>>
>>> I'll be using the 3.4.6 kernel for now in openSuSE 12.2 with ext4. So
>>> after a lot of Google'n and reading it is my understanding that discard
>>> is not sent to the devices via the raid drivers. I am aware of Shaohua
>>> Li's patches to make it work but am not inclined to use them due to
>>> openSuSE's Online Update replacing the kernel. I'm not against patching
>>> and gen'ing a kernel, that used to be SOP, but just don't want deal with
>>> that overhead. Of course unless I really need to.
>>>
>>> So I've read, and if I understand things correctly, I can use LVM and
>>> RAID1 and the the discard commands will be sent to the devices. Is that
>>> correct and currently the only way or is/are there other ways?
>>>
>>> I've also read that a lot of people are saying TRIM isn't needed because
>>> the SSD's garbage collection is so good now TRIM isn't needed. But I
>>> don't see how that could work because the SSD's don't have access to the
>>> file system so they don't know which pages in the blocks are marked
>>> unused to do any consolidation and erasing. And using TRIM is suggested
>>> in a OCZ document I read and who's drives these are. Unless, the SDD
>>> when it has to change a page moves the whole block then erases the old
>>> block? But without TRIM in could be moving invalid data too because it
>>> doesn't know that and that to me sure doesn't sound efficient and this
>>> operation would be a perfect time to get rid of the invalid data if it
>>> did know.
>>>
>>
>> TRIM is not necessary.
>>
>> In some situations, TRIM can improve speed - in other cases, it can make
>> the system significantly slower.  And it is only ever a help until the
>> disk is getting fairly full.
>>
>> Before deciding about TRIM, it is important to understand what it does,
>> and how it works.  TRIM lets the filesystem tell the SSD that a
>> particular logical disk block is no longer in use.  The SSD can then
>> find the physical flash block associated with that logical block, and
>> mark it for garbage collection.
>>
>> If TRIM had been specified /properly/ for SATA (as it is for SCSI/SAS),
>> then it would have been quite useful.  But it has two huge failings -
>> there is no specification as to what the host will get if it tries to
>> read the trimmed logical block (this is what makes it terrible for RAID
>> systems), and it causes a pipeline flush and stall (which is what makes
>> TRIM so slow).  The pipeline flushing and stalling will cause particular
>> problems if you have a lot of metadata changes or small reads and writes
>> in parallel - the sort of accesses you get with database servers.  So
>> enabling TRIM will make databases significantly slower.
>>
>> And what do you lose if you /don't/ enable TRIM?  When a filesystem
>> deletes a file, it knows the logical blocks are free, but the SSD keeps
>> them around.  When the filesystem re-uses them for new data, the SSD
>> then knows that the old physical blocks can be garbage-collected and
>> re-used.  So all you are really doing by not using TRIM is delaying the
>> collection of unneeded blocks.  As long as the SSD has plenty of spare
>> blocks (and this is one of the reasons why any half-decent SSD has
>> over-provisioning), TRIM gains you nothing at all here.  (If you have a
>> very old SSD, or a very small one, or a very cheap one, then you will
>> have poor over-provisioning and poor garbage collection - TRIM might
>> then improve the SSD speed as long as the disk is mostly empty.)
>>
>> It is possible that blocks that could have been TRIMMED will get
>> unnecessarily copied as part of a wear-levelling pass - but the effect
>> of this is going to be completely negligible on the SSD's lifetime.
>>
>>
>> So TRIM complicates RAID, limits your flexibility for how to set up your
>> disks and arrays, and slows down your metadata transactions and small
>> accesses.
>>
>>
>> TRIM /did/ have a useful role for early SSDs - in particular, it
>> improved the artificial benchmarks used by testers and reviewers.  So it
>> has ended up being seen as a "must have" feature for both the SSD
>> itself, and the software and filesystems accessing them.
>>
>>
>
> Thanks for the explanation, makes a lot of sense, has me leaning towards
> not using TRIM.
>
> But your explanation focused on blocks, leaving out pages. Does TRIM
> info sent to the device only do that on the block level or does it do it
> at the page level? I was thinking that if it did it at the page level
> the SSD's garbage collection would consolidate blocks by removing unused
> pages (akin to defragmenting) then erasing those pages thus making them
> ready to be written.
>

I was not using "block" in a particularly strict of formal way.  There 
are a number of different levels of structure involved here, including 
"logical blocks", "sectors", "allocation units", "erase blocks", "write 
pages", etc.  I am simply talking about "lumps of data", rather than any 
specific structure.

As far as the computer is concerned, it deals with "sector numbers" of 
512 byte or 4K sectors.  It is up to the SSD to map these logical 
numbers to physical pages within flash erase blocks.  The PC has no way 
of knowing whether a given set of logical sectors are mapped to pages 
within the same erase block or different ones.

You are right that the SSD's garbage collection routines will sometimes 
collect together the used pages of an erase block, and copy them over to 
another erase block, so that the first erase block can be recycled.  But 
this is done independently of the TRIM, and is part of the normal 
garbage collection function.

mvh.,

David



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-30 15:55     ` David Brown
@ 2012-10-30 18:30       ` Curt Blank
  2012-10-30 18:43         ` Roberto Spadim
  2012-10-30 19:59         ` Chris Murphy
  0 siblings, 2 replies; 23+ messages in thread
From: Curt Blank @ 2012-10-30 18:30 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid



On Tue, 30 Oct 2012, David Brown wrote:

> On 30/10/2012 15:29, Curtis J Blank wrote:
> > On 10/30/12 04:49, David Brown wrote:
> > > On 28/10/2012 19:59, Curtis J Blank wrote:
> > > > I've got two new SSD's that I want to set up as RAID1 and use strictly
> > > > for the OS and MySQL DB's partitioned accordingly.
> > > >
> > > > I'll be using the 3.4.6 kernel for now in openSuSE 12.2 with ext4. So
> > > > after a lot of Google'n and reading it is my understanding that discard
> > > > is not sent to the devices via the raid drivers. I am aware of Shaohua
> > > > Li's patches to make it work but am not inclined to use them due to
> > > > openSuSE's Online Update replacing the kernel. I'm not against patching
> > > > and gen'ing a kernel, that used to be SOP, but just don't want deal with
> > > > that overhead. Of course unless I really need to.
> > > >
> > > > So I've read, and if I understand things correctly, I can use LVM and
> > > > RAID1 and the the discard commands will be sent to the devices. Is that
> > > > correct and currently the only way or is/are there other ways?
> > > >
> > > > I've also read that a lot of people are saying TRIM isn't needed because
> > > > the SSD's garbage collection is so good now TRIM isn't needed. But I
> > > > don't see how that could work because the SSD's don't have access to the
> > > > file system so they don't know which pages in the blocks are marked
> > > > unused to do any consolidation and erasing. And using TRIM is suggested
> > > > in a OCZ document I read and who's drives these are. Unless, the SDD
> > > > when it has to change a page moves the whole block then erases the old
> > > > block? But without TRIM in could be moving invalid data too because it
> > > > doesn't know that and that to me sure doesn't sound efficient and this
> > > > operation would be a perfect time to get rid of the invalid data if it
> > > > did know.
> > > >
> > >
> > > TRIM is not necessary.
> > >
> > > In some situations, TRIM can improve speed - in other cases, it can make
> > > the system significantly slower.  And it is only ever a help until the
> > > disk is getting fairly full.
> > >
> > > Before deciding about TRIM, it is important to understand what it does,
> > > and how it works.  TRIM lets the filesystem tell the SSD that a
> > > particular logical disk block is no longer in use.  The SSD can then
> > > find the physical flash block associated with that logical block, and
> > > mark it for garbage collection.
> > >
> > > If TRIM had been specified /properly/ for SATA (as it is for SCSI/SAS),
> > > then it would have been quite useful.  But it has two huge failings -
> > > there is no specification as to what the host will get if it tries to
> > > read the trimmed logical block (this is what makes it terrible for RAID
> > > systems), and it causes a pipeline flush and stall (which is what makes
> > > TRIM so slow).  The pipeline flushing and stalling will cause particular
> > > problems if you have a lot of metadata changes or small reads and writes
> > > in parallel - the sort of accesses you get with database servers.  So
> > > enabling TRIM will make databases significantly slower.
> > >
> > > And what do you lose if you /don't/ enable TRIM?  When a filesystem
> > > deletes a file, it knows the logical blocks are free, but the SSD keeps
> > > them around.  When the filesystem re-uses them for new data, the SSD
> > > then knows that the old physical blocks can be garbage-collected and
> > > re-used.  So all you are really doing by not using TRIM is delaying the
> > > collection of unneeded blocks.  As long as the SSD has plenty of spare
> > > blocks (and this is one of the reasons why any half-decent SSD has
> > > over-provisioning), TRIM gains you nothing at all here.  (If you have a
> > > very old SSD, or a very small one, or a very cheap one, then you will
> > > have poor over-provisioning and poor garbage collection - TRIM might
> > > then improve the SSD speed as long as the disk is mostly empty.)
> > >
> > > It is possible that blocks that could have been TRIMMED will get
> > > unnecessarily copied as part of a wear-levelling pass - but the effect
> > > of this is going to be completely negligible on the SSD's lifetime.
> > >
> > >
> > > So TRIM complicates RAID, limits your flexibility for how to set up your
> > > disks and arrays, and slows down your metadata transactions and small
> > > accesses.
> > >
> > >
> > > TRIM /did/ have a useful role for early SSDs - in particular, it
> > > improved the artificial benchmarks used by testers and reviewers.  So it
> > > has ended up being seen as a "must have" feature for both the SSD
> > > itself, and the software and filesystems accessing them.
> > >
> > >
> >
> > Thanks for the explanation, makes a lot of sense, has me leaning towards
> > not using TRIM.
> >
> > But your explanation focused on blocks, leaving out pages. Does TRIM
> > info sent to the device only do that on the block level or does it do it
> > at the page level? I was thinking that if it did it at the page level
> > the SSD's garbage collection would consolidate blocks by removing unused
> > pages (akin to defragmenting) then erasing those pages thus making them
> > ready to be written.
> >
> 
> I was not using "block" in a particularly strict of formal way.  There are a
> number of different levels of structure involved here, including "logical
> blocks", "sectors", "allocation units", "erase blocks", "write pages", etc.  I
> am simply talking about "lumps of data", rather than any specific structure.
> 
> As far as the computer is concerned, it deals with "sector numbers" of 512
> byte or 4K sectors.  It is up to the SSD to map these logical numbers to
> physical pages within flash erase blocks.  The PC has no way of knowing
> whether a given set of logical sectors are mapped to pages within the same
> erase block or different ones.
> 
> You are right that the SSD's garbage collection routines will sometimes
> collect together the used pages of an erase block, and copy them over to
> another erase block, so that the first erase block can be recycled.  But this
> is done independently of the TRIM, and is part of the normal garbage
> collection function.

Right, and without TRIM to tell the SSD which page(s) are invalid the 
garbage collection will never be able to do that so the garbage 
collection will be carrying around and preserving invalid page(s) when 
ever it does do something. Assuming there are invalid pages in the blocks 
it is acting on. That to me seems inefficient and for that reason says 
TRIM should be used? 

And makes me think if not what good is garage collection if it's not 
concatenating blocks to only contain valid pages and also then erasing 
invalid blocks so then the pages can be used when needed? In this 
scenario it then appears the only good garbage collection can do is for 
wear leveling.

As far as I understand TRIM, among other things, it allows the SSD to 
combine the invalid pages into a block so the block can be erased thus 
making the pages ready to be written indiviually and avoiding the 
read-erase-modify-write of the block when a page changes, i.e. write 
amplification. Even if it does a read-modify-write to a new block then 
acks the write and does the erase after in the background it's still 
overhead in the read-modify-write i.e. read a whole block, modify a page, 
write a whole block, instead of just being able to write a page.

Am I on the right page? :-)

> 
> mvh.,
> 
> David
> 
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-30 18:30       ` Curt Blank
@ 2012-10-30 18:43         ` Roberto Spadim
  2012-10-30 19:59         ` Chris Murphy
  1 sibling, 0 replies; 23+ messages in thread
From: Roberto Spadim @ 2012-10-30 18:43 UTC (permalink / raw)
  To: Curt Blank; +Cc: David Brown, linux-raid

right, but i never see a good speed improvement because someone put
TRIM command to work, try it, maybe it work better now with last
kernel changes

2012/10/30 Curt Blank <curt@curtronics.com>:
>
>
> On Tue, 30 Oct 2012, David Brown wrote:
>
>> On 30/10/2012 15:29, Curtis J Blank wrote:
>> > On 10/30/12 04:49, David Brown wrote:
>> > > On 28/10/2012 19:59, Curtis J Blank wrote:
>> > > > I've got two new SSD's that I want to set up as RAID1 and use strictly
>> > > > for the OS and MySQL DB's partitioned accordingly.
>> > > >
>> > > > I'll be using the 3.4.6 kernel for now in openSuSE 12.2 with ext4. So
>> > > > after a lot of Google'n and reading it is my understanding that discard
>> > > > is not sent to the devices via the raid drivers. I am aware of Shaohua
>> > > > Li's patches to make it work but am not inclined to use them due to
>> > > > openSuSE's Online Update replacing the kernel. I'm not against patching
>> > > > and gen'ing a kernel, that used to be SOP, but just don't want deal with
>> > > > that overhead. Of course unless I really need to.
>> > > >
>> > > > So I've read, and if I understand things correctly, I can use LVM and
>> > > > RAID1 and the the discard commands will be sent to the devices. Is that
>> > > > correct and currently the only way or is/are there other ways?
>> > > >
>> > > > I've also read that a lot of people are saying TRIM isn't needed because
>> > > > the SSD's garbage collection is so good now TRIM isn't needed. But I
>> > > > don't see how that could work because the SSD's don't have access to the
>> > > > file system so they don't know which pages in the blocks are marked
>> > > > unused to do any consolidation and erasing. And using TRIM is suggested
>> > > > in a OCZ document I read and who's drives these are. Unless, the SDD
>> > > > when it has to change a page moves the whole block then erases the old
>> > > > block? But without TRIM in could be moving invalid data too because it
>> > > > doesn't know that and that to me sure doesn't sound efficient and this
>> > > > operation would be a perfect time to get rid of the invalid data if it
>> > > > did know.
>> > > >
>> > >
>> > > TRIM is not necessary.
>> > >
>> > > In some situations, TRIM can improve speed - in other cases, it can make
>> > > the system significantly slower.  And it is only ever a help until the
>> > > disk is getting fairly full.
>> > >
>> > > Before deciding about TRIM, it is important to understand what it does,
>> > > and how it works.  TRIM lets the filesystem tell the SSD that a
>> > > particular logical disk block is no longer in use.  The SSD can then
>> > > find the physical flash block associated with that logical block, and
>> > > mark it for garbage collection.
>> > >
>> > > If TRIM had been specified /properly/ for SATA (as it is for SCSI/SAS),
>> > > then it would have been quite useful.  But it has two huge failings -
>> > > there is no specification as to what the host will get if it tries to
>> > > read the trimmed logical block (this is what makes it terrible for RAID
>> > > systems), and it causes a pipeline flush and stall (which is what makes
>> > > TRIM so slow).  The pipeline flushing and stalling will cause particular
>> > > problems if you have a lot of metadata changes or small reads and writes
>> > > in parallel - the sort of accesses you get with database servers.  So
>> > > enabling TRIM will make databases significantly slower.
>> > >
>> > > And what do you lose if you /don't/ enable TRIM?  When a filesystem
>> > > deletes a file, it knows the logical blocks are free, but the SSD keeps
>> > > them around.  When the filesystem re-uses them for new data, the SSD
>> > > then knows that the old physical blocks can be garbage-collected and
>> > > re-used.  So all you are really doing by not using TRIM is delaying the
>> > > collection of unneeded blocks.  As long as the SSD has plenty of spare
>> > > blocks (and this is one of the reasons why any half-decent SSD has
>> > > over-provisioning), TRIM gains you nothing at all here.  (If you have a
>> > > very old SSD, or a very small one, or a very cheap one, then you will
>> > > have poor over-provisioning and poor garbage collection - TRIM might
>> > > then improve the SSD speed as long as the disk is mostly empty.)
>> > >
>> > > It is possible that blocks that could have been TRIMMED will get
>> > > unnecessarily copied as part of a wear-levelling pass - but the effect
>> > > of this is going to be completely negligible on the SSD's lifetime.
>> > >
>> > >
>> > > So TRIM complicates RAID, limits your flexibility for how to set up your
>> > > disks and arrays, and slows down your metadata transactions and small
>> > > accesses.
>> > >
>> > >
>> > > TRIM /did/ have a useful role for early SSDs - in particular, it
>> > > improved the artificial benchmarks used by testers and reviewers.  So it
>> > > has ended up being seen as a "must have" feature for both the SSD
>> > > itself, and the software and filesystems accessing them.
>> > >
>> > >
>> >
>> > Thanks for the explanation, makes a lot of sense, has me leaning towards
>> > not using TRIM.
>> >
>> > But your explanation focused on blocks, leaving out pages. Does TRIM
>> > info sent to the device only do that on the block level or does it do it
>> > at the page level? I was thinking that if it did it at the page level
>> > the SSD's garbage collection would consolidate blocks by removing unused
>> > pages (akin to defragmenting) then erasing those pages thus making them
>> > ready to be written.
>> >
>>
>> I was not using "block" in a particularly strict of formal way.  There are a
>> number of different levels of structure involved here, including "logical
>> blocks", "sectors", "allocation units", "erase blocks", "write pages", etc.  I
>> am simply talking about "lumps of data", rather than any specific structure.
>>
>> As far as the computer is concerned, it deals with "sector numbers" of 512
>> byte or 4K sectors.  It is up to the SSD to map these logical numbers to
>> physical pages within flash erase blocks.  The PC has no way of knowing
>> whether a given set of logical sectors are mapped to pages within the same
>> erase block or different ones.
>>
>> You are right that the SSD's garbage collection routines will sometimes
>> collect together the used pages of an erase block, and copy them over to
>> another erase block, so that the first erase block can be recycled.  But this
>> is done independently of the TRIM, and is part of the normal garbage
>> collection function.
>
> Right, and without TRIM to tell the SSD which page(s) are invalid the
> garbage collection will never be able to do that so the garbage
> collection will be carrying around and preserving invalid page(s) when
> ever it does do something. Assuming there are invalid pages in the blocks
> it is acting on. That to me seems inefficient and for that reason says
> TRIM should be used?
>
> And makes me think if not what good is garage collection if it's not
> concatenating blocks to only contain valid pages and also then erasing
> invalid blocks so then the pages can be used when needed? In this
> scenario it then appears the only good garbage collection can do is for
> wear leveling.
>
> As far as I understand TRIM, among other things, it allows the SSD to
> combine the invalid pages into a block so the block can be erased thus
> making the pages ready to be written indiviually and avoiding the
> read-erase-modify-write of the block when a page changes, i.e. write
> amplification. Even if it does a read-modify-write to a new block then
> acks the write and does the erase after in the background it's still
> overhead in the read-modify-write i.e. read a whole block, modify a page,
> write a whole block, instead of just being able to write a page.
>
> Am I on the right page? :-)
>
>>
>> mvh.,
>>
>> David
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Roberto Spadim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-30 18:30       ` Curt Blank
  2012-10-30 18:43         ` Roberto Spadim
@ 2012-10-30 19:59         ` Chris Murphy
  2012-10-31  8:32           ` David Brown
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2012-10-30 19:59 UTC (permalink / raw)
  To: linux-raid


On Oct 30, 2012, at 12:30 PM, Curt Blank <curt@curtronics.com> wrote:
> 
> Right, and without TRIM to tell the SSD which page(s) are invalid the 
> garbage collection will never be able to do that so the garbage 
> collection will be carrying around and preserving invalid page(s) when 
> ever it does do something. Assuming there are invalid pages in the blocks 
> it is acting on. That to me seems inefficient and for that reason says 
> TRIM should be used? 

My understanding is that a modern consumer SSD works by copy-on-write for new or changed blocks, so this need for TRIM is not needed. The SSD is only writing data to "empty" or previously erased cells. The correlation between logical sectors and physical sectors is constantly adjusted, unlike on HDDs where this remapping tends to only occur with persistent write failures to a sector.

Case 1: A file is being overwritten, or modified in some way. The file system knows this file consumes, .e.g LBA's 5000 to 6000, and so it sends a write command to the SSD, in effect "write data to LBA 5000, 1000" ergo write a data stream starting at LBA 5000, for 1000 (contiguous) sectors. Obviously a file system might break up this file into multiple fragments, so this is simplistic.

The SSD doesn't actually do what it's told. It doesn't literally overwrite those LBA's, what it does is dereference them in its lookup. And remaps those LBA's to new empty cells, and writes your data there. Later, it can go back and do garbage college on those dereferenced cells when there are enough of them accumulated.

Case 2: A file is being newly written. The basic thing happens. It's possible the file system requests LBA's never before provisioned, or it requests LBA's from previously deleted files. 

Either way, the SSD writes to empty cells. The case where it needs to write to occupied cells is if it runs out of empty ones, i.e. like David Brown said, in a case where the disk is getting full and poorly provisioned this could occur.

It might also occur in some use cases where large files are being created/modified, destroyed, very frequently, such that the disk can't keep up with garbage collection. Maybe an example of this would be heavy VM usage with consumer SSDs. Why someone would do this I don't know but perhaps that's an example.


> As far as I understand TRIM, among other things, it allows the SSD to 
> combine the invalid pages into a block so the block can be erased thus 
> making the pages ready to be written indiviually and avoiding the 
> read-erase-modify-write of the block when a page changes, i.e. write 
> amplification.

It will do this with or without TRIM. TRIM simply is a mechanism for the file system to inform the SSD of this in advance, in the case of file deletions, where it may be some time before the SSD is informed those blocks are "free" when the file system decides to reuse those sectors.


> Even if it does a read-modify-write to a new block then 
> acks the write and does the erase after in the background it's still 
> overhead in the read-modify-write i.e. read a whole block, modify a page, 
> write a whole block, instead of just being able to write a page.


a.) Neglible.
b.) The file system does RWM at a block/cluster level anyway (typically this is 4KB).


Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-30 19:59         ` Chris Murphy
@ 2012-10-31  8:32           ` David Brown
  2012-10-31 13:44             ` Roberto Spadim
                               ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: David Brown @ 2012-10-31  8:32 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid

On 30/10/2012 20:59, Chris Murphy wrote:
>
> On Oct 30, 2012, at 12:30 PM, Curt Blank <curt@curtronics.com>
> wrote:
>>
>> Right, and without TRIM to tell the SSD which page(s) are invalid
>> the garbage collection will never be able to do that so the
>> garbage collection will be carrying around and preserving invalid
>> page(s) when ever it does do something. Assuming there are invalid
>> pages in the blocks it is acting on. That to me seems inefficient
>> and for that reason says TRIM should be used?

That is correct - there will be unneeded data carried around that stops 
erase blocks from being garbage collected, and this unneeded data will 
occasionally be copied as part of compaction routines or wear-levelling 
functions.

There are a few things to note, however - there will /always/ be some 
unneeded data carried around, no matter how enthusiastic the filesystem 
is about issuing TRIMs (and filesystems /don't/ always issue a TRIM, 
especially in cases where the logical block will be re-used).

Also, the whole point of garbage collection (of TRIM'ed blocks or blocks 
whose logical sector has been overwritten) is so that when the host 
wants to write something, there are free blocks on the SSD already 
erased and waiting.  As long as the SSD has more than enough such free 
blocks at any given time, then it does not need any more - extra free 
blocks cannot improve the speed of the SSD.

Modern SSD's have over-provisioning - the disk claims to have "x" GB of 
space, and provides logical block number for "x" GB, but in fact it has 
something like "x + 15%" GB of actual flash space.  This extra 15% 
(actual values vary) provides two things - a safety margin for bad 
blocks, and a guarantee that there are enough pages that are known to be 
unneeded (even in the absence of TRIM), so that there can always be 
plenty of free erase blocks.  Since the host can only see "x" GB, then 
at most "x" GB of pages can be in use - at least "15% of x" GB pages are 
known to be free.  The SSD may need to re-arrange pages and blocks a bit 
("defragmenting"), but it can always do it.

There are pathological cases where TRIM could make a difference.  If you 
fill your disk with random data, then erase everything, then fill it 
again using very random writes, then your writes will be slowed as 
garbage collection has to put together new free erase blocks - while 
TRIM could have let the SSD erase blocks earlier.

>
> My understanding is that a modern consumer SSD works by copy-on-write
> for new or changed blocks, so this need for TRIM is not needed. The
> SSD is only writing data to "empty" or previously erased cells. The
> correlation between logical sectors and physical sectors is
> constantly adjusted, unlike on HDDs where this remapping tends to
> only occur with persistent write failures to a sector.

Correct.

>
> Case 1: A file is being overwritten, or modified in some way. The
> file system knows this file consumes, .e.g LBA's 5000 to 6000, and so
> it sends a write command to the SSD, in effect "write data to LBA
> 5000, 1000" ergo write a data stream starting at LBA 5000, for 1000
> (contiguous) sectors. Obviously a file system might break up this
> file into multiple fragments, so this is simplistic.
>
> The SSD doesn't actually do what it's told. It doesn't literally
> overwrite those LBA's, what it does is dereference them in its
> lookup. And remaps those LBA's to new empty cells, and writes your
> data there. Later, it can go back and do garbage college on those
> dereferenced cells when there are enough of them accumulated.

Exactly.  The SSD knows that the old physical blocks that used to be 
associated with LBA's 5000 to 6000 are now free, and can be garbage 
collected.  So for re-writing, TRIM is unnecessary.

>
> Case 2: A file is being newly written. The basic thing happens. It's
> possible the file system requests LBA's never before provisioned, or
> it requests LBA's from previously deleted files.

Yes.

>
> Either way, the SSD writes to empty cells. The case where it needs to
> write to occupied cells is if it runs out of empty ones, i.e. like
> David Brown said, in a case where the disk is getting full and poorly
> provisioned this could occur.
>
> It might also occur in some use cases where large files are being
> created/modified, destroyed, very frequently, such that the disk
> can't keep up with garbage collection. Maybe an example of this would
> be heavy VM usage with consumer SSDs. Why someone would do this I
> don't know but perhaps that's an example.

There will always be pathological cases like this where TRIM could be a 
win.  But on the other hand, there are pathological cases where TRIM 
causes great slowdowns - such as deleting a lot of files (as sending 
TRIM commands is very slow).

If you actually want to using your SSD in such a way, with lots of big, 
fast deletions and writings, then you can help it out by 
"short-stroking" it.  You take your new SSD (or newly "secure erased" 
SSD) and partition it to only use part of the space - leave some extra 
at the end.  This extra space increases the over-provisioning of the 
disk, and therefore increases the amount of free blocks you have at any 
given time.


I'd add a case 3 to your list:

Case 3: A file is erased.  If you have TRIM, the data blocks used by the 
file can be marked as "unneeded" by the SSD.  Without TRIM, the SSD 
thinks they are still important.  But the OS/filesystem knows the LBAs 
are free, and will re-use them sooner or later.  As soon as they are 
re-used, the SSD will mark the old physical blocks as unneeded and can 
garbage-collect them.  Without TRIM, this collection is delayed - but it 
still happens, and as long as the SSD has other free blocks, the delay 
has no impact on performance.


>
>
>> As far as I understand TRIM, among other things, it allows the SSD
>> to combine the invalid pages into a block so the block can be
>> erased thus making the pages ready to be written indiviually and
>> avoiding the read-erase-modify-write of the block when a page
>> changes, i.e. write amplification.
>
> It will do this with or without TRIM. TRIM simply is a mechanism for
> the file system to inform the SSD of this in advance, in the case of
> file deletions, where it may be some time before the SSD is informed
> those blocks are "free" when the file system decides to reuse those
> sectors.
>
>
>> Even if it does a read-modify-write to a new block then acks the
>> write and does the erase after in the background it's still
>> overhead in the read-modify-write i.e. read a whole block, modify a
>> page, write a whole block, instead of just being able to write a
>> page.

The SSD doesn't do that.  If make a change to data that is in a page in 
the middle of an erase block, it is only that page that is copied (for 
RMW) to another free page in the same or a different erase block.  The 
original page is marked "unneeded".  TRIM makes no difference to this 
process.  All it does is make it more likely that the other pages in the 
same block are marked "unneeded" at an earlier stage, so the whole old 
block can be recycled earlier.  But as I said above, doing this earlier 
or later makes no difference to performance.

>
>
> a.) Neglible.
 > b.) The file system does RWM at a block/cluster level
> anyway (typically this is 4KB).
>
>
> Chris Murphy--
 > To unsubscribe from this list: send the line
> "unsubscribe linux-raid" in the body of a message to
> majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-31  8:32           ` David Brown
@ 2012-10-31 13:44             ` Roberto Spadim
       [not found]             ` <CAJEsFnkM9w0kNbNd51ShP0uExvsZE6V9h3WKKs3nxWfncUCYJA@mail.gmail.com>
  2012-10-31 17:34             ` Curtis J Blank
  2 siblings, 0 replies; 23+ messages in thread
From: Roberto Spadim @ 2012-10-31 13:44 UTC (permalink / raw)
  To: David Brown; +Cc: Chris Murphy, linux-raid

just a point ...
you are using vertex right? if you will use it in enterprise, you
should use deneva2 or deneva (from ocz)... they are enterprise edition
of ssd (yes you will be safe with your vertex ssd but enterprise ssd
are for enterprise use = very high write/read)

i read some time a explanation about garbage collection and firmware
logic of ocz drivers , if i found i will post here

2012/10/31 David Brown <david.brown@hesbynett.no>:
> On 30/10/2012 20:59, Chris Murphy wrote:
>>
>>
>> On Oct 30, 2012, at 12:30 PM, Curt Blank <curt@curtronics.com>
>> wrote:
>>>
>>>
>>> Right, and without TRIM to tell the SSD which page(s) are invalid
>>> the garbage collection will never be able to do that so the
>>> garbage collection will be carrying around and preserving invalid
>>> page(s) when ever it does do something. Assuming there are invalid
>>> pages in the blocks it is acting on. That to me seems inefficient
>>> and for that reason says TRIM should be used?
>
>
> That is correct - there will be unneeded data carried around that stops
> erase blocks from being garbage collected, and this unneeded data will
> occasionally be copied as part of compaction routines or wear-levelling
> functions.
>
> There are a few things to note, however - there will /always/ be some
> unneeded data carried around, no matter how enthusiastic the filesystem is
> about issuing TRIMs (and filesystems /don't/ always issue a TRIM, especially
> in cases where the logical block will be re-used).
>
> Also, the whole point of garbage collection (of TRIM'ed blocks or blocks
> whose logical sector has been overwritten) is so that when the host wants to
> write something, there are free blocks on the SSD already erased and
> waiting.  As long as the SSD has more than enough such free blocks at any
> given time, then it does not need any more - extra free blocks cannot
> improve the speed of the SSD.
>
> Modern SSD's have over-provisioning - the disk claims to have "x" GB of
> space, and provides logical block number for "x" GB, but in fact it has
> something like "x + 15%" GB of actual flash space.  This extra 15% (actual
> values vary) provides two things - a safety margin for bad blocks, and a
> guarantee that there are enough pages that are known to be unneeded (even in
> the absence of TRIM), so that there can always be plenty of free erase
> blocks.  Since the host can only see "x" GB, then at most "x" GB of pages
> can be in use - at least "15% of x" GB pages are known to be free.  The SSD
> may need to re-arrange pages and blocks a bit ("defragmenting"), but it can
> always do it.
>
> There are pathological cases where TRIM could make a difference.  If you
> fill your disk with random data, then erase everything, then fill it again
> using very random writes, then your writes will be slowed as garbage
> collection has to put together new free erase blocks - while TRIM could have
> let the SSD erase blocks earlier.
>
>
>>
>> My understanding is that a modern consumer SSD works by copy-on-write
>> for new or changed blocks, so this need for TRIM is not needed. The
>> SSD is only writing data to "empty" or previously erased cells. The
>> correlation between logical sectors and physical sectors is
>> constantly adjusted, unlike on HDDs where this remapping tends to
>> only occur with persistent write failures to a sector.
>
>
> Correct.
>
>
>>
>> Case 1: A file is being overwritten, or modified in some way. The
>> file system knows this file consumes, .e.g LBA's 5000 to 6000, and so
>> it sends a write command to the SSD, in effect "write data to LBA
>> 5000, 1000" ergo write a data stream starting at LBA 5000, for 1000
>> (contiguous) sectors. Obviously a file system might break up this
>> file into multiple fragments, so this is simplistic.
>>
>> The SSD doesn't actually do what it's told. It doesn't literally
>> overwrite those LBA's, what it does is dereference them in its
>> lookup. And remaps those LBA's to new empty cells, and writes your
>> data there. Later, it can go back and do garbage college on those
>> dereferenced cells when there are enough of them accumulated.
>
>
> Exactly.  The SSD knows that the old physical blocks that used to be
> associated with LBA's 5000 to 6000 are now free, and can be garbage
> collected.  So for re-writing, TRIM is unnecessary.
>
>
>>
>> Case 2: A file is being newly written. The basic thing happens. It's
>> possible the file system requests LBA's never before provisioned, or
>> it requests LBA's from previously deleted files.
>
>
> Yes.
>
>
>>
>> Either way, the SSD writes to empty cells. The case where it needs to
>> write to occupied cells is if it runs out of empty ones, i.e. like
>> David Brown said, in a case where the disk is getting full and poorly
>> provisioned this could occur.
>>
>> It might also occur in some use cases where large files are being
>> created/modified, destroyed, very frequently, such that the disk
>> can't keep up with garbage collection. Maybe an example of this would
>> be heavy VM usage with consumer SSDs. Why someone would do this I
>> don't know but perhaps that's an example.
>
>
> There will always be pathological cases like this where TRIM could be a win.
> But on the other hand, there are pathological cases where TRIM causes great
> slowdowns - such as deleting a lot of files (as sending TRIM commands is
> very slow).
>
> If you actually want to using your SSD in such a way, with lots of big, fast
> deletions and writings, then you can help it out by "short-stroking" it.
> You take your new SSD (or newly "secure erased" SSD) and partition it to
> only use part of the space - leave some extra at the end.  This extra space
> increases the over-provisioning of the disk, and therefore increases the
> amount of free blocks you have at any given time.
>
>
> I'd add a case 3 to your list:
>
> Case 3: A file is erased.  If you have TRIM, the data blocks used by the
> file can be marked as "unneeded" by the SSD.  Without TRIM, the SSD thinks
> they are still important.  But the OS/filesystem knows the LBAs are free,
> and will re-use them sooner or later.  As soon as they are re-used, the SSD
> will mark the old physical blocks as unneeded and can garbage-collect them.
> Without TRIM, this collection is delayed - but it still happens, and as long
> as the SSD has other free blocks, the delay has no impact on performance.
>
>
>
>>
>>
>>> As far as I understand TRIM, among other things, it allows the SSD
>>> to combine the invalid pages into a block so the block can be
>>> erased thus making the pages ready to be written indiviually and
>>> avoiding the read-erase-modify-write of the block when a page
>>> changes, i.e. write amplification.
>>
>>
>> It will do this with or without TRIM. TRIM simply is a mechanism for
>> the file system to inform the SSD of this in advance, in the case of
>> file deletions, where it may be some time before the SSD is informed
>> those blocks are "free" when the file system decides to reuse those
>> sectors.
>>
>>
>>> Even if it does a read-modify-write to a new block then acks the
>>> write and does the erase after in the background it's still
>>> overhead in the read-modify-write i.e. read a whole block, modify a
>>> page, write a whole block, instead of just being able to write a
>>> page.
>
>
> The SSD doesn't do that.  If make a change to data that is in a page in the
> middle of an erase block, it is only that page that is copied (for RMW) to
> another free page in the same or a different erase block.  The original page
> is marked "unneeded".  TRIM makes no difference to this process.  All it
> does is make it more likely that the other pages in the same block are
> marked "unneeded" at an earlier stage, so the whole old block can be
> recycled earlier.  But as I said above, doing this earlier or later makes no
> difference to performance.
>
>
>>
>>
>> a.) Neglible.
>
>> b.) The file system does RWM at a block/cluster level
>>
>> anyway (typically this is 4KB).
>>
>>
>> Chris Murphy--
>
>> To unsubscribe from this list: send the line
>>
>> "unsubscribe linux-raid" in the body of a message to
>> majordomo@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Roberto Spadim

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
       [not found]             ` <CAJEsFnkM9w0kNbNd51ShP0uExvsZE6V9h3WKKs3nxWfncUCYJA@mail.gmail.com>
@ 2012-10-31 14:11               ` David Brown
  2012-11-13 13:39                 ` Ric Wheeler
  0 siblings, 1 reply; 23+ messages in thread
From: David Brown @ 2012-10-31 14:11 UTC (permalink / raw)
  To: Alexander Haase; +Cc: Chris Murphy, linux-raid

On 31/10/2012 14:12, Alexander Haase wrote:
> Has anyone considered handling TRIM via an idle IO queue? You'd have to
> purge queue items that conflicted with incoming writes, but it does get
> around the performance complaint. If the idle period never comes, old
> TRIMs can be silently dropped to lessen queue bloat.
>

I am sure it has been considered - but is it worth the effort and the 
complications?  TRIM has been implemented in several filesystems (ext4 
and, I believe, btrfs) - but is disabled by default because it typically 
slows down the system.  You are certainly correct that putting TRIM at 
the back of the queue will avoid the delays it causes - but it still 
will not give any significant benefit (except for old SSDs with limited 
garbage collection and small over-provisioning ), and you have a lot of 
extra complexity to ensure that a TRIM is never pushed back until after 
a new write to the same logical sectors.

It would be much easier and safer, and give much better effect, to make 
sure the block allocation procedure for filesystems emphasised 
re-writing old blocks as soon as possible (when on an SSD).  Then there 
is no need for TRIM at all.  This would have the added benefit of 
working well for compressed (or sparse) hard disk image files used by 
virtual machines - such image files only take up real disk space for 
blocks that are written, so re-writes would save real-world disk space.

> As far as parity consistency, bitmaps could track which stripes( and
> blocks within those stripes) are expected to be out of parity( also
> useful for lazy device init ). Maybe a bit-per-stripe map at the logical
> device level and a bit-per-LBA bitmap at the stripe level?

Tracking "no-sync" areas of a raid array is already high on the md raid 
things-to-do list (perhaps it is already implemented - I lose track of 
which features are planned and which are implemented).  And yes, such 
no-sync tracking would be useful here.  But it is complicated, 
especially for raid5/6 (raid1 is not too bad) - should TRIMs that cover 
part of a stripe be dropped?  Should the md layer remember them and 
coalesce them when it can TRIM a whole stripe?  Should it try to track 
partial synchronisation within a stripe?

Or should the md developers simply say that since supporting TRIM is not 
going to have any measurable benefits (certainly not with the sort of 
SSD's people use in raid arrays), and since TRIM slows down some 
operations, it is better to keep things simple and ignore TRIM entirely? 
  Even if there are occasional benefits to having TRIM, is it worth it 
in the face of added complication in the code and the risk of errors?

There /have/ been developers working on TRIM support on raid5.  It seems 
to have been a complicated process.  But some people like a challenge!

>
> On the other hand, does it hurt if empty blocks are out of parity( due
> to TRIM or lazy device init)? The parity recovery of garbage is still
> garbage, which is what any sane FS expects from unused blocks. If and
> when you do a parity scrub, you will spend a lot of time recovering
> garbage and undo any good TRIM might have done, but usual drive
> operation should quickly balance that out in a write-intensive
> environment where idle TRIM might help.
>

Yes, it "hurts" if empty blocks are out of sync.  On obvious issue is 
that you will get errors when scrubbing - the md layer has no way of 
knowing that these are unimportant (assuming there is no no-sync 
tracking), so any real problems will be hidden by the unimportant ones.

Another issue is for RMW cycles on raid5.  Small writes are done by 
reading the old data, reading the old parity, writing the new data and 
the new parity - but that only works if the parity was correct across 
the whole stripe.  Even if raid5 TRIM is restricted to whole stripes, a 
later small write to that stripe will be a disaster if it is not in sync.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-31  8:32           ` David Brown
  2012-10-31 13:44             ` Roberto Spadim
       [not found]             ` <CAJEsFnkM9w0kNbNd51ShP0uExvsZE6V9h3WKKs3nxWfncUCYJA@mail.gmail.com>
@ 2012-10-31 17:34             ` Curtis J Blank
  2012-10-31 20:04               ` David Brown
  2 siblings, 1 reply; 23+ messages in thread
From: Curtis J Blank @ 2012-10-31 17:34 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On 10/31/12 03:32, David Brown wrote:
>
> There will always be pathological cases like this where TRIM could be a
> win.  But on the other hand, there are pathological cases where TRIM
> causes great slowdowns - such as deleting a lot of files (as sending
> TRIM commands is very slow).
>
> If you actually want to using your SSD in such a way, with lots of big,
> fast deletions and writings, then you can help it out by
> "short-stroking" it.  You take your new SSD (or newly "secure erased"
> SSD) and partition it to only use part of the space - leave some extra
> at the end.  This extra space increases the over-provisioning of the
> disk, and therefore increases the amount of free blocks you have at any
> given time.
>

I was planning, all the partitions i.e. mount points will be below 50% 
used, most way below that and I don't see them filling up. That is on 
purpose, theses SSD's are for the OS to gain performance and not a lot 
of data storage with the exception of mysql.

So, if I have unused space at the end of the SSD, say 60G out of the 
256G don't use it, don't partition it the SSD will use it for what ever? 
It will know that it can use it when in a RAID1 set? Or make the raidset 
only using cylinders to 196G and partition that leaving the rest unused?

Ok, the only areas that will have a lot of writes are /var/log, logs are 
moved to a dated directory every 24 hours then gzip'd tarballed after 14 
days and the tarball kept and the logs erased. Sounds like the normal 
filesystem reuse of blocks will negate the need for TRIM. Do want 
/var/log on the SSD's because a lot of logging is done and want the 
performance there so as to keep iowait as low as possible.

/home with user accounts, mine only really, getting email will cause a 
lot of activity so maybe /home doesn't need to be on the SSD. Don't 
really need SSD performance there. Same for /usr/local which is a MP and 
/usr/local/src is where I do all my code development.

/mysql where all my DB's are and are very active and I want on the SSD's 
for the performance. This a good idea or not? Two DB's are very active 
one doing mostly inserts and updates so not too bad there, another doing 
a real lot of inserts and deletes. If you're familiar with ZoneMinder 
and how events are saved then later deleted a real lot of activity there.

>
> I'd add a case 3 to your list:
>
> Case 3: A file is erased.  If you have TRIM, the data blocks used by the
> file can be marked as "unneeded" by the SSD.  Without TRIM, the SSD
> thinks they are still important.  But the OS/filesystem knows the LBAs
> are free, and will re-use them sooner or later.  As soon as they are
> re-used, the SSD will mark the old physical blocks as unneeded and can
> garbage-collect them.  Without TRIM, this collection is delayed - but it
> still happens, and as long as the SSD has other free blocks, the delay
> has no impact on performance.
>
>>
>>> As far as I understand TRIM, among other things, it allows the SSD
>>> to combine the invalid pages into a block so the block can be
>>> erased thus making the pages ready to be written individually and
>>> avoiding the read-erase-modify-write of the block when a page
>>> changes, i.e. write amplification.
>>
>> It will do this with or without TRIM. TRIM simply is a mechanism for
>> the file system to inform the SSD of this in advance, in the case of
>> file deletions, where it may be some time before the SSD is informed
>> those blocks are "free" when the file system decides to reuse those
>> sectors.
>>
>>> Even if it does a read-modify-write to a new block then acks the
>>> write and does the erase after in the background it's still
>>> overhead in the read-modify-write i.e. read a whole block, modify a
>>> page, write a whole block, instead of just being able to write a
>>> page.
>
> The SSD doesn't do that.  If make a change to data that is in a page in
> the middle of an erase block, it is only that page that is copied (for
> RMW) to another free page in the same or a different erase block.  The
> original page is marked "unneeded".  TRIM makes no difference to this
> process.  All it does is make it more likely that the other pages in the
> same block are marked "unneeded" at an earlier stage, so the whole old
> block can be recycled earlier.  But as I said above, doing this earlier
> or later makes no difference to performance.
>

Ok but what about making a change to a page in a block whose other pages 
are valid? The whole block gets moved then the old block is later 
erased? That's what I'm understanding which sounds ok.

I think I was over thinking this. If a page changes the only way to do 
that is read-modify-write of the block to where ever. So it might as 
well be to an already erased block. I was getting hung up on having 
erased pages in the blocks that can be immediately and just written. 
Period. But that only occurs when appending data to a file. Let the 
filesystem and SSD's do there thing...

I'm really thinking I don't need TRIM now. And when it is finally in the 
kernel I can maybe try it. I was worried that if I don't do it from the 
start it be too late later after the SSD's had been used for a while to 
get the full benefit of it.


>>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-31 17:34             ` Curtis J Blank
@ 2012-10-31 20:04               ` David Brown
  2012-11-01  1:54                 ` Curtis J Blank
  0 siblings, 1 reply; 23+ messages in thread
From: David Brown @ 2012-10-31 20:04 UTC (permalink / raw)
  To: Curtis J Blank; +Cc: linux-raid

On 31/10/12 18:34, Curtis J Blank wrote:
> On 10/31/12 03:32, David Brown wrote:
>>
>> There will always be pathological cases like this where TRIM could be a
>> win.  But on the other hand, there are pathological cases where TRIM
>> causes great slowdowns - such as deleting a lot of files (as sending
>> TRIM commands is very slow).
>>
>> If you actually want to using your SSD in such a way, with lots of big,
>> fast deletions and writings, then you can help it out by
>> "short-stroking" it.  You take your new SSD (or newly "secure erased"
>> SSD) and partition it to only use part of the space - leave some extra
>> at the end.  This extra space increases the over-provisioning of the
>> disk, and therefore increases the amount of free blocks you have at any
>> given time.
>>
>
> I was planning, all the partitions i.e. mount points will be below 50%
> used, most way below that and I don't see them filling up. That is on
> purpose, theses SSD's are for the OS to gain performance and not a lot
> of data storage with the exception of mysql.
>
> So, if I have unused space at the end of the SSD, say 60G out of the
> 256G don't use it, don't partition it the SSD will use it for what ever?
> It will know that it can use it when in a RAID1 set? Or make the raidset
> only using cylinders to 196G and partition that leaving the rest unused?
>

If you want to leave extra space to improve the over-provisioning (it is 
typically not necessary with more high-end SSDs, but you might want to 
do it anyway), then it is important that the extra space is never 
written.  The easiest way to ensure that is to leave extra space during 
partitioning.  But be careful with raid - you have to use the 
partition(s) for your raid devices, not the disk, or else you will write 
to the entire SSD during the initial raid1 sync.

A typical arrangement would be to make a 1 GB partition at the start of 
each SSD, then perhaps a 4 GB partition, then a big partition of about 
200 GB in this case.  Make a raid1 with metadata 1.0 from the first 
partition of each disk for /boot, to make life easier for the 
bootloader.  Use the second partition of each disk for swap (no need for 
raid here unless you are really concerned about uptime in the face of 
disk failure and you actually expect to use swap significantly - in 
which case go for raid1 or raid10 if you have more than 2 disks).  Use 
the third partition for your main raid (such as raid1, or perhaps 
something else if you have more than two disks).

> Ok, the only areas that will have a lot of writes are /var/log, logs are
> moved to a dated directory every 24 hours then gzip'd tarballed after 14
> days and the tarball kept and the logs erased. Sounds like the normal
> filesystem reuse of blocks will negate the need for TRIM. Do want
> /var/log on the SSD's because a lot of logging is done and want the
> performance there so as to keep iowait as low as possible.
>

That sounds fine.

However, note that writing files like logs should not normally cause 
delays - no matter how slow the disks.  The writes will simply buffer up 
in ram and be written out when there is the opportunity - processes 
don't have to wait for the writes to complete.  Speed (and latency) is 
only really important for reads (since processes will typically have to 
wait for the read to complete), and synchronised writes (where the 
application waits until it is sure the data hits the platter).  Even 
reads are not an issue if they are re-reads of data in the cache, and 
you have plenty of memory.

Still, there is no harm in putting /var/log on an SSD.

> /home with user accounts, mine only really, getting email will cause a
> lot of activity so maybe /home doesn't need to be on the SSD. Don't
> really need SSD performance there. Same for /usr/local which is a MP and
> /usr/local/src is where I do all my code development.
>

Unless you have huge amounts of data, put it on the SSD anyway.

> /mysql where all my DB's are and are very active and I want on the SSD's
> for the performance. This a good idea or not? Two DB's are very active
> one doing mostly inserts and updates so not too bad there, another doing
> a real lot of inserts and deletes. If you're familiar with ZoneMinder
> and how events are saved then later deleted a real lot of activity there.

Put the DB's on the SSD.

As with all database applications, if you can get enough memory to have 
most work done without reading from disks, it will go faster.

With decent SSD's (and since you have quite big ones, I assume they are 
good quality), there is no harm in writing lots.  You can probably write 
at 30 MB/s continuously for years before causing any wearout on the disk.

>
>>
>> I'd add a case 3 to your list:
>>
>> Case 3: A file is erased.  If you have TRIM, the data blocks used by the
>> file can be marked as "unneeded" by the SSD.  Without TRIM, the SSD
>> thinks they are still important.  But the OS/filesystem knows the LBAs
>> are free, and will re-use them sooner or later.  As soon as they are
>> re-used, the SSD will mark the old physical blocks as unneeded and can
>> garbage-collect them.  Without TRIM, this collection is delayed - but it
>> still happens, and as long as the SSD has other free blocks, the delay
>> has no impact on performance.
>>
>>>
>>>> As far as I understand TRIM, among other things, it allows the SSD
>>>> to combine the invalid pages into a block so the block can be
>>>> erased thus making the pages ready to be written individually and
>>>> avoiding the read-erase-modify-write of the block when a page
>>>> changes, i.e. write amplification.
>>>
>>> It will do this with or without TRIM. TRIM simply is a mechanism for
>>> the file system to inform the SSD of this in advance, in the case of
>>> file deletions, where it may be some time before the SSD is informed
>>> those blocks are "free" when the file system decides to reuse those
>>> sectors.
>>>
>>>> Even if it does a read-modify-write to a new block then acks the
>>>> write and does the erase after in the background it's still
>>>> overhead in the read-modify-write i.e. read a whole block, modify a
>>>> page, write a whole block, instead of just being able to write a
>>>> page.
>>
>> The SSD doesn't do that.  If make a change to data that is in a page in
>> the middle of an erase block, it is only that page that is copied (for
>> RMW) to another free page in the same or a different erase block.  The
>> original page is marked "unneeded".  TRIM makes no difference to this
>> process.  All it does is make it more likely that the other pages in the
>> same block are marked "unneeded" at an earlier stage, so the whole old
>> block can be recycled earlier.  But as I said above, doing this earlier
>> or later makes no difference to performance.
>>
>
> Ok but what about making a change to a page in a block whose other pages
> are valid? The whole block gets moved then the old block is later
> erased? That's what I'm understanding which sounds ok.

No, the changed page will get re-mapped to a different page somewhere 
else - the unchanged data will remain where it was.  That data will only 
get moved if it makes sense for "defragmenting" to free up erase blocks, 
or as part of wear-levelling routines.

>
> I think I was over thinking this. If a page changes the only way to do
> that is read-modify-write of the block to where ever. So it might as
> well be to an already erased block. I was getting hung up on having
> erased pages in the blocks that can be immediately and just written.
> Period. But that only occurs when appending data to a file. Let the
> filesystem and SSD's do there thing...
>
> I'm really thinking I don't need TRIM now. And when it is finally in the
> kernel I can maybe try it. I was worried that if I don't do it from the
> start it be too late later after the SSD's had been used for a while to
> get the full benefit of it.
>


I think what you really want to use is "fstrim" - this walks through a 
filesystem metadata, identifies free blocks, and sends TRIM commands for 
each of them.  Obviously this can take a bit of time, and will slow down 
the disks while working, but you typically do it with a cron job in the 
middle of the night.

<http://www.vdmeulen.net/cgi-bin/man/man2html?fstrim+8>


I don't think the patches for passing TRIM through the md layer have yet 
made it to mainstream distro kernels, but once they do you can run fstrim.



Incidentally, have a look at the figures in this:

<https://patrick-nagel.net/blog/archives/337>

A sample size of 1 web page is not great statistically evidence, but the 
difference in the times for "sync" are quite large...




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-31 20:04               ` David Brown
@ 2012-11-01  1:54                 ` Curtis J Blank
  2012-11-01  8:15                   ` David Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Curtis J Blank @ 2012-11-01  1:54 UTC (permalink / raw)
  To: David Brown; +Cc: linux-raid

On 10/31/12 15:04, David Brown wrote:
> On 31/10/12 18:34, Curtis J Blank wrote:
>> On 10/31/12 03:32, David Brown wrote:
>>
>> I was planning, all the partitions i.e. mount points will be below 50%
>> used, most way below that and I don't see them filling up. That is on
>> purpose, theses SSD's are for the OS to gain performance and not a lot
>> of data storage with the exception of mysql.
>>
>> So, if I have unused space at the end of the SSD, say 60G out of the
>> 256G don't use it, don't partition it the SSD will use it for what ever?
>> It will know that it can use it when in a RAID1 set? Or make the raidset
>> only using cylinders to 196G and partition that leaving the rest unused?
>>
>
> If you want to leave extra space to improve the over-provisioning (it is
> typically not necessary with more high-end SSDs, but you might want to
> do it anyway), then it is important that the extra space is never
> written.  The easiest way to ensure that is to leave extra space during
> partitioning.  But be careful with raid - you have to use the
> partition(s) for your raid devices, not the disk, or else you will write
> to the entire SSD during the initial raid1 sync.
>
> A typical arrangement would be to make a 1 GB partition at the start of
> each SSD, then perhaps a 4 GB partition, then a big partition of about
> 200 GB in this case.  Make a raid1 with metadata 1.0 from the first
> partition of each disk for /boot, to make life easier for the
> bootloader.  Use the second partition of each disk for swap (no need for
> raid here unless you are really concerned about uptime in the face of
> disk failure and you actually expect to use swap significantly - in
> which case go for raid1 or raid10 if you have more than 2 disks).  Use
> the third partition for your main raid (such as raid1, or perhaps
> something else if you have more than two disks).

David, first off I want to say thanks for all the advice and your time. 
This was what I was looking for to make informed decisions and I see I 
came to the right place.

Yep, that's the way I do it, partition the disk then use the partitions 
in the raid, not the whole disk. Although I do make more partitions and 
more mount points only so that one thing can't use up all the space and 
break other things. But still any one won't be over 50% utilization.

Oh and I do raid swap, not because it's used a lot, it's not, but to 
raid everything else and leave a single point of failure kind of defeats 
the purpose unless the goal is only to protect the data. Mine is that 
and uptime.

>
>> Ok, the only areas that will have a lot of writes are /var/log, logs are
>> moved to a dated directory every 24 hours then gzip'd tarballed after 14
>> days and the tarball kept and the logs erased. Sounds like the normal
>> filesystem reuse of blocks will negate the need for TRIM. Do want
>> /var/log on the SSD's because a lot of logging is done and want the
>> performance there so as to keep iowait as low as possible.
>>
>
> That sounds fine.
>
> However, note that writing files like logs should not normally cause
> delays - no matter how slow the disks.  The writes will simply buffer up
> in ram and be written out when there is the opportunity - processes
> don't have to wait for the writes to complete.  Speed (and latency) is
> only really important for reads (since processes will typically have to
> wait for the read to complete), and synchronised writes (where the
> application waits until it is sure the data hits the platter).  Even
> reads are not an issue if they are re-reads of data in the cache, and
> you have plenty of memory.
>
> Still, there is no harm in putting /var/log on an SSD.
>
>> /home with user accounts, mine only really, getting email will cause a
>> lot of activity so maybe /home doesn't need to be on the SSD. Don't
>> really need SSD performance there. Same for /usr/local which is a MP and
>> /usr/local/src is where I do all my code development.
>>
>
> Unless you have huge amounts of data, put it on the SSD anyway.
>
>> /mysql where all my DB's are and are very active and I want on the SSD's
>> for the performance. This a good idea or not? Two DB's are very active
>> one doing mostly inserts and updates so not too bad there, another doing
>> a real lot of inserts and deletes. If you're familiar with ZoneMinder
>> and how events are saved then later deleted a real lot of activity there.
>
> Put the DB's on the SSD.
>
> As with all database applications, if you can get enough memory to have
> most work done without reading from disks, it will go faster.
>
> With decent SSD's (and since you have quite big ones, I assume they are
> good quality), there is no harm in writing lots.  You can probably write
> at 30 MB/s continuously for years before causing any wearout on the disk.
>

Memory is currently at 16G, when I get around to it which won't be in 
the too distant future it will be 32G. I'm fully aware and try to have 
everything running in memory

The SSD's are OCZ Vertex 4 VTX4-25SAT3-256G. I hope they're good ones. 
I'm trying to get their PEC just because I want to know. I'm also going 
to try and get the over provisioned number, again just so I know.

I still haven't decided whether to connect the SSD's to the motherboard 
which is SATA III and use Linux raid or connect them to my Areca 1882i 
battery backed up caching raid controller which is also SATA III. Kind 
of hinges on whether or not the controller passes discard. It's their 
second generation card PCIe 2.0 not the new third generation PCIe 3.0 
card. Trying to find that out too.

Like to hear your thoughts on this. My thinking is the performance would 
really scream on the 1882i. And it just dawned on me if I use the 
motherboard I might not be able to use the noop scheduler which is what 
I currently use with my ARC-1220 because it has all the disks.

>>
>> Ok but what about making a change to a page in a block whose other pages
>> are valid? The whole block gets moved then the old block is later
>> erased? That's what I'm understanding which sounds ok.
>
> No, the changed page will get re-mapped to a different page somewhere
> else - the unchanged data will remain where it was.  That data will only
> get moved if it makes sense for "defragmenting" to free up erase blocks,
> or as part of wear-levelling routines.

Got it.

>
>>
>> I think I was over thinking this. If a page changes the only way to do
>> that is read-modify-write of the block to where ever. So it might as
>> well be to an already erased block. I was getting hung up on having
>> erased pages in the blocks that can be immediately and just written.
>> Period. But that only occurs when appending data to a file. Let the
>> filesystem and SSD's do there thing...
>>
>> I'm really thinking I don't need TRIM now. And when it is finally in the
>> kernel I can maybe try it. I was worried that if I don't do it from the
>> start it be too late later after the SSD's had been used for a while to
>> get the full benefit of it.
>>
>
>
> I think what you really want to use is "fstrim" - this walks through a
> filesystem metadata, identifies free blocks, and sends TRIM commands for
> each of them.  Obviously this can take a bit of time, and will slow down
> the disks while working, but you typically do it with a cron job in the
> middle of the night.
>
> <http://www.vdmeulen.net/cgi-bin/man/man2html?fstrim+8>
>

Yep, this sounds like the ticket. I was aware of it but didn't pursue it.

>
> I don't think the patches for passing TRIM through the md layer have yet
> made it to mainstream distro kernels, but once they do you can run fstrim.
>

Neil Brown told me probably 3.7, so we'll see I guess. It's becoming 
less important to me though, but maybe nice when they do. I haven't 
totally ruled out building a kernel with the patches but leaning towards 
not doing it.

>
>
> Incidentally, have a look at the figures in this:
>
> <https://patrick-nagel.net/blog/archives/337>
>
> A sample size of 1 web page is not great statistically evidence, but the
> difference in the times for "sync" are quite large...

That says pretty much what I learned so far and the numbers are 
interesting. Sort of says not to use trim real time continuously.

>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-11-01  1:54                 ` Curtis J Blank
@ 2012-11-01  8:15                   ` David Brown
  2012-11-01 15:01                     ` Wolfgang Denk
  0 siblings, 1 reply; 23+ messages in thread
From: David Brown @ 2012-11-01  8:15 UTC (permalink / raw)
  To: Curtis J Blank; +Cc: linux-raid

On 01/11/2012 02:54, Curtis J Blank wrote:
> On 10/31/12 15:04, David Brown wrote:
>> On 31/10/12 18:34, Curtis J Blank wrote:
>>> On 10/31/12 03:32, David Brown wrote:
>>>
>>> I was planning, all the partitions i.e. mount points will be below 50%
>>> used, most way below that and I don't see them filling up. That is on
>>> purpose, theses SSD's are for the OS to gain performance and not a lot
>>> of data storage with the exception of mysql.
>>>
>>> So, if I have unused space at the end of the SSD, say 60G out of the
>>> 256G don't use it, don't partition it the SSD will use it for what ever?
>>> It will know that it can use it when in a RAID1 set? Or make the raidset
>>> only using cylinders to 196G and partition that leaving the rest unused?
>>>
>>
>> If you want to leave extra space to improve the over-provisioning (it is
>> typically not necessary with more high-end SSDs, but you might want to
>> do it anyway), then it is important that the extra space is never
>> written.  The easiest way to ensure that is to leave extra space during
>> partitioning.  But be careful with raid - you have to use the
>> partition(s) for your raid devices, not the disk, or else you will write
>> to the entire SSD during the initial raid1 sync.
>>
>> A typical arrangement would be to make a 1 GB partition at the start of
>> each SSD, then perhaps a 4 GB partition, then a big partition of about
>> 200 GB in this case.  Make a raid1 with metadata 1.0 from the first
>> partition of each disk for /boot, to make life easier for the
>> bootloader.  Use the second partition of each disk for swap (no need for
>> raid here unless you are really concerned about uptime in the face of
>> disk failure and you actually expect to use swap significantly - in
>> which case go for raid1 or raid10 if you have more than 2 disks).  Use
>> the third partition for your main raid (such as raid1, or perhaps
>> something else if you have more than two disks).
>
> David, first off I want to say thanks for all the advice and your time.
> This was what I was looking for to make informed decisions and I see I
> came to the right place.
>

No problem.  I learn a lot by making suggestions her, and having other 
people correct me!  So if my advice had been badly wrong, I expect 
someone else would have said by now.

> Yep, that's the way I do it, partition the disk then use the partitions
> in the raid, not the whole disk. Although I do make more partitions and
> more mount points only so that one thing can't use up all the space and
> break other things. But still any one won't be over 50% utilization.

If you make your big raid1 pair an LVM physical volume, you can split it 
into logical volumes as and when you want, and re-size them whenever 
necessary.  Note, however, that the unpartitioned space within the LVM 
physical volume is still "used" as far as the SSD is concerned, since 
the initial raid1 synchronisation has written to it.  So only space 
outside the raid1 partition acts as extra over-provisioning.  (Not that 
you will need much extra, if any.)

Of course, you can always start with a 50% size partition for your raid1 
pair, leaving (almost) 50% of the SSD completely unused.  And if you 
want more space, you can just add another partition of say 30% on each 
disk, match them up as a raid1 pair, put a new LVM physical volume onto 
it, then add that physical volume to the volume group.  You end up with 
the same data in the same place, with only a tiny overhead for the LVM 
indirection.

Once no-sync tracking is in place for md raid, it will be easier, as 
there is no initial sync for raid1 (everything is marked no-sync).  In 
that case, space that is not partitioned within the LVM physical volume 
will not be written to at all, and will therefore act as extra 
over-provisioning until you actually need it.

If your SSDs do transparent compression, then another trick is to write 
blocks of zero to unused space (you can do this across the whole disk 
before partitioning).  Blocks of zero compress rather well, so take tiny 
amounts of physical space on the disk - and the freed space is then 
extra recyclable blocks.

>
> Oh and I do raid swap, not because it's used a lot, it's not, but to
> raid everything else and leave a single point of failure kind of defeats
> the purpose unless the goal is only to protect the data. Mine is that
> and uptime.

That makes lots of sense.

I find swap useful even on machines with lots of ram - I put /tmp and 
/var/tmp on tmpfs mounts, and sometimes use tmpfs mounts in other 
places.  tmpfs is always the fastest filesystem, as it has no overheads 
for safety or to match sector layouts on disk.  And with plenty of swap, 
you don't have to worry about the space it takes - anything beyond 
memory will automatically spill out to disk (making it slower, but still 
faster than putting those same files on a disk filesystem).

<snip>

>> Put the DB's on the SSD.
>>
>> As with all database applications, if you can get enough memory to have
>> most work done without reading from disks, it will go faster.
>>
>> With decent SSD's (and since you have quite big ones, I assume they are
>> good quality), there is no harm in writing lots.  You can probably write
>> at 30 MB/s continuously for years before causing any wearout on the disk.
>>
>
> Memory is currently at 16G, when I get around to it which won't be in
> the too distant future it will be 32G. I'm fully aware and try to have
> everything running in memory
>
> The SSD's are OCZ Vertex 4 VTX4-25SAT3-256G. I hope they're good ones.
> I'm trying to get their PEC just because I want to know. I'm also going
> to try and get the over provisioned number, again just so I know.
>
> I still haven't decided whether to connect the SSD's to the motherboard
> which is SATA III and use Linux raid or connect them to my Areca 1882i
> battery backed up caching raid controller which is also SATA III. Kind
> of hinges on whether or not the controller passes discard. It's their
> second generation card PCIe 2.0 not the new third generation PCIe 3.0
> card. Trying to find that out too.

One thing to be very careful about with raid cards is that they can add 
a lot of latency to SSDs.  You can end up dropping your IOPs by a factor 
of 20 or more.  So check if the card works well with SSDs before using it.

For two disks, I'd connect them directly to the motherboard SATA (and 
use an external UPS).  But that depends on how much you value the 
battery on the raid card, and how likely you see the risk of a system 
crash (there is slightly lower chance of data loss via a raid card with 
battery cache in such circumstances).

>
> Like to hear your thoughts on this. My thinking is the performance would
> really scream on the 1882i. And it just dawned on me if I use the
> motherboard I might not be able to use the noop scheduler which is what
> I currently use with my ARC-1220 because it has all the disks.
>

I would be very surprised if it ran faster on the raid card than 
connected directly to the motherboard SATA.  Raid cards can, sometimes, 
give you higher speeds for raid5/6 compared to direct connections.  In 
particular, they help if you have a large number of disks (though with 
the latest md raid multithreading for raid5/6, that will probably 
change).  But generally speaking, a raid card is not for speed - 
especially not for SSDs where the extra layer will add noticeable 
latency.  Your CPU, motherboard and memory are more than capable of 
saturating two fast SSDs - how could a raid card go any faster?


>>>
>>> Ok but what about making a change to a page in a block whose other pages
>>> are valid? The whole block gets moved then the old block is later
>>> erased? That's what I'm understanding which sounds ok.
>>
>> No, the changed page will get re-mapped to a different page somewhere
>> else - the unchanged data will remain where it was.  That data will only
>> get moved if it makes sense for "defragmenting" to free up erase blocks,
>> or as part of wear-levelling routines.
>
> Got it.
>
>>
>>>
>>> I think I was over thinking this. If a page changes the only way to do
>>> that is read-modify-write of the block to where ever. So it might as
>>> well be to an already erased block. I was getting hung up on having
>>> erased pages in the blocks that can be immediately and just written.
>>> Period. But that only occurs when appending data to a file. Let the
>>> filesystem and SSD's do there thing...
>>>
>>> I'm really thinking I don't need TRIM now. And when it is finally in the
>>> kernel I can maybe try it. I was worried that if I don't do it from the
>>> start it be too late later after the SSD's had been used for a while to
>>> get the full benefit of it.
>>>
>>
>>
>> I think what you really want to use is "fstrim" - this walks through a
>> filesystem metadata, identifies free blocks, and sends TRIM commands for
>> each of them.  Obviously this can take a bit of time, and will slow down
>> the disks while working, but you typically do it with a cron job in the
>> middle of the night.
>>
>> <http://www.vdmeulen.net/cgi-bin/man/man2html?fstrim+8>
>>
>
> Yep, this sounds like the ticket. I was aware of it but didn't pursue it.
>

I haven't tried fstrim myself.  Some day I must upgrade my ageing Fedora 
14 system so that I can play with these new toys instead of just reading 
about them...

>>
>> I don't think the patches for passing TRIM through the md layer have yet
>> made it to mainstream distro kernels, but once they do you can run
>> fstrim.
>>
>
> Neil Brown told me probably 3.7, so we'll see I guess. It's becoming
> less important to me though, but maybe nice when they do. I haven't
> totally ruled out building a kernel with the patches but leaning towards
> not doing it.
>
>>
>>
>> Incidentally, have a look at the figures in this:
>>
>> <https://patrick-nagel.net/blog/archives/337>
>>
>> A sample size of 1 web page is not great statistically evidence, but the
>> difference in the times for "sync" are quite large...
>
> That says pretty much what I learned so far and the numbers are
> interesting. Sort of says not to use trim real time continuously.
>



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-11-01  8:15                   ` David Brown
@ 2012-11-01 15:01                     ` Wolfgang Denk
  2012-11-01 16:41                       ` David Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Wolfgang Denk @ 2012-11-01 15:01 UTC (permalink / raw)
  To: David Brown; +Cc: Curtis J Blank, linux-raid

Dear David,

In message <50922FA4.7070702@hesbynett.no> you wrote:
>
> If you make your big raid1 pair an LVM physical volume, you can split it 
> into logical volumes as and when you want, and re-size them whenever 
> necessary.  Note, however, that the unpartitioned space within the LVM 
> physical volume is still "used" as far as the SSD is concerned, since 
> the initial raid1 synchronisation has written to it.  So only space 

What if the creation of the array was done with "--assume-clean" ?

Best regards,

Wolfgang Denk

-- 
DENX Software Engineering GmbH,     MD: Wolfgang Denk & Detlev Zundel
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-10 Fax: (+49)-8142-66989-80 Email: wd@denx.de
Common sense and a sense of humor  are  the  same  thing,  moving  at
different speeds.  A sense of humor is just common sense, dancing.
                                                        - Clive James

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-11-01 15:01                     ` Wolfgang Denk
@ 2012-11-01 16:41                       ` David Brown
  0 siblings, 0 replies; 23+ messages in thread
From: David Brown @ 2012-11-01 16:41 UTC (permalink / raw)
  To: Wolfgang Denk; +Cc: Curtis J Blank, linux-raid

On 01/11/12 16:01, Wolfgang Denk wrote:
> Dear David,
>
> In message <50922FA4.7070702@hesbynett.no> you wrote:
>>
>> If you make your big raid1 pair an LVM physical volume, you can split it
>> into logical volumes as and when you want, and re-size them whenever
>> necessary.  Note, however, that the unpartitioned space within the LVM
>> physical volume is still "used" as far as the SSD is concerned, since
>> the initial raid1 synchronisation has written to it.  So only space
>
> What if the creation of the array was done with "--assume-clean" ?
>

I think in that case you will avoid writing to the disks - but you will 
have trouble if you try to check or scrub the disk, or if it runs a 
resync due to an unclean shutdown.

I suppose new SSDs may consistently return all zeros or all ones when 
read - so if they are consistent then you should be fine to 
"--assume-clean" (for raid1).

However, I think that resyncs currently write data to the second drive 
without checking if it is out of sync, which would put you back where 
you started.  (I believe there are patches on their way to change that 
behaviour to save writes on SSDs - but re-writing everything on the 
second disk is the fastest resync method for normal hard disk setups.)



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-10-31 14:11               ` David Brown
@ 2012-11-13 13:39                 ` Ric Wheeler
  2012-11-13 15:13                   ` David Brown
  0 siblings, 1 reply; 23+ messages in thread
From: Ric Wheeler @ 2012-11-13 13:39 UTC (permalink / raw)
  To: David Brown; +Cc: Alexander Haase, Chris Murphy, linux-raid

On 10/31/2012 10:11 AM, David Brown wrote:
> On 31/10/2012 14:12, Alexander Haase wrote:
>> Has anyone considered handling TRIM via an idle IO queue? You'd have to
>> purge queue items that conflicted with incoming writes, but it does get
>> around the performance complaint. If the idle period never comes, old
>> TRIMs can be silently dropped to lessen queue bloat.
>>
>
> I am sure it has been considered - but is it worth the effort and the 
> complications?  TRIM has been implemented in several filesystems (ext4 and, I 
> believe, btrfs) - but is disabled by default because it typically slows down 
> the system.  You are certainly correct that putting TRIM at the back of the 
> queue will avoid the delays it causes - but it still will not give any 
> significant benefit (except for old SSDs with limited garbage collection and 
> small over-provisioning ), and you have a lot of extra complexity to ensure 
> that a TRIM is never pushed back until after a new write to the same logical 
> sectors.

I think that you are vastly understating the need for discard support or what 
your first hand experience is, so let me  inject some facts into this thread 
from working on this for several years (with vendors) :)

Overview:

* In Linux, we have "discard" support which vectors down into the device 
appropriate method (TRIM for S-ATA, UNMAP/WRITE_SAME+UNMAP for SCSI, just 
discard for various SW only block devices)
* There is support for inline discard in many file systems (ext4, xfs, btrfs, 
gfs2, ...)
* There is support for "batched" discard (still online) via tools like fstrim

Every SSD device benefits from TRIM and the SSD companies test this code with 
the upstream community.

In our testing with various devices, the inline (mount -o discard) can have a 
performance impact so typically using the batched method is better.

For SCSI arrays (less an issue here on this list), the discard allows for 
over-provisioning of LUN's.

Device mapper has support (newly added) for dm-thinp targets which can do the 
same without hardware support.

>
> It would be much easier and safer, and give much better effect, to make sure 
> the block allocation procedure for filesystems emphasised re-writing old 
> blocks as soon as possible (when on an SSD).  Then there is no need for TRIM 
> at all.  This would have the added benefit of working well for compressed (or 
> sparse) hard disk image files used by virtual machines - such image files only 
> take up real disk space for blocks that are written, so re-writes would save 
> real-world disk space.

Above you are mixing the need for TRIM (which allows devices like SSD's to do 
wear levelling and performance tuning on physical blocks) with the virtual block 
layout of SSD devices. Please keep in mind that the block space advertised out 
to a file system is contiguous, but SSD's internally remapped the physical 
blocks aggressively. Think of physical DRAM and your virtual memory layout.

Doing a naive always allocate and reuse the lowest block would have horrendous 
performance impact on certain devices. Even on SSD's where seek is negligible, 
having to do lots of small IO's instead of larger, contiguous IO's is much slower.

Regards,

Ric


>
>> As far as parity consistency, bitmaps could track which stripes( and
>> blocks within those stripes) are expected to be out of parity( also
>> useful for lazy device init ). Maybe a bit-per-stripe map at the logical
>> device level and a bit-per-LBA bitmap at the stripe level?
>
> Tracking "no-sync" areas of a raid array is already high on the md raid 
> things-to-do list (perhaps it is already implemented - I lose track of which 
> features are planned and which are implemented). And yes, such no-sync 
> tracking would be useful here.  But it is complicated, especially for raid5/6 
> (raid1 is not too bad) - should TRIMs that cover part of a stripe be dropped?  
> Should the md layer remember them and coalesce them when it can TRIM a whole 
> stripe?  Should it try to track partial synchronisation within a stripe?
>
> Or should the md developers simply say that since supporting TRIM is not going 
> to have any measurable benefits (certainly not with the sort of SSD's people 
> use in raid arrays), and since TRIM slows down some operations, it is better 
> to keep things simple and ignore TRIM entirely?  Even if there are occasional 
> benefits to having TRIM, is it worth it in the face of added complication in 
> the code and the risk of errors?
>
> There /have/ been developers working on TRIM support on raid5.  It seems to 
> have been a complicated process.  But some people like a challenge!
>
>>
>> On the other hand, does it hurt if empty blocks are out of parity( due
>> to TRIM or lazy device init)? The parity recovery of garbage is still
>> garbage, which is what any sane FS expects from unused blocks. If and
>> when you do a parity scrub, you will spend a lot of time recovering
>> garbage and undo any good TRIM might have done, but usual drive
>> operation should quickly balance that out in a write-intensive
>> environment where idle TRIM might help.
>>
>
> Yes, it "hurts" if empty blocks are out of sync.  On obvious issue is that you 
> will get errors when scrubbing - the md layer has no way of knowing that these 
> are unimportant (assuming there is no no-sync tracking), so any real problems 
> will be hidden by the unimportant ones.
>
> Another issue is for RMW cycles on raid5.  Small writes are done by reading 
> the old data, reading the old parity, writing the new data and the new parity 
> - but that only works if the parity was correct across the whole stripe.  Even 
> if raid5 TRIM is restricted to whole stripes, a later small write to that 
> stripe will be a disaster if it is not in sync.
>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-11-13 13:39                 ` Ric Wheeler
@ 2012-11-13 15:13                   ` David Brown
  2012-11-13 15:39                     ` Ric Wheeler
  0 siblings, 1 reply; 23+ messages in thread
From: David Brown @ 2012-11-13 15:13 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Alexander Haase, Chris Murphy, linux-raid

On 13/11/2012 14:39, Ric Wheeler wrote:
> On 10/31/2012 10:11 AM, David Brown wrote:
>> On 31/10/2012 14:12, Alexander Haase wrote:
>>> Has anyone considered handling TRIM via an idle IO queue? You'd have to
>>> purge queue items that conflicted with incoming writes, but it does get
>>> around the performance complaint. If the idle period never comes, old
>>> TRIMs can be silently dropped to lessen queue bloat.
>>>
>>
>> I am sure it has been considered - but is it worth the effort and the
>> complications?  TRIM has been implemented in several filesystems (ext4
>> and, I believe, btrfs) - but is disabled by default because it
>> typically slows down the system.  You are certainly correct that
>> putting TRIM at the back of the queue will avoid the delays it causes
>> - but it still will not give any significant benefit (except for old
>> SSDs with limited garbage collection and small over-provisioning ),
>> and you have a lot of extra complexity to ensure that a TRIM is never
>> pushed back until after a new write to the same logical sectors.
>
> I think that you are vastly understating the need for discard support or
> what your first hand experience is, so let me  inject some facts into
> this thread from working on this for several years (with vendors) :)
>

That is quite possible - my experience is limited.  My aim in this 
discussion is not to say that TRIM should be ignored completely, but to 
ask if it really is necessary, and if its benefits outweigh its 
disadvantages and the added complexity.  I am trying to dispel the 
widely held myths that TRIM is essential, that SSDs are painfully slow 
without it, that SSDs do not work with RAID because RAID does not 
support TRIM, and that you must always enable TRIM (and "discard" mount 
options) to get the best from your SSDs.

Nothing makes me happier here than seeing someone with strong experience 
from multiple vendors bringing in some facts - so thank you for your 
comments and help here.

> Overview:
>
> * In Linux, we have "discard" support which vectors down into the device
> appropriate method (TRIM for S-ATA, UNMAP/WRITE_SAME+UNMAP for SCSI,
> just discard for various SW only block devices)
> * There is support for inline discard in many file systems (ext4, xfs,
> btrfs, gfs2, ...)
> * There is support for "batched" discard (still online) via tools like
> fstrim
>

OK.

> Every SSD device benefits from TRIM and the SSD companies test this code
> with the upstream community.
>
> In our testing with various devices, the inline (mount -o discard) can
> have a performance impact so typically using the batched method is better.
>

I am happy to see you confirm this.  I think fstrim is a much more 
practical choice than inline trim for many uses (with SATA SSD's at 
least - SCSI/SAS SSD's have better "trim" equivalents with less 
performance impact, since they can be queued).  I also think fstrim will 
work better along with RAID and other layered systems, since it will 
have fewer, larger TRIMs and allow the RAID system to trim whole stripes 
at a time (and just drop any leftovers).

> For SCSI arrays (less an issue here on this list), the discard allows
> for over-provisioning of LUN's.
>
> Device mapper has support (newly added) for dm-thinp targets which can
> do the same without hardware support.
>
>>
>> It would be much easier and safer, and give much better effect, to
>> make sure the block allocation procedure for filesystems emphasised
>> re-writing old blocks as soon as possible (when on an SSD).  Then
>> there is no need for TRIM at all.  This would have the added benefit
>> of working well for compressed (or sparse) hard disk image files used
>> by virtual machines - such image files only take up real disk space
>> for blocks that are written, so re-writes would save real-world disk
>> space.
>
> Above you are mixing the need for TRIM (which allows devices like SSD's
> to do wear levelling and performance tuning on physical blocks) with the
> virtual block layout of SSD devices. Please keep in mind that the block
> space advertised out to a file system is contiguous, but SSD's
> internally remapped the physical blocks aggressively. Think of physical
> DRAM and your virtual memory layout.

I don't think I am mixing these concepts - but I might well be 
expressing myself badly.

Suppose the disk has logical blocks log000 to log499, and physical 
blocks phy000 to phy599.  The filesystem sees 500 blocks, which the 
SSD's firmware maps onto the 600 physical blocks as needed (20% 
overprovisioning).  We start off with a blank SSD.

The filesystem writes out a file to blocks log000 through log009.  The 
SSD has to map these to physical blocks, and picks phy000 through phy009.

Then the filesystem deletes that file.  Logical blocks log000 to log009 
are now free for re-use by the filesystem.  But without TRIM, the SSD 
does not know that - so it must preserve phy000 to phy009.

Then the filesystem writes a new 10-block file.  If it picks log010 to 
log019 for the logical blocks, then the SSD will write them to phy010 
through phy019.  Everything works fine, but the SSD is carrying around 
these extra physical blocks that it believes are important, because they 
are still mapped to logical blocks log000 to log009, and the SSD does 
not know they are now unused.

But if instead the filesystem wrote the new file to log000 to log009, we 
would have a different case.  The SSD would again allocate phy010 to 
phy019, since it needs to use blank blocks.  But now the SSD has changed 
the mapping for log000 to phy010 instead of phy000, and knows that 
physical blocks phy000 to phy009 are not needed - without a logical 
block mapping, they cannot be accessed by the file system.  So these 
physical blocks can be re-cycled in exactly the same manner as if they 
were TRIM'ed.

In this way, if the filesystem is careful about re-using free logical 
blocks (rather than aiming for low fragmentation and contiguous block 
allocation, as done for hard disk speed), there is no need for TRIM. 
The only benefit of TRIM is to move the recycling process to a slightly 
earlier stage - but I believe that effect would be negligible with 
appropriate overprovisioning.

That's my theory, anyway.

>
> Doing a naive always allocate and reuse the lowest block would have
> horrendous performance impact on certain devices. Even on SSD's where
> seek is negligible, having to do lots of small IO's instead of larger,
> contiguous IO's is much slower.

Clearly the allocation algorithms would have to be different for SSDs 
and hard disks (and I realise this complicates matters - an aim with the 
block device system is to keep things device independent when possible. 
  There is always someone who wants to make a three-way raid1 mirror 
from an SSD, a hard disk partition, and a block of memory exported by 
iSCSI from a remote server - and it is great that they can do so).  And 
clearly having lots of small IOs will increase overheads and reduce any 
performance benefits.  But somewhere here is the possibility to bias the 
filesystems' allocation schemes towards reuse, giving most of the 
benefits of TRIM "for free".

It may also be the case that filesystems already do this, and I am 
recommending a re-invention of a wheel that is already optimised - 
obviously you will know that far better than me.  I am just trying to 
come up with helpful ideas.

mvh.,

David


>
> Regards,
>
> Ric
>
>
>>
>>> As far as parity consistency, bitmaps could track which stripes( and
>>> blocks within those stripes) are expected to be out of parity( also
>>> useful for lazy device init ). Maybe a bit-per-stripe map at the logical
>>> device level and a bit-per-LBA bitmap at the stripe level?
>>
>> Tracking "no-sync" areas of a raid array is already high on the md
>> raid things-to-do list (perhaps it is already implemented - I lose
>> track of which features are planned and which are implemented). And
>> yes, such no-sync tracking would be useful here.  But it is
>> complicated, especially for raid5/6 (raid1 is not too bad) - should
>> TRIMs that cover part of a stripe be dropped? Should the md layer
>> remember them and coalesce them when it can TRIM a whole stripe?
>> Should it try to track partial synchronisation within a stripe?
>>
>> Or should the md developers simply say that since supporting TRIM is
>> not going to have any measurable benefits (certainly not with the sort
>> of SSD's people use in raid arrays), and since TRIM slows down some
>> operations, it is better to keep things simple and ignore TRIM
>> entirely?  Even if there are occasional benefits to having TRIM, is it
>> worth it in the face of added complication in the code and the risk of
>> errors?
>>
>> There /have/ been developers working on TRIM support on raid5.  It
>> seems to have been a complicated process.  But some people like a
>> challenge!
>>
>>>
>>> On the other hand, does it hurt if empty blocks are out of parity( due
>>> to TRIM or lazy device init)? The parity recovery of garbage is still
>>> garbage, which is what any sane FS expects from unused blocks. If and
>>> when you do a parity scrub, you will spend a lot of time recovering
>>> garbage and undo any good TRIM might have done, but usual drive
>>> operation should quickly balance that out in a write-intensive
>>> environment where idle TRIM might help.
>>>
>>
>> Yes, it "hurts" if empty blocks are out of sync.  On obvious issue is
>> that you will get errors when scrubbing - the md layer has no way of
>> knowing that these are unimportant (assuming there is no no-sync
>> tracking), so any real problems will be hidden by the unimportant ones.
>>
>> Another issue is for RMW cycles on raid5.  Small writes are done by
>> reading the old data, reading the old parity, writing the new data and
>> the new parity - but that only works if the parity was correct across
>> the whole stripe.  Even if raid5 TRIM is restricted to whole stripes,
>> a later small write to that stripe will be a disaster if it is not in
>> sync.
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Best way (only?) to setup SSD's for using TRIM
  2012-11-13 15:13                   ` David Brown
@ 2012-11-13 15:39                     ` Ric Wheeler
  0 siblings, 0 replies; 23+ messages in thread
From: Ric Wheeler @ 2012-11-13 15:39 UTC (permalink / raw)
  To: David Brown; +Cc: Alexander Haase, Chris Murphy, linux-raid

On 11/13/2012 10:13 AM, David Brown wrote:
> On 13/11/2012 14:39, Ric Wheeler wrote:
>> On 10/31/2012 10:11 AM, David Brown wrote:
>>> On 31/10/2012 14:12, Alexander Haase wrote:
>>>> Has anyone considered handling TRIM via an idle IO queue? You'd have to
>>>> purge queue items that conflicted with incoming writes, but it does get
>>>> around the performance complaint. If the idle period never comes, old
>>>> TRIMs can be silently dropped to lessen queue bloat.
>>>>
>>>
>>> I am sure it has been considered - but is it worth the effort and the
>>> complications?  TRIM has been implemented in several filesystems (ext4
>>> and, I believe, btrfs) - but is disabled by default because it
>>> typically slows down the system.  You are certainly correct that
>>> putting TRIM at the back of the queue will avoid the delays it causes
>>> - but it still will not give any significant benefit (except for old
>>> SSDs with limited garbage collection and small over-provisioning ),
>>> and you have a lot of extra complexity to ensure that a TRIM is never
>>> pushed back until after a new write to the same logical sectors.
>>
>> I think that you are vastly understating the need for discard support or
>> what your first hand experience is, so let me  inject some facts into
>> this thread from working on this for several years (with vendors) :)
>>
>
> That is quite possible - my experience is limited.  My aim in this discussion 
> is not to say that TRIM should be ignored completely, but to ask if it really 
> is necessary, and if its benefits outweigh its disadvantages and the added 
> complexity.  I am trying to dispel the widely held myths that TRIM is 
> essential, that SSDs are painfully slow without it, that SSDs do not work with 
> RAID because RAID does not support TRIM, and that you must always enable TRIM 
> (and "discard" mount options) to get the best from your SSDs.

It really is required, the question and challenge is how to use it correctly and 
how to use the right technique on the right device.

If you have an extremely light workload on any device (an SSD in a laptop used 
for web browsing?), this probably won't matter for a long time but also would 
not impact your performance much since you are not pushing a lot of IO :)

>
> Nothing makes me happier here than seeing someone with strong experience from 
> multiple vendors bringing in some facts - so thank you for your comments and 
> help here.
>
>> Overview:
>>
>> * In Linux, we have "discard" support which vectors down into the device
>> appropriate method (TRIM for S-ATA, UNMAP/WRITE_SAME+UNMAP for SCSI,
>> just discard for various SW only block devices)
>> * There is support for inline discard in many file systems (ext4, xfs,
>> btrfs, gfs2, ...)
>> * There is support for "batched" discard (still online) via tools like
>> fstrim
>>
>
> OK.
>
>> Every SSD device benefits from TRIM and the SSD companies test this code
>> with the upstream community.
>>
>> In our testing with various devices, the inline (mount -o discard) can
>> have a performance impact so typically using the batched method is better.
>>
>
> I am happy to see you confirm this.  I think fstrim is a much more practical 
> choice than inline trim for many uses (with SATA SSD's at least - SCSI/SAS 
> SSD's have better "trim" equivalents with less performance impact, since they 
> can be queued).  I also think fstrim will work better along with RAID and 
> other layered systems, since it will have fewer, larger TRIMs and allow the 
> RAID system to trim whole stripes at a time (and just drop any leftovers).

The basic observation - again important to note that this is for S-ATA devices, 
not all discard enabled devices - is that an ATA_TRIM command takes about the 
same time regardless of the size being trimmed. It is also currently a 
non-queued command, so we shut down NCQ (draining the queue for a S-ATA device 
on each command).

Basically a good idea for S-ATA to use fewer commands to minimize that impact.

The standards body T13 is thinking about fixing the non-queueable issue so this 
might improve.

Note again, there a loads of other device types where this is not such an impact.

As a footnote, if you want to see the various bits of capability we scrape out 
of devices, we put a lot of information into /sys/block/sda/queue (discard 
support, etc).

>
>> For SCSI arrays (less an issue here on this list), the discard allows
>> for over-provisioning of LUN's.
>>
>> Device mapper has support (newly added) for dm-thinp targets which can
>> do the same without hardware support.
>>
>>>
>>> It would be much easier and safer, and give much better effect, to
>>> make sure the block allocation procedure for filesystems emphasised
>>> re-writing old blocks as soon as possible (when on an SSD). Then
>>> there is no need for TRIM at all.  This would have the added benefit
>>> of working well for compressed (or sparse) hard disk image files used
>>> by virtual machines - such image files only take up real disk space
>>> for blocks that are written, so re-writes would save real-world disk
>>> space.
>>
>> Above you are mixing the need for TRIM (which allows devices like SSD's
>> to do wear levelling and performance tuning on physical blocks) with the
>> virtual block layout of SSD devices. Please keep in mind that the block
>> space advertised out to a file system is contiguous, but SSD's
>> internally remapped the physical blocks aggressively. Think of physical
>> DRAM and your virtual memory layout.
>
> I don't think I am mixing these concepts - but I might well be expressing 
> myself badly.
>
> Suppose the disk has logical blocks log000 to log499, and physical blocks 
> phy000 to phy599.  The filesystem sees 500 blocks, which the SSD's firmware 
> maps onto the 600 physical blocks as needed (20% overprovisioning).  We start 
> off with a blank SSD.
>
> The filesystem writes out a file to blocks log000 through log009. The SSD has 
> to map these to physical blocks, and picks phy000 through phy009.
>
> Then the filesystem deletes that file.  Logical blocks log000 to log009 are 
> now free for re-use by the filesystem.  But without TRIM, the SSD does not 
> know that - so it must preserve phy000 to phy009.
>
> Then the filesystem writes a new 10-block file.  If it picks log010 to log019 
> for the logical blocks, then the SSD will write them to phy010 through 
> phy019.  Everything works fine, but the SSD is carrying around these extra 
> physical blocks that it believes are important, because they are still mapped 
> to logical blocks log000 to log009, and the SSD does not know they are now 
> unused.
>
> But if instead the filesystem wrote the new file to log000 to log009, we would 
> have a different case.  The SSD would again allocate phy010 to phy019, since 
> it needs to use blank blocks. But now the SSD has changed the mapping for 
> log000 to phy010 instead of phy000, and knows that physical blocks phy000 to 
> phy009 are not needed - without a logical block mapping, they cannot be 
> accessed by the file system.  So these physical blocks can be re-cycled in 
> exactly the same manner as if they were TRIM'ed.
>
> In this way, if the filesystem is careful about re-using free logical blocks 
> (rather than aiming for low fragmentation and contiguous block allocation, as 
> done for hard disk speed), there is no need for TRIM. The only benefit of TRIM 
> is to move the recycling process to a slightly earlier stage - but I believe 
> that effect would be negligible with appropriate overprovisioning.
>
> That's my theory, anyway.

I think that any assumptions about the logical layout for an SSD mapping into 
the same physical layout is optimistic.

Still not clear to me why you are trying to combine the two concepts.

Putting things together (contiguous allocation) in your virtual address space 
(block space) is good since you can allocate larger IO's to get your file read 
from the device into DRAM.

Letting the target storage device know what is used/unused (discard) is totally 
unrelated. It allows the device to optimize/garbage collect/wear level/etc.

Not that simple allocation schemes are a bad idea for SSD's (why work harder 
than you need to, avoid wasting CPU cycles, etc), but it is not tied into 
discard or not.

If you want to see a lot of hard data on SSD's, there is a fairly solid body of 
work published at USENIX FAST conferences (www.usenix.org) including work on 
various firmware ideas, testing, etc.

>
>>
>> Doing a naive always allocate and reuse the lowest block would have
>> horrendous performance impact on certain devices. Even on SSD's where
>> seek is negligible, having to do lots of small IO's instead of larger,
>> contiguous IO's is much slower.
>
> Clearly the allocation algorithms would have to be different for SSDs and hard 
> disks (and I realise this complicates matters - an aim with the block device 
> system is to keep things device independent when possible.  There is always 
> someone who wants to make a three-way raid1 mirror from an SSD, a hard disk 
> partition, and a block of memory exported by iSCSI from a remote server - and 
> it is great that they can do so).  And clearly having lots of small IOs will 
> increase overheads and reduce any performance benefits.  But somewhere here is 
> the possibility to bias the filesystems' allocation schemes towards reuse, 
> giving most of the benefits of TRIM "for free".
>
> It may also be the case that filesystems already do this, and I am 
> recommending a re-invention of a wheel that is already optimised - obviously 
> you will know that far better than me.  I am just trying to come up with 
> helpful ideas.
>
> mvh.,
>
> David
>
>

It is unfortunately not just SSD's versus S-ATA spindles. We have SAS SSD's, 
PCI-e SSD's, enterprise arrays (SCSI luns), consumer S-ATA SSDs and software 
only discard enabled devices.

We do work hard to deduce the generic type of the device (again, see the 
/sys/block information) but we need to be careful not to spin into a ton of 
device specific algorithms :)

Ric



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-11-13 15:39 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-10-28 18:59 Best way (only?) to setup SSD's for using TRIM Curtis J Blank
     [not found] ` <CAH3kUhHX28yNXggLuA+D_cH0STY-Rn_BjxVt_bh1sMeYLnM0cw@mail.gmail.com>
2012-10-29 14:35   ` Curtis J Blank
     [not found]   ` <508E9289.5070904@curtronics.com>
     [not found]     ` <CAH3kUhEdOO+GXKK6ALFUYJdYeTw2Mx-PF9M=0vQvkzzidihxSg@mail.gmail.com>
2012-10-29 17:08       ` Curt Blank
2012-10-29 18:06         ` Roberto Spadim
2012-10-30  9:49 ` David Brown
2012-10-30 14:29   ` Curtis J Blank
2012-10-30 14:33     ` Roberto Spadim
2012-10-30 15:55     ` David Brown
2012-10-30 18:30       ` Curt Blank
2012-10-30 18:43         ` Roberto Spadim
2012-10-30 19:59         ` Chris Murphy
2012-10-31  8:32           ` David Brown
2012-10-31 13:44             ` Roberto Spadim
     [not found]             ` <CAJEsFnkM9w0kNbNd51ShP0uExvsZE6V9h3WKKs3nxWfncUCYJA@mail.gmail.com>
2012-10-31 14:11               ` David Brown
2012-11-13 13:39                 ` Ric Wheeler
2012-11-13 15:13                   ` David Brown
2012-11-13 15:39                     ` Ric Wheeler
2012-10-31 17:34             ` Curtis J Blank
2012-10-31 20:04               ` David Brown
2012-11-01  1:54                 ` Curtis J Blank
2012-11-01  8:15                   ` David Brown
2012-11-01 15:01                     ` Wolfgang Denk
2012-11-01 16:41                       ` David Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).