All of lore.kernel.org
 help / color / mirror / Atom feed
* Btrfs/SSD
@ 2017-04-14 11:02 Imran Geriskovan
  2017-04-17 11:53 ` Btrfs/SSD Austin S. Hemmelgarn
  0 siblings, 1 reply; 49+ messages in thread
From: Imran Geriskovan @ 2017-04-14 11:02 UTC (permalink / raw)
  To: linux-btrfs

Hi,
Sometime ago we had some discussion about SSDs.
Within the limits of unknown/undocumented device infos,
we loosely had covered data retension capability/disk age/life time
interrelations, (in?)effectiveness of btrfs dup on SSDs, etc..

Now, as time passed and with some accumulated experience on SSDs
I think we again can have a status check/update on them if you
can share your experiences and best practices.

So if you have something to share about SSDs (it may or may not be
directly related with btrfs) I'm sure everybody here will be happy to
hear it.

Regard,
Imran

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-14 11:02 Btrfs/SSD Imran Geriskovan
@ 2017-04-17 11:53 ` Austin S. Hemmelgarn
  2017-04-17 16:58   ` Btrfs/SSD Chris Murphy
                     ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-17 11:53 UTC (permalink / raw)
  To: Imran Geriskovan, linux-btrfs

On 2017-04-14 07:02, Imran Geriskovan wrote:
> Hi,
> Sometime ago we had some discussion about SSDs.
> Within the limits of unknown/undocumented device infos,
> we loosely had covered data retension capability/disk age/life time
> interrelations, (in?)effectiveness of btrfs dup on SSDs, etc..
>
> Now, as time passed and with some accumulated experience on SSDs
> I think we again can have a status check/update on them if you
> can share your experiences and best practices.
>
> So if you have something to share about SSDs (it may or may not be
> directly related with btrfs) I'm sure everybody here will be happy to
> hear it.

General info (not BTRFS specific):
* Based on SMART attributes and other factors, current life expectancy 
for light usage (normal desktop usage) appears to be somewhere around 
8-12 years depending on specifics of usage (assuming the same workload, 
F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper 
end, XFS is roughly in the middle, ext4 and NTFS are on the low end 
(tested using Windows 7's NTFS driver), and FAT32 is an outlier at the 
bottom of the barrel).
* Queued DISCARD support is still missing in most consumer SATA SSD's, 
which in turn makes the trade-off on those between performance and 
lifetime much sharper.
* Modern (2015 and newer) SSD's seem to have better handling in the FTL 
for the journaling behavior of filesystems like ext4 and XFS.  I'm not 
sure if this is actually a result of the FTL being better, or some 
change in the hardware.
* In my personal experience, Intel, Samsung, and Crucial appear to be 
the best name brands (in relative order of quality).  I have personally 
had bad experiences with SanDisk and Kingston SSD's, but I don't have 
anything beyond circumstantial evidence indicating that it was anything 
but bad luck on both counts.

Regarding BTRFS specifically:
* Given my recently newfound understanding of what the 'ssd' mount 
option actually does, I'm inclined to recommend that people who are 
using high-end SSD's _NOT_ use it as it will heavily increase 
fragmentation and will likely have near zero impact on actual device 
lifetime (but may _hurt_ performance).  It will still probably help with 
mid and low-end SSD's.
* Files with NOCOW and filesystems with 'nodatacow' set will both hurt 
performance for BTRFS on SSD's, and appear to reduce the lifetime of the 
SSD.
* Compression should help performance and device lifetime most of the 
time, unless your CPU is fully utilized on a regular basis (in which 
case it will hurt performance, but still improve device lifetimes).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 11:53 ` Btrfs/SSD Austin S. Hemmelgarn
@ 2017-04-17 16:58   ` Chris Murphy
  2017-04-17 17:13     ` Btrfs/SSD Austin S. Hemmelgarn
  2017-04-18 13:02   ` Btrfs/SSD Imran Geriskovan
  2017-05-12  4:51   ` Btrfs/SSD Duncan
  2 siblings, 1 reply; 49+ messages in thread
From: Chris Murphy @ 2017-04-17 16:58 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Imran Geriskovan, Btrfs BTRFS

On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

> Regarding BTRFS specifically:
> * Given my recently newfound understanding of what the 'ssd' mount option
> actually does, I'm inclined to recommend that people who are using high-end
> SSD's _NOT_ use it as it will heavily increase fragmentation and will likely
> have near zero impact on actual device lifetime (but may _hurt_
> performance).  It will still probably help with mid and low-end SSD's.

What is a high end SSD these days? Built-in NVMe?



> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt
> performance for BTRFS on SSD's, and appear to reduce the lifetime of the
> SSD.

Can you elaborate. It's an interesting problem, on a small scale the
systemd folks have journald set +C on /var/log/journal so that any new
journals are nocow. There is an initial fallocate, but the write
behavior is writing in the same place at the head and tail. But at the
tail, the writes get pushed torward the middle. So the file is growing
into its fallocated space from the tail. The header changes in the
same location, it's an overwrite.

So long as this file is not reflinked or snapshot, filefrag shows a
pile of mostly 4096 byte blocks, thousands. But as they're pretty much
all continuous, the file fragmentation (extent count) is usually never
higher than 12. It meanders between 1 and 12 extents for its life.

Except on the system using ssd_spread mount option. That one has a
journal file that is +C, is not being snapshot, but has over 3000
extents per filefrag and btrfs-progs/debugfs. Really weird.

Now, systemd aside, there are databases that behave this same way
where there's a small section contantly being overwritten, and one or
more sections that grow the data base file from within and at the end.
If this is made cow, the file will absolutely fragment a ton. And
especially if the changes are mostly 4KiB block sizes that then are
fsync'd.

It's almost like we need these things to not fsync at all, and just
rely on the filesystem commit time...





-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 16:58   ` Btrfs/SSD Chris Murphy
@ 2017-04-17 17:13     ` Austin S. Hemmelgarn
  2017-04-17 18:24       ` Btrfs/SSD Roman Mamedov
  2017-04-17 18:34       ` Btrfs/SSD Chris Murphy
  0 siblings, 2 replies; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-17 17:13 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Imran Geriskovan, Btrfs BTRFS

On 2017-04-17 12:58, Chris Murphy wrote:
> On Mon, Apr 17, 2017 at 5:53 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> Regarding BTRFS specifically:
>> * Given my recently newfound understanding of what the 'ssd' mount option
>> actually does, I'm inclined to recommend that people who are using high-end
>> SSD's _NOT_ use it as it will heavily increase fragmentation and will likely
>> have near zero impact on actual device lifetime (but may _hurt_
>> performance).  It will still probably help with mid and low-end SSD's.
>
> What is a high end SSD these days? Built-in NVMe?
One with a good FTL in the firmware.  At minimum, the good Samsung EVO 
drives, the high quality Intel ones, and the Crucial MX series, but 
probably some others.  My choice of words here probably wasn't the best 
though.
>
>
>
>> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt
>> performance for BTRFS on SSD's, and appear to reduce the lifetime of the
>> SSD.
>
> Can you elaborate. It's an interesting problem, on a small scale the
> systemd folks have journald set +C on /var/log/journal so that any new
> journals are nocow. There is an initial fallocate, but the write
> behavior is writing in the same place at the head and tail. But at the
> tail, the writes get pushed torward the middle. So the file is growing
> into its fallocated space from the tail. The header changes in the
> same location, it's an overwrite.
For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets 
rewritten in-place.  This means that cheap FTL's will rewrite that erase 
block in-place (which won't hurt performance but will impact device 
lifetime), and good ones will rewrite into a free block somewhere else 
but may not free that original block for quite some time (which is bad 
for performance but slightly better for device lifetime).

When BTRFS does a COW operation on a block however, it will guarantee 
that that block moves.  Because of this, the old location will either:
1. Be discarded by the FS itself if the 'discard' mount option is set.
2. Be caught by a scheduled call to 'fstrim'.
3. Lay dormant for at least a while.

The first case is ideal for most FTL's, because it lets them know 
immediately that that data isn't needed and the space can be reused. 
The second is close to ideal, but defers telling the FTL that the block 
is unused, which can be better on some SSD's (some have firmware that 
handles wear-leveling better in batches).  The third is not ideal, but 
is still better than what happens with NOCOW or nodatacow set.

Overall, this boils down to the fact that most FTL's get slower if they 
can't wear-level the device properly, and in-place rewrites make it 
harder for them to do proper wear-leveling.
>
> So long as this file is not reflinked or snapshot, filefrag shows a
> pile of mostly 4096 byte blocks, thousands. But as they're pretty much
> all continuous, the file fragmentation (extent count) is usually never
> higher than 12. It meanders between 1 and 12 extents for its life.
>
> Except on the system using ssd_spread mount option. That one has a
> journal file that is +C, is not being snapshot, but has over 3000
> extents per filefrag and btrfs-progs/debugfs. Really weird.
Given how the 'ssd' mount option behaves and the frequency that most 
systemd instances write to their journals, that's actually reasonably 
expected.  We look for big chunks of free space to write into and then 
align to 2M regardless of the actual size of the write, which in turn 
means that files like the systemd journal which see lots of small 
(relatively speaking) writes will have way more extents than they should 
until you defragment them.
>
> Now, systemd aside, there are databases that behave this same way
> where there's a small section contantly being overwritten, and one or
> more sections that grow the data base file from within and at the end.
> If this is made cow, the file will absolutely fragment a ton. And
> especially if the changes are mostly 4KiB block sizes that then are
> fsync'd.
>
> It's almost like we need these things to not fsync at all, and just
> rely on the filesystem commit time...
Essentially yes, but that causes all kinds of other problems.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 17:13     ` Btrfs/SSD Austin S. Hemmelgarn
@ 2017-04-17 18:24       ` Roman Mamedov
  2017-04-17 19:22         ` Btrfs/SSD Imran Geriskovan
  2017-04-18  3:23         ` Btrfs/SSD Duncan
  2017-04-17 18:34       ` Btrfs/SSD Chris Murphy
  1 sibling, 2 replies; 49+ messages in thread
From: Roman Mamedov @ 2017-04-17 18:24 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Imran Geriskovan, Btrfs BTRFS

On Mon, 17 Apr 2017 07:53:04 -0400
"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:

> General info (not BTRFS specific):
> * Based on SMART attributes and other factors, current life expectancy 
> for light usage (normal desktop usage) appears to be somewhere around 
> 8-12 years depending on specifics of usage (assuming the same workload, 
> F2FS is at the very top of the range, BTRFS and NILFS2 are on the upper 
> end, XFS is roughly in the middle, ext4 and NTFS are on the low end 
> (tested using Windows 7's NTFS driver), and FAT32 is an outlier at the 
> bottom of the barrel).

Life expectancy for an SSD is defined not in years, but in TBW (terabytes
written), and AFAICT that's not "from host", but "to flash" (some SSDs will
show you both values in two separate SMART attributes out of the box, on some
it can be unlocked). Filesystem may come into play only by the amount of write
amplification they cause (how much "to flash" is greater than "from host").
Do you have any test data to show that FSes are ranked in that order by WA
they cause, or is it all about "general feel" and how they are branded (F2FS
says so, so it must be the best)

> * Queued DISCARD support is still missing in most consumer SATA SSD's, 
> which in turn makes the trade-off on those between performance and 
> lifetime much sharper.

My choice was to make a script to run from crontab, using "fstrim" on all
mounted SSDs nightly, and aside from that all FSes are mounted with
"nodiscard". Best of the both worlds, and no interference with actual IO
operation.

> * Modern (2015 and newer) SSD's seem to have better handling in the FTL 
> for the journaling behavior of filesystems like ext4 and XFS.  I'm not 
> sure if this is actually a result of the FTL being better, or some 
> change in the hardware.

Again, what makes you think this, did you observe the write amplification
readings and now those are demonstrably lower than on "2014 and older" SSDs?
So, by how much, and which models did you compare?

> * In my personal experience, Intel, Samsung, and Crucial appear to be 
> the best name brands (in relative order of quality).  I have personally 
> had bad experiences with SanDisk and Kingston SSD's, but I don't have 
> anything beyond circumstantial evidence indicating that it was anything 
> but bad luck on both counts.

Why not think in terms not of "name brands" but platforms, i.e. a controller
model + flash combination. For instance Intel have been using some other
companies' controllers in their SSDs. Kingston uses tons of various
controllers (Sandforce/Phison/Marvell/more?) depending on the model and range.

> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt 
> performance for BTRFS on SSD's, and appear to reduce the lifetime of the 
> SSD.

"Appear to"? Just... what. So how many SSDs did you have fail under nocow?

Or maybe can we get serious in a technical discussion? Did you by any chance
mean cause more writes to the SSD and more "to flash" writes (resulting in a
higher WA). If so, then by how much, and what was your test scenario comparing
the same usage with and without nocow?

> * Compression should help performance and device lifetime most of the 
> time, unless your CPU is fully utilized on a regular basis (in which 
> case it will hurt performance, but still improve device lifetimes).

Days are long gone since the end user had to ever think about device lifetimes
with SSDs. Refer to endurance studies such as 
http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead
http://ssdendurancetest.com/
https://3dnews.ru/938764/
It has been demonstrated that all SSDs on the market tend to overshoot even
their rated TBW by several times, as a result it will take any user literally
dozens of years to wear out the flash no matter which filesystem or what
settings used. And most certainly it's not worth it changing anything
significant in your workflow (such as enabling compression if it's otherwise
inconvenient or not needed) just to save the SSD lifetime.

On Mon, 17 Apr 2017 13:13:39 -0400
"Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:

> > What is a high end SSD these days? Built-in NVMe?
> One with a good FTL in the firmware.  At minimum, the good Samsung EVO 
> drives, the high quality Intel ones

As opposed to bad Samsung EVO drives and low-quality Intel ones?

> and the Crucial MX series, but 
> probably some others.  My choice of words here probably wasn't the best 
> though.

Again, which controller? Crucial does not manufacture SSD controllers on their
own, they just pack and brand stuff manufactured by someone else. So if you
meant Marvell based SSDs, then that's many brands, not just Crucial.

> For a normal filesystem or BTRFS with nodatacow or NOCOW, the block gets 
> rewritten in-place.  This means that cheap FTL's will rewrite that erase 
> block in-place (which won't hurt performance but will impact device 
> lifetime), and good ones will rewrite into a free block somewhere else 
> but may not free that original block for quite some time (which is bad 
> for performance but slightly better for device lifetime).

> Overall, this boils down to the fact that most FTL's get slower if they 
> can't wear-level the device properly, and in-place rewrites make it 
> harder for them to do proper wear-leveling.

"Cheap FTL" in your description would be an USB flash stick, or maybe an old
CompactFlash card. Even the 2010 era mass market SandForce controllers will do
none of the stupid shit you describe.

Sorry but with your superficial understanding of the topic (and confidently
sounding verbose posts) all you do is spreading common SSD myths and
misconceptions, i.e. just being downright harmful to whoever reads this
discussion or discovers it later.


-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 17:13     ` Btrfs/SSD Austin S. Hemmelgarn
  2017-04-17 18:24       ` Btrfs/SSD Roman Mamedov
@ 2017-04-17 18:34       ` Chris Murphy
  2017-04-17 19:26         ` Btrfs/SSD Austin S. Hemmelgarn
  1 sibling, 1 reply; 49+ messages in thread
From: Chris Murphy @ 2017-04-17 18:34 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Imran Geriskovan, Btrfs BTRFS

On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>> What is a high end SSD these days? Built-in NVMe?
>
> One with a good FTL in the firmware.  At minimum, the good Samsung EVO
> drives, the high quality Intel ones, and the Crucial MX series, but probably
> some others.  My choice of words here probably wasn't the best though.

It's a confusing market that sorta defies figuring out what we've got.

I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung
EVO+ SD Card in an Intel NUC. They use that same EVO branding on an
$11 SD Card.

And then there's the Samsung Electronics Co Ltd NVMe SSD Controller
SM951/PM951 in another laptop.


>> So long as this file is not reflinked or snapshot, filefrag shows a
>> pile of mostly 4096 byte blocks, thousands. But as they're pretty much
>> all continuous, the file fragmentation (extent count) is usually never
>> higher than 12. It meanders between 1 and 12 extents for its life.
>>
>> Except on the system using ssd_spread mount option. That one has a
>> journal file that is +C, is not being snapshot, but has over 3000
>> extents per filefrag and btrfs-progs/debugfs. Really weird.
>
> Given how the 'ssd' mount option behaves and the frequency that most systemd
> instances write to their journals, that's actually reasonably expected.  We
> look for big chunks of free space to write into and then align to 2M
> regardless of the actual size of the write, which in turn means that files
> like the systemd journal which see lots of small (relatively speaking)
> writes will have way more extents than they should until you defragment
> them.

Nope. The first paragraph applies to NVMe machine with ssd mount
option. Few fragments.

The second paragraph applies to SD Card machine with ssd_spread mount
option. Many fragments.

These are different versions of systemd-journald so I can't completely
rule out a difference in write behavior.


>> Now, systemd aside, there are databases that behave this same way
>> where there's a small section contantly being overwritten, and one or
>> more sections that grow the data base file from within and at the end.
>> If this is made cow, the file will absolutely fragment a ton. And
>> especially if the changes are mostly 4KiB block sizes that then are
>> fsync'd.
>>
>> It's almost like we need these things to not fsync at all, and just
>> rely on the filesystem commit time...
>
> Essentially yes, but that causes all kinds of other problems.

Drat.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 18:24       ` Btrfs/SSD Roman Mamedov
@ 2017-04-17 19:22         ` Imran Geriskovan
  2017-04-17 22:55           ` Btrfs/SSD Hans van Kranenburg
  2017-04-18 12:26           ` Btrfs/SSD Austin S. Hemmelgarn
  2017-04-18  3:23         ` Btrfs/SSD Duncan
  1 sibling, 2 replies; 49+ messages in thread
From: Imran Geriskovan @ 2017-04-17 19:22 UTC (permalink / raw)
  To: linux-btrfs

On 4/17/17, Roman Mamedov <rm@romanrm.net> wrote:
> "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:

>> * Compression should help performance and device lifetime most of the
>> time, unless your CPU is fully utilized on a regular basis (in which
>> case it will hurt performance, but still improve device lifetimes).

> Days are long gone since the end user had to ever think about device lifetimes
> with SSDs. Refer to endurance studies such as
> It has been demonstrated that all SSDs on the market tend to overshoot even
> their rated TBW by several times, as a result it will take any user literally
> dozens of years to wear out the flash no matter which filesystem or what
> settings used. And most certainly it's not worth it changing anything
> significant in your workflow (such as enabling compression if it's
> otherwise inconvenient or not needed) just to save the SSD lifetime.

Going over the thread following questions come to my mind:

- What exactly does btrfs ssd option does relative to plain mode?

- Most(all?) SSDs employ wear leveling. Isn't it? That is they are
constrantly remapping their blocks under the hood. So isn't it
meaningless to speak of some kind of a block forging/fragmentation/etc..
affect of any writing pattern?

- If it is so, Doesn't it mean that there is no better ssd usage strategy
other than minimizing the total bytes written? That is whatever we do,
if it contributes to this fact it is good, otherwise bad. Are all other things
are beyond any user control? Is there a recommended setting?

- How about "data retension" experiences? It is known that
new ssds can hold data safely for longer period. As they age
that margin gets shorter. As an extreme case if I write into a new
ssd and shelve it, can i get back my data back after 5 years?
How about a file written 5 years ago and never touched again although
rest of the ssd is in active use during that period?

- Yes may be lifetimes getting irrelevant. However TBW has
still direct relation with data retension capability.
Knowing that writing more data to a ssd can reduce the
"life time of your data" is something strange.

- But someone can come and say: Hey don't worry about
"data retension years". Because your ssd will already be dead
before data retension becomes a problem for you... Which is
relieving.. :)) Anyway what are your opinions?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 18:34       ` Btrfs/SSD Chris Murphy
@ 2017-04-17 19:26         ` Austin S. Hemmelgarn
  2017-04-17 19:39           ` Btrfs/SSD Chris Murphy
  0 siblings, 1 reply; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-17 19:26 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Imran Geriskovan, Btrfs BTRFS

On 2017-04-17 14:34, Chris Murphy wrote:
> On Mon, Apr 17, 2017 at 11:13 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>>> What is a high end SSD these days? Built-in NVMe?
>>
>> One with a good FTL in the firmware.  At minimum, the good Samsung EVO
>> drives, the high quality Intel ones, and the Crucial MX series, but probably
>> some others.  My choice of words here probably wasn't the best though.
>
> It's a confusing market that sorta defies figuring out what we've got.
>
> I have a Samsung EVO SATA SSD in one laptop, but then I have a Samsung
> EVO+ SD Card in an Intel NUC. They use that same EVO branding on an
> $11 SD Card.
>
> And then there's the Samsung Electronics Co Ltd NVMe SSD Controller
> SM951/PM951 in another laptop.
What makes it even more confusing is that other than Samsung (who _only_ 
use their own flash and controllers), manufacturer does not map to 
controller choice consistently, and even two drives with the same 
controller may have different firmware (and thus different degrees of 
reliability, those OCZ drives that were such crap at data retention were 
the result of a firmware option that the controller manufacturer pretty 
much told them not to use on production devices).
>
>
>>> So long as this file is not reflinked or snapshot, filefrag shows a
>>> pile of mostly 4096 byte blocks, thousands. But as they're pretty much
>>> all continuous, the file fragmentation (extent count) is usually never
>>> higher than 12. It meanders between 1 and 12 extents for its life.
>>>
>>> Except on the system using ssd_spread mount option. That one has a
>>> journal file that is +C, is not being snapshot, but has over 3000
>>> extents per filefrag and btrfs-progs/debugfs. Really weird.
>>
>> Given how the 'ssd' mount option behaves and the frequency that most systemd
>> instances write to their journals, that's actually reasonably expected.  We
>> look for big chunks of free space to write into and then align to 2M
>> regardless of the actual size of the write, which in turn means that files
>> like the systemd journal which see lots of small (relatively speaking)
>> writes will have way more extents than they should until you defragment
>> them.
>
> Nope. The first paragraph applies to NVMe machine with ssd mount
> option. Few fragments.
>
> The second paragraph applies to SD Card machine with ssd_spread mount
> option. Many fragments.
Ah, apologies for my misunderstanding.
>
> These are different versions of systemd-journald so I can't completely
> rule out a difference in write behavior.
There have only been a couple of changes in the write patterns that I 
know of, but I would double check that the values for Seal and Compress 
in the journald.conf file are the same, as I know for a fact that 
changing those does change the write patterns (not much, but they do 
change).
>
>
>>> Now, systemd aside, there are databases that behave this same way
>>> where there's a small section contantly being overwritten, and one or
>>> more sections that grow the data base file from within and at the end.
>>> If this is made cow, the file will absolutely fragment a ton. And
>>> especially if the changes are mostly 4KiB block sizes that then are
>>> fsync'd.
>>>
>>> It's almost like we need these things to not fsync at all, and just
>>> rely on the filesystem commit time...
>>
>> Essentially yes, but that causes all kinds of other problems.
>
> Drat.
>
Admittedly most of the problems are use-case specific (you can't afford 
to lose transactions in a financial database  for example, so it 
functionally has to call fsync after each transaction), but most of it 
stems from the fact that BTRFS is doing a lot of the same stuff that 
much of the 'problem' software is doing itself internally.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 19:26         ` Btrfs/SSD Austin S. Hemmelgarn
@ 2017-04-17 19:39           ` Chris Murphy
  2017-04-18 11:31             ` Btrfs/SSD Austin S. Hemmelgarn
  0 siblings, 1 reply; 49+ messages in thread
From: Chris Murphy @ 2017-04-17 19:39 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Imran Geriskovan, Btrfs BTRFS

On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2017-04-17 14:34, Chris Murphy wrote:

>> Nope. The first paragraph applies to NVMe machine with ssd mount
>> option. Few fragments.
>>
>> The second paragraph applies to SD Card machine with ssd_spread mount
>> option. Many fragments.
>
> Ah, apologies for my misunderstanding.
>>
>>
>> These are different versions of systemd-journald so I can't completely
>> rule out a difference in write behavior.
>
> There have only been a couple of changes in the write patterns that I know
> of, but I would double check that the values for Seal and Compress in the
> journald.conf file are the same, as I know for a fact that changing those
> does change the write patterns (not much, but they do change).

Same, unchanged defaults on both systems.

#Storage=auto
#Compress=yes
#Seal=yes
#SplitMode=uid
#SyncIntervalSec=5m
#RateLimitIntervalSec=30s
#RateLimitBurst=1000


The sync interval sec is curious. 5 minutes? Umm, I'm seeing nearly
constant hits every 2-5 seconds on the journal file; using filefrag.
I'm sure there's a better way to trace a single file being
read/written to than this, but...


>>>> It's almost like we need these things to not fsync at all, and just
>>>> rely on the filesystem commit time...
>>>
>>>
>>> Essentially yes, but that causes all kinds of other problems.
>>
>>
>> Drat.
>>
> Admittedly most of the problems are use-case specific (you can't afford to
> lose transactions in a financial database  for example, so it functionally
> has to call fsync after each transaction), but most of it stems from the
> fact that BTRFS is doing a lot of the same stuff that much of the 'problem'
> software is doing itself internally.
>

Seems like the old way of doing things, and the staleness of the
internet, have colluded to create a lot of nervousness and misuse of
fsync. The very fact Btrfs needs a log tree to deal with fsync's in a
semi-sane way...


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 19:22         ` Btrfs/SSD Imran Geriskovan
@ 2017-04-17 22:55           ` Hans van Kranenburg
  2017-04-19 18:10             ` Btrfs/SSD Chris Murphy
  2017-04-18 12:26           ` Btrfs/SSD Austin S. Hemmelgarn
  1 sibling, 1 reply; 49+ messages in thread
From: Hans van Kranenburg @ 2017-04-17 22:55 UTC (permalink / raw)
  To: Imran Geriskovan, linux-btrfs

On 04/17/2017 09:22 PM, Imran Geriskovan wrote:
> [...]
> 
> Going over the thread following questions come to my mind:
> 
> - What exactly does btrfs ssd option does relative to plain mode?

There's quite an amount of information in the the very recent threads:
- "About free space fragmentation, metadata write amplification and (no)ssd"
- "BTRFS as a GlusterFS storage back-end, and what I've learned from
using it as such."
- "btrfs filesystem keeps allocating new chunks for no apparent reason"
- ... and a few more

I suspect there will be some "summary" mails at some point, but for now,
I'd recommend crawling through these threads first.

And now for your instant satisfaction, a short visual guide to the
difference, which shows actual btrfs behaviour instead of our guesswork
around it (taken from the second mail thread just mentioned):

-o ssd:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

-o nossd:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 18:24       ` Btrfs/SSD Roman Mamedov
  2017-04-17 19:22         ` Btrfs/SSD Imran Geriskovan
@ 2017-04-18  3:23         ` Duncan
  2017-04-18  4:58           ` Btrfs/SSD Roman Mamedov
  1 sibling, 1 reply; 49+ messages in thread
From: Duncan @ 2017-04-18  3:23 UTC (permalink / raw)
  To: linux-btrfs

Roman Mamedov posted on Mon, 17 Apr 2017 23:24:19 +0500 as excerpted:

> Days are long gone since the end user had to ever think about device
> lifetimes with SSDs. Refer to endurance studies such as
> http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-
all-dead
> http://ssdendurancetest.com/
> https://3dnews.ru/938764/
> It has been demonstrated that all SSDs on the market tend to overshoot
> even their rated TBW by several times, as a result it will take any user
> literally dozens of years to wear out the flash no matter which
> filesystem or what settings used

Without reading the links...

Are you /sure/ it's /all/ ssds currently on the market?  Or are you 
thinking narrowly, those actually sold as ssds?

Because all I've read (and I admit I may not actually be current, but...) 
on for instance sd cards, certainly ssds by definition, says they're 
still very write-cycle sensitive -- very simple FTL with little FTL wear-
leveling.

And AFAIK, USB thumb drives tend to be in the middle, moderately complex 
FTL with some, somewhat simplistic, wear-leveling.

While the stuff actually marketed as SSDs, generally SATA or direct PCIE/
NVME connected, may indeed match your argument, no real end-user concern 
necessary any more as the FTLs are advanced enough that user or 
filesystem level write-cycle concerns simply aren't necessary these days.


So does that claim that write-cycle concerns simply don't apply to modern 
ssds, also apply to common thumb drives and sd cards?  Because these are 
certainly ssds both technically and by btrfs standards.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-18  3:23         ` Btrfs/SSD Duncan
@ 2017-04-18  4:58           ` Roman Mamedov
  0 siblings, 0 replies; 49+ messages in thread
From: Roman Mamedov @ 2017-04-18  4:58 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Tue, 18 Apr 2017 03:23:13 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> Without reading the links...
> 
> Are you /sure/ it's /all/ ssds currently on the market?  Or are you 
> thinking narrowly, those actually sold as ssds?
> 
> Because all I've read (and I admit I may not actually be current, but...) 
> on for instance sd cards, certainly ssds by definition, says they're 
> still very write-cycle sensitive -- very simple FTL with little FTL wear-
> leveling.
> 
> And AFAIK, USB thumb drives tend to be in the middle, moderately complex 
> FTL with some, somewhat simplistic, wear-leveling.
> 

If I have to clarify, yes, it's all about SATA and NVMe SSDs. SD cards may be
SSDs "by definition", but nobody will think of an SD card when you say "I
bought an SSD for my computer". And yes, SD card and USB flash sticks are
commonly understood to be much simpler and more brittle devices than full
blown desktop (not to mention server) SSDs.

> While the stuff actually marketed as SSDs, generally SATA or direct PCIE/
> NVME connected, may indeed match your argument, no real end-user concern 
> necessary any more as the FTLs are advanced enough that user or 
> filesystem level write-cycle concerns simply aren't necessary these days.
> 
> 
> So does that claim that write-cycle concerns simply don't apply to modern 
> ssds, also apply to common thumb drives and sd cards?  Because these are 
> certainly ssds both technically and by btrfs standards.
> 


-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 19:39           ` Btrfs/SSD Chris Murphy
@ 2017-04-18 11:31             ` Austin S. Hemmelgarn
  2017-04-18 12:20               ` Btrfs/SSD Hugo Mills
  0 siblings, 1 reply; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-18 11:31 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Imran Geriskovan, Btrfs BTRFS

On 2017-04-17 15:39, Chris Murphy wrote:
> On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2017-04-17 14:34, Chris Murphy wrote:
>
>>> Nope. The first paragraph applies to NVMe machine with ssd mount
>>> option. Few fragments.
>>>
>>> The second paragraph applies to SD Card machine with ssd_spread mount
>>> option. Many fragments.
>>
>> Ah, apologies for my misunderstanding.
>>>
>>>
>>> These are different versions of systemd-journald so I can't completely
>>> rule out a difference in write behavior.
>>
>> There have only been a couple of changes in the write patterns that I know
>> of, but I would double check that the values for Seal and Compress in the
>> journald.conf file are the same, as I know for a fact that changing those
>> does change the write patterns (not much, but they do change).
>
> Same, unchanged defaults on both systems.
>
> #Storage=auto
> #Compress=yes
> #Seal=yes
> #SplitMode=uid
> #SyncIntervalSec=5m
> #RateLimitIntervalSec=30s
> #RateLimitBurst=1000
>
>
> The sync interval sec is curious. 5 minutes? Umm, I'm seeing nearly
> constant hits every 2-5 seconds on the journal file; using filefrag.
> I'm sure there's a better way to trace a single file being
> read/written to than this, but...
AIUI, the sync interval is like BTRFS's commit interval, the journal 
file is guaranteed to be 100% consistent at least once every 
<SyncIntervalSec> seconds.

As far as tracing, I think it's possible to do some kind of filtering 
with btrace so you just see a specific file, but I'm not certain.
>
>
>>>>> It's almost like we need these things to not fsync at all, and just
>>>>> rely on the filesystem commit time...
>>>>
>>>>
>>>> Essentially yes, but that causes all kinds of other problems.
>>>
>>>
>>> Drat.
>>>
>> Admittedly most of the problems are use-case specific (you can't afford to
>> lose transactions in a financial database  for example, so it functionally
>> has to call fsync after each transaction), but most of it stems from the
>> fact that BTRFS is doing a lot of the same stuff that much of the 'problem'
>> software is doing itself internally.
>>
>
> Seems like the old way of doing things, and the staleness of the
> internet, have colluded to create a lot of nervousness and misuse of
> fsync. The very fact Btrfs needs a log tree to deal with fsync's in a
> semi-sane way...
Except that BTRFS is somewhat unusual.  Prior to this, the only 
'mainstream' filesystem that provided most of these features was ZFS, 
and that does a good enough job that this doesn't matter.

For something like a database though, where you need ACID guarantees, 
you pretty much have to have COW semantics internally, and you have to 
force things to stable storage after each transaction that actually 
modifies data.  Looking at it another way, most database storage formats 
are essentially record-oriented filesystems (as opposed to 
block-oriented filesystems that most people think of).  This is part of 
why you see such similar access patterns in databases and VM disk images 
(even if the VM isn't running database software), they are essentially 
doing the same things at a low level.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-18 11:31             ` Btrfs/SSD Austin S. Hemmelgarn
@ 2017-04-18 12:20               ` Hugo Mills
  0 siblings, 0 replies; 49+ messages in thread
From: Hugo Mills @ 2017-04-18 12:20 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Imran Geriskovan, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 2427 bytes --]

On Tue, Apr 18, 2017 at 07:31:34AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-04-17 15:39, Chris Murphy wrote:
> >On Mon, Apr 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
> ><ahferroin7@gmail.com> wrote:
> >>On 2017-04-17 14:34, Chris Murphy wrote:
[...]
> >>>>>It's almost like we need these things to not fsync at all, and just
> >>>>>rely on the filesystem commit time...
> >>>>
> >>>>
> >>>>Essentially yes, but that causes all kinds of other problems.
> >>>
> >>>
> >>>Drat.
> >>>
> >>Admittedly most of the problems are use-case specific (you can't afford to
> >>lose transactions in a financial database  for example, so it functionally
> >>has to call fsync after each transaction), but most of it stems from the
> >>fact that BTRFS is doing a lot of the same stuff that much of the 'problem'
> >>software is doing itself internally.
> >>
> >
> >Seems like the old way of doing things, and the staleness of the
> >internet, have colluded to create a lot of nervousness and misuse of
> >fsync. The very fact Btrfs needs a log tree to deal with fsync's in a
> >semi-sane way...
> Except that BTRFS is somewhat unusual.  Prior to this, the only
> 'mainstream' filesystem that provided most of these features was
> ZFS, and that does a good enough job that this doesn't matter.
> 
> For something like a database though, where you need ACID
> guarantees, you pretty much have to have COW semantics internally,
> and you have to force things to stable storage after each
> transaction that actually modifies data.  Looking at it another way,
> most database storage formats are essentially record-oriented
> filesystems (as opposed to block-oriented filesystems that most
> people think of).  This is part of why you see such similar access
> patterns in databases and VM disk images (even if the VM isn't
> running database software), they are essentially doing the same
> things at a low level.

   I remember thinking, when I was learning about the internals of
btrfs, that it looked an awful lot like the high-level description of
the internals of Oracle which I'd just been learning about. Most of
the same pieces, doing mostly the same kinds operations to achieve the
same effective results.

   Hugo.

-- 
Hugo Mills             | Don't worry, he's not drunk. He's like that all the
hugo@... carfax.org.uk | time.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                           A.H. Deakin

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 19:22         ` Btrfs/SSD Imran Geriskovan
  2017-04-17 22:55           ` Btrfs/SSD Hans van Kranenburg
@ 2017-04-18 12:26           ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-18 12:26 UTC (permalink / raw)
  To: Imran Geriskovan, linux-btrfs

On 2017-04-17 15:22, Imran Geriskovan wrote:
> On 4/17/17, Roman Mamedov <rm@romanrm.net> wrote:
>> "Austin S. Hemmelgarn" <ahferroin7@gmail.com> wrote:
>
>>> * Compression should help performance and device lifetime most of the
>>> time, unless your CPU is fully utilized on a regular basis (in which
>>> case it will hurt performance, but still improve device lifetimes).
>
>> Days are long gone since the end user had to ever think about device lifetimes
>> with SSDs. Refer to endurance studies such as
>> It has been demonstrated that all SSDs on the market tend to overshoot even
>> their rated TBW by several times, as a result it will take any user literally
>> dozens of years to wear out the flash no matter which filesystem or what
>> settings used. And most certainly it's not worth it changing anything
>> significant in your workflow (such as enabling compression if it's
>> otherwise inconvenient or not needed) just to save the SSD lifetime.
>
> Going over the thread following questions come to my mind:
>
> - What exactly does btrfs ssd option does relative to plain mode?
Assuming I understand what it does correctly, it prioritizes writing 
into larger, 2MB aligned chunks of free-space, whereas normal mode goes 
for 64k alignment.
>
> - Most(all?) SSDs employ wear leveling. Isn't it? That is they are
> constrantly remapping their blocks under the hood. So isn't it
> meaningless to speak of some kind of a block forging/fragmentation/etc..
> affect of any writing pattern?
Because making one big I/O request to fetch a file is faster than a 
bunch of small ones.  If your file is all in one extent in the 
filesystem, it takes less work to copy to memory than if you're pulling 
form a dozen places on the device.  This doesn't have much impact on 
light workloads, but when you're looking at heavy server workloads, it's 
big.
>
> - If it is so, Doesn't it mean that there is no better ssd usage strategy
> other than minimizing the total bytes written? That is whatever we do,
> if it contributes to this fact it is good, otherwise bad. Are all other things
> are beyond any user control? Is there a recommended setting?
As a general strategy, yes, that appears to be the case.  ON a specific 
SSD, it may not be.  For example, on the Crucial MX300's I have in most 
of my systems, the 'ssd' mount option actually makes things slower by 
anywhere from 2-10%.
>
> - How about "data retension" experiences? It is known that
> new ssds can hold data safely for longer period. As they age
> that margin gets shorter. As an extreme case if I write into a new
> ssd and shelve it, can i get back my data back after 5 years?
> How about a file written 5 years ago and never touched again although
> rest of the ssd is in active use during that period?
>
> - Yes may be lifetimes getting irrelevant. However TBW has
> still direct relation with data retension capability.
> Knowing that writing more data to a ssd can reduce the
> "life time of your data" is something strange.
Explaining this and your comment above requires a bit of understanding 
of how flash memory actually works.  The general structure of a single 
cell is that of a field-effect transistor (almost always a MOSFET) with 
a floating gate which consists of a bit of material electrically 
isolated from the rest of the transistor.  Data is stored by trapping 
electrons on this floating gate, but getting them there requires a 
strong enough current to break through the insulating layer that keeps 
it isolated from the rest of the transistor.  This process breaks down 
the insulating layer over time, making it easier for the electrons 
trapped in the floating gate to leak back into the rest of the 
transistor, thus losing data.

Aside from the write-based degradation of the insulating layer, there 
are other things that can cause it to break down or for the electrons to 
leak out, including very high temperatures (we're talking industrial 
temperatures here, not the type you're likely to see in most consumer 
electronics), strong electromagnetic fields (again, we're talking 
_really_ strong here, not stuff you're likely to see in most consumer 
electronics), cosmic background radiation, and even noise from other 
nearby cells being rewritten (known as a read disturb error, only an 
issue in NAND flash (but that's what all SSD's are these days)).
>
> - But someone can come and say: Hey don't worry about
> "data retension years". Because your ssd will already be dead
> before data retension becomes a problem for you... Which is
> relieving.. :)) Anyway what are your opinions?
On this in particular, my opinion is that that claim is bogus unless you 
have an SSD designed to brick itself after a fixed period of time.  That 
statement is about the same as saying that you don't need to worry about 
uncorrectable errors in ECC RAM because you'll lose entire chips before 
they ever happen.  In both cases, you should indeed be worrying more 
about catastrophic failure, but that's because it will have a bigger 
impact and is absolutely unavoidable (it will eventually happen, and 
there's not really anything you can do to prevent it from happening), 
but that does not mean you shouldn't worry about other failure modes, 
especially ones that still have a significant impact (and losing data in 
a persistent storage device generally qualifies as a significant impact).

This study by CMU [1] may be of particular interest, especially since it 
particularly worries about data retention rates, not device lifetime, 
and seems to indicate that the opposite of the above statement is in 
fact true if you don't do prophylactic rewrites.  Note that this has 
little to no bearing on my argument above (I stand by that argument 
irrespective of this study).

[1] 
https://users.ece.cmu.edu/~omutlu/pub/flash-memory-data-retention_hpca15.pdf

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 11:53 ` Btrfs/SSD Austin S. Hemmelgarn
  2017-04-17 16:58   ` Btrfs/SSD Chris Murphy
@ 2017-04-18 13:02   ` Imran Geriskovan
  2017-04-18 13:39     ` Btrfs/SSD Austin S. Hemmelgarn
  2017-05-12 18:27     ` Btrfs/SSD Kai Krakow
  2017-05-12  4:51   ` Btrfs/SSD Duncan
  2 siblings, 2 replies; 49+ messages in thread
From: Imran Geriskovan @ 2017-04-18 13:02 UTC (permalink / raw)
  To: linux-btrfs

On 4/17/17, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
> Regarding BTRFS specifically:
> * Given my recently newfound understanding of what the 'ssd' mount
> option actually does, I'm inclined to recommend that people who are
> using high-end SSD's _NOT_ use it as it will heavily increase
> fragmentation and will likely have near zero impact on actual device
> lifetime (but may _hurt_ performance).  It will still probably help with
> mid and low-end SSD's.

I'm trying to have a proper understanding of what "fragmentation" really
means for an ssd and interrelation with wear-leveling.

Before continuing lets remember:
Pages cannot be erased individually, only whole blocks can be erased.
The size of a NAND-flash page size can vary, and most drive have pages
of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256
pages, which means that the size of a block can vary between 256 KB
and 4 MB.
codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/

Lets continue:
Since block sizes are between 256k-4MB, data smaller than this will
"probably" will not be fragmented in a reasonably empty and trimmed
drive. And for a brand new ssd we may speak of contiguous series
of blocks.

However, as drive is used more and more and as wear leveling kicking in
(ie. blocks are remapped) the meaning of "contiguous blocks" will erode.
So any file bigger than a block size will be written to blocks physically apart
no matter what their block addresses says. But my guess is that accessing
device blocks -contiguous or not- are constant time operations. So it would
not contribute performance issues. Right? Comments?

So your the feeling about fragmentation/performance is probably related
with if the file is spread into less or more blocks. If # of blocks used
is higher than necessary (ie. no empty blocks can be found. Instead
lots of partially empty blocks have to be used increasing the total
# of blocks involved) then we will notice performance loss.

Additionally if the filesystem will gonna try something to reduce
the fragmentation for the blocks, it should precisely know where
those blocks are located. Then how about ssd block informations?
Are they available and do filesystems use it?

Anyway if you can provide some more details about your experiences
on this we can probably have better view on the issue.


> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt
> performance for BTRFS on SSD's, and appear to reduce the lifetime of the
> SSD.

This and other experinces tell us it is still possible to "forge some
blocks of ssd". How could this be possible if there is wear-leveling?

Two alternatives comes to mind:

- If there is no empty (trimmed) blocks left on the ssd, it will have no
chance other than forging the block. How about its reserve blocks?
Are they exhausted too? Or are they only used as bad block replacements?

- No proper wear-levelling is accually done by the drive.

Comments?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-18 13:02   ` Btrfs/SSD Imran Geriskovan
@ 2017-04-18 13:39     ` Austin S. Hemmelgarn
  2017-05-12 18:27     ` Btrfs/SSD Kai Krakow
  1 sibling, 0 replies; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-18 13:39 UTC (permalink / raw)
  To: Imran Geriskovan, linux-btrfs

On 2017-04-18 09:02, Imran Geriskovan wrote:
> On 4/17/17, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>> Regarding BTRFS specifically:
>> * Given my recently newfound understanding of what the 'ssd' mount
>> option actually does, I'm inclined to recommend that people who are
>> using high-end SSD's _NOT_ use it as it will heavily increase
>> fragmentation and will likely have near zero impact on actual device
>> lifetime (but may _hurt_ performance).  It will still probably help with
>> mid and low-end SSD's.
>
> I'm trying to have a proper understanding of what "fragmentation" really
> means for an ssd and interrelation with wear-leveling.
>
> Before continuing lets remember:
> Pages cannot be erased individually, only whole blocks can be erased.
> The size of a NAND-flash page size can vary, and most drive have pages
> of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256
> pages, which means that the size of a block can vary between 256 KB
> and 4 MB.
> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/
>
> Lets continue:
> Since block sizes are between 256k-4MB, data smaller than this will
> "probably" will not be fragmented in a reasonably empty and trimmed
> drive. And for a brand new ssd we may speak of contiguous series
> of blocks.
We're slightly talking past each other here.  I'm referring to 
fragmentation on the filesystem level.  This impacts performance on 
SSD's because it necessitates a larger number of IO operations to read 
the data off of the device (which is also the case on traditional HDD's, 
but it has near zero impact there compared to the seek latency).  You 
appear to be referring to fragmentation at the level of the 
flash-translation layer (FTL), which is present in almost any SSD, and 
should have near zero impact on performance if the device has good 
firmware and a decent controller.
>
> However, as drive is used more and more and as wear leveling kicking in
> (ie. blocks are remapped) the meaning of "contiguous blocks" will erode.
> So any file bigger than a block size will be written to blocks physically apart
> no matter what their block addresses says. But my guess is that accessing
> device blocks -contiguous or not- are constant time operations. So it would
> not contribute performance issues. Right? Comments?
Correct.
>
> So your the feeling about fragmentation/performance is probably related
> with if the file is spread into less or more blocks. If # of blocks used
> is higher than necessary (ie. no empty blocks can be found. Instead
> lots of partially empty blocks have to be used increasing the total
> # of blocks involved) then we will notice performance loss.
Kind of.

As an example, consider a 16MB file on a device that can read up to 16MB 
of data in a single read operation (arbitrary numbers chose to make math 
easier).

If you copy that file onto the device while it's idle and has a block of 
free space 16MB in size, it will end up as one extent (in BTRFS at 
least, and probably also in most other  extent-based filesystems).  In 
that case, it will take 1 read operation to read the whole file into memory.

If instead that file gets created with multiple extents that aren't 
right next to each other on disk, you will need a number of read 
operation equal to the number of extents to read the file into memory.

The performance loss I'm referring to when talking about fragmentation 
is the result of the increased number of read operations required to 
read a file with a larger number of extents into memory.  It actually 
has nothing to do with whether or not the device is an SSD, a HDD, a 
DVD, NVRAM, SPI NOR flash, an SD card, or any other storage device, it 
just has more impact on storage devices that have zero seek latency 
because the seek latency usually far exceeds the overhead of the extra 
read operations.
>
> Additionally if the filesystem will gonna try something to reduce
> the fragmentation for the blocks, it should precisely know where
> those blocks are located. Then how about ssd block informations?
> Are they available and do filesystems use it?
>
> Anyway if you can provide some more details about your experiences
> on this we can probably have better view on the issue.
>
>
>> * Files with NOCOW and filesystems with 'nodatacow' set will both hurt
>> performance for BTRFS on SSD's, and appear to reduce the lifetime of the
>> SSD.
>
> This and other experinces tell us it is still possible to "forge some
> blocks of ssd". How could this be possible if there is wear-leveling?
>
> Two alternatives comes to mind:
>
> - If there is no empty (trimmed) blocks left on the ssd, it will have no
> chance other than forging the block. How about its reserve blocks?
> Are they exhausted too? Or are they only used as bad block replacements?
>
> - No proper wear-levelling is accually done by the drive.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 22:55           ` Btrfs/SSD Hans van Kranenburg
@ 2017-04-19 18:10             ` Chris Murphy
  0 siblings, 0 replies; 49+ messages in thread
From: Chris Murphy @ 2017-04-19 18:10 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: Imran Geriskovan, Btrfs BTRFS

On Mon, Apr 17, 2017 at 4:55 PM, Hans van Kranenburg
<hans.van.kranenburg@mendix.com> wrote:
> On 04/17/2017 09:22 PM, Imran Geriskovan wrote:
>> [...]
>>
>> Going over the thread following questions come to my mind:
>>
>> - What exactly does btrfs ssd option does relative to plain mode?
>
> There's quite an amount of information in the the very recent threads:
> - "About free space fragmentation, metadata write amplification and (no)ssd"
> - "BTRFS as a GlusterFS storage back-end, and what I've learned from
> using it as such."
> - "btrfs filesystem keeps allocating new chunks for no apparent reason"
> - ... and a few more
>
> I suspect there will be some "summary" mails at some point, but for now,
> I'd recommend crawling through these threads first.
>
> And now for your instant satisfaction, a short visual guide to the
> difference, which shows actual btrfs behaviour instead of our guesswork
> around it (taken from the second mail thread just mentioned):
>
> -o ssd:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4
>
> -o nossd:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

I'm uncertain from these if the option affects both metadata and data
writes, or just data. The latter makes some sense, if you think a
given data write event contains related files and thus increases the
chance when those files are deleted of having a mostly freed up erase
block. That way wear leveling is doing less work. For metadata writes
it makes less sense to me, and is inconsistent with what I've seen
from metadata chunk allocation. Pretty much anything means dozens or
more 16K nodes are being COWd. e.g. a 2KiB write to systemd journal,
even preallocated, means adding an EXTENT DATA item, one of maybe 200
per node, which means that whole node must be COWd, and whatever its
parent is must be written (ROOT ITEM I think) and then tree root, and
then super block. I see generally 30 16K nodes modified in about 4
minutes with average logging. Even if it's 1 change per 4 minutes, and
all 30 nodes get written to one 2MB block, and then that block isn't
ever written to again, the metadata chunk would be growing and I don't
see that. For weeks or months I see a 512MB metadata chunk and it
doesn't ever get bigger than this.

Anyway, I think ssd mount option still sounds plausibly useful. What
I'm skeptical of on SSD is defragmenting without compression, and also
nocow.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-17 11:53 ` Btrfs/SSD Austin S. Hemmelgarn
  2017-04-17 16:58   ` Btrfs/SSD Chris Murphy
  2017-04-18 13:02   ` Btrfs/SSD Imran Geriskovan
@ 2017-05-12  4:51   ` Duncan
  2017-05-12 13:02     ` Btrfs/SSD Imran Geriskovan
  2 siblings, 1 reply; 49+ messages in thread
From: Duncan @ 2017-05-12  4:51 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Mon, 17 Apr 2017 07:53:04 -0400 as
excerpted:

> * In my personal experience, Intel, Samsung, and Crucial appear to be
> the best name brands (in relative order of quality).  I have personally
> had bad experiences with SanDisk and Kingston SSD's, but I don't have
> anything beyond circumstantial evidence indicating that it was anything
> but bad luck on both counts.

FWIW, I'm in the market for SSDs ATM, and remembered this from a couple 
weeks ago so went back to find it.  Thanks. =:^)

(I'm currently still on quarter-TB generation ssds, plus spinning rust 
for the larger media partition and backups, and want to be rid of the 
spinning rust, so am looking at half-TB to TB, which seems to be the 
pricing sweet spot these days anyway.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-12  4:51   ` Btrfs/SSD Duncan
@ 2017-05-12 13:02     ` Imran Geriskovan
  2017-05-12 18:36       ` Btrfs/SSD Kai Krakow
  2017-05-14  8:46       ` Btrfs/SSD Duncan
  0 siblings, 2 replies; 49+ messages in thread
From: Imran Geriskovan @ 2017-05-12 13:02 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On 5/12/17, Duncan <1i5t5.duncan@cox.net> wrote:
> FWIW, I'm in the market for SSDs ATM, and remembered this from a couple
> weeks ago so went back to find it.  Thanks. =:^)
>
> (I'm currently still on quarter-TB generation ssds, plus spinning rust
> for the larger media partition and backups, and want to be rid of the
> spinning rust, so am looking at half-TB to TB, which seems to be the
> pricing sweet spot these days anyway.)

Since you are taking ssds to mainstream based on your experience,
I guess your perception of data retension/reliability is better than that
of spinning rust. Right? Can you eloborate?

Or an other criteria might be physical constraints of spinning rust
on notebooks which dictates that you should handle the device
with care when running.

What was your primary motivation other than performance?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-04-18 13:02   ` Btrfs/SSD Imran Geriskovan
  2017-04-18 13:39     ` Btrfs/SSD Austin S. Hemmelgarn
@ 2017-05-12 18:27     ` Kai Krakow
  2017-05-12 20:31       ` Btrfs/SSD Imran Geriskovan
                         ` (2 more replies)
  1 sibling, 3 replies; 49+ messages in thread
From: Kai Krakow @ 2017-05-12 18:27 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 18 Apr 2017 15:02:42 +0200
schrieb Imran Geriskovan <imran.geriskovan@gmail.com>:

> On 4/17/17, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
> > Regarding BTRFS specifically:
> > * Given my recently newfound understanding of what the 'ssd' mount
> > option actually does, I'm inclined to recommend that people who are
> > using high-end SSD's _NOT_ use it as it will heavily increase
> > fragmentation and will likely have near zero impact on actual device
> > lifetime (but may _hurt_ performance).  It will still probably help
> > with mid and low-end SSD's.  
> 
> I'm trying to have a proper understanding of what "fragmentation"
> really means for an ssd and interrelation with wear-leveling.
> 
> Before continuing lets remember:
> Pages cannot be erased individually, only whole blocks can be erased.
> The size of a NAND-flash page size can vary, and most drive have pages
> of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256
> pages, which means that the size of a block can vary between 256 KB
> and 4 MB.
> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/
> 
> Lets continue:
> Since block sizes are between 256k-4MB, data smaller than this will
> "probably" will not be fragmented in a reasonably empty and trimmed
> drive. And for a brand new ssd we may speak of contiguous series
> of blocks.
> 
> However, as drive is used more and more and as wear leveling kicking
> in (ie. blocks are remapped) the meaning of "contiguous blocks" will
> erode. So any file bigger than a block size will be written to blocks
> physically apart no matter what their block addresses says. But my
> guess is that accessing device blocks -contiguous or not- are
> constant time operations. So it would not contribute performance
> issues. Right? Comments?
> 
> So your the feeling about fragmentation/performance is probably
> related with if the file is spread into less or more blocks. If # of
> blocks used is higher than necessary (ie. no empty blocks can be
> found. Instead lots of partially empty blocks have to be used
> increasing the total # of blocks involved) then we will notice
> performance loss.
> 
> Additionally if the filesystem will gonna try something to reduce
> the fragmentation for the blocks, it should precisely know where
> those blocks are located. Then how about ssd block informations?
> Are they available and do filesystems use it?
> 
> Anyway if you can provide some more details about your experiences
> on this we can probably have better view on the issue.

What you really want for SSD is not defragmented files but defragmented
free space. That increases life time.

So, defragmentation on SSD makes sense if it cares more about free
space but not file data itself.

But of course, over time, fragmentation of file data (be it meta data
or content data) may introduce overhead - and in btrfs it probably
really makes a difference if I scan through some of the past posts.

I don't think it is important for the file system to know where the SSD
FTL located a data block. It's just important to keep everything nicely
aligned with erase block sizes, reduce rewrite patterns, and free up
complete erase blocks as good as possible.

Maybe such a process should be called "compaction" and not
"defragmentation". In the end, the more continuous blocks of free space
there are, the better the chance for proper wear leveling.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-12 13:02     ` Btrfs/SSD Imran Geriskovan
@ 2017-05-12 18:36       ` Kai Krakow
  2017-05-13  9:52         ` Btrfs/SSD Roman Mamedov
  2017-05-15 12:03         ` Btrfs/SSD Austin S. Hemmelgarn
  2017-05-14  8:46       ` Btrfs/SSD Duncan
  1 sibling, 2 replies; 49+ messages in thread
From: Kai Krakow @ 2017-05-12 18:36 UTC (permalink / raw)
  To: linux-btrfs

Am Fri, 12 May 2017 15:02:20 +0200
schrieb Imran Geriskovan <imran.geriskovan@gmail.com>:

> On 5/12/17, Duncan <1i5t5.duncan@cox.net> wrote:
> > FWIW, I'm in the market for SSDs ATM, and remembered this from a
> > couple weeks ago so went back to find it.  Thanks. =:^)
> >
> > (I'm currently still on quarter-TB generation ssds, plus spinning
> > rust for the larger media partition and backups, and want to be rid
> > of the spinning rust, so am looking at half-TB to TB, which seems
> > to be the pricing sweet spot these days anyway.)  
> 
> Since you are taking ssds to mainstream based on your experience,
> I guess your perception of data retension/reliability is better than
> that of spinning rust. Right? Can you eloborate?
> 
> Or an other criteria might be physical constraints of spinning rust
> on notebooks which dictates that you should handle the device
> with care when running.
> 
> What was your primary motivation other than performance?

Personally, I don't really trust SSDs so much. They are much more
robust when it comes to physical damage because there are no physical
parts. That's absolutely not my concern. Regarding this, I trust SSDs
better than HDDs.

My concern is with fail scenarios of some SSDs which die unexpected and
horribly. I found some reports of older Samsung SSDs which failed
suddenly and unexpected, and in a way that the drive completely died:
No more data access, everything gone. HDDs start with bad sectors and
there's a good chance I can recover most of the data except a few
sectors.

When SSD blocks die, they are probably huge compared to a sector (256kB
to 4MB usually because that's erase block sizes). If this happens, the
firmware may decide to either allow read-only access or completely deny
access. There's another situation where dying storage chips may
completely mess up the firmware and there's no longer any access to
data.

That's why I don't trust any of my data to them. But I still want the
benefit of their speed. So I use SSDs mostly as frontend caches to
HDDs. This gives me big storage with fast access. Indeed, I'm using
bcache successfully for this. A warm cache is almost as fast as native
SSD (at least it feels almost that fast, it will be slower if you threw
benchmarks at it).


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-12 18:27     ` Btrfs/SSD Kai Krakow
@ 2017-05-12 20:31       ` Imran Geriskovan
  2017-05-13  9:39       ` Btrfs/SSD Duncan
  2017-05-15 11:46       ` Btrfs/SSD Austin S. Hemmelgarn
  2 siblings, 0 replies; 49+ messages in thread
From: Imran Geriskovan @ 2017-05-12 20:31 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

On 5/12/17, Kai Krakow <hurikhan77@gmail.com> wrote:
> I don't think it is important for the file system to know where the SSD
> FTL located a data block. It's just important to keep everything nicely
> aligned with erase block sizes, reduce rewrite patterns, and free up
> complete erase blocks as good as possible.

Yeah. "Tight packing" of data into erase blocks will reduce fragmentation
at flash level, but not necessarily the fragmentation at fs level. And
unless we are writing in continuous journaling style (as f2fs ?),
we still need to have some info about the erase blocks.

Of course while these are going on, there is also something like roundrobin
mapping or some kind of journaling would be going on at the low level flash
as wear leveling/bad block replacements which is totally invisible to us.


> Maybe such a process should be called "compaction" and not
> "defragmentation". In the end, the more continuous blocks of free space
> there are, the better the chance for proper wear leveling.


Tight packing into erase blocks seems dominant factor for ssd welfare.

However, fs fragmentation may still be a thing to consider because
increased fs fragmentation will probably increase the # of erase
blocks involved, affecting both read/write performance and wear.

Keeping an eye on both is a though job. Worse there is "two" uncoordinated
eyes one watching the "fs" and the other watching the "flash" making the
whole process suboptimal.

I think the ultimate utopic combination would be "absolutely dumb flash
controller" providing direct access to physical bytes and the ultimate
"Flash FS" making use of possible performance, wear leveling tricks.

Clearly, we are far from it.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-12 18:27     ` Btrfs/SSD Kai Krakow
  2017-05-12 20:31       ` Btrfs/SSD Imran Geriskovan
@ 2017-05-13  9:39       ` Duncan
  2017-05-13 11:15         ` Btrfs/SSD Janos Toth F.
                           ` (2 more replies)
  2017-05-15 11:46       ` Btrfs/SSD Austin S. Hemmelgarn
  2 siblings, 3 replies; 49+ messages in thread
From: Duncan @ 2017-05-13  9:39 UTC (permalink / raw)
  To: linux-btrfs

Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted:

> In the end, the more continuous blocks of free space there are, the
> better the chance for proper wear leveling.

Talking about which...

When I was doing my ssd research the first time around, the going 
recommendation was to keep 20-33% of the total space on the ssd entirely 
unallocated, allowing it to use that space as an FTL erase-block 
management pool.

At the time, I added up all my "performance matters" data dirs and 
allowing for reasonable in-filesystem free-space, decided I could fit it 
in 64 GB if I had to, tho 80 GB would be a more comfortable fit, so 
allowing for the above entirely unpartitioned/unused slackspace 
recommendations, had a target of 120-128 GB, with a reasonable range 
depending on actual availability of 100-160 GB.

It turned out, due to pricing and availability, I ended up spending 
somewhat more and getting 256 GB (238.5 GiB).  Of course that allowed me 
much more flexibility than I had expected and I ended up with basically 
everything but the media partition on the ssds, PLUS I still left them at 
only just over 50% partitioned, (using the gdisk figures, 51%- 
partitioned, 49%+ free).

Given that, I've not enabled btrfs trim/discard (which saved me from the 
bugs with it a few kernel cycles ago), and while I do have a weekly fstrim 
systemd timer setup, I've not had to be too concerned about btrfs bugs 
(also now fixed, I believe) when fstrim on btrfs was known not to be 
trimming everything it really should have been.


Anyway, that 20-33% left entirely unallocated/unpartitioned 
recommendation still holds, right?  Am I correct in asserting that if one 
is following that, the FTL already has plenty of erase-blocks available 
for management and the discussion about filesystem level trim and free 
space management becomes much less urgent, tho of course it's still worth 
considering if it's convenient to do so?

And am I also correct in believing that while it's not really worth 
spending more to over-provision to the near 50% as I ended up doing, if 
things work out that way as they did with me because the difference in 
price between 30% overprovisioning and 50% overprovisioning ends up being 
trivial, there's really not much need to worry about active filesystem 
trim at all, because the FTL has effectively half the device left to play 
erase-block musical chairs with as it decides it needs to?


Of course the higher per-GiB cost of ssd as compared to spinning rust 
does mean that the above overprovisioning recommendation really does 
hurt, most of the time, driving per-usable-GB costs even higher, and as I 
recall that was definitely the case back then between 80 GiB and 160 GiB, 
and it was basically an accident of timing, that I was buying just as the 
manufactures flooded the market with newly cost-effective 256 GB devices, 
that meant they were only trivially more expensive than the 128 or 160 
GB, AND unlike the smaller devices, actually /available/ in the 500-ish 
MB/sec performance range that (for SATA-based SSDs) is actually capped by 
SATA-600 bus speeds more than the chips themselves.  (There were lower 
cost 128 GB devices, but they were lower speed than I wanted, too.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-12 18:36       ` Btrfs/SSD Kai Krakow
@ 2017-05-13  9:52         ` Roman Mamedov
  2017-05-13 10:47           ` Btrfs/SSD Kai Krakow
  2017-05-15 12:03         ` Btrfs/SSD Austin S. Hemmelgarn
  1 sibling, 1 reply; 49+ messages in thread
From: Roman Mamedov @ 2017-05-13  9:52 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

On Fri, 12 May 2017 20:36:44 +0200
Kai Krakow <hurikhan77@gmail.com> wrote:

> My concern is with fail scenarios of some SSDs which die unexpected and
> horribly. I found some reports of older Samsung SSDs which failed
> suddenly and unexpected, and in a way that the drive completely died:
> No more data access, everything gone. HDDs start with bad sectors and
> there's a good chance I can recover most of the data except a few
> sectors.

Just have your backups up-to-date, doesn't matter if it's SSD, HDD or any sort
of RAID.

In a way it's even better, that SSDs [are said to] fail abruptly and entirely.
You can then just restore from backups and go on. Whereas a failing HDD can
leave you puzzled on e.g. whether it's a cable or controller problem instead,
and possibly can even cause some data corruption which you won't notice until
too late.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-13  9:52         ` Btrfs/SSD Roman Mamedov
@ 2017-05-13 10:47           ` Kai Krakow
  0 siblings, 0 replies; 49+ messages in thread
From: Kai Krakow @ 2017-05-13 10:47 UTC (permalink / raw)
  To: linux-btrfs

Am Sat, 13 May 2017 14:52:47 +0500
schrieb Roman Mamedov <rm@romanrm.net>:

> On Fri, 12 May 2017 20:36:44 +0200
> Kai Krakow <hurikhan77@gmail.com> wrote:
> 
> > My concern is with fail scenarios of some SSDs which die unexpected
> > and horribly. I found some reports of older Samsung SSDs which
> > failed suddenly and unexpected, and in a way that the drive
> > completely died: No more data access, everything gone. HDDs start
> > with bad sectors and there's a good chance I can recover most of
> > the data except a few sectors.  
> 
> Just have your backups up-to-date, doesn't matter if it's SSD, HDD or
> any sort of RAID.
> 
> In a way it's even better, that SSDs [are said to] fail abruptly and
> entirely. You can then just restore from backups and go on. Whereas a
> failing HDD can leave you puzzled on e.g. whether it's a cable or
> controller problem instead, and possibly can even cause some data
> corruption which you won't notice until too late.

My current backup strategy can handle this. I never backup files from
the source again if it didn't change by timestamp. That way, silent data
corruption won't creep into the backup. Additionally, I keep a backlog
of 5 years of file history. Even if a corrupted file creeps into the
backup, there is enough time to get a good copy back. If it's older, it
probably doesn't hurt so much anyway.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-13  9:39       ` Btrfs/SSD Duncan
@ 2017-05-13 11:15         ` Janos Toth F.
  2017-05-13 11:34         ` [OT] SSD performance patterns (was: Btrfs/SSD) Kai Krakow
  2017-05-14 16:21         ` Btrfs/SSD Chris Murphy
  2 siblings, 0 replies; 49+ messages in thread
From: Janos Toth F. @ 2017-05-13 11:15 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

> Anyway, that 20-33% left entirely unallocated/unpartitioned
> recommendation still holds, right?

I never liked that idea. And I really disliked how people considered
it to be (and even passed it down as) some magical, absolute
stupid-proof fail-safe thing (because it's not).

1: Unless you reliably trim the whole LBA space (and/or run
ata_secure_erase on the whole drive) before you (re-)partition the LBA
space, you have zero guarantee that the drive's controller/firmware
will treat the unallocated space as empty or will keep it's content
around as useful data (even if it's full of zeros because zero could
be very useful data unless it's specifically marked as "throwaway" by
trim/erase). On the other hand, a trim-compatible filesystem should
properly mark (trim) all (or at least most of) the free space as free
(= free to erase internally by the controller's discretion). And even
if trim isn't fail-proof either, those bugs should be temporary (and
it's not like a sane SSD will die in a few weeks due to these kind of
issues during sane usage and crazy drivers will often fail under crazy
usage regardless of trim and spare space).

2: It's not some daemon-summoning, world-ending catastrophe if you
occasionally happen to fill your SSD to ~100%. It probably won't like
it (it will probably get slow by the end of the writes and the
internal write amplification might skyrocket at it's peak) but nothing
extraordinary will happen and normal operation (high write speed,
normal internal write amplification, etc) should resume soon after you
make some room (for example, you delete your temporary files or move
some old content to an archive storage and you properly trim that
space). That space is there to be used, just don't leave it close to
100% all the time and try never leaving it close to 100% when you plan
to keep it busy with many small random writes.

3: Some drives have plenty of hidden internal spare space (especially
the expensive kinds offered for datacenters or "enthusiast" consumers
by big companies like Intel and such). Even some cheap drivers might
have plenty of erased space at 100% LBA allocation if they use
compression internally (and you don't fill it up to 100% with
in-compressible content).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [OT] SSD performance patterns (was: Btrfs/SSD)
  2017-05-13  9:39       ` Btrfs/SSD Duncan
  2017-05-13 11:15         ` Btrfs/SSD Janos Toth F.
@ 2017-05-13 11:34         ` Kai Krakow
  2017-05-14 16:21         ` Btrfs/SSD Chris Murphy
  2 siblings, 0 replies; 49+ messages in thread
From: Kai Krakow @ 2017-05-13 11:34 UTC (permalink / raw)
  To: linux-btrfs

Am Sat, 13 May 2017 09:39:39 +0000 (UTC)
schrieb Duncan <1i5t5.duncan@cox.net>:

> Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted:
> 
> > In the end, the more continuous blocks of free space there are, the
> > better the chance for proper wear leveling.  
> 
> Talking about which...
> 
> When I was doing my ssd research the first time around, the going 
> recommendation was to keep 20-33% of the total space on the ssd
> entirely unallocated, allowing it to use that space as an FTL
> erase-block management pool.
> 
> At the time, I added up all my "performance matters" data dirs and 
> allowing for reasonable in-filesystem free-space, decided I could fit
> it in 64 GB if I had to, tho 80 GB would be a more comfortable fit,
> so allowing for the above entirely unpartitioned/unused slackspace 
> recommendations, had a target of 120-128 GB, with a reasonable range 
> depending on actual availability of 100-160 GB.
> 
> It turned out, due to pricing and availability, I ended up spending 
> somewhat more and getting 256 GB (238.5 GiB).  Of course that allowed
> me much more flexibility than I had expected and I ended up with
> basically everything but the media partition on the ssds, PLUS I
> still left them at only just over 50% partitioned, (using the gdisk
> figures, 51%- partitioned, 49%+ free).

I put by ESP (for UEFI) onto the SSD and also played with putting swap
onto it dedicated to hibernation. But I discarded the hibernation idea
and removed the swap because it didn't work well: It wasn't much faster
then waking from HDD, and hibernation is not that reliable anyways.
Also, hybrid hibernation is not yet integrated into KDE so I stick to
sleep mode currently.

The rest of my SSD (also 500GB) is dedicated to bcache. This fits my
complete work set of daily work with hit ratios going up to 90% and
beyond. My filesystem boots and feels like SSD, the HDDs are almost
silent and still my file system is 3TB on 3x 1TB HDD.


> Given that, I've not enabled btrfs trim/discard (which saved me from
> the bugs with it a few kernel cycles ago), and while I do have a
> weekly fstrim systemd timer setup, I've not had to be too concerned
> about btrfs bugs (also now fixed, I believe) when fstrim on btrfs was
> known not to be trimming everything it really should have been.

This is a good recommendation as TRIM is still a slow operation because
Queued TRIM is not used for most drives due to buggy firmware. So you
not only circumvent kernel and firmware bugs, but also get better
performance that way.


> Anyway, that 20-33% left entirely unallocated/unpartitioned 
> recommendation still holds, right?  Am I correct in asserting that if
> one is following that, the FTL already has plenty of erase-blocks
> available for management and the discussion about filesystem level
> trim and free space management becomes much less urgent, tho of
> course it's still worth considering if it's convenient to do so?
> 
> And am I also correct in believing that while it's not really worth 
> spending more to over-provision to the near 50% as I ended up doing,
> if things work out that way as they did with me because the
> difference in price between 30% overprovisioning and 50%
> overprovisioning ends up being trivial, there's really not much need
> to worry about active filesystem trim at all, because the FTL has
> effectively half the device left to play erase-block musical chairs
> with as it decides it needs to?

I think, things may have changed since long ago. See below. But it
certainly depends on which drive manufacturer you chose, I guess.

I can at least confirm that bigger drives wear their write cycles much
slower, even when filled up. My old 128MB Crucial drive was worn after
only 1 year (I swapped it early, I kept an eye on SMART numbers). My
500GB Samsung drive is around 1 year old now, I do write a lot more
data to it, but according to SMART it should work for at least 5 to 7
more years. By that time, I probably already swapped it for a bigger
drive.

So I guess you should maybe look at your SMART numbers and calculate
the expected life time:

Power_on_Hours(RAW) * WLC(VALUE) / (100-WLC(VALUE))
with WLC = Wear_Leveling_Count

should get you the expected remaining power on hours. My drive is
powered on 24/7 most of the time but if you power your drive only 8
hours per day, you can easily ramp up the life time by three times of
days vs. me. ;-)

There is also Total_LBAs_Written but that, at least for me, usually
gives much higher lifetime values so I'd stick with the pessimistic
ones.

Even when WLC goes to zero, the drive should still have reserved blocks
available. My drive sets the threshold to 0 for WLC which makes me
think that it is not fatal when it hits 0 because the drive still has
reserved blocks. And for reserved blocks, the threshold is 10%.

Now combine that with your planning of getting a new drive, and you can
optimize space efficiency vs. lifetime better.


> Of course the higher per-GiB cost of ssd as compared to spinning rust 
> does mean that the above overprovisioning recommendation really does 
> hurt, most of the time, driving per-usable-GB costs even higher, and
> as I recall that was definitely the case back then between 80 GiB and
> 160 GiB, and it was basically an accident of timing, that I was
> buying just as the manufactures flooded the market with newly
> cost-effective 256 GB devices, that meant they were only trivially
> more expensive than the 128 or 160 GB, AND unlike the smaller
> devices, actually /available/ in the 500-ish MB/sec performance range
> that (for SATA-based SSDs) is actually capped by SATA-600 bus speeds
> more than the chips themselves.  (There were lower cost 128 GB
> devices, but they were lower speed than I wanted, too.)

Well, I think that most modern drives have a huge fast write cache in
front of the FTL to combine writes and reduce rewrite patterns to the
flash storage. This should already help a lot. Otherwise I cannot
explain how endurance tests show multi-petabyte write endurance even
for TLC drives that are specified with only a few hundred terabytes
write endurance.

So, depending on your write patterns, overprovisioning may not be that
important these days. I.e., Samsung even removed the overprovisioning
feature from their most recent major update to the Magician software. I
believe that is for that reason. Plus, because modern Windows versions
do proper trimming (it is built into the defrag system tool which is
now enabled by default on SSD but only "optimizes" by trimming free
space, Windows Server versions even allow thinning the host disk images
that way when used in virtualization environments).

But overprovisioning should still get you a faster drive while you fill
up your FS because it can handle slow erase cycles in the background
easily.

But prices drop and technology improves, so we can optimize from both
sides and can lower overprovisioning while still staying with an optimal
lifetime and big storage size.

Regarding performance, it seems that only drives on and beyond the
500GB limit can saturate the SATA-600 bus both for reading and writing.
>From this I conclude I should get a PCIe SSD if I bought a drive beyond
500GB.


-- 
Regards,
Kai

Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-12 13:02     ` Btrfs/SSD Imran Geriskovan
  2017-05-12 18:36       ` Btrfs/SSD Kai Krakow
@ 2017-05-14  8:46       ` Duncan
  1 sibling, 0 replies; 49+ messages in thread
From: Duncan @ 2017-05-14  8:46 UTC (permalink / raw)
  To: linux-btrfs

Imran Geriskovan posted on Fri, 12 May 2017 15:02:20 +0200 as excerpted:

> On 5/12/17, Duncan <1i5t5.duncan@cox.net> wrote:
>> FWIW, I'm in the market for SSDs ATM, and remembered this from a couple
>> weeks ago so went back to find it.  Thanks. =:^)
>>
>> (I'm currently still on quarter-TB generation ssds, plus spinning rust
>> for the larger media partition and backups, and want to be rid of the
>> spinning rust, so am looking at half-TB to TB, which seems to be the
>> pricing sweet spot these days anyway.)
> 
> Since you are taking ssds to mainstream based on your experience,
> I guess your perception of data retension/reliability is better than
> that of spinning rust. Right? Can you eloborate?
> 
> Or an other criteria might be physical constraints of spinning rust on
> notebooks which dictates that you should handle the device with care
> when running.
> 
> What was your primary motivation other than performance?

Well, the /immediate/ motivation is that the spinning rust is starting to 
hint that it's time to start thinking about rotating it out of service...

It's my main workstation so wall powered, but because it's the media and 
secondary backups partitions, I don't have anything from it mounted most 
of the time and because it /is/ spinning rust, I allow it to spin down.  
It spins right back up if I mount it, and reads seem to be fine, but if I 
let it set a bit after mount, possibly due to it spinning down again, 
sometimes I get write errors, SATA resets, etc.  Sometimes the write will 
then eventually appear to go thru, sometimes not, but once this happens, 
unmounting often times out, and upon a remount (which may or may not work 
until a clean reboot), the last writes may or may not still be there.

And the smart info, while not bad, does indicate it's starting to age, 
tho not extremely so.

Now even a year ago I'd have likely played with it, adjusting timeouts, 
spindowns, etc, attempting to get it working normally again.

But they say that ssd performance spoils you and you don't want to go 
back, and while it's a media drive and performance isn't normally an 
issue, those secondary backups to it as spinning rust sure take a lot 
longer than the primary backups to other partitions on the same pair of 
ssds that the working copies (of everything but media) are on.

Which means I don't like to do them... which means sometimes I put them 
off longer than I should.  Basically, it's another application of my 
"don't make it so big it takes so long to maintain you don't do it as you 
should" rule, only here, it's not the size but rather because I've been 
spoiled by the performance of the ssds.


So couple the aging spinning rust with the fact that I've really wanted 
to put media and the backups on ssd all along, only it couldn't be cost-
justified a few years ago when I bought the original ssds, and I now have 
my excuse to get the now cheaper ssds I really wanted all along. =:^)


As for reliability...  For archival usage I still think spinning rust is 
more reliable, and certainly more cost effective.

However, for me at least, with some real-world ssd experience under my 
belt now, including an early slow failure (more and more blocks going 
bad, I deliberately kept running it in btrfs raid1 mode with scrubs 
handling the bad blocks for quite some time, just to get the experience 
both with ssds and with btrfs) and replacement of one of the ssds with 
one I had originally bought for a different machine (my netbook, which 
went missing shortly thereafter), I now find ssds reliable enough for 
normal usage, certainly so if the data is valuable enough to have backups 
of it anyway, and if it's not valuable enough to be worth doing backups, 
then losing it is obviously not a big deal, because it's self-evidently 
worth less than the time, trouble and resources of doing that backup.

Particularly so if the speed of ssds helpfully encourages you to keep the 
backups more current than you would otherwise. =:^)

But spinning rust remains appropriate for long-term archival usage, like 
that third-level last-resort backup I like to make, then keep on the 
shelf, or store with a friend, or in a safe deposit box, or whatever, and 
basically never use, but like to have just in case.  IOW, that almost 
certainly write once, read-never, seldom update, last resort backup.  If 
three years down the line there's a fire/flood/whatever, and all I can 
find in the ashes/mud or retrieve from that friend is that three year old 
backup, I'll be glad to still have it.

Of course those who have multi-TB scale data needs may still find 
spinning rust useful as well, because while 4-TB ssds are available now, 
they're /horribly/ expensive.  But with 3D-NAND, even that use-case looks 
like it may go ssd in the next five years or so, leaving multi-year to 
decade-plus archiving, and perhaps say 50-TB-plus, but that's going to 
take long enough to actually write or otherwise do anything with it's 
effectively archiving as well, as about the only remaining spinning rust 
holdout.

Meanwhile, it'll be interesting to see if once ssds are used for 
everything else and there's no other legacy hdd territory to expand into, 
if they come up with a reasonable archiving solution for them as well.
Considering that picking up that old pre-2010 thumb-drive (or MP3 player 
you found in the back of the drawer) and a burnt CD/DVDROM from the same 
period, the flash-based thumb drive or mp3 player is far more likely to 
continue to safely hold its data, one might reasonably believe archival 
flash style ssds are well within reason.  Basically, we already have 
them, we just have to adjust the physical format a bit, and build and 
market them to that purpose, plus scale down the cost, of course, but 
that could easily come if it were addressed to the same scale that 
they've been addressing the ssds as main storage problem.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-13  9:39       ` Btrfs/SSD Duncan
  2017-05-13 11:15         ` Btrfs/SSD Janos Toth F.
  2017-05-13 11:34         ` [OT] SSD performance patterns (was: Btrfs/SSD) Kai Krakow
@ 2017-05-14 16:21         ` Chris Murphy
  2017-05-14 18:01           ` Btrfs/SSD Tomasz Kusmierz
  2 siblings, 1 reply; 49+ messages in thread
From: Chris Murphy @ 2017-05-14 16:21 UTC (permalink / raw)
  To: Duncan; +Cc: Btrfs BTRFS

On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.duncan@cox.net> wrote:

> When I was doing my ssd research the first time around, the going
> recommendation was to keep 20-33% of the total space on the ssd entirely
> unallocated, allowing it to use that space as an FTL erase-block
> management pool.

Any brand name SSD has its own reserve above its specified size to
ensure that there's decent performance, even when there is no trim
hinting supplied by the OS; and thereby the SSD can only depend on LBA
"overwrites" to know what blocks are to be freed up.


> Anyway, that 20-33% left entirely unallocated/unpartitioned
> recommendation still holds, right?

Not that I'm aware of. I've never done this by literally walling off
space that I won't use. IA fairly large percentage of my partitions
have free space so it does effectively happen as far as the SSD is
concerned. And I use fstrim timer. Most of the file systems support
trim.

Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file
system that would not issue trim commands on this drive, and it was
doing full performance writes through that point. Then deleted maybe
5% of the files, and then refill the drive to 98% again, and it was
the same performance.  So it must have had enough in reserve to permit
full performance "overwrites" which were in effect directed to reserve
blocks as the freed up blocks were being erased. Thus the erasure
happening on the fly was not inhibiting performance on this SSD. Now
had I gone to 99.9% full, and then delete say 1GiB, and then started
going a bunch of heavy small file writes rather than sequential? I
don't know what would happening, it might have choked because this is
a lot more work for the SSD to deal with heavy IOPS and erasure.

It will invariably be something that's very model and even firmware
version specific.



>  Am I correct in asserting that if one
> is following that, the FTL already has plenty of erase-blocks available
> for management and the discussion about filesystem level trim and free
> space management becomes much less urgent, tho of course it's still worth
> considering if it's convenient to do so?

Most file systems don't direct writes to new areas, they're fairly
prone to overwriting. So the firmware is going to get notified fairly
quickly with either trim or an overwrite, which LBAs are stale. It's
probably more important with Btrfs which has more variable behavior,
it can continue to direct new writes to recently allocated chunks
before it'll do overwrites in older chunks that have free space.


> And am I also correct in believing that while it's not really worth
> spending more to over-provision to the near 50% as I ended up doing, if
> things work out that way as they did with me because the difference in
> price between 30% overprovisioning and 50% overprovisioning ends up being
> trivial, there's really not much need to worry about active filesystem
> trim at all, because the FTL has effectively half the device left to play
> erase-block musical chairs with as it decides it needs to?


I think it's not worth to overprovision by default ever. Use all of
that space until you have a problem. If you have a 256G drive, you
paid to get the spec performance for 100% of those 256G. You did not
pay that company to second guess things and have cut it slack by
overprovisioning from the outset.

I don't know how long it takes for erasure to happen though, so I have
no idea how much overprovisioning is really needed at the write rate
of the drive, so that it can erase at the same rate as writes, in
order to avoid a slow down.

I guess an even worse test would be one that intentionally fragments
across erase block boundaries, forcing the firmware to be unable to do
erasures without first migrating partially full blocks in order to
make them empty, so they can then be erased, and now be used for new
writes. That sort of shuffling is what will separate the good from
average drives, and why the drives have multicore CPUs on them, as
well as most now having on the fly always on encryption.

Even completely empty, some of these drives have a short term higher
speed write which falls back to a lower speed as the fast flash gets
full. After some pause that fast write capability is restored for
future writes. I have no idea if this is separate kind of flash on the
drive, or if it's just a difference in encoding data onto the flash
that's faster. Samsung has a drive that can "simulate" SLC NAND on 3D
VNAND. That sounds like an encoding method; it's fast but inefficient
and probably needs reencoding.

But that's the thing, the firmware is really complicated now.

I kinda wonder if f2fs could be chopped down to become a modular
allocator for the existing file systems; activate that allocation
method with "ssd" mount option rather than whatever overly smart thing
it does today that's based on assumptions that are now likely
outdated.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-14 16:21         ` Btrfs/SSD Chris Murphy
@ 2017-05-14 18:01           ` Tomasz Kusmierz
  2017-05-14 20:47             ` Btrfs/SSD (my -o ssd "summary") Hans van Kranenburg
  2017-05-14 23:01             ` Btrfs/SSD Imran Geriskovan
  0 siblings, 2 replies; 49+ messages in thread
From: Tomasz Kusmierz @ 2017-05-14 18:01 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Duncan, Btrfs BTRFS

All stuff that Chris wrote holds true, I just wanted to add flash specific information (from my experience of writing low level code for operating flash)

So with flash, to erase you have to erase a large allocation block, usually it used to be 128kB (plus some crc data and stuff makes more than 128kB, but we are talking functional data storage space) on never setups it can be megabytes … device dependant really.
To erase a block you need to provide whole 128 x 8 bits with voltage higher that is usually used for IO (can be even 15V) so it requires an external supply or build in internal charge pump to provide that voltage to a block erasure circuitry. This process generates a lot of heat and requires a lot of energy, so consensus back in the day was that you could erase one block at a time and this could take up to 200ms (0.2 second). After a erase you need to check whenever all bits are set to 1 (charged state) and then sector is marked as ready for storage.

Of course, flash memories are moving forward and in more demanding environments there are solutions where blocks are grouped into groups which have separate eraser circuits that will allow errasure to be performed in parallel in multiple parts of flash module, still you are bound to one per group.

Another problem is that erasure procedure locally does increase temperature and on flat flashes it’s not that much of a problem, but on emerging solutions like 3d flashed locally we might experience undesired temperature increases that would either degrade life span of flash or simply erase neighbouring blocks. 

In terms of over provisioning of SSD it’s a give and take relationship … on good drive there is enough over provisioning to allow a normal operation on systems without TRIM … now if you would use a 1TB drive daily without TRIM and have only 30GB stored on it you will have fantastic performance but if you will want to store 500GB at roughly 200GB you will hit a brick wall and you writes will slow dow to megabytes / s … this is symptom of drive running out of over provisioning space … if you would run OS that issues trim, this problem would not exist since drive would know that whole 970GB of space is free and it would be pre-emptively erased days before. 

And last part - hard drive is not aware of filesystem and partitions … so you could have 400GB on this 1TB drive left unpartitioned and still you would be cooked. Technically speaking using as much as possible space on a SSD to a FS and OS that supports trim will give you best performance because drive will be notified of as much as possible disk space that is actually free …..

So, to summaries: 
- don’t try to outsmart built in mechanics of SSD (people that suggest that are just morons that want to have 5 minutes of fame).
- don’t buy crap SSD and expect it to behave like good one if you use below certain % of it … it’s stupid, buy more reasonable SSD but smaller and store slow data on spinning rust.
- read more books and wikipedia, not jumping down on you but internet is filled with people that provide false information, sometimes unknowingly and swear by it ( Dunning–Kruger effect :D ) and some of them are very good and making all theories sexy and stuff … you simply have to get used to it… 
- if something is to good to be true, than it’s not
- promise of future performance gains is a domain of “sleazy salesman"



> On 14 May 2017, at 17:21, Chris Murphy <lists@colorremedies.com> wrote:
> 
> On Sat, May 13, 2017 at 3:39 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> 
>> When I was doing my ssd research the first time around, the going
>> recommendation was to keep 20-33% of the total space on the ssd entirely
>> unallocated, allowing it to use that space as an FTL erase-block
>> management pool.
> 
> Any brand name SSD has its own reserve above its specified size to
> ensure that there's decent performance, even when there is no trim
> hinting supplied by the OS; and thereby the SSD can only depend on LBA
> "overwrites" to know what blocks are to be freed up.
> 
> 
>> Anyway, that 20-33% left entirely unallocated/unpartitioned
>> recommendation still holds, right?
> 
> Not that I'm aware of. I've never done this by literally walling off
> space that I won't use. IA fairly large percentage of my partitions
> have free space so it does effectively happen as far as the SSD is
> concerned. And I use fstrim timer. Most of the file systems support
> trim.
> 
> Anyway I've stuffed a Samsung 840 EVO to 98% full with an OS/file
> system that would not issue trim commands on this drive, and it was
> doing full performance writes through that point. Then deleted maybe
> 5% of the files, and then refill the drive to 98% again, and it was
> the same performance.  So it must have had enough in reserve to permit
> full performance "overwrites" which were in effect directed to reserve
> blocks as the freed up blocks were being erased. Thus the erasure
> happening on the fly was not inhibiting performance on this SSD. Now
> had I gone to 99.9% full, and then delete say 1GiB, and then started
> going a bunch of heavy small file writes rather than sequential? I
> don't know what would happening, it might have choked because this is
> a lot more work for the SSD to deal with heavy IOPS and erasure.
> 
> It will invariably be something that's very model and even firmware
> version specific.
> 
> 
> 
>> Am I correct in asserting that if one
>> is following that, the FTL already has plenty of erase-blocks available
>> for management and the discussion about filesystem level trim and free
>> space management becomes much less urgent, tho of course it's still worth
>> considering if it's convenient to do so?
> 
> Most file systems don't direct writes to new areas, they're fairly
> prone to overwriting. So the firmware is going to get notified fairly
> quickly with either trim or an overwrite, which LBAs are stale. It's
> probably more important with Btrfs which has more variable behavior,
> it can continue to direct new writes to recently allocated chunks
> before it'll do overwrites in older chunks that have free space.
> 
> 
>> And am I also correct in believing that while it's not really worth
>> spending more to over-provision to the near 50% as I ended up doing, if
>> things work out that way as they did with me because the difference in
>> price between 30% overprovisioning and 50% overprovisioning ends up being
>> trivial, there's really not much need to worry about active filesystem
>> trim at all, because the FTL has effectively half the device left to play
>> erase-block musical chairs with as it decides it needs to?
> 
> 
> I think it's not worth to overprovision by default ever. Use all of
> that space until you have a problem. If you have a 256G drive, you
> paid to get the spec performance for 100% of those 256G. You did not
> pay that company to second guess things and have cut it slack by
> overprovisioning from the outset.
> 
> I don't know how long it takes for erasure to happen though, so I have
> no idea how much overprovisioning is really needed at the write rate
> of the drive, so that it can erase at the same rate as writes, in
> order to avoid a slow down.
> 
> I guess an even worse test would be one that intentionally fragments
> across erase block boundaries, forcing the firmware to be unable to do
> erasures without first migrating partially full blocks in order to
> make them empty, so they can then be erased, and now be used for new
> writes. That sort of shuffling is what will separate the good from
> average drives, and why the drives have multicore CPUs on them, as
> well as most now having on the fly always on encryption.
> 
> Even completely empty, some of these drives have a short term higher
> speed write which falls back to a lower speed as the fast flash gets
> full. After some pause that fast write capability is restored for
> future writes. I have no idea if this is separate kind of flash on the
> drive, or if it's just a difference in encoding data onto the flash
> that's faster. Samsung has a drive that can "simulate" SLC NAND on 3D
> VNAND. That sounds like an encoding method; it's fast but inefficient
> and probably needs reencoding.
> 
> But that's the thing, the firmware is really complicated now.
> 
> I kinda wonder if f2fs could be chopped down to become a modular
> allocator for the existing file systems; activate that allocation
> method with "ssd" mount option rather than whatever overly smart thing
> it does today that's based on assumptions that are now likely
> outdated.
> 
> -- 
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD (my -o ssd "summary")
  2017-05-14 18:01           ` Btrfs/SSD Tomasz Kusmierz
@ 2017-05-14 20:47             ` Hans van Kranenburg
  2017-05-14 23:01             ` Btrfs/SSD Imran Geriskovan
  1 sibling, 0 replies; 49+ messages in thread
From: Hans van Kranenburg @ 2017-05-14 20:47 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: Btrfs BTRFS

On 05/14/2017 08:01 PM, Tomasz Kusmierz wrote:
> All stuff that Chris wrote holds true, I just wanted to add flash
> specific information (from my experience of writing low level code
> for operating flash)

Thanks!

> [... erase ...]

> In terms of over provisioning of SSD it’s a give and take
> relationship … on good drive there is enough over provisioning to
> allow a normal operation on systems without TRIM … now if you would
> use a 1TB drive daily without TRIM and have only 30GB stored on it
> you will have fantastic performance but if you will want to store
> 500GB at roughly 200GB you will hit a brick wall and you writes will
> slow dow to megabytes / s … this is symptom of drive running out of
> over provisioning space … if you would run OS that issues trim, this
> problem would not exist since drive would know that whole 970GB of
> space is free and it would be pre-emptively erased days before.

== ssd_spread ==

The worst case behaviour is the btrfs ssd_spread mount option in
combination with not having discard enabled. It has a side effect of
minimizing the reuse of free space previously written in.

== ssd ==

[And, since I didn't write a "summary post" about this issue yet, here
is my version of it:]

The default mount options you get for an ssd ('ssd' mode enabled,
'discard' not enabled), in combination with writing and deleting many
files that are not too big also causes this pattern, ending up with the
physical address space fully allocated and written to.

My favourite videos about this: *)

ssd (write pattern is small increments in /var/log/mail.log, a mail
spool on /var/spool/postfix (lots of file adds and deletes), and mailman
archives with a lot of little files):

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

*) The picture uses Hilbert Curve ordering (see link below) and shows
the four last created DATA block groups appended together. (so a new
chunk allocation pushes the others back in the picture).
https://github.com/knorrie/btrfs-heatmap/blob/master/doc/curves.md

 * What the ssd mode does, is simply setting a lower boundary to the
size of free space fragments that are reused.
 * In combination with always trying to walk forward inside a block
group, not looking back at freed up space, it fills up with a shotgun
blast pattern when you do writes and deletes all the time.
 * When a write comes in that is bigger than any free space part left
behind, a new chunk gets allocated, and the bad pattern continues in there.
 * Because it keeps allocating more and more new chunks, and keeps
circling around in the latest one, until a big write is done, it leaves
mostly empty ones behind.
 * Without 'discard', the SSD will never learn that all the free space
left behind is actually free.
 * Eventually all raw disk space is allocated, and users run into
problems with ENOSPC and balance etc.

So, enabling this ssd mode actually means it starts choking itself to
death here.

When users see this effect, they start scheduling balance operations, to
compact free space to bring the amount of allocated but unused space
down a bit.
 * But, doing that is causing just more and more writes to the ssd.
 * Also, since balance takes a "usage" argument and not a "how badly
fragmented" argument, it's causing lots of unnecessary rewriting of data.
 * And, with a decent amount (like a few thousand) subvolumes, all
having a few snapshots of their own, the ratio data:metadata written
during balance is skyrocketing, causing not only the data to be
rewritten, but also causing pushing out lots of metadata to the ssd.
(example: on my backup server rewriting 1GiB of data causes writing of
>40GiB of metadata, where probably 99.99% of those writes are some kind
of intermediary writes which are immediately invalidated during the next
btrfs transaction that is done).

All in all, this reminds me of the series "breaking bad", where every
step taken to try fix things, only made things worse. At every bullet
point above, this is also happening.

== nossd ==

nossd mode (even still without discard) allows a pattern of overwriting
much more previously used space, causing many more implicit discards to
happen because of the overwrite information the ssd gets.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

> And last part - hard drive is not aware of filesystem and partitions
> … so you could have 400GB on this 1TB drive left unpartitioned and
> still you would be cooked. Technically speaking using as much as
> possible space on a SSD to a FS and OS that supports trim will give
> you best performance because drive will be notified of as much as
> possible disk space that is actually free …..
> 
> So, to summaries:

> - don’t try to outsmart built in mechanics of SSD (people that
> suggest that are just morons that want to have 5 minutes of fame).

This is exactly what the btrfs ssd options are trying to do.

Still, I don't think it's very nice to call Chris Mason "just a moron". ;-]

However, from the information we found out, and from the various
discussions and real-life behavioural measurements (for me, the ones
above), I think it's pretty clear now that the assumptions done 10 years
ago are not valid, or not valid any more, if they ever were.

I think the ssd options are actually worse for ssds. D:

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-14 18:01           ` Btrfs/SSD Tomasz Kusmierz
  2017-05-14 20:47             ` Btrfs/SSD (my -o ssd "summary") Hans van Kranenburg
@ 2017-05-14 23:01             ` Imran Geriskovan
  2017-05-15  0:23               ` Btrfs/SSD Tomasz Kusmierz
  2017-05-15  0:24               ` Btrfs/SSD Tomasz Kusmierz
  1 sibling, 2 replies; 49+ messages in thread
From: Imran Geriskovan @ 2017-05-14 23:01 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: Chris Murphy, Duncan, Btrfs BTRFS

On 5/14/17, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
> In terms of over provisioning of SSD it’s a give and take relationship … on
> good drive there is enough over provisioning to allow a normal operation on
> systems without TRIM … now if you would use a 1TB drive daily without TRIM
> and have only 30GB stored on it you will have fantastic performance but if
> you will want to store 500GB at roughly 200GB you will hit a brick wall and
> you writes will slow dow to megabytes / s … this is symptom of drive running
> out of over provisioning space …

What exactly happens on a non-trimmed drive?
Does it begin to forge certain erase-blocks? If so
which are those? What happens when you never
trim and continue dumping data on it?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-14 23:01             ` Btrfs/SSD Imran Geriskovan
@ 2017-05-15  0:23               ` Tomasz Kusmierz
  2017-05-15  0:24               ` Btrfs/SSD Tomasz Kusmierz
  1 sibling, 0 replies; 49+ messages in thread
From: Tomasz Kusmierz @ 2017-05-15  0:23 UTC (permalink / raw)
  To: Imran Geriskovan; +Cc: Chris Murphy, Duncan, Btrfs BTRFS

Theoretically all sectors in over provision are erased - practically they are either erased or waiting to be erased or broken.

What you have to understand is that sectors on SSD are not where you really think they are - they can swap place with sectors with over provisioning are, they can swap place with each other ect … stuff you see as a disk from 0 to MAX does not have to be arranged in sequence on SSD (and mostly never is) 

If you never trim - when your device is 100% full - you need to start overwrite data to keep writing - this is where over provisioning shines: ssd fakes that you write to a sector while really you write to a sector in over provisioning area and those magically swap places without you knowing -> the sector that was occupied ends up in over provisioning pool and SSD hardware performs a slow errase on it to make it free for the future. This mechanism is simple, and transparent for users -> you don’t know that it happens and SSD does all heavy lifting. 

Over provisioned area does have more uses than that. For example if you have a 1TB drive where you store 500GB of data that you never modify -> SSD will copy part of that data to over provisioned area -> free sectors that were unwritten for a while -> free sectors that were continuously hammered by writes and write a static data there. This mechanism is wear levelling - it means that SSD internals make sure that sectors on SSD have an equal use over time. Despite of some thinking that it’s pointless imagine situation where you’ve got a 1TB drive with 1GB free and you keep writing and modifying data in this 1GB free … those sectors will quickly die due to short flash life expectancy ( some as short as 1k erases !!!!! ).

So again, buy a good quality drives (not a hardcore enterprise drives, just good customer ones) and leave stuff to a drive + use OS that gives you trim and you should be golden !!!! 


> On 15 May 2017, at 00:01, Imran Geriskovan <imran.geriskovan@gmail.com> wrote:
> 
> On 5/14/17, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
>> In terms of over provisioning of SSD it’s a give and take relationship … on
>> good drive there is enough over provisioning to allow a normal operation on
>> systems without TRIM … now if you would use a 1TB drive daily without TRIM
>> and have only 30GB stored on it you will have fantastic performance but if
>> you will want to store 500GB at roughly 200GB you will hit a brick wall and
>> you writes will slow dow to megabytes / s … this is symptom of drive running
>> out of over provisioning space …
> 
> What exactly happens on a non-trimmed drive?
> Does it begin to forge certain erase-blocks? If so
> which are those? What happens when you never
> trim and continue dumping data on it?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-14 23:01             ` Btrfs/SSD Imran Geriskovan
  2017-05-15  0:23               ` Btrfs/SSD Tomasz Kusmierz
@ 2017-05-15  0:24               ` Tomasz Kusmierz
  2017-05-15 11:25                 ` Btrfs/SSD Imran Geriskovan
  1 sibling, 1 reply; 49+ messages in thread
From: Tomasz Kusmierz @ 2017-05-15  0:24 UTC (permalink / raw)
  To: Imran Geriskovan; +Cc: Chris Murphy, Duncan, Btrfs BTRFS

Theoretically all sectors in over provision are erased - practically they are either erased or waiting to be erased or broken.

What you have to understand is that sectors on SSD are not where you really think they are - they can swap place with sectors with over provisioning are, they can swap place with each other ect … stuff you see as a disk from 0 to MAX does not have to be arranged in sequence on SSD (and mostly never is) 

If you never trim - when your device is 100% full - you need to start overwrite data to keep writing - this is where over provisioning shines: ssd fakes that you write to a sector while really you write to a sector in over provisioning area and those magically swap places without you knowing -> the sector that was occupied ends up in over provisioning pool and SSD hardware performs a slow errase on it to make it free for the future. This mechanism is simple, and transparent for users -> you don’t know that it happens and SSD does all heavy lifting. 

Over provisioned area does have more uses than that. For example if you have a 1TB drive where you store 500GB of data that you never modify -> SSD will copy part of that data to over provisioned area -> free sectors that were unwritten for a while -> free sectors that were continuously hammered by writes and write a static data there. This mechanism is wear levelling - it means that SSD internals make sure that sectors on SSD have an equal use over time. Despite of some thinking that it’s pointless imagine situation where you’ve got a 1TB drive with 1GB free and you keep writing and modifying data in this 1GB free … those sectors will quickly die due to short flash life expectancy ( some as short as 1k erases !!!!! ).

So again, buy a good quality drives (not a hardcore enterprise drives, just good customer ones) and leave stuff to a drive + use OS that gives you trim and you should be golden !!!!

> On 15 May 2017, at 00:01, Imran Geriskovan <imran.geriskovan@gmail.com> wrote:
> 
> On 5/14/17, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
>> In terms of over provisioning of SSD it’s a give and take relationship … on
>> good drive there is enough over provisioning to allow a normal operation on
>> systems without TRIM … now if you would use a 1TB drive daily without TRIM
>> and have only 30GB stored on it you will have fantastic performance but if
>> you will want to store 500GB at roughly 200GB you will hit a brick wall and
>> you writes will slow dow to megabytes / s … this is symptom of drive running
>> out of over provisioning space …
> 
> What exactly happens on a non-trimmed drive?
> Does it begin to forge certain erase-blocks? If so
> which are those? What happens when you never
> trim and continue dumping data on it?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-15  0:24               ` Btrfs/SSD Tomasz Kusmierz
@ 2017-05-15 11:25                 ` Imran Geriskovan
  0 siblings, 0 replies; 49+ messages in thread
From: Imran Geriskovan @ 2017-05-15 11:25 UTC (permalink / raw)
  To: Tomasz Kusmierz; +Cc: Chris Murphy, Duncan, Btrfs BTRFS

On 5/15/17, Tomasz Kusmierz <tom.kusmierz@gmail.com> wrote:
> Theoretically all sectors in over provision are erased - practically they
> are either erased or waiting to be erased or broken.
> Over provisioned area does have more uses than that. For example if you have
> a 1TB drive where you store 500GB of data that you never modify -> SSD will
> copy part of that data to over provisioned area -> free sectors that were
> unwritten for a while -> free sectors that were continuously hammered by
> writes and write a static data there. This mechanism is wear levelling - it
> means that SSD internals make sure that sectors on SSD have an equal use
> over time. Despite of some thinking that it’s pointless imagine situation
> where you’ve got a 1TB drive with 1GB free and you keep writing and
> modifying data in this 1GB free … those sectors will quickly die due to
> short flash life expectancy ( some as short as 1k erases !!!!! ).

Thanks for the info. It can be understood that, the drive
has a pool of erase blocks from which some portion (say %90-95)
is provided as usable. Trimmed blocks are candidates
for new allocations. If the drive is not trimmed, that allocatable
pool becomes smaller than it can be and new allocations
under wear levelling logic is done from smaller group.
This will probably increase data traffic on that "small group"
of blocks, eating from their erase cycles.

However, this logic is valid if the drive does NOT move
data on trimmed blocks to trimmed/available ones.

Under some advanced wear leveling operations, drive may
decide to swap two blocks (one occupied/one vacant) if the
cummulative erase cycles of the former is much lower than
the latter to provide some balancing affect.

Theoretically swapping may even occur when the flash tend
to lose charge (and thus data) based on the age of the
data and/or block health.

But in any case I understand that trimming will provide
important degree of freedom and health to the drive.
Without trimming drive will continue to deal with worthless
blocks simply because it doesn't know they are worthless...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-12 18:27     ` Btrfs/SSD Kai Krakow
  2017-05-12 20:31       ` Btrfs/SSD Imran Geriskovan
  2017-05-13  9:39       ` Btrfs/SSD Duncan
@ 2017-05-15 11:46       ` Austin S. Hemmelgarn
  2017-05-15 19:22         ` Btrfs/SSD Kai Krakow
  2 siblings, 1 reply; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-05-15 11:46 UTC (permalink / raw)
  To: linux-btrfs

On 2017-05-12 14:27, Kai Krakow wrote:
> Am Tue, 18 Apr 2017 15:02:42 +0200
> schrieb Imran Geriskovan <imran.geriskovan@gmail.com>:
>
>> On 4/17/17, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:
>>> Regarding BTRFS specifically:
>>> * Given my recently newfound understanding of what the 'ssd' mount
>>> option actually does, I'm inclined to recommend that people who are
>>> using high-end SSD's _NOT_ use it as it will heavily increase
>>> fragmentation and will likely have near zero impact on actual device
>>> lifetime (but may _hurt_ performance).  It will still probably help
>>> with mid and low-end SSD's.
>>
>> I'm trying to have a proper understanding of what "fragmentation"
>> really means for an ssd and interrelation with wear-leveling.
>>
>> Before continuing lets remember:
>> Pages cannot be erased individually, only whole blocks can be erased.
>> The size of a NAND-flash page size can vary, and most drive have pages
>> of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256
>> pages, which means that the size of a block can vary between 256 KB
>> and 4 MB.
>> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/
>>
>> Lets continue:
>> Since block sizes are between 256k-4MB, data smaller than this will
>> "probably" will not be fragmented in a reasonably empty and trimmed
>> drive. And for a brand new ssd we may speak of contiguous series
>> of blocks.
>>
>> However, as drive is used more and more and as wear leveling kicking
>> in (ie. blocks are remapped) the meaning of "contiguous blocks" will
>> erode. So any file bigger than a block size will be written to blocks
>> physically apart no matter what their block addresses says. But my
>> guess is that accessing device blocks -contiguous or not- are
>> constant time operations. So it would not contribute performance
>> issues. Right? Comments?
>>
>> So your the feeling about fragmentation/performance is probably
>> related with if the file is spread into less or more blocks. If # of
>> blocks used is higher than necessary (ie. no empty blocks can be
>> found. Instead lots of partially empty blocks have to be used
>> increasing the total # of blocks involved) then we will notice
>> performance loss.
>>
>> Additionally if the filesystem will gonna try something to reduce
>> the fragmentation for the blocks, it should precisely know where
>> those blocks are located. Then how about ssd block informations?
>> Are they available and do filesystems use it?
>>
>> Anyway if you can provide some more details about your experiences
>> on this we can probably have better view on the issue.
>
> What you really want for SSD is not defragmented files but defragmented
> free space. That increases life time.
>
> So, defragmentation on SSD makes sense if it cares more about free
> space but not file data itself.
>
> But of course, over time, fragmentation of file data (be it meta data
> or content data) may introduce overhead - and in btrfs it probably
> really makes a difference if I scan through some of the past posts.
>
> I don't think it is important for the file system to know where the SSD
> FTL located a data block. It's just important to keep everything nicely
> aligned with erase block sizes, reduce rewrite patterns, and free up
> complete erase blocks as good as possible.
>
> Maybe such a process should be called "compaction" and not
> "defragmentation". In the end, the more continuous blocks of free space
> there are, the better the chance for proper wear leveling.

There is one other thing to consider though.  From a practical 
perspective, performance on an SSD is a function of the number of 
requests and what else is happening in the background.  The second 
aspect isn't easy to eliminate on most systems, but the first is pretty 
easy to mitigate by defragmenting data.

Reiterating the example I made elsewhere in the thread:
Assume you have an SSD and storage controller that can use DMA to 
transfer up to 16MB of data off of the disk in a single operation.  If 
you need to load a 16MB file off of this disk and it's properly aligned 
(it usually will be with most modern filesystems if the partition is 
properly aligned) and defragmented, it will take exactly one operation 
(assuming that doesn't get interrupted).  By contrast, if you have 16 
fragments of 1MB each, that will take at minimum 2 operations, and more 
likely 15-16 (depends on where everything is on-disk, and how smart the 
driver is about minimizing the number of required operations).  Each 
request has some amount of overhead to set up and complete, so the first 
case (one single extent) will take less total time to transfer the data 
than the second one.

This particular effect actually impacts almost any data transfer, not 
just pulling data off of an SSD (this is why jumbo frames are important 
for high-performance networking, and why a higher latency timer on the 
PCI bus will improve performance (but conversely increase latency)), 
even when fetching data from a traditional hard drive (but it's not very 
noticeable there unless your fragments are tightly grouped, because seek 
latency dominates performance).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-12 18:36       ` Btrfs/SSD Kai Krakow
  2017-05-13  9:52         ` Btrfs/SSD Roman Mamedov
@ 2017-05-15 12:03         ` Austin S. Hemmelgarn
  2017-05-15 13:09           ` Btrfs/SSD Tomasz Kusmierz
  2017-05-15 19:49           ` Btrfs/SSD Kai Krakow
  1 sibling, 2 replies; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-05-15 12:03 UTC (permalink / raw)
  To: linux-btrfs

On 2017-05-12 14:36, Kai Krakow wrote:
> Am Fri, 12 May 2017 15:02:20 +0200
> schrieb Imran Geriskovan <imran.geriskovan@gmail.com>:
>
>> On 5/12/17, Duncan <1i5t5.duncan@cox.net> wrote:
>>> FWIW, I'm in the market for SSDs ATM, and remembered this from a
>>> couple weeks ago so went back to find it.  Thanks. =:^)
>>>
>>> (I'm currently still on quarter-TB generation ssds, plus spinning
>>> rust for the larger media partition and backups, and want to be rid
>>> of the spinning rust, so am looking at half-TB to TB, which seems
>>> to be the pricing sweet spot these days anyway.)
>>
>> Since you are taking ssds to mainstream based on your experience,
>> I guess your perception of data retension/reliability is better than
>> that of spinning rust. Right? Can you eloborate?
>>
>> Or an other criteria might be physical constraints of spinning rust
>> on notebooks which dictates that you should handle the device
>> with care when running.
>>
>> What was your primary motivation other than performance?
>
> Personally, I don't really trust SSDs so much. They are much more
> robust when it comes to physical damage because there are no physical
> parts. That's absolutely not my concern. Regarding this, I trust SSDs
> better than HDDs.
>
> My concern is with fail scenarios of some SSDs which die unexpected and
> horribly. I found some reports of older Samsung SSDs which failed
> suddenly and unexpected, and in a way that the drive completely died:
> No more data access, everything gone. HDDs start with bad sectors and
> there's a good chance I can recover most of the data except a few
> sectors.
Older is the key here.  Some early SSD's did indeed behave like that, 
but most modern ones do generally show signs that they will fail in the 
near future.  There's also the fact that traditional hard drives _do_ 
fail like that sometimes, even without rough treatment.
>
> When SSD blocks die, they are probably huge compared to a sector (256kB
> to 4MB usually because that's erase block sizes). If this happens, the
> firmware may decide to either allow read-only access or completely deny
> access. There's another situation where dying storage chips may
> completely mess up the firmware and there's no longer any access to
> data.
I've yet to see an SSD that blocks user access to an erase block. 
Almost every one I've seen will instead rewrite the block (possibly with 
the corrupted data intact (that is, without mangling it further)) to one 
of the reserve blocks, and then just update it's internal mapping so 
that the old block doesn't get used, and the new one is pointing to the 
right place.  Some of the really good SSD's even use erasure coding in 
the FTL for data verification instead of CRC's, so they can actually 
reconstruct the missing bits when they do this.

Traditional hard drives usually do this too these days (they've been 
under-provisioned since before SSD's existed), which is part of why 
older disks tend to be noisier and slower (the reserved space is usually 
at the far inside or outside of the platter, so using sectors from there 
to replace stuff leads to long seeks).
>
> That's why I don't trust any of my data to them. But I still want the
> benefit of their speed. So I use SSDs mostly as frontend caches to
> HDDs. This gives me big storage with fast access. Indeed, I'm using
> bcache successfully for this. A warm cache is almost as fast as native
> SSD (at least it feels almost that fast, it will be slower if you threw
> benchmarks at it).
That's to be expected though, most benchmarks don't replicate actual 
usage patterns for client systems, and using SSD's for caching with 
bcache or dm-cache for most server workloads except a file server will 
usually get you a performance hit.

It's worth noting also that on average, COW filesystems like BTRFS (or 
log-structured-filesystems will not benefit as much as traditional 
filesystems from SSD caching unless the caching is built into the 
filesystem itself, since they don't do in-place rewrites (so any new 
write by definition has to drop other data from the cache).

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-15 12:03         ` Btrfs/SSD Austin S. Hemmelgarn
@ 2017-05-15 13:09           ` Tomasz Kusmierz
  2017-05-15 19:12             ` Btrfs/SSD Kai Krakow
  2017-05-15 19:49           ` Btrfs/SSD Kai Krakow
  1 sibling, 1 reply; 49+ messages in thread
From: Tomasz Kusmierz @ 2017-05-15 13:09 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: linux-btrfs


> Traditional hard drives usually do this too these days (they've been under-provisioned since before SSD's existed), which is part of why older disks tend to be noisier and slower (the reserved space is usually at the far inside or outside of the platter, so using sectors from there to replace stuff leads to long seeks).

Not true. When HDD uses 10% (10% is just for easy example) of space as spare than aligment on disk is (US - used sector, SS - spare sector, BS - bad sector)

US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS
US US US US US US US US US SS

if failure occurs - drive actually shifts sectors up:

US US US US US US US US US SS
US US US BS BS BS US US US US
US US US US US US US US US US
US US US US US US US US US US
US US US US US US US US US SS
US US US BS US US US US US US
US US US US US US US US US SS
US US US US US US US US US SS

that strategy is in place to actually mitigate the problem that you’ve described, actually it was in place since drives were using PATA :) so if your drive get’s nosier over time it’s either a broken bearing or demagnetised arm magnet causing it to not aim propperly - so drive have to readjust position multiple times before hitting a right track 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-15 13:09           ` Btrfs/SSD Tomasz Kusmierz
@ 2017-05-15 19:12             ` Kai Krakow
  2017-05-16  4:48               ` Btrfs/SSD Duncan
  0 siblings, 1 reply; 49+ messages in thread
From: Kai Krakow @ 2017-05-15 19:12 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 15 May 2017 14:09:20 +0100
schrieb Tomasz Kusmierz <tom.kusmierz@gmail.com>:

> > Traditional hard drives usually do this too these days (they've
> > been under-provisioned since before SSD's existed), which is part
> > of why older disks tend to be noisier and slower (the reserved
> > space is usually at the far inside or outside of the platter, so
> > using sectors from there to replace stuff leads to long seeks).  
> 
> Not true. When HDD uses 10% (10% is just for easy example) of space
> as spare than aligment on disk is (US - used sector, SS - spare
> sector, BS - bad sector)
> 
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> 
> if failure occurs - drive actually shifts sectors up:
> 
> US US US US US US US US US SS
> US US US BS BS BS US US US US
> US US US US US US US US US US
> US US US US US US US US US US
> US US US US US US US US US SS
> US US US BS US US US US US US
> US US US US US US US US US SS
> US US US US US US US US US SS

This makes sense... Reserve area somehow implies it is continuous and
as such located at one far end of the platter. But your image totally
makes sense.


> that strategy is in place to actually mitigate the problem that
> you’ve described, actually it was in place since drives were using
> PATA :) so if your drive get’s nosier over time it’s either a broken
> bearing or demagnetised arm magnet causing it to not aim propperly -
> so drive have to readjust position multiple times before hitting a
> right track -- To unsubscribe from this list: send the line
> "unsubscribe linux-btrfs" in the body of a message to
> majordomo@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

I can confirm that such drives usually do not get noisier except there's
something broken other than just a few sectors. And faulty bearing in
notebook drives is the most often scenario I see. I always recommend to
replace such drives early because they will usually fail completely.
Such notebooks are good candidates for SSD replacements btw. ;-)

The demagnetised arm magnet is an interesting error scenario - didn't
think of it. Thanks for the pointer.

But still, there's one noise you can easily identify as bad sectors:
When the drive starts clicking for 30 or more seconds while trying to
read data, and usually also freezes the OS during that time. Such
drives can be "repaired" by rewriting the offending sectors (because it
will be moved to reserve area then). But I guess it's best to already
replace such a drive by that time.

Early, back in PATA times, I often had harddisks exposing seemingly bad
sectors when power was cut while the drive was writing data. I usually
used dd to rewrite such sectors and the drive was good as new again -
except I lost some file data maybe. Luckily, modern drives don't show
such behavior. And also SSDs learned to handle this...


-- 
Regards,
Kai

Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-15 11:46       ` Btrfs/SSD Austin S. Hemmelgarn
@ 2017-05-15 19:22         ` Kai Krakow
  0 siblings, 0 replies; 49+ messages in thread
From: Kai Krakow @ 2017-05-15 19:22 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 15 May 2017 07:46:01 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2017-05-12 14:27, Kai Krakow wrote:
> > Am Tue, 18 Apr 2017 15:02:42 +0200
> > schrieb Imran Geriskovan <imran.geriskovan@gmail.com>:
> >  
> >> On 4/17/17, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote:  
>  [...]  
> >>
> >> I'm trying to have a proper understanding of what "fragmentation"
> >> really means for an ssd and interrelation with wear-leveling.
> >>
> >> Before continuing lets remember:
> >> Pages cannot be erased individually, only whole blocks can be
> >> erased. The size of a NAND-flash page size can vary, and most
> >> drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have
> >> blocks of 128 or 256 pages, which means that the size of a block
> >> can vary between 256 KB and 4 MB.
> >> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/
> >>
> >> Lets continue:
> >> Since block sizes are between 256k-4MB, data smaller than this will
> >> "probably" will not be fragmented in a reasonably empty and trimmed
> >> drive. And for a brand new ssd we may speak of contiguous series
> >> of blocks.
> >>
> >> However, as drive is used more and more and as wear leveling
> >> kicking in (ie. blocks are remapped) the meaning of "contiguous
> >> blocks" will erode. So any file bigger than a block size will be
> >> written to blocks physically apart no matter what their block
> >> addresses says. But my guess is that accessing device blocks
> >> -contiguous or not- are constant time operations. So it would not
> >> contribute performance issues. Right? Comments?
> >>
> >> So your the feeling about fragmentation/performance is probably
> >> related with if the file is spread into less or more blocks. If #
> >> of blocks used is higher than necessary (ie. no empty blocks can be
> >> found. Instead lots of partially empty blocks have to be used
> >> increasing the total # of blocks involved) then we will notice
> >> performance loss.
> >>
> >> Additionally if the filesystem will gonna try something to reduce
> >> the fragmentation for the blocks, it should precisely know where
> >> those blocks are located. Then how about ssd block informations?
> >> Are they available and do filesystems use it?
> >>
> >> Anyway if you can provide some more details about your experiences
> >> on this we can probably have better view on the issue.  
> >
> > What you really want for SSD is not defragmented files but
> > defragmented free space. That increases life time.
> >
> > So, defragmentation on SSD makes sense if it cares more about free
> > space but not file data itself.
> >
> > But of course, over time, fragmentation of file data (be it meta
> > data or content data) may introduce overhead - and in btrfs it
> > probably really makes a difference if I scan through some of the
> > past posts.
> >
> > I don't think it is important for the file system to know where the
> > SSD FTL located a data block. It's just important to keep
> > everything nicely aligned with erase block sizes, reduce rewrite
> > patterns, and free up complete erase blocks as good as possible.
> >
> > Maybe such a process should be called "compaction" and not
> > "defragmentation". In the end, the more continuous blocks of free
> > space there are, the better the chance for proper wear leveling.  
> 
> There is one other thing to consider though.  From a practical 
> perspective, performance on an SSD is a function of the number of 
> requests and what else is happening in the background.  The second 
> aspect isn't easy to eliminate on most systems, but the first is
> pretty easy to mitigate by defragmenting data.
> 
> Reiterating the example I made elsewhere in the thread:
> Assume you have an SSD and storage controller that can use DMA to 
> transfer up to 16MB of data off of the disk in a single operation.
> If you need to load a 16MB file off of this disk and it's properly
> aligned (it usually will be with most modern filesystems if the
> partition is properly aligned) and defragmented, it will take exactly
> one operation (assuming that doesn't get interrupted).  By contrast,
> if you have 16 fragments of 1MB each, that will take at minimum 2
> operations, and more likely 15-16 (depends on where everything is
> on-disk, and how smart the driver is about minimizing the number of
> required operations).  Each request has some amount of overhead to
> set up and complete, so the first case (one single extent) will take
> less total time to transfer the data than the second one.
> 
> This particular effect actually impacts almost any data transfer, not 
> just pulling data off of an SSD (this is why jumbo frames are
> important for high-performance networking, and why a higher latency
> timer on the PCI bus will improve performance (but conversely
> increase latency)), even when fetching data from a traditional hard
> drive (but it's not very noticeable there unless your fragments are
> tightly grouped, because seek latency dominates performance).

I know all this but many people will be offended by this, and that SSDs
don't need defragmentation, and it's even harmful, like "when you do
this, your drive will die tomorrow!" Or at least they will try to tell
you "there's no seek overhead, so fragmentation doesn't matter". And
probably for most desktop workloads, this is true. But if your workload
depends on IOPS, this may well not be true.

But I believe: If done right, defragmentation will improve lifetime and
performance. And one important factor is keeping free space continuous
(best by not rewriting data but encouraging big free space blocks in
the first place). Most filesystems are already very good at keeping
file fragmentation low. Apparently, btrfs doesn't belong to this
category... At least with some (typical) workloads. And autodefrag adds
"expensive" writes to the SSD. But I'm using it nevertheless. Overall
long-time performance is better that way for me.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-15 12:03         ` Btrfs/SSD Austin S. Hemmelgarn
  2017-05-15 13:09           ` Btrfs/SSD Tomasz Kusmierz
@ 2017-05-15 19:49           ` Kai Krakow
  2017-05-15 20:05             ` Btrfs/SSD Tomasz Torcz
  2017-05-16 11:43             ` Btrfs/SSD Austin S. Hemmelgarn
  1 sibling, 2 replies; 49+ messages in thread
From: Kai Krakow @ 2017-05-15 19:49 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 15 May 2017 08:03:48 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> > That's why I don't trust any of my data to them. But I still want
> > the benefit of their speed. So I use SSDs mostly as frontend caches
> > to HDDs. This gives me big storage with fast access. Indeed, I'm
> > using bcache successfully for this. A warm cache is almost as fast
> > as native SSD (at least it feels almost that fast, it will be
> > slower if you threw benchmarks at it).  
> That's to be expected though, most benchmarks don't replicate actual 
> usage patterns for client systems, and using SSD's for caching with 
> bcache or dm-cache for most server workloads except a file server
> will usually get you a performance hit.

You mean "performance boost"? Almost every read-most server workload
should benefit... I file server may be the exact opposite...

Also, I think dm-cache and bcache work very differently and are not
directly comparable. Their benefit depends much on the applied workload.

If I remember right, dm-cache is more about keeping "hot data" in the
flash storage while bcache is more about reducing seeking. So dm-cache
optimizes for bigger throughput of SSDs while bcache optimizes for
almost-zero seek overhead of SSDs. Depending on your underlying
storage, one or the other may even give zero benefit or worsen
performance. Which is what I'd call a "performance hit"... I didn't
ever try dm-cache, tho. For reasons I don't remember exactly, I didn't
like something about how it's implemented, I think it was related to
crash recovery. I don't know if that still holds true with modern
kernels. It may have changed but I never looked back to revise that
decision.


> It's worth noting also that on average, COW filesystems like BTRFS
> (or log-structured-filesystems will not benefit as much as
> traditional filesystems from SSD caching unless the caching is built
> into the filesystem itself, since they don't do in-place rewrites (so
> any new write by definition has to drop other data from the cache).

Yes, I considered that, too. And when I tried, there was almost no
perceivable performance difference between bcache-writearound and
bcache-writeback. But the latency of performance improvement was much
longer in writearound mode, so I sticked to writeback mode. Also,
writing random data is faster because bcache will defer it to
background and do writeback in sector order. Sequential access is
passed around bcache anyway, harddisks are already good at that.

But of course, the COW nature of btrfs will lower the hit rate I can
on writes. That's why I see no benefit in using bcache-writethrough
with btrfs.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-15 19:49           ` Btrfs/SSD Kai Krakow
@ 2017-05-15 20:05             ` Tomasz Torcz
  2017-05-16  1:58               ` Btrfs/SSD Kai Krakow
  2017-05-16 11:43             ` Btrfs/SSD Austin S. Hemmelgarn
  1 sibling, 1 reply; 49+ messages in thread
From: Tomasz Torcz @ 2017-05-15 20:05 UTC (permalink / raw)
  To: linux-btrfs

On Mon, May 15, 2017 at 09:49:38PM +0200, Kai Krakow wrote:
> 
> > It's worth noting also that on average, COW filesystems like BTRFS
> > (or log-structured-filesystems will not benefit as much as
> > traditional filesystems from SSD caching unless the caching is built
> > into the filesystem itself, since they don't do in-place rewrites (so
> > any new write by definition has to drop other data from the cache).
> 
> Yes, I considered that, too. And when I tried, there was almost no
> perceivable performance difference between bcache-writearound and
> bcache-writeback. But the latency of performance improvement was much
> longer in writearound mode, so I sticked to writeback mode. Also,
> writing random data is faster because bcache will defer it to
> background and do writeback in sector order. Sequential access is
> passed around bcache anyway, harddisks are already good at that.

  Let me add my 2 cents.  bcache-writearound does not cache writes
on SSD, so there are less writes overall to flash.  It is said
to prolong the life of the flash drive.
  I've recently switched from bcache-writeback to bcache-writearound,
because my SSD caching drive is at the edge of it's lifetime. I'm
using bcache in following configuration: https://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg
My SSD is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago.

  Now, according to http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
120GB and 250GB warranty only covers 75 TBW (terabytes written).
My drive has  # smartctl -a /dev/sda  | grep LBA
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       136025596053

which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well…

[35354.697513] sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[35354.697516] sd 0:0:0:0: [sda] tag#19 Sense Key : Medium Error [current] 
[35354.697518] sd 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed
[35354.697522] sd 0:0:0:0: [sda] tag#19 CDB: Read(10) 28 00 0c 30 82 9f 00 00 48 00
[35354.697524] blk_update_request: I/O error, dev sda, sector 204505785

Above started appearing recently.  So, I was really suprised that:
- this drive is only rated for 120 TBW
- I went through this limit in only 2 years

  The workload is lightly utilised home server / media center.

-- 
Tomasz Torcz                Only gods can safely risk perfection,
xmpp: zdzichubg@chrome.pl     it's a dangerous thing for a man.  -- Alia


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-15 20:05             ` Btrfs/SSD Tomasz Torcz
@ 2017-05-16  1:58               ` Kai Krakow
  2017-05-16 12:21                 ` Btrfs/SSD Tomasz Torcz
  0 siblings, 1 reply; 49+ messages in thread
From: Kai Krakow @ 2017-05-16  1:58 UTC (permalink / raw)
  To: linux-btrfs

Am Mon, 15 May 2017 22:05:05 +0200
schrieb Tomasz Torcz <tomek@pipebreaker.pl>:

> On Mon, May 15, 2017 at 09:49:38PM +0200, Kai Krakow wrote:
> >   
> > > It's worth noting also that on average, COW filesystems like BTRFS
> > > (or log-structured-filesystems will not benefit as much as
> > > traditional filesystems from SSD caching unless the caching is
> > > built into the filesystem itself, since they don't do in-place
> > > rewrites (so any new write by definition has to drop other data
> > > from the cache).  
> > 
> > Yes, I considered that, too. And when I tried, there was almost no
> > perceivable performance difference between bcache-writearound and
> > bcache-writeback. But the latency of performance improvement was
> > much longer in writearound mode, so I sticked to writeback mode.
> > Also, writing random data is faster because bcache will defer it to
> > background and do writeback in sector order. Sequential access is
> > passed around bcache anyway, harddisks are already good at that.  
> 
>   Let me add my 2 cents.  bcache-writearound does not cache writes
> on SSD, so there are less writes overall to flash.  It is said
> to prolong the life of the flash drive.
>   I've recently switched from bcache-writeback to bcache-writearound,
> because my SSD caching drive is at the edge of it's lifetime. I'm
> using bcache in following configuration:
> http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My SSD
> is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago.
> 
>   Now, according to
> http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
> 120GB and 250GB warranty only covers 75 TBW (terabytes written).

According to your chart, all your data is written twice to bcache. It
may have been better to buy two drives, one per mirror. I don't think
that SSD firmwares do deduplication - so data is really written twice.

They may do compression but that won't be streaming compression but
per-block compression, so it won't help here as a deduplicator. Also,
due to internal structure, compression would probably work similar to
how zswap works: By combining compressed blocks into "buddy blocks", so
only compression above 2:1 will merge compressed blocks into single
blocks. For most of your data, this won't be true. So effectively, this
has no overall effect. For this reason, I doubt that any firmware takes
the chance for compression, effects are just too low vs. the management
overhead and complexity that adds to the already complicated FTL layer.


> My
> drive has  # smartctl -a /dev/sda  | grep LBA 241
> Total_LBAs_Written      0x0032   099   099   000    Old_age
> Always       -       136025596053

Doesn't say this "99%" remaining? The threshold is far from being
reached...

I'm curious, what is Wear_Leveling_Count reporting?

> which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well…
> 
> [35354.697513] sd 0:0:0:0: [sda] tag#19 FAILED Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE [35354.697516] sd 0:0:0:0:
> [sda] tag#19 Sense Key : Medium Error [current] [35354.697518] sd
> 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto
> reallocate failed [35354.697522] sd 0:0:0:0: [sda] tag#19 CDB:
> Read(10) 28 00 0c 30 82 9f 00 00 48 00 [35354.697524]
> blk_update_request: I/O error, dev sda, sector 204505785
> 
> Above started appearing recently.  So, I was really suprised that:
> - this drive is only rated for 120 TBW
> - I went through this limit in only 2 years
> 
>   The workload is lightly utilised home server / media center.

I think, bcache is a real SSD killer for drives around 120GB size or
below... I had similar life usage with my previous small SSD just after
one year. But I never had a sense error because I took it out of
service early. And I switched to writearound, too.

I think the write-pattern of bcache cannot be handled well by the FTL.
It behaves like a log-structured file system, with new writes only
appended, and sometimes a garbage collection is done by freeing
complete erase blocks. Maybe it could work better, if btrfs could pass
information about freed blocks down to bcache. Btrfs has a lot of these
due to COW nature.

I wonder if this would already be supported if turning on discard in
btrfs? Does anyone know?


-- 
Regards,
Kai

Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-15 19:12             ` Btrfs/SSD Kai Krakow
@ 2017-05-16  4:48               ` Duncan
  0 siblings, 0 replies; 49+ messages in thread
From: Duncan @ 2017-05-16  4:48 UTC (permalink / raw)
  To: linux-btrfs

Kai Krakow posted on Mon, 15 May 2017 21:12:06 +0200 as excerpted:

> Am Mon, 15 May 2017 14:09:20 +0100
> schrieb Tomasz Kusmierz <tom.kusmierz@gmail.com>:
>> 
>> Not true. When HDD uses 10% (10% is just for easy example) of space
>> as spare than aligment on disk is (US - used sector, SS - spare
>> sector, BS - bad sector)
>> 
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> US US US US US US US US US SS
>> 
>> if failure occurs - drive actually shifts sectors up:
>> 
>> US US US US US US US US US SS
>> US US US BS BS BS US US US US
>> US US US US US US US US US US
>> US US US US US US US US US US
>> US US US US US US US US US SS
>> US US US BS US US US US US US
>> US US US US US US US US US SS
>> US US US US US US US US US SS
> 
> This makes sense... Reserve area somehow implies it is continuous and
> as such located at one far end of the platter. But your image totally
> makes sense.

Thanks Tomasz.  It makes a lot of sense indeed, and had I thought about
it I think I already "knew" it, but I simply hadn't stopped to think about
it that hard, so you disabused me of the vague idea of spares all at one
end of the disk, too. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-15 19:49           ` Btrfs/SSD Kai Krakow
  2017-05-15 20:05             ` Btrfs/SSD Tomasz Torcz
@ 2017-05-16 11:43             ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-05-16 11:43 UTC (permalink / raw)
  To: linux-btrfs

On 2017-05-15 15:49, Kai Krakow wrote:
> Am Mon, 15 May 2017 08:03:48 -0400
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
>
>>> That's why I don't trust any of my data to them. But I still want
>>> the benefit of their speed. So I use SSDs mostly as frontend caches
>>> to HDDs. This gives me big storage with fast access. Indeed, I'm
>>> using bcache successfully for this. A warm cache is almost as fast
>>> as native SSD (at least it feels almost that fast, it will be
>>> slower if you threw benchmarks at it).
>> That's to be expected though, most benchmarks don't replicate actual
>> usage patterns for client systems, and using SSD's for caching with
>> bcache or dm-cache for most server workloads except a file server
>> will usually get you a performance hit.
>
> You mean "performance boost"? Almost every read-most server workload
> should benefit... I file server may be the exact opposite...
In my experience, short of some types of file server and non-interactive 
websites, read-mostly server workloads are rare.
>
> Also, I think dm-cache and bcache work very differently and are not
> directly comparable. Their benefit depends much on the applied workload.
The low-level framework is different, and much of the internals are 
different, but based on most of the testing I've done, running them in 
the same mode (write-back/write-through/etc) will on average get you 
roughly the same performance.
>
> If I remember right, dm-cache is more about keeping "hot data" in the
> flash storage while bcache is more about reducing seeking. So dm-cache
> optimizes for bigger throughput of SSDs while bcache optimizes for
> almost-zero seek overhead of SSDs. Depending on your underlying
> storage, one or the other may even give zero benefit or worsen
> performance. Which is what I'd call a "performance hit"... I didn't
> ever try dm-cache, tho. For reasons I don't remember exactly, I didn't
> like something about how it's implemented, I think it was related to
> crash recovery. I don't know if that still holds true with modern
> kernels. It may have changed but I never looked back to revise that
> decision.
dm-cache is a bit easier to convert to or from in-place and is in my 
experience a bit more flexible in data handling, but has the issue that 
you can still see the FS on the back-end storage (because it has no 
superblock or anything like that on the back-end storage), which means 
it's almost useless with BTRFS, and it requires a separate cache device 
for each back-end device (as well as an independent metadata device, but 
that's usually tiny since it's largely just used as a bitmap to track 
what blocks are clean in-cache).

bcache is more complicated to set up initially, and _requires_ a kernel 
with bcache support to access even if you aren't doing any caching, but 
it masks the back-end (so it's safe to use with BTRFS (recent versions 
of it are at least)), and it doesn't require a 1:1 mapping of cache 
devices to back-end storage.
>
>
>> It's worth noting also that on average, COW filesystems like BTRFS
>> (or log-structured-filesystems will not benefit as much as
>> traditional filesystems from SSD caching unless the caching is built
>> into the filesystem itself, since they don't do in-place rewrites (so
>> any new write by definition has to drop other data from the cache).
>
> Yes, I considered that, too. And when I tried, there was almost no
> perceivable performance difference between bcache-writearound and
> bcache-writeback. But the latency of performance improvement was much
> longer in writearound mode, so I sticked to writeback mode. Also,
> writing random data is faster because bcache will defer it to
> background and do writeback in sector order. Sequential access is
> passed around bcache anyway, harddisks are already good at that.
>
> But of course, the COW nature of btrfs will lower the hit rate I can
> on writes. That's why I see no benefit in using bcache-writethrough
> with btrfs.
Yeah, on average based on my own testing, write-through mode is 
worthless for COW filesystems, and write-back is only worthwhile if you 
have a large enough cache proportionate to your bandwidth requirements 
(4G should be more than enough for a desktop or workstation, but servers 
may need huge amounts of space), while write-around is only worthwhile 
for stuff that needs read performance but doesn't really care about latency.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-16  1:58               ` Btrfs/SSD Kai Krakow
@ 2017-05-16 12:21                 ` Tomasz Torcz
  2017-05-16 12:35                   ` Btrfs/SSD Austin S. Hemmelgarn
  2017-05-16 17:08                   ` Btrfs/SSD Kai Krakow
  0 siblings, 2 replies; 49+ messages in thread
From: Tomasz Torcz @ 2017-05-16 12:21 UTC (permalink / raw)
  To: linux-btrfs

On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote:
> Am Mon, 15 May 2017 22:05:05 +0200
> schrieb Tomasz Torcz <tomek@pipebreaker.pl>:
> 
> > > Yes, I considered that, too. And when I tried, there was almost no
> > > perceivable performance difference between bcache-writearound and
> > > bcache-writeback. But the latency of performance improvement was
> > > much longer in writearound mode, so I sticked to writeback mode.
> > > Also, writing random data is faster because bcache will defer it to
> > > background and do writeback in sector order. Sequential access is
> > > passed around bcache anyway, harddisks are already good at that.  
> > 
> >   Let me add my 2 cents.  bcache-writearound does not cache writes
> > on SSD, so there are less writes overall to flash.  It is said
> > to prolong the life of the flash drive.
> >   I've recently switched from bcache-writeback to bcache-writearound,
> > because my SSD caching drive is at the edge of it's lifetime. I'm
> > using bcache in following configuration:
> > http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My SSD
> > is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago.
> > 
> >   Now, according to
> > http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
> > 120GB and 250GB warranty only covers 75 TBW (terabytes written).
> 
> According to your chart, all your data is written twice to bcache. It
> may have been better to buy two drives, one per mirror. I don't think
> that SSD firmwares do deduplication - so data is really written twice.

  I'm aware of that, but 50 GB (I've got 100GB caching partition)
is still plenty to cache my ~, some media files, two small VMs.
On the other hand I don't want to overspend. This is just a home
server.
  Nb. I'm still waiting for btrfs native SSD caching, which was
planned for 3.6 kernel 5 years ago :)
( https://oss.oracle.com/~mason/presentation/btrfs-jls-12/btrfs.html#/planned-3.6 )

> 
> 
> > My
> > drive has  # smartctl -a /dev/sda  | grep LBA 241
> > Total_LBAs_Written      0x0032   099   099   000    Old_age
> > Always       -       136025596053
> 
> Doesn't say this "99%" remaining? The threshold is far from being
> reached...
> 
> I'm curious, what is Wear_Leveling_Count reporting?

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       18227
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       29
177 Wear_Leveling_Count     0x0013   001   001   000    Pre-fail  Always       -       4916

 Is this 001 mean 1%? If so, SMART contradicts datasheets. And I
don't think I shoud see read errors for 1% wear.
 

> > which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well…
> > 

-- 
Tomasz Torcz       ,,(...) today's high-end is tomorrow's embedded processor.''
xmpp: zdzichubg@chrome.pl                      -- Mitchell Blank on LKML


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-16 12:21                 ` Btrfs/SSD Tomasz Torcz
@ 2017-05-16 12:35                   ` Austin S. Hemmelgarn
  2017-05-16 17:08                   ` Btrfs/SSD Kai Krakow
  1 sibling, 0 replies; 49+ messages in thread
From: Austin S. Hemmelgarn @ 2017-05-16 12:35 UTC (permalink / raw)
  To: Tomasz Torcz, linux-btrfs

On 2017-05-16 08:21, Tomasz Torcz wrote:
> On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote:
>> Am Mon, 15 May 2017 22:05:05 +0200
>> schrieb Tomasz Torcz <tomek@pipebreaker.pl>:
>>> My
>>> drive has  # smartctl -a /dev/sda  | grep LBA 241
>>> Total_LBAs_Written      0x0032   099   099   000    Old_age
>>> Always       -       136025596053
>>
>> Doesn't say this "99%" remaining? The threshold is far from being
>> reached...
>>
>> I'm curious, what is Wear_Leveling_Count reporting?
>
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       18227
>  12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       29
> 177 Wear_Leveling_Count     0x0013   001   001   000    Pre-fail  Always       -       4916
>
>  Is this 001 mean 1%? If so, SMART contradicts datasheets. And I
> don't think I shoud see read errors for 1% wear.
The 'normalized' values shown in the VALUE, WORST, and THRESH columns 
usually count down to zero (with the notable exception of the thermal 
attributes, which usually match the raw value), they exist as a way of 
comparing things without having to know what vendor or model the device 
is, as the raw values are (again with limited exceptions) technically 
vendor specific (the various *_Error_Rate counters on traditional HDD's 
are good examples of this).  VALUE is your current value, WORST is a 
peak-detector type thing that monitors the worst it's been, and THRESH 
is the point at which the device manufacturer considers that aspect 
failed (which will usually result in the 'Overall Health Assessment' 
failing as well), though I'm pretty sure that if THRESH is 000, that 
means that the firmware doesn't base it's asse3ssment for that attribute 
on the normalized value at all.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Btrfs/SSD
  2017-05-16 12:21                 ` Btrfs/SSD Tomasz Torcz
  2017-05-16 12:35                   ` Btrfs/SSD Austin S. Hemmelgarn
@ 2017-05-16 17:08                   ` Kai Krakow
  1 sibling, 0 replies; 49+ messages in thread
From: Kai Krakow @ 2017-05-16 17:08 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 16 May 2017 14:21:20 +0200
schrieb Tomasz Torcz <tomek@pipebreaker.pl>:

> On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote:
> > Am Mon, 15 May 2017 22:05:05 +0200
> > schrieb Tomasz Torcz <tomek@pipebreaker.pl>:
> >   
>  [...]  
> > > 
> > >   Let me add my 2 cents.  bcache-writearound does not cache writes
> > > on SSD, so there are less writes overall to flash.  It is said
> > > to prolong the life of the flash drive.
> > >   I've recently switched from bcache-writeback to
> > > bcache-writearound, because my SSD caching drive is at the edge
> > > of it's lifetime. I'm using bcache in following configuration:
> > > http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My
> > > SSD is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years
> > > ago.
> > > 
> > >   Now, according to
> > > http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
> > > 120GB and 250GB warranty only covers 75 TBW (terabytes written).  
> > 
> > According to your chart, all your data is written twice to bcache.
> > It may have been better to buy two drives, one per mirror. I don't
> > think that SSD firmwares do deduplication - so data is really
> > written twice.  
> 
>   I'm aware of that, but 50 GB (I've got 100GB caching partition)
> is still plenty to cache my ~, some media files, two small VMs.
> On the other hand I don't want to overspend. This is just a home
> server.
>   Nb. I'm still waiting for btrfs native SSD caching, which was
> planned for 3.6 kernel 5 years ago :)
> ( https://oss.oracle.com/~mason/presentation/btrfs-jls-12/btrfs.html#/planned-3.6
> )
> 
> > 
> >   
> > > My
> > > drive has  # smartctl -a /dev/sda  | grep LBA 241
> > > Total_LBAs_Written      0x0032   099   099   000    Old_age
> > > Always       -       136025596053  
> > 
> > Doesn't say this "99%" remaining? The threshold is far from being
> > reached...
> > 
> > I'm curious, what is Wear_Leveling_Count reporting?  
> 
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE 9 Power_On_Hours          0x0032
> 096   096   000    Old_age   Always       -       18227 12
> Power_Cycle_Count       0x0032   099   099   000    Old_age
> Always       -       29 177 Wear_Leveling_Count     0x0013   001
> 001   000    Pre-fail  Always       -       4916
> 
>  Is this 001 mean 1%? If so, SMART contradicts datasheets. And I
> don't think I shoud see read errors for 1% wear.

It more means 1% left, that is 99% wear... Most of these are counters
from 100 down to zero, with THRESH being the threshold point below or at
which it is considered failed or failing.

Only a few values work the other way around (like temperature).

Be careful with interpreting raw values: they may be very manufacturer
specific and not normalized.

According to Total_LBAs_Written, the manufacturer thinks the drive
could still take 100x more (only 1% used). But your wear level is almost
100% (value = 001). I think that value isn't really designed around the
flash cell lifetime, but intermediate components like caches.

So you need to read most values "backwards": It's not a used counter,
but a "what's left" counter.

What does it tell you about reserved blocks usage? Note that it's sort
of double negation here: value 100 used means 100% unused or 0%
used... ;-) Or just insert a "minus" in front of those values and think
of them counting up to zero. So on a time axis it's at -100% of the
total lifetime scale and 0 is the fail point (or whatever "thresh"
says).


-- 
Regards,
Kai

Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-05-16 17:08 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-14 11:02 Btrfs/SSD Imran Geriskovan
2017-04-17 11:53 ` Btrfs/SSD Austin S. Hemmelgarn
2017-04-17 16:58   ` Btrfs/SSD Chris Murphy
2017-04-17 17:13     ` Btrfs/SSD Austin S. Hemmelgarn
2017-04-17 18:24       ` Btrfs/SSD Roman Mamedov
2017-04-17 19:22         ` Btrfs/SSD Imran Geriskovan
2017-04-17 22:55           ` Btrfs/SSD Hans van Kranenburg
2017-04-19 18:10             ` Btrfs/SSD Chris Murphy
2017-04-18 12:26           ` Btrfs/SSD Austin S. Hemmelgarn
2017-04-18  3:23         ` Btrfs/SSD Duncan
2017-04-18  4:58           ` Btrfs/SSD Roman Mamedov
2017-04-17 18:34       ` Btrfs/SSD Chris Murphy
2017-04-17 19:26         ` Btrfs/SSD Austin S. Hemmelgarn
2017-04-17 19:39           ` Btrfs/SSD Chris Murphy
2017-04-18 11:31             ` Btrfs/SSD Austin S. Hemmelgarn
2017-04-18 12:20               ` Btrfs/SSD Hugo Mills
2017-04-18 13:02   ` Btrfs/SSD Imran Geriskovan
2017-04-18 13:39     ` Btrfs/SSD Austin S. Hemmelgarn
2017-05-12 18:27     ` Btrfs/SSD Kai Krakow
2017-05-12 20:31       ` Btrfs/SSD Imran Geriskovan
2017-05-13  9:39       ` Btrfs/SSD Duncan
2017-05-13 11:15         ` Btrfs/SSD Janos Toth F.
2017-05-13 11:34         ` [OT] SSD performance patterns (was: Btrfs/SSD) Kai Krakow
2017-05-14 16:21         ` Btrfs/SSD Chris Murphy
2017-05-14 18:01           ` Btrfs/SSD Tomasz Kusmierz
2017-05-14 20:47             ` Btrfs/SSD (my -o ssd "summary") Hans van Kranenburg
2017-05-14 23:01             ` Btrfs/SSD Imran Geriskovan
2017-05-15  0:23               ` Btrfs/SSD Tomasz Kusmierz
2017-05-15  0:24               ` Btrfs/SSD Tomasz Kusmierz
2017-05-15 11:25                 ` Btrfs/SSD Imran Geriskovan
2017-05-15 11:46       ` Btrfs/SSD Austin S. Hemmelgarn
2017-05-15 19:22         ` Btrfs/SSD Kai Krakow
2017-05-12  4:51   ` Btrfs/SSD Duncan
2017-05-12 13:02     ` Btrfs/SSD Imran Geriskovan
2017-05-12 18:36       ` Btrfs/SSD Kai Krakow
2017-05-13  9:52         ` Btrfs/SSD Roman Mamedov
2017-05-13 10:47           ` Btrfs/SSD Kai Krakow
2017-05-15 12:03         ` Btrfs/SSD Austin S. Hemmelgarn
2017-05-15 13:09           ` Btrfs/SSD Tomasz Kusmierz
2017-05-15 19:12             ` Btrfs/SSD Kai Krakow
2017-05-16  4:48               ` Btrfs/SSD Duncan
2017-05-15 19:49           ` Btrfs/SSD Kai Krakow
2017-05-15 20:05             ` Btrfs/SSD Tomasz Torcz
2017-05-16  1:58               ` Btrfs/SSD Kai Krakow
2017-05-16 12:21                 ` Btrfs/SSD Tomasz Torcz
2017-05-16 12:35                   ` Btrfs/SSD Austin S. Hemmelgarn
2017-05-16 17:08                   ` Btrfs/SSD Kai Krakow
2017-05-16 11:43             ` Btrfs/SSD Austin S. Hemmelgarn
2017-05-14  8:46       ` Btrfs/SSD Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.