All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
@ 2021-06-09 10:53 Ric Wheeler
  2021-06-09 18:05 ` Bart Van Assche
  2021-06-13 20:41 ` [LSF/MM/BPF TOPIC] SSDFS: LFS file system without GC operations + NAND flash devices lifetime prolongation Viacheslav Dubeyko
  0 siblings, 2 replies; 14+ messages in thread
From: Ric Wheeler @ 2021-06-09 10:53 UTC (permalink / raw)
  To: lsf-pc; +Cc: Linux FS Devel, linux-block


Consumer devices are pushed to use the highest capacity emmc class devices, but 
they have horrible write durability.

At the same time, we layer on top of these devices our normal stack - device 
mapper and ext4 or f2fs are common configurations today - which causes write 
amplification and can burn out storage even faster. I think it would be useful 
to discuss how we can minimize the write amplification when we need to run on 
these low end parts & see where the stack needs updating.

Great background paper which inspired me to spend time tormenting emmc parts is:

http://www.cs.unc.edu/~porter/pubs/hotos17-final29.pdf



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-09 10:53 [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!) Ric Wheeler
@ 2021-06-09 18:05 ` Bart Van Assche
  2021-06-09 18:30   ` Matthew Wilcox
  2021-06-13 20:41 ` [LSF/MM/BPF TOPIC] SSDFS: LFS file system without GC operations + NAND flash devices lifetime prolongation Viacheslav Dubeyko
  1 sibling, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2021-06-09 18:05 UTC (permalink / raw)
  To: Ric Wheeler, lsf-pc; +Cc: Linux FS Devel, linux-block

On 6/9/21 3:53 AM, Ric Wheeler wrote:
> Consumer devices are pushed to use the highest capacity emmc class
> devices, but they have horrible write durability.
> 
> At the same time, we layer on top of these devices our normal stack -
> device mapper and ext4 or f2fs are common configurations today - which
> causes write amplification and can burn out storage even faster. I think
> it would be useful to discuss how we can minimize the write
> amplification when we need to run on these low end parts & see where the
> stack needs updating.
> 
> Great background paper which inspired me to spend time tormenting emmc
> parts is:
> 
> http://www.cs.unc.edu/~porter/pubs/hotos17-final29.pdf

Without having read that paper, has zoned storage been considered? F2FS
already supports zoned block devices. I'm not aware of a better solution
to reduce write amplification for flash devices. Maybe I'm missing
something?

More information is available in this paper:
https://dl.acm.org/doi/pdf/10.1145/3458336.3465300.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-09 18:05 ` Bart Van Assche
@ 2021-06-09 18:30   ` Matthew Wilcox
  2021-06-09 18:47     ` Bart Van Assche
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2021-06-09 18:30 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Ric Wheeler, lsf-pc, Linux FS Devel, linux-block

On Wed, Jun 09, 2021 at 11:05:22AM -0700, Bart Van Assche wrote:
> On 6/9/21 3:53 AM, Ric Wheeler wrote:
> > Consumer devices are pushed to use the highest capacity emmc class
> > devices, but they have horrible write durability.
> > 
> > At the same time, we layer on top of these devices our normal stack -
> > device mapper and ext4 or f2fs are common configurations today - which
> > causes write amplification and can burn out storage even faster. I think
> > it would be useful to discuss how we can minimize the write
> > amplification when we need to run on these low end parts & see where the
> > stack needs updating.
> > 
> > Great background paper which inspired me to spend time tormenting emmc
> > parts is:
> > 
> > http://www.cs.unc.edu/~porter/pubs/hotos17-final29.pdf
> 
> Without having read that paper, has zoned storage been considered? F2FS
> already supports zoned block devices. I'm not aware of a better solution
> to reduce write amplification for flash devices. Maybe I'm missing
> something?

maybe you should read the paper.

" Thiscomparison demonstrates that using F2FS, a flash-friendly file
sys-tem, does not mitigate the wear-out problem, except inasmuch asit
inadvertently rate limitsallI/O to the device"

> More information is available in this paper:
> https://dl.acm.org/doi/pdf/10.1145/3458336.3465300.
> 
> Thanks,
> 
> Bart.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-09 18:30   ` Matthew Wilcox
@ 2021-06-09 18:47     ` Bart Van Assche
  2021-06-10  0:16       ` Damien Le Moal
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Bart Van Assche @ 2021-06-09 18:47 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Ric Wheeler, lsf-pc, Linux FS Devel, linux-block

On 6/9/21 11:30 AM, Matthew Wilcox wrote:
> maybe you should read the paper.
> 
> " Thiscomparison demonstrates that using F2FS, a flash-friendly file
> sys-tem, does not mitigate the wear-out problem, except inasmuch asit
> inadvertently rate limitsallI/O to the device"

It seems like my email was not clear enough? What I tried to make clear
is that I think that there is no way to solve the flash wear issue with
the traditional block interface. I think that F2FS in combination with
the zone interface is an effective solution.

What is also relevant in this context is that the "Flash drive lifespan
is a problem" paper was published in 2017. I think that the first
commercial SSDs with a zone interface became available at a later time
(summer of 2020?).

Bart.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-09 18:47     ` Bart Van Assche
@ 2021-06-10  0:16       ` Damien Le Moal
  2021-06-10  1:11         ` Ric Wheeler
  2021-06-10  1:20       ` Ric Wheeler
       [not found]       ` <CAOtxgyeRf=+grEoHxVLEaSM=Yfx4KrSG5q96SmztpoWfP=QrDg@mail.gmail.com>
  2 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2021-06-10  0:16 UTC (permalink / raw)
  To: Bart Van Assche, Matthew Wilcox
  Cc: Ric Wheeler, lsf-pc, Linux FS Devel, linux-block

On 2021/06/10 3:47, Bart Van Assche wrote:
> On 6/9/21 11:30 AM, Matthew Wilcox wrote:
>> maybe you should read the paper.
>>
>> " Thiscomparison demonstrates that using F2FS, a flash-friendly file
>> sys-tem, does not mitigate the wear-out problem, except inasmuch asit
>> inadvertently rate limitsallI/O to the device"
> 
> It seems like my email was not clear enough? What I tried to make clear
> is that I think that there is no way to solve the flash wear issue with
> the traditional block interface. I think that F2FS in combination with
> the zone interface is an effective solution.
> 
> What is also relevant in this context is that the "Flash drive lifespan
> is a problem" paper was published in 2017. I think that the first
> commercial SSDs with a zone interface became available at a later time
> (summer of 2020?).

Yes, zone support in the block layer and f2fs was added with kernel 4.10
released in Feb 2017. So the authors likely did not consider that as a solution,
especially considering that at the time, it was all about SMR HDDs only. Now, we
do have ZNS and things like SD-Express coming which may allow NVMe/ZNS on even
the cheapest of consumer devices.

That said, I do not think that f2fs is not yet an ideal solution as is since all
its metadata need update in-place, so are subject to the drive implementation of
FTL/weir leveling. And the quality of this varies between devices and vendors...

btrfs zone support improves that as even the super blocks are not updated in
place on zoned devices. Everything is copy-on-write, sequential write into
zones. While the current block allocator is rather simple for now, it could be
tweaked to add some weir leveling awareness, eventually (per zone weir leveling
is something much easier to do inside the drive though, so the host should not
care).

In the context of zoned storage, the discussion could be around how to best
support file systems. Do we keep modifying one file system after another to
support zones, or implement weir leveling ? That is *very* hard to do and
sometimes not reasonably feasible depending on the FS design.

I do remember Dave Chinner talk back in 2018 LSF/MM (was it ?) where he
discussed the idea of having block allocation moved out of FSes and turned into
a kind of library common to many file systems. In the context of consumer flash
weir leveling, and eventually zones (likely with some remapping needed), this
may be something interesting to discuss again.



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-10  0:16       ` Damien Le Moal
@ 2021-06-10  1:11         ` Ric Wheeler
  0 siblings, 0 replies; 14+ messages in thread
From: Ric Wheeler @ 2021-06-10  1:11 UTC (permalink / raw)
  To: Damien Le Moal, Bart Van Assche, Matthew Wilcox
  Cc: lsf-pc, Linux FS Devel, linux-block

On 6/9/21 8:16 PM, Damien Le Moal wrote:
> On 2021/06/10 3:47, Bart Van Assche wrote:
>> On 6/9/21 11:30 AM, Matthew Wilcox wrote:
>>> maybe you should read the paper.
>>>
>>> " Thiscomparison demonstrates that using F2FS, a flash-friendly file
>>> sys-tem, does not mitigate the wear-out problem, except inasmuch asit
>>> inadvertently rate limitsallI/O to the device"
>> It seems like my email was not clear enough? What I tried to make clear
>> is that I think that there is no way to solve the flash wear issue with
>> the traditional block interface. I think that F2FS in combination with
>> the zone interface is an effective solution.
>>
>> What is also relevant in this context is that the "Flash drive lifespan
>> is a problem" paper was published in 2017. I think that the first
>> commercial SSDs with a zone interface became available at a later time
>> (summer of 2020?).
> Yes, zone support in the block layer and f2fs was added with kernel 4.10
> released in Feb 2017. So the authors likely did not consider that as a solution,
> especially considering that at the time, it was all about SMR HDDs only. Now, we
> do have ZNS and things like SD-Express coming which may allow NVMe/ZNS on even
> the cheapest of consumer devices.
>
> That said, I do not think that f2fs is not yet an ideal solution as is since all
> its metadata need update in-place, so are subject to the drive implementation of
> FTL/weir leveling. And the quality of this varies between devices and vendors...
>
> btrfs zone support improves that as even the super blocks are not updated in
> place on zoned devices. Everything is copy-on-write, sequential write into
> zones. While the current block allocator is rather simple for now, it could be
> tweaked to add some weir leveling awareness, eventually (per zone weir leveling
> is something much easier to do inside the drive though, so the host should not
> care).
>
> In the context of zoned storage, the discussion could be around how to best
> support file systems. Do we keep modifying one file system after another to
> support zones, or implement weir leveling ? That is *very* hard to do and
> sometimes not reasonably feasible depending on the FS design.
>
> I do remember Dave Chinner talk back in 2018 LSF/MM (was it ?) where he
> discussed the idea of having block allocation moved out of FSes and turned into
> a kind of library common to many file systems. In the context of consumer flash
> weir leveling, and eventually zones (likely with some remapping needed), this
> may be something interesting to discuss again.
>
Some of the other bits that make this hard in the embedded space include 
layering on top of device mapper - using dm verity for example - and our usual 
problem of having apps that drive too many small IO's down to service sqlite 
transactions.

Looking to get some measurements done to show the write amplification - measure 
the amount of writes done in total by applications - and what that translates 
into for device requests. Anything done for metadata, logging, etc all counts as 
"write amplification" when viewed this way.

Useful to try and figure out what the best case durability of parts would be for 
specific workloads.

Measuring write amplification inside of a device is often possible as well so we 
could end up getting a pretty clear picture.

Regards,

Ric





^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-09 18:47     ` Bart Van Assche
  2021-06-10  0:16       ` Damien Le Moal
@ 2021-06-10  1:20       ` Ric Wheeler
  2021-06-10 11:07         ` Tim Walker
       [not found]       ` <CAOtxgyeRf=+grEoHxVLEaSM=Yfx4KrSG5q96SmztpoWfP=QrDg@mail.gmail.com>
  2 siblings, 1 reply; 14+ messages in thread
From: Ric Wheeler @ 2021-06-10  1:20 UTC (permalink / raw)
  To: Bart Van Assche, Matthew Wilcox; +Cc: lsf-pc, Linux FS Devel, linux-block

On 6/9/21 2:47 PM, Bart Van Assche wrote:
> On 6/9/21 11:30 AM, Matthew Wilcox wrote:
>> maybe you should read the paper.
>>
>> " Thiscomparison demonstrates that using F2FS, a flash-friendly file
>> sys-tem, does not mitigate the wear-out problem, except inasmuch asit
>> inadvertently rate limitsallI/O to the device"
> It seems like my email was not clear enough? What I tried to make clear
> is that I think that there is no way to solve the flash wear issue with
> the traditional block interface. I think that F2FS in combination with
> the zone interface is an effective solution.
>
> What is also relevant in this context is that the "Flash drive lifespan
> is a problem" paper was published in 2017. I think that the first
> commercial SSDs with a zone interface became available at a later time
> (summer of 2020?).
>
> Bart.

Just to address the zone interface support, it unfortunately takes a very long 
time to make it down into the world of embedded parts (emmc is super common and 
very primitive for example). UFS parts are in higher end devices, have not had a 
chance to look at what they offer.

Ric



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-10  1:20       ` Ric Wheeler
@ 2021-06-10 11:07         ` Tim Walker
  2021-06-10 16:38           ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Tim Walker @ 2021-06-10 11:07 UTC (permalink / raw)
  To: Ric Wheeler, Bart Van Assche, Matthew Wilcox
  Cc: lsf-pc, Linux FS Devel, linux-block

Hi all-

On  Wednesday, June 9, 2021 at 9:20:52 PM Ric Wheeler wrote:

>On 6/9/21 2:47 PM, Bart Van Assche wrote:
>> On 6/9/21 11:30 AM, Matthew Wilcox wrote:
>>> maybe you should read the paper.
>>>
>>> " Thiscomparison demonstrates that using F2FS, a flash-friendly file
>>> sys-tem, does not mitigate the wear-out problem, except inasmuch asit
>>> inadvertently rate limitsallI/O to the device"
>> It seems like my email was not clear enough? What I tried to make clear
>> is that I think that there is no way to solve the flash wear issue with
>> the traditional block interface. I think that F2FS in combination with
>> the zone interface is an effective solution.
>>
>> What is also relevant in this context is that the "Flash drive lifespan
>> is a problem" paper was published in 2017. I think that the first
>> commercial SSDs with a zone interface became available at a later time
>> (summer of 2020?).
>>
>> Bart.
>
>Just to address the zone interface support, it unfortunately takes a very long 
>time to make it down into the world of embedded parts (emmc is super common and 
>very primitive for example). UFS parts are in higher end devices, have not had a 
>chance to look at what they offer.
>
>Ric
>
>
>

For zoned block devices, particularly the sequential write zones, maybe it makes more sense for the device to manage wear leveling on a zone-by-zone basis. It seems like it could be pretty easy for a device to decide which head/die to select for a given zone when the zone is initially opened after the last reset write pointer.

Best regards,
-Tim



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
       [not found]       ` <CAOtxgyeRf=+grEoHxVLEaSM=Yfx4KrSG5q96SmztpoWfP=QrDg@mail.gmail.com>
@ 2021-06-10 16:22         ` Ric Wheeler
  2021-06-10 17:06           ` Matthew Wilcox
  2021-06-10 17:57           ` Viacheslav Dubeyko
  0 siblings, 2 replies; 14+ messages in thread
From: Ric Wheeler @ 2021-06-10 16:22 UTC (permalink / raw)
  To: Jaegeuk Kim, Bart Van Assche
  Cc: Matthew Wilcox, lsf-pc, Linux FS Devel, linux-block

On 6/9/21 5:32 PM, Jaegeuk Kim wrote:
> On Wed, Jun 9, 2021 at 11:47 AM Bart Van Assche <bvanassche@acm.org 
> <mailto:bvanassche@acm.org>> wrote:
>
>     On 6/9/21 11:30 AM, Matthew Wilcox wrote:
>     > maybe you should read the paper.
>     >
>     > " Thiscomparison demonstrates that using F2FS, a flash-friendly file
>     > sys-tem, does not mitigate the wear-out problem, except inasmuch asit
>     > inadvertently rate limitsallI/O to the device"
>
>
> Do you agree with that statement based on your insight? At least to me, that
> paper is missing the fundamental GC problem which was supposed to be
> evaluated by real workloads instead of using a simple benchmark generating
> 4KB random writes only. And, they had to investigate more details in FTL/IO
> patterns including UNMAP and LBA alignment between host and storage, which
> all affect WAF. Based on that, the point of the zoned device is quite promising
> to me, since it can address LBA alignment entirely and give a way that host
> SW stack can control QoS.

Just a note, using a pretty simple and optimal streaming write pattern, I have 
been able to burn out emmc parts in a little over a week.

My test case creating a 1GB file (filled with random data just in case the 
device was looking for zero blocks to ignore) and then do a loop to cp and sync 
that file until the emmc device life time was shown as exhausted.

This was a clean, best case sequential write so this is not just an issue with 
small, random writes.

Of course, this is normal to wear them out, but for the super low end parts, 
taking away any of the device writes in our stack is costly given how little 
life they have....

Regards,


Ric


>
> The topic has been a long-standing issue in flash area for multiple years and
> it'd be exciting to see any new ideas.
>
>
>     It seems like my email was not clear enough? What I tried to make clear
>     is that I think that there is no way to solve the flash wear issue with
>     the traditional block interface. I think that F2FS in combination with
>     the zone interface is an effective solution.
>
>     What is also relevant in this context is that the "Flash drive lifespan
>     is a problem" paper was published in 2017. I think that the first
>     commercial SSDs with a zone interface became available at a later time
>     (summer of 2020?).
>
>     Bart.
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-10 11:07         ` Tim Walker
@ 2021-06-10 16:38           ` Keith Busch
  0 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2021-06-10 16:38 UTC (permalink / raw)
  To: Tim Walker
  Cc: Ric Wheeler, Bart Van Assche, Matthew Wilcox, lsf-pc,
	Linux FS Devel, linux-block

On Thu, Jun 10, 2021 at 11:07:09AM +0000, Tim Walker wrote:
>  Wednesday, June 9, 2021 at 9:20:52 PM Ric Wheeler wrote:
> >On 6/9/21 2:47 PM, Bart Van Assche wrote:
> >> On 6/9/21 11:30 AM, Matthew Wilcox wrote:
> >>> maybe you should read the paper.
> >>>
> >>> " Thiscomparison demonstrates that using F2FS, a flash-friendly file
> >>> sys-tem, does not mitigate the wear-out problem, except inasmuch asit
> >>> inadvertently rate limitsallI/O to the device"
> >> It seems like my email was not clear enough? What I tried to make clear
> >> is that I think that there is no way to solve the flash wear issue with
> >> the traditional block interface. I think that F2FS in combination with
> >> the zone interface is an effective solution.
> >>
> >> What is also relevant in this context is that the "Flash drive lifespan
> >> is a problem" paper was published in 2017. I think that the first
> >> commercial SSDs with a zone interface became available at a later time
> >> (summer of 2020?).
> >>
> >> Bart.
> >
> >Just to address the zone interface support, it unfortunately takes a very long 
> >time to make it down into the world of embedded parts (emmc is super common and 
> >very primitive for example). UFS parts are in higher end devices, have not had a 
> >chance to look at what they offer.
> >
> >Ric
> 
> For zoned block devices, particularly the sequential write zones,
> maybe it makes more sense for the device to manage wear leveling on a
> zone-by-zone basis. It seems like it could be pretty easy for a device
> to decide which head/die to select for a given zone when the zone is
> initially opened after the last reset write pointer.

I think device managed wear leveling was the point of zoned ssd's. If the
host was managing that, then that's pretty much an open channel ssd.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-10 16:22         ` Ric Wheeler
@ 2021-06-10 17:06           ` Matthew Wilcox
  2021-06-10 17:25             ` Ric Wheeler
  2021-06-10 17:57           ` Viacheslav Dubeyko
  1 sibling, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2021-06-10 17:06 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jaegeuk Kim, Bart Van Assche, lsf-pc, Linux FS Devel, linux-block

On Thu, Jun 10, 2021 at 12:22:40PM -0400, Ric Wheeler wrote:
> On 6/9/21 5:32 PM, Jaegeuk Kim wrote:
> > On Wed, Jun 9, 2021 at 11:47 AM Bart Van Assche <bvanassche@acm.org
> > <mailto:bvanassche@acm.org>> wrote:
> > 
> >     On 6/9/21 11:30 AM, Matthew Wilcox wrote:
> >     > maybe you should read the paper.
> >     >
> >     > " Thiscomparison demonstrates that using F2FS, a flash-friendly file
> >     > sys-tem, does not mitigate the wear-out problem, except inasmuch asit
> >     > inadvertently rate limitsallI/O to the device"
> > 
> > 
> > Do you agree with that statement based on your insight? At least to me, that
> > paper is missing the fundamental GC problem which was supposed to be
> > evaluated by real workloads instead of using a simple benchmark generating
> > 4KB random writes only. And, they had to investigate more details in FTL/IO
> > patterns including UNMAP and LBA alignment between host and storage, which
> > all affect WAF. Based on that, the point of the zoned device is quite promising
> > to me, since it can address LBA alignment entirely and give a way that host
> > SW stack can control QoS.
> 
> Just a note, using a pretty simple and optimal streaming write pattern, I
> have been able to burn out emmc parts in a little over a week.
> 
> My test case creating a 1GB file (filled with random data just in case the
> device was looking for zero blocks to ignore) and then do a loop to cp and
> sync that file until the emmc device life time was shown as exhausted.
> 
> This was a clean, best case sequential write so this is not just an issue
> with small, random writes.

How many LBAs were you using?  My mental model of a FTL (which may
be out of date) is that it's essentially a log-structured filesystem.
When there are insufficient empty erase-blocks available, the device
finds a suitable victim erase-block, copies all the still-live LBAs into
an active erase-block, updates the FTL and erases the erase-block.

So the key is making sure that LBAs are reused as much as possible.
Short of modifying a filesystem to make this happen, I force it by
short-stroking my SSD.  We can model it statistically, but intuitively,
if there are more "live" LBAs, the higher the write amplification and
wear on the drive will be because the victim erase-blocks will have
more live LBAs to migrate.

This is why the paper intrigued me; it seemed like they were rewriting
a 100MB file in place.  That _shouldn't_ cause ridiculous wear, unless
the emmc device was otherwise almost full.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-10 17:06           ` Matthew Wilcox
@ 2021-06-10 17:25             ` Ric Wheeler
  0 siblings, 0 replies; 14+ messages in thread
From: Ric Wheeler @ 2021-06-10 17:25 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jaegeuk Kim, Bart Van Assche, lsf-pc, Linux FS Devel, linux-block

On 6/10/21 1:06 PM, Matthew Wilcox wrote:
> On Thu, Jun 10, 2021 at 12:22:40PM -0400, Ric Wheeler wrote:
>> On 6/9/21 5:32 PM, Jaegeuk Kim wrote:
>>> On Wed, Jun 9, 2021 at 11:47 AM Bart Van Assche <bvanassche@acm.org
>>> <mailto:bvanassche@acm.org>> wrote:
>>>
>>>      On 6/9/21 11:30 AM, Matthew Wilcox wrote:
>>>      > maybe you should read the paper.
>>>      >
>>>      > " Thiscomparison demonstrates that using F2FS, a flash-friendly file
>>>      > sys-tem, does not mitigate the wear-out problem, except inasmuch asit
>>>      > inadvertently rate limitsallI/O to the device"
>>>
>>>
>>> Do you agree with that statement based on your insight? At least to me, that
>>> paper is missing the fundamental GC problem which was supposed to be
>>> evaluated by real workloads instead of using a simple benchmark generating
>>> 4KB random writes only. And, they had to investigate more details in FTL/IO
>>> patterns including UNMAP and LBA alignment between host and storage, which
>>> all affect WAF. Based on that, the point of the zoned device is quite promising
>>> to me, since it can address LBA alignment entirely and give a way that host
>>> SW stack can control QoS.
>> Just a note, using a pretty simple and optimal streaming write pattern, I
>> have been able to burn out emmc parts in a little over a week.
>>
>> My test case creating a 1GB file (filled with random data just in case the
>> device was looking for zero blocks to ignore) and then do a loop to cp and
>> sync that file until the emmc device life time was shown as exhausted.
>>
>> This was a clean, best case sequential write so this is not just an issue
>> with small, random writes.
> How many LBAs were you using?  My mental model of a FTL (which may
> be out of date) is that it's essentially a log-structured filesystem.
> When there are insufficient empty erase-blocks available, the device
> finds a suitable victim erase-block, copies all the still-live LBAs into
> an active erase-block, updates the FTL and erases the erase-block.
>
> So the key is making sure that LBAs are reused as much as possible.
> Short of modifying a filesystem to make this happen, I force it by
> short-stroking my SSD.  We can model it statistically, but intuitively,
> if there are more "live" LBAs, the higher the write amplification and
> wear on the drive will be because the victim erase-blocks will have
> more live LBAs to migrate.
>
> This is why the paper intrigued me; it seemed like they were rewriting
> a 100MB file in place.  That _shouldn't_ cause ridiculous wear, unless
> the emmc device was otherwise almost full.

During the test run, I did not look at which LBA's that got written to over the 
couple of weeks.

Roughly, I tried to make sure that the file system ranged in fullness from 50% 
to 75% (did not let it get too close to full).

Any vendor (especially on the low end parts) might do something really 
primitive, but the hope I would have is similar to what you describe - if there 
is sufficient free space, the firmware should be able to wear level across all 
of the cells in the device. Overwriting in place or writing (and then 
freeing/discarded) each LBA *should* be roughly equivalent. Free space being 
defined as LBA's that are not known to the device as those without valid, 
un-discarded data.

Also important to write enough to flush through any possible DRAM/SRAM like 
cache a device might have that could absorb tiny writes.

The parts I played with ranged from what seemed to be roughly 3x write 
amplification for the workload I ran down to more like 1.3x write amplification 
(measured super coarsely as app level IO dispatched so all metadata, etc counted 
as "WA" in my coarse look). Just trying to figure out for a given IO/fs stack, 
how specific devices handle the user workload.

Regards,

Ric



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!)
  2021-06-10 16:22         ` Ric Wheeler
  2021-06-10 17:06           ` Matthew Wilcox
@ 2021-06-10 17:57           ` Viacheslav Dubeyko
  1 sibling, 0 replies; 14+ messages in thread
From: Viacheslav Dubeyko @ 2021-06-10 17:57 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Jaegeuk Kim, Bart Van Assche, Matthew Wilcox, lsf-pc,
	Linux FS Devel, linux-block



> On Jun 10, 2021, at 9:22 AM, Ric Wheeler <ricwheeler@gmail.com> wrote:
> 
> On 6/9/21 5:32 PM, Jaegeuk Kim wrote:
>> On Wed, Jun 9, 2021 at 11:47 AM Bart Van Assche <bvanassche@acm.org <mailto:bvanassche@acm.org>> wrote:
>> 
>>    On 6/9/21 11:30 AM, Matthew Wilcox wrote:
>>    > maybe you should read the paper.
>>    >
>>    > " Thiscomparison demonstrates that using F2FS, a flash-friendly file
>>    > sys-tem, does not mitigate the wear-out problem, except inasmuch asit
>>    > inadvertently rate limitsallI/O to the device"
>> 
>> 
>> Do you agree with that statement based on your insight? At least to me, that
>> paper is missing the fundamental GC problem which was supposed to be
>> evaluated by real workloads instead of using a simple benchmark generating
>> 4KB random writes only. And, they had to investigate more details in FTL/IO
>> patterns including UNMAP and LBA alignment between host and storage, which
>> all affect WAF. Based on that, the point of the zoned device is quite promising
>> to me, since it can address LBA alignment entirely and give a way that host
>> SW stack can control QoS.
> 
> Just a note, using a pretty simple and optimal streaming write pattern, I have been able to burn out emmc parts in a little over a week.
> 
> My test case creating a 1GB file (filled with random data just in case the device was looking for zero blocks to ignore) and then do a loop to cp and sync that file until the emmc device life time was shown as exhausted.
> 
> This was a clean, best case sequential write so this is not just an issue with small, random writes.
> 
> Of course, this is normal to wear them out, but for the super low end parts, taking away any of the device writes in our stack is costly given how little life they have....
> 
> Regards,
> 
> 
> Ric
> 

I think that we need to distinguish various cases here. If we have pretty aged volume then GC plays the important role in write amplification issue. I believe that F2FS still has not very efficient GC subsystem. And, potentially, there is competition between FS’s GC and FTL’s GC. So, F2FS GC subsystem can be optimized in some way to reduce write amplification and GC competition. But I believe that the fundamental nature of F2FS GC subsystem doesn’t provide the way to exclude the write amplification issue completely. However, if GC is not playing then this source of write amplification can be excluded from the consideration.

The F2FS in-place update area is another source of write amplification issue that expected to be managed by FTL. This architectural decision doesn’t provide some room to make optimization here. Only if some metadata will be moved into the area that is living under Copy-On-Write policy. But it could be hard and time-consuming change.

Another source of write amplification issue in F2FS is the block mapping technique. Every update of logical block with user data results in update of block mapping metadata. So, this architectural solution still doesn’t provide a lot of room for optimization. Maybe, if some another metadata structure or mapping technique will be introduced.

So, if we exclude GC, in-place area, block mapping technique and other architectural decisions then the next possible direction to decrease write amplification could be not to update logical blocks frequently or to make lesser number of write operations. The most obvious solutions for this are: (1) compression, (2) deduplication, (3) combine several small files into one NAND page, (4) use inline technique to store small files’ content into the inode’s area.

I believe that additional potential issue of F2FS is the metadata reservation technique. I mean here that creation of a volume implies the reservation and initialization of metadata structures. It means that even if the metadata doesn’t contain yet any valuable info then, anyway, FTL implies that it’s valid data that needs to be managed to guarantee access to this content. Finally, FTL will move this data among erase blocks and it could decrease the lifetime of the device. Especially, if we are talking about NAND flash with not good endurance then read disturbance could play significant role.

Thanks,
Slava.







^ permalink raw reply	[flat|nested] 14+ messages in thread

* [LSF/MM/BPF TOPIC] SSDFS: LFS file system without GC operations + NAND flash devices lifetime prolongation
  2021-06-09 10:53 [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!) Ric Wheeler
  2021-06-09 18:05 ` Bart Van Assche
@ 2021-06-13 20:41 ` Viacheslav Dubeyko
  1 sibling, 0 replies; 14+ messages in thread
From: Viacheslav Dubeyko @ 2021-06-13 20:41 UTC (permalink / raw)
  To: lsf-pc; +Cc: Linux FS Devel, linux-block, Ric Wheeler

Hello,

I would like to discuss SSDFS file system [1] architecture. The architecture of file system has ben designed to be the LFS file system that can: (1) exclude the GC overhead, (2) prolong NAND flash devices lifetime, (3) achieve a good performance balance even if the NAND flash device's lifetime  is a priority. I wrote a paper [5] that contains analysis of possible approaches to prolong NAND flash device's lifetime and deeper explanation of SSDFS file system architecture.

The fundamental concepts of SSDFS:

[LOGICAL SEGMENT] Logical segment is always located at the same position of the volume. And file system volume can be imagined like a sequence of logical segments on the fixed positions. As a result, every logical block can be described by logical extent {segment_id, block_index_inside_segment, length}. It means that metadata information about position of logical block on the volume never needs to be updated because it will always live at the same logical segment even after update (COW policy). This concept completely excludes block mapping metadata structure updates that could result in decreasing the write amplification factor. Because, COW policy requires frequent updates of block mapping metadata structure.

[LOGICAL ERASE BLOCK] Initially, logical segment is an empty container that is capable to contain one or several erase blocks. Logical erase block can be mapped into any “physical” erase block. “Physical” erase block means a contiguous sequence of LBAs are aligned on erase block size. There is mapping table that manages the association of logical erase blocks (LEB) into “physical” erase blocks (PEB). The goal of LEB and mapping table is to implement the logical extent concept. The goal to have several LEBs into one segment is to improve performance of I/O operations. Because, PEBs in the segment can be located into different NAND dies on the device and can be accessed through different device’s channels.

[SEGMENT TYPE] There are several segment types on the volume (superblock, mapping  table, segment bitmap, b-tree node, user data). The goal of various segment types is to make PEB’s “temperature” more predictable and to compact/aggregate several pieces of data into one NAND page. For example, several small files, several compressed logical blocks, or several compressed b-tree nodes can be aggregated into one NAND page. It means that several pieces of data can be aggregated into one write/read (I/O) request and it is the way to decrease the write amplification factor. To make PEB’s “temperature" more predictable implies that aggregation of the same type of data into one segment can make more stable/predictable average number of update/read I/O requests for the same segment type. As a result, it could decrease GC activity and to decrease the write amplification factor.

[LOG] Log is the central part of techniques to manage the write amplification factor. Every PEB contains one log or sequence of logs. The goal of log is to aggregate several pieces of data into one NAND page to decrease the write amplification factor. For example, several small files or several compressed logical blocks can be aggregated into one NAND page. An offset transaction table is the metadata structure that converts the logical block ID (LBA) into the offset inside of the log where a piece of data is stored. Log is split on several areas (diff-on-write area, journal area, main area) with the goal to store the data of different nature. For example, main area could store not compressed logical block, journal area could aggregate small files or compressed logical blocks into one NAND page, and diff-on-wrtite area could aggregate small updates of different logical blocks into one NAND page. The different area types have goal to distinguish “temperature” of data and to average the “temperature” of area. For example, diff-on-write area could be more hot than journal area. As a result, it is possible to expect that, for example, diff-on-write area could be completely invalidated by regular updates of some logical blocks without necessity to use any GC activity.

[MIGRATION SCHEME] Migration scheme is the central technique to implement the logical extent concept and to exclude the necessity in GC activity. If some PEB is exhausted by logs (no free space) then it needs to start the migration for this PEB. Because it is used compression and compaction schemes for the metadata and user data then real data volume is using only portion of the PEB’s space. It means that it is possible to reserve another PEB in mapping table with the goal to associate two PEBs for migration process (exhausted PEB is the source and clean PEB is the destination). Every update of some logical block results in storing new state in the destination PEB and invalidation of logical block in the exhausted one. Generally speaking, it means that regular I/O operations are capable to completely invalidate the exhausted PEB for the case of “hot" data. Finally, invalidated PEB can be erased and to marked as clean and available for new write operations. Another important point that even after migration the logical block is still living in the same segment. And it doesn’t need to update metadata in block mapping metadata structure because logical extent has actual state. The offset translation table are keeping the actual position of logical block in the PEB space.

[MIGRATION STIMULATION] However, not every PEB can migrate completely under pressure of regular I/O operations (for example, in the case of “warm” or “cold” data). So, SSDFS is using the migration stimulation technique as complementary to migration scheme. It means that if some LEB is under migration then a flush thread is checking the opportunity to add some additional content into the log under commit. If flush thread has received a request to commit some log then it has the content of updated logical blocks that have been requested to be updated. However, it is possible that available content cannot fill a whole NAND page completely (for example, it can use only 2 KB). And if there are some valid logical blocks in the exhausted PEB then it is possible to compress and to add the content of such logical block into the log under commit. Finally, every log commit can be resulted by migration additional logical blocks from exhausted PEB into new one. As a result, regular update (I/O) operations can completely invalidate the exhausted PEB without the necessity in GC activity at all. The important point here that compaction technique can decrease the amount of write requests. And exclusion of GC activity results in decreasing of I/O operations are initiated by GC. It is possible to state that migration scheme and migration stimulation techniques are capable to significantly decrease the write amplification factor.

[GC] SSDFS has several GC threads but the goal of these threads is to check the state of segments, to stimulate the slow migration process, and to destroy already not in use the in-core segment objects. There is segment bitmap metadata structure that is tracking the state of segments (clean, using, used, pre-dirty, dirty). Every GC thread is dedicated to check the segments in similar state (for example, pre-dirty). Sometimes, PEB migration process could start and then to be stuck for some time because of absence of update requests for this particular PEB under migration. The goal of GC threads is to find such PEBs and to stimulate migration of valid blocks from exhausted PEB to clean one. But the number of GC initiated I/O requests could be pretty negligible because GC selects the segments that have no consumers right now. Migration scheme and migration stimulation could manage around 90% of the all necessary migration and cleaning operations.

[COLD DATA] SSDFS never moves the PEBs with cold data. It means that if some PEB with data is not under migration and doesn’t receive the update requests then SSDFS leaves such PEBs untouched. Because, FTL could manage error-correction and moving erase blocks with cold data in the background inside of NAND flash device. Such technique could be considered like another approach to decrease the write amplification factor.

[B-TREE] Migration scheme and logical extent concept provide the way to use the b-trees. The inodes tree, dentries trees, and extents trees are implemented as b-trees. And this is important technique to decrease the write amplification factor. First of all, b-tree provides the way to exclude the metadata reservation because it is possible to add the metadata space on b-tree’s node basis. Additionally, SSDFS is using three type of nodes: (1) leaf node, (2) hybrid node, (3) index node. The hybrid node includes as metadata records as index records that are the metadata about another nodes in the tree. So, the hybrid node is the way to decrease the number of nodes for the case of small b-trees. As a result, it can decrease the write amplification factor and decrease the NAND flash wearing that could result in prolongation of NAND flash device lifetime. 

[PURE LFS] SSDFS is pure LFS file system without any in-place update areas. It follows COW policy in any areas of the volume. Even superblocks are stored into dedicated segment as a sequence. Moreover, every header of the log contains copy of superblock that can be considered like a reliability technique. It is possible to use two different techniques of placing superblock segments on the volume. These segments could live in designated set of segments or could be distributed through the space of the volume. However, the designated set of segments could guarantee the predictable mount time and to decrease the read disturbance.

[INLINE TECHNIQUES] SSDFS is trying to use inline techniques as much as possible. For example, small inodes tree can be kept in the superblock at first. Small dentries and extents tree can be kept in the inode as inline metadata. Small file’s content can be stored into inode as inline data. It means that it doesn’t need to allocate dedicated logical block for small metadata or user data. So, such inline techniques are able to combine several metadata (and user data) pieces into one I/O request and to decrease write amplification factor.

[MINIMUM RESERVATIONS] There are two metadata structures (mapping table and segment bitmap) that require reservation on the volume. These metadata structures’ size is defined by volume size and erase block, segment sizes. So, as a result, these metadata structures describe the current state of the volume. But the rest metadata (inodes tree, dentries trees, xattr trees, and so on) are represented by b-trees and it doesn’t need to be reserved beforehand. So, it can be allocated by nodes in the case when old ones are exhausted. Finally, NAND flash device doesn’t need to keep the reserved metadata space that, currently, contains nothing. As a result, FTL doesn’t need to manage this NAND pages and it could decrease NAND flash wearing. So, it could be considered like technique to prolong NAND flash device’s lifetime.

[MULTI-THREADED ARCHITECTURE] SSDFS is based on multi-threaded approach. It means that there are dedicated threads for some tasks. For example, there is special thread that is sending TRIM or erase operation requests for invalidated PEBs in the background. Another dedicated thread is doing the extents trees invalidation in the background. Also, there are several GC threads (in the background) that are tracking the necessity to stimulate migration in segments and to destroy the in-core segment objects in the case of absence of consumers of these segments. But this technique is directed to manage the performance mostly. 

Thanks,
Slava.

[1] www.ssdfs.org
[2] SSDFS tools: https://github.com/dubeyko/ssdfs-tools.git
[3] SSDFS driver: https://github.com/dubeyko/ssdfs-driver.git
[4] SSDFS Linux kernel: https://github.com/dubeyko/linux.git
[5] https://arxiv.org/abs/1907.11825


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-06-13 20:42 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-09 10:53 [LSF/MM/BPF TOPIC] durability vs performance for flash devices (especially embedded!) Ric Wheeler
2021-06-09 18:05 ` Bart Van Assche
2021-06-09 18:30   ` Matthew Wilcox
2021-06-09 18:47     ` Bart Van Assche
2021-06-10  0:16       ` Damien Le Moal
2021-06-10  1:11         ` Ric Wheeler
2021-06-10  1:20       ` Ric Wheeler
2021-06-10 11:07         ` Tim Walker
2021-06-10 16:38           ` Keith Busch
     [not found]       ` <CAOtxgyeRf=+grEoHxVLEaSM=Yfx4KrSG5q96SmztpoWfP=QrDg@mail.gmail.com>
2021-06-10 16:22         ` Ric Wheeler
2021-06-10 17:06           ` Matthew Wilcox
2021-06-10 17:25             ` Ric Wheeler
2021-06-10 17:57           ` Viacheslav Dubeyko
2021-06-13 20:41 ` [LSF/MM/BPF TOPIC] SSDFS: LFS file system without GC operations + NAND flash devices lifetime prolongation Viacheslav Dubeyko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.