All of lore.kernel.org
 help / color / mirror / Atom feed
* Testing devices for discard support properly
@ 2019-05-06 20:56 Ric Wheeler
  2019-05-07  7:10 ` Lukas Czerner
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Ric Wheeler @ 2019-05-06 20:56 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Linux FS Devel, lczerner


(repost without the html spam, sorry!)

Last week at LSF/MM, I suggested we can provide a tool or test suite to 
test discard performance.

Put in the most positive light, it will be useful for drive vendors to 
use to qualify their offerings before sending them out to the world. For 
customers that care, they can use the same set of tests to help during 
selection to weed out any real issues.

Also, community users can run the same tools of course and share the 
results.

Down to the questions part:

 * Do we just need to figure out a workload to feed our existing tools 
like blkdiscard and fio?

* What workloads are key?

Thoughts about what I would start getting timings for:

* Whole device discard at the block level both for a device that has 
been completely written and for one that had already been trimmed

* Discard performance at the block level for 4k discards for a device 
that has been completely written and again the same test for a device 
that has been completely discarded.

* Same test for large discards - say at a megabyte and/or gigabyte size?

* Same test done at the device optimal discard chunk size and alignment

Should the discard pattern be done with a random pattern? Or just 
sequential?

I think the above would give us a solid base, thoughts or comments?

Thanks!

Ric





^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-06 20:56 Testing devices for discard support properly Ric Wheeler
@ 2019-05-07  7:10 ` Lukas Czerner
  2019-05-07  8:48   ` Jan Tulak
  2019-05-07  8:21 ` Nikolay Borisov
  2019-05-07 22:04 ` Dave Chinner
  2 siblings, 1 reply; 30+ messages in thread
From: Lukas Czerner @ 2019-05-07  7:10 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Jens Axboe, linux-block, Linux FS Devel

On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
> 
> (repost without the html spam, sorry!)
> 
> Last week at LSF/MM, I suggested we can provide a tool or test suite to test
> discard performance.
> 
> Put in the most positive light, it will be useful for drive vendors to use
> to qualify their offerings before sending them out to the world. For
> customers that care, they can use the same set of tests to help during
> selection to weed out any real issues.
> 
> Also, community users can run the same tools of course and share the
> results.
> 
> Down to the questions part:
> 
>  * Do we just need to figure out a workload to feed our existing tools like
> blkdiscard and fio?

Hi Ric,

I think being able to specify workload using fio will be very useful
regardless if we'll end up with with a standalone discard testing tool
or not.

> 
> * What workloads are key?
> 
> Thoughts about what I would start getting timings for:

A long time ago I wrote a tool for testing discaed performance. You can
find it here. Keep in mind that it was really long time ago since I even
looked at it, so not sure if it still even compiles.

https://sourceforge.net/projects/test-discard/

You can go through the README file to see what it does but in summary
you can:

- specify size of discard request
- specify range of discard request size to test
- discard already discarded blocks
- can test with sequential or random pattern
- for every discard request size tested it will give you results like

<record_size> <total_size> <min> <max> <avg> <sum> <throughput in MB/s>

> 
> * Whole device discard at the block level both for a device that has been
> completely written and for one that had already been trimmed

Yes, usefull. Also note that a long time ago when I've done the testing
I noticed that after a discard request, especially after whole device
discard, the read/write IO performance went down significanly for some
drives. I am sure things have changed, but I think it would be
interesting to see how does it behave now.

> 
> * Discard performance at the block level for 4k discards for a device that
> has been completely written and again the same test for a device that has
> been completely discarded.
> 
> * Same test for large discards - say at a megabyte and/or gigabyte size?

From my testing (again it was long time ago and things probably changed
since then) most of the drives I've seen had largely the same or similar
timing for discard request regardless of the size (hence, the conclusion
was the bigger the request the better). A small variation I did see
could have been explained by kernel implementation and discard_max_bytes
limitations as well.

> 
> * Same test done at the device optimal discard chunk size and alignment
> 
> Should the discard pattern be done with a random pattern? Or just
> sequential?

I think that all of the above will be interesting. However there are two
sides of it. One is just pure discard performance to figure out what
could be the expectations and the other will be "real" workload
performance. Since from my experience discard can have an impact on
drive IO performance beyond of what's obvious, testing mixed workload
(IO + discard) is going to be very important as well. And that's where
fio workloads can come in (I actually do not know if fio already
supports this or not).

-Lukas

> 
> I think the above would give us a solid base, thoughts or comments?
> 
> Thanks!
> 
> Ric
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-06 20:56 Testing devices for discard support properly Ric Wheeler
  2019-05-07  7:10 ` Lukas Czerner
@ 2019-05-07  8:21 ` Nikolay Borisov
  2019-05-07 22:04 ` Dave Chinner
  2 siblings, 0 replies; 30+ messages in thread
From: Nikolay Borisov @ 2019-05-07  8:21 UTC (permalink / raw)
  To: Ric Wheeler, Jens Axboe, linux-block, Linux FS Devel, lczerner



On 6.05.19 г. 23:56 ч., Ric Wheeler wrote:
> 
> (repost without the html spam, sorry!)
> 
> Last week at LSF/MM, I suggested we can provide a tool or test suite to
> test discard performance.
> 
> Put in the most positive light, it will be useful for drive vendors to
> use to qualify their offerings before sending them out to the world. For
> customers that care, they can use the same set of tests to help during
> selection to weed out any real issues.
> 
> Also, community users can run the same tools of course and share the
> results.
> 
> Down to the questions part:
> 
>  * Do we just need to figure out a workload to feed our existing tools
> like blkdiscard and fio?
> 
> * What workloads are key?
> 
> Thoughts about what I would start getting timings for:
> 
> * Whole device discard at the block level both for a device that has
> been completely written and for one that had already been trimmed
> 
> * Discard performance at the block level for 4k discards for a device
> that has been completely written and again the same test for a device
> that has been completely discarded.
> 
> * Same test for large discards - say at a megabyte and/or gigabyte size?
> 
> * Same test done at the device optimal discard chunk size and alignment
> 
> Should the discard pattern be done with a random pattern? Or just
> sequential?
> 
> I think the above would give us a solid base, thoughts or comments?

I have some vague recollection this was brought up before but how sure
are we that when a discard request is sent down to disk and a response
is returned the actual data has indeed been discarded. What about NCQ
effects i.e "instant completion" while doing work in the background. Or
ignoring the discard request altogether?

> 
> Thanks!
> 
> Ric
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-07  7:10 ` Lukas Czerner
@ 2019-05-07  8:48   ` Jan Tulak
  2019-05-07  9:40     ` Lukas Czerner
  0 siblings, 1 reply; 30+ messages in thread
From: Jan Tulak @ 2019-05-07  8:48 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Ric Wheeler, Jens Axboe, linux-block, Linux FS Devel, Nikolay Borisov

On Tue, May 7, 2019 at 9:10 AM Lukas Czerner <lczerner@redhat.com> wrote:
>
> On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
> >
...
> >
> > * Whole device discard at the block level both for a device that has been
> > completely written and for one that had already been trimmed
>
> Yes, usefull. Also note that a long time ago when I've done the testing
> I noticed that after a discard request, especially after whole device
> discard, the read/write IO performance went down significanly for some
> drives. I am sure things have changed, but I think it would be
> interesting to see how does it behave now.
>
> >
> > * Discard performance at the block level for 4k discards for a device that
> > has been completely written and again the same test for a device that has
> > been completely discarded.
> >
> > * Same test for large discards - say at a megabyte and/or gigabyte size?
>
> From my testing (again it was long time ago and things probably changed
> since then) most of the drives I've seen had largely the same or similar
> timing for discard request regardless of the size (hence, the conclusion
> was the bigger the request the better). A small variation I did see
> could have been explained by kernel implementation and discard_max_bytes
> limitations as well.
>
> >
> > * Same test done at the device optimal discard chunk size and alignment
> >
> > Should the discard pattern be done with a random pattern? Or just
> > sequential?
>
> I think that all of the above will be interesting. However there are two
> sides of it. One is just pure discard performance to figure out what
> could be the expectations and the other will be "real" workload
> performance. Since from my experience discard can have an impact on
> drive IO performance beyond of what's obvious, testing mixed workload
> (IO + discard) is going to be very important as well. And that's where
> fio workloads can come in (I actually do not know if fio already
> supports this or not).
>

And:

On Tue, May 7, 2019 at 10:22 AM Nikolay Borisov <nborisov@suse.com> wrote:
> I have some vague recollection this was brought up before but how sure
> are we that when a discard request is sent down to disk and a response
> is returned the actual data has indeed been discarded. What about NCQ
> effects i.e "instant completion" while doing work in the background. Or
> ignoring the discard request altogether?


As Nikolay writes in the other thread, I too have a feeling that there
have been a discard-related discussion at LSF/MM before. And if I
remember, there were hints that the drives (sometimes) do asynchronous
trim after returning a success. Which would explain the similar time
for all sizes and IO drop after trim.

So, I think that the mixed workload (IO + discard) is a pretty
important part of the whole topic and a pure discard test doesn't
really tell us anything, at least for some drives.

Jan



-- 
Jan Tulak

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-07  8:48   ` Jan Tulak
@ 2019-05-07  9:40     ` Lukas Czerner
  2019-05-07 12:57       ` Ric Wheeler
  0 siblings, 1 reply; 30+ messages in thread
From: Lukas Czerner @ 2019-05-07  9:40 UTC (permalink / raw)
  To: Jan Tulak
  Cc: Ric Wheeler, Jens Axboe, linux-block, Linux FS Devel, Nikolay Borisov

On Tue, May 07, 2019 at 10:48:55AM +0200, Jan Tulak wrote:
> On Tue, May 7, 2019 at 9:10 AM Lukas Czerner <lczerner@redhat.com> wrote:
> >
> > On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
> > >
> ...
> > >
> > > * Whole device discard at the block level both for a device that has been
> > > completely written and for one that had already been trimmed
> >
> > Yes, usefull. Also note that a long time ago when I've done the testing
> > I noticed that after a discard request, especially after whole device
> > discard, the read/write IO performance went down significanly for some
> > drives. I am sure things have changed, but I think it would be
> > interesting to see how does it behave now.
> >
> > >
> > > * Discard performance at the block level for 4k discards for a device that
> > > has been completely written and again the same test for a device that has
> > > been completely discarded.
> > >
> > > * Same test for large discards - say at a megabyte and/or gigabyte size?
> >
> > From my testing (again it was long time ago and things probably changed
> > since then) most of the drives I've seen had largely the same or similar
> > timing for discard request regardless of the size (hence, the conclusion
> > was the bigger the request the better). A small variation I did see
> > could have been explained by kernel implementation and discard_max_bytes
> > limitations as well.
> >
> > >
> > > * Same test done at the device optimal discard chunk size and alignment
> > >
> > > Should the discard pattern be done with a random pattern? Or just
> > > sequential?
> >
> > I think that all of the above will be interesting. However there are two
> > sides of it. One is just pure discard performance to figure out what
> > could be the expectations and the other will be "real" workload
> > performance. Since from my experience discard can have an impact on
> > drive IO performance beyond of what's obvious, testing mixed workload
> > (IO + discard) is going to be very important as well. And that's where
> > fio workloads can come in (I actually do not know if fio already
> > supports this or not).
> >
> 
> And:
> 
> On Tue, May 7, 2019 at 10:22 AM Nikolay Borisov <nborisov@suse.com> wrote:
> > I have some vague recollection this was brought up before but how sure
> > are we that when a discard request is sent down to disk and a response
> > is returned the actual data has indeed been discarded. What about NCQ
> > effects i.e "instant completion" while doing work in the background. Or
> > ignoring the discard request altogether?
> 
> 
> As Nikolay writes in the other thread, I too have a feeling that there
> have been a discard-related discussion at LSF/MM before. And if I
> remember, there were hints that the drives (sometimes) do asynchronous
> trim after returning a success. Which would explain the similar time
> for all sizes and IO drop after trim.

Yes, that was definitely the case  in the past. It's also why we've
seen IO performance drop after a big (whole device) discard as the
device was busy in the background.

However Nikolay does have a point. IIRC device is free to ignore discard
requests, I do not think there is any reliable way to actually tell that
the data was really discarded. I can even imagine a situation that the
device is not going to do anything unless it's pass some threshold of
free blocks for wear leveling. If that's the case our tests are not
going to be very useful unless they do stress such corner cases. But
that's just my speculation, so someone with a better knowledge of what
vendors are doing might tell us if it's something to worry about or not.

> 
> So, I think that the mixed workload (IO + discard) is a pretty
> important part of the whole topic and a pure discard test doesn't
> really tell us anything, at least for some drives.

I think both are important especially since mixed IO tests are going to
be highly workload specific.

-Lukas

> 
> Jan
> 
> 
> 
> -- 
> Jan Tulak

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-07  9:40     ` Lukas Czerner
@ 2019-05-07 12:57       ` Ric Wheeler
  2019-05-07 15:35         ` Bryan Gurney
  0 siblings, 1 reply; 30+ messages in thread
From: Ric Wheeler @ 2019-05-07 12:57 UTC (permalink / raw)
  To: Lukas Czerner, Jan Tulak
  Cc: Jens Axboe, linux-block, Linux FS Devel, Nikolay Borisov


On 5/7/19 5:40 AM, Lukas Czerner wrote:
> On Tue, May 07, 2019 at 10:48:55AM +0200, Jan Tulak wrote:
>> On Tue, May 7, 2019 at 9:10 AM Lukas Czerner <lczerner@redhat.com> wrote:
>>> On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
>> ...
>>>> * Whole device discard at the block level both for a device that has been
>>>> completely written and for one that had already been trimmed
>>> Yes, usefull. Also note that a long time ago when I've done the testing
>>> I noticed that after a discard request, especially after whole device
>>> discard, the read/write IO performance went down significanly for some
>>> drives. I am sure things have changed, but I think it would be
>>> interesting to see how does it behave now.


My understanding of how drives (not just SSD's but they are the main 
target here) can handle a discard can vary a lot, including:

* just ignore it for any reason and not return a failure - it is just a 
hint by spec.

* update metadata to mark that region as unused and then defer any real 
work to later (like doing wear level stuff, pre-erase for writes, etc). 
This can have a post-discard impact. I think of this kind of like 
updating page table entries for virtual memory - low cost update now, 
all real work deferred.

* do everything as part of the command - this can be relatively slow, 
most of the cost of a write I would guess (i.e., go in and over-write 
the region with zeros or just do the erase of the flash block under the 
region).

Your earlier work supports the need to test IO performance after doing 
the trims/discards - we might want to test it right away, then see if 
waiting 10 minutes, 30 minutes, etc helps?

>>>
>>>> * Discard performance at the block level for 4k discards for a device that
>>>> has been completely written and again the same test for a device that has
>>>> been completely discarded.
>>>>
>>>> * Same test for large discards - say at a megabyte and/or gigabyte size?
>>>  From my testing (again it was long time ago and things probably changed
>>> since then) most of the drives I've seen had largely the same or similar
>>> timing for discard request regardless of the size (hence, the conclusion
>>> was the bigger the request the better). A small variation I did see
>>> could have been explained by kernel implementation and discard_max_bytes
>>> limitations as well.
>>>
>>>> * Same test done at the device optimal discard chunk size and alignment
>>>>
>>>> Should the discard pattern be done with a random pattern? Or just
>>>> sequential?
>>> I think that all of the above will be interesting. However there are two
>>> sides of it. One is just pure discard performance to figure out what
>>> could be the expectations and the other will be "real" workload
>>> performance. Since from my experience discard can have an impact on
>>> drive IO performance beyond of what's obvious, testing mixed workload
>>> (IO + discard) is going to be very important as well. And that's where
>>> fio workloads can come in (I actually do not know if fio already
>>> supports this or not).
>>>

Really good points. I think it is probably best to test just at the 
block device level to eliminate any possible file system interactions 
here.  The lessons learned though might help file systems handle things 
more effectively?

>> And:
>>
>> On Tue, May 7, 2019 at 10:22 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>> I have some vague recollection this was brought up before but how sure
>>> are we that when a discard request is sent down to disk and a response
>>> is returned the actual data has indeed been discarded. What about NCQ
>>> effects i.e "instant completion" while doing work in the background. Or
>>> ignoring the discard request altogether?
>>
>> As Nikolay writes in the other thread, I too have a feeling that there
>> have been a discard-related discussion at LSF/MM before. And if I
>> remember, there were hints that the drives (sometimes) do asynchronous
>> trim after returning a success. Which would explain the similar time
>> for all sizes and IO drop after trim.
> Yes, that was definitely the case  in the past. It's also why we've
> seen IO performance drop after a big (whole device) discard as the
> device was busy in the background.

For SATA specifically, there was a time when the ATA discard command was 
not queued so we had to drain all other pending requests, do the 
discard, and then resume. This was painfully slow then (not clear that 
this was related to the performance impact you saw - it would be an 
impact I think for the next few dozen commands?).

The T13 people (and most drives I hope) fixed this years back to be a 
queued command so we don't have that same concern now I think.

>
> However Nikolay does have a point. IIRC device is free to ignore discard
> requests, I do not think there is any reliable way to actually tell that
> the data was really discarded. I can even imagine a situation that the
> device is not going to do anything unless it's pass some threshold of
> free blocks for wear leveling. If that's the case our tests are not
> going to be very useful unless they do stress such corner cases. But
> that's just my speculation, so someone with a better knowledge of what
> vendors are doing might tell us if it's something to worry about or not.


The way I think of it is our "nirvana" state for discard would be:

* all drives have very low cost discard commands with minimal 
post-discard performance impact on the normal workload which would let 
us issue the in-band discards (-o discard mount option)

* drives might still (and should be expected to) ignore some of these 
commands so freed and "discarded" space might still not be really discarded.

* we will still need to run a periodic (once a day? a week?) fstrim to 
give the drive a chance to clean up anything even when using "mount -o 
discard". Of course, the fstrim size is bigger I expect than the size 
from inband discard so testing larger sizes will be important.

Does this make sense?

Ric


>
>> So, I think that the mixed workload (IO + discard) is a pretty
>> important part of the whole topic and a pure discard test doesn't
>> really tell us anything, at least for some drives.
> I think both are important especially since mixed IO tests are going to
> be highly workload specific.
>
> -Lukas
>
>> Jan
>>
>>
>>
>> -- 
>> Jan Tulak

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-07 12:57       ` Ric Wheeler
@ 2019-05-07 15:35         ` Bryan Gurney
  2019-05-07 15:44           ` Ric Wheeler
  0 siblings, 1 reply; 30+ messages in thread
From: Bryan Gurney @ 2019-05-07 15:35 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Lukas Czerner, Jan Tulak, Jens Axboe, linux-block,
	Linux FS Devel, Nikolay Borisov

On Tue, May 7, 2019 at 8:57 AM Ric Wheeler <ricwheeler@gmail.com> wrote:
>
>
> On 5/7/19 5:40 AM, Lukas Czerner wrote:
> > On Tue, May 07, 2019 at 10:48:55AM +0200, Jan Tulak wrote:
> >> On Tue, May 7, 2019 at 9:10 AM Lukas Czerner <lczerner@redhat.com> wrote:
> >>> On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
> >> ...
> >>>> * Whole device discard at the block level both for a device that has been
> >>>> completely written and for one that had already been trimmed
> >>> Yes, usefull. Also note that a long time ago when I've done the testing
> >>> I noticed that after a discard request, especially after whole device
> >>> discard, the read/write IO performance went down significanly for some
> >>> drives. I am sure things have changed, but I think it would be
> >>> interesting to see how does it behave now.
>
>
> My understanding of how drives (not just SSD's but they are the main
> target here) can handle a discard can vary a lot, including:
>
> * just ignore it for any reason and not return a failure - it is just a
> hint by spec.
>
> * update metadata to mark that region as unused and then defer any real
> work to later (like doing wear level stuff, pre-erase for writes, etc).
> This can have a post-discard impact. I think of this kind of like
> updating page table entries for virtual memory - low cost update now,
> all real work deferred.
>
> * do everything as part of the command - this can be relatively slow,
> most of the cost of a write I would guess (i.e., go in and over-write
> the region with zeros or just do the erase of the flash block under the
> region).
>
> Your earlier work supports the need to test IO performance after doing
> the trims/discards - we might want to test it right away, then see if
> waiting 10 minutes, 30 minutes, etc helps?

Using blktrace / blkparse may be a good way to visualize certain
latency differences of a drive, depending on the scenario.

I tried these quick fio tests in succession while tracing an NVMe device:

- fio --name=writetest --filename=/dev/nvme0n1p1 --rw=write
--bs=1048576 --size=2G --iodepth=32 --ioengine=libaio --direct=1
- fio --name=writetest --filename=/dev/nvme0n1p1 --rw=trim
--bs=1048576 --size=128M --iodepth=32 --ioengine=libaio --direct=1
- fio --name=writetest --filename=/dev/nvme0n1p1 --rw=write
--bs=1048576 --size=2G --iodepth=32 --ioengine=libaio --direct=1

...and if I run "blkparse -t -i nvme0n1p1.blktrace.0", I see output
that looks like this:

(The number after the "sector + size" in parentheses is the "time
delta per IO", which I believe is effectively the "completion latency"
for the IO.)

259,1   23       42     0.130790560 13843  C  WS 2048 + 256 (   69234) [0]
259,1   23       84     0.130832015 13843  C  WS 2304 + 256 (  106529) [0]
259,1   23      110     0.130879691 13843  C  WS 2560 + 256 (  151234) [0]
259,1   23      127     0.130932938 13843  C  WS 2816 + 256 (  201708) [0]
259,1   23      169     0.130985313 13843  C  WS 3072 + 256 (  251695) [0]
259,1   23      244     0.131068599 13843  C  WS 3328 + 256 (  332505) [0]
259,1   23      255     0.131120364 13843  C  WS 3584 + 256 (  382228) [0]
259,1   23      295     0.131169431 13843  C  WS 3840 + 256 (  429079) [0]
259,1   23      337     0.131254437 13843  C  WS 4096 + 256 (  452715) [0]
259,1   23      379     0.131303693 13843  C  WS 4352 + 256 (  498415) [0]
...

259,1   23     2886     0.145571119     0  C  WS 68864 + 256 (12172318) [0]
259,1   23     2887     0.145621801     0  C  WS 69120 + 256 (12220934) [0]
259,1   23     2888     0.145707376     0  C  WS 69376 + 256 (12304282) [0]
259,1   23     2889     0.145758056 13843  C  WS 69632 + 256 (12305257) [0]
259,1   23     2897     0.145806491 13843  C  WS 69888 + 256 (12351416) [0]
259,1   23     2932     0.145855909     0  C  WS 70144 + 256 (12398688) [0]
259,1   23     2933     0.145906931     0  C  WS 70400 + 256 (12447322) [0]
259,1   23     2934     0.145955324     0  C  WS 70656 + 256 (12493640) [0]
259,1   23     2935     0.146047271     0  C  WS 70912 + 256 (12583404) [0]
259,1   23     2936     0.146098918     0  C  WS 71168 + 256 (12633107) [0]
259,1   23     2937     0.146147758     0  C  WS 71424 + 256 (12680779) [0]
259,1   23     2938     0.146199611 13843  C  WS 71680 + 256 (12673451) [0]
259,1   23     2947     0.146248198 13843  C  WS 71936 + 256 (12717754) [0]
...

259,1   19        8     1.654335893     0  C  DS 2048 + 2048 (  703367) [0]
259,1   19       16     1.654407034     0  C  DS 4096 + 2048 (   16801) [0]
259,1   19       24     1.654441037     0  C  DS 6144 + 2048 (   14973) [0]
259,1   19       32     1.654473187     0  C  DS 8192 + 2048 (   18403) [0]
259,1   19       40     1.654508066     0  C  DS 10240 + 2048 (   15949) [0]
259,1   19       48     1.654546974     0  C  DS 12288 + 2048 (   25803) [0]
259,1   19       56     1.654575186     0  C  DS 14336 + 2048 (   15839) [0]
259,1   19       64     1.654602836     0  C  DS 16384 + 2048 (   15449) [0]
259,1   19       72     1.654629376     0  C  DS 18432 + 2048 (   14659) [0]
259,1   19       80     1.654655744     0  C  DS 20480 + 2048 (   14653) [0]
259,1   19       88     1.654682306     0  C  DS 22528 + 2048 (   14769) [0]
259,1   19       96     1.654710616     0  C  DS 24576 + 2048 (   16660) [0]
259,1   19      104     1.654737113     0  C  DS 26624 + 2048 (   14876) [0]
259,1   19      112     1.654763661     0  C  DS 28672 + 2048 (   14707) [0]
259,1   19      120     1.654790141     0  C  DS 30720 + 2048 (   14809) [0]

I can see two things:

1. The writes appear to be limited to 128 kilobytes, which agrees with
the device's "queue/max_hw_sectors_kb" value.
2. The discard completion latency ("C DS" actions) is very low, at
about 16 microseonds.

It's possible to filter this output further:

blkparse -t -i nvme0n1p1.blktrace.0 | grep C\ *[RWD] | tr -d \(\) |
awk '{print $4, $6, $7, $8, $9, $10, $11}'

...to yield output that's more digestible to a graphing program like gnuplot:

0.130790560 C WS 2048 + 256 69234
0.130832015 C WS 2304 + 256 106529
0.130879691 C WS 2560 + 256 151234
0.130932938 C WS 2816 + 256 201708
0.130985313 C WS 3072 + 256 251695
0.131068599 C WS 3328 + 256 332505
0.131120364 C WS 3584 + 256 382228

...at which point you can look at the graph, and see patterns, like
the peak latency during a sustained write, or "two bands" of latency,
as though there are "two queues" on the device, for some reason, and
so on.

I usually create a graph with the timestamp as the X axis, and the
"track-ios" output as the Y axis.

>
> >>>
> >>>> * Discard performance at the block level for 4k discards for a device that
> >>>> has been completely written and again the same test for a device that has
> >>>> been completely discarded.
> >>>>
> >>>> * Same test for large discards - say at a megabyte and/or gigabyte size?
> >>>  From my testing (again it was long time ago and things probably changed
> >>> since then) most of the drives I've seen had largely the same or similar
> >>> timing for discard request regardless of the size (hence, the conclusion
> >>> was the bigger the request the better). A small variation I did see
> >>> could have been explained by kernel implementation and discard_max_bytes
> >>> limitations as well.
> >>>
> >>>> * Same test done at the device optimal discard chunk size and alignment
> >>>>
> >>>> Should the discard pattern be done with a random pattern? Or just
> >>>> sequential?
> >>> I think that all of the above will be interesting. However there are two
> >>> sides of it. One is just pure discard performance to figure out what
> >>> could be the expectations and the other will be "real" workload
> >>> performance. Since from my experience discard can have an impact on
> >>> drive IO performance beyond of what's obvious, testing mixed workload
> >>> (IO + discard) is going to be very important as well. And that's where
> >>> fio workloads can come in (I actually do not know if fio already
> >>> supports this or not).
> >>>
>
> Really good points. I think it is probably best to test just at the
> block device level to eliminate any possible file system interactions
> here.  The lessons learned though might help file systems handle things
> more effectively?
>
> >> And:
> >>
> >> On Tue, May 7, 2019 at 10:22 AM Nikolay Borisov <nborisov@suse.com> wrote:
> >>> I have some vague recollection this was brought up before but how sure
> >>> are we that when a discard request is sent down to disk and a response
> >>> is returned the actual data has indeed been discarded. What about NCQ
> >>> effects i.e "instant completion" while doing work in the background. Or
> >>> ignoring the discard request altogether?
> >>
> >> As Nikolay writes in the other thread, I too have a feeling that there
> >> have been a discard-related discussion at LSF/MM before. And if I
> >> remember, there were hints that the drives (sometimes) do asynchronous
> >> trim after returning a success. Which would explain the similar time
> >> for all sizes and IO drop after trim.
> > Yes, that was definitely the case  in the past. It's also why we've
> > seen IO performance drop after a big (whole device) discard as the
> > device was busy in the background.
>
> For SATA specifically, there was a time when the ATA discard command was
> not queued so we had to drain all other pending requests, do the
> discard, and then resume. This was painfully slow then (not clear that
> this was related to the performance impact you saw - it would be an
> impact I think for the next few dozen commands?).
>
> The T13 people (and most drives I hope) fixed this years back to be a
> queued command so we don't have that same concern now I think.

There are still some ATA devices that are blacklisted due to problems
handling queued trim (ATA_HORKAGE_NO_NCQ_TRIM), as well as problems
handing zero-after-trim (ATA_HORKAGE_ZERO_AFTER_TRIM).  Most newer
drives fixed those problems, but the older drives will still be out in
the field until they get replaced with newer drives.

The "zero after trim" issue might be important to applications that
expect a discard to zero the blocks that were specified in the discard
command.  For drives that "post-process" discards, is there a time
threshold of when those blocks are expected to return zeroes?


Thanks,

Bryan

>
> >
> > However Nikolay does have a point. IIRC device is free to ignore discard
> > requests, I do not think there is any reliable way to actually tell that
> > the data was really discarded. I can even imagine a situation that the
> > device is not going to do anything unless it's pass some threshold of
> > free blocks for wear leveling. If that's the case our tests are not
> > going to be very useful unless they do stress such corner cases. But
> > that's just my speculation, so someone with a better knowledge of what
> > vendors are doing might tell us if it's something to worry about or not.
>
>
> The way I think of it is our "nirvana" state for discard would be:
>
> * all drives have very low cost discard commands with minimal
> post-discard performance impact on the normal workload which would let
> us issue the in-band discards (-o discard mount option)
>
> * drives might still (and should be expected to) ignore some of these
> commands so freed and "discarded" space might still not be really discarded.
>
> * we will still need to run a periodic (once a day? a week?) fstrim to
> give the drive a chance to clean up anything even when using "mount -o
> discard". Of course, the fstrim size is bigger I expect than the size
> from inband discard so testing larger sizes will be important.
>
> Does this make sense?
>
> Ric
>
>
> >
> >> So, I think that the mixed workload (IO + discard) is a pretty
> >> important part of the whole topic and a pure discard test doesn't
> >> really tell us anything, at least for some drives.
> > I think both are important especially since mixed IO tests are going to
> > be highly workload specific.
> >
> > -Lukas
> >
> >> Jan
> >>
> >>
> >>
> >> --
> >> Jan Tulak

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-07 15:35         ` Bryan Gurney
@ 2019-05-07 15:44           ` Ric Wheeler
  2019-05-07 20:09             ` Bryan Gurney
  0 siblings, 1 reply; 30+ messages in thread
From: Ric Wheeler @ 2019-05-07 15:44 UTC (permalink / raw)
  To: Bryan Gurney
  Cc: Lukas Czerner, Jan Tulak, Jens Axboe, linux-block,
	Linux FS Devel, Nikolay Borisov


On 5/7/19 11:35 AM, Bryan Gurney wrote:
> On Tue, May 7, 2019 at 8:57 AM Ric Wheeler <ricwheeler@gmail.com> wrote:
>>
>> On 5/7/19 5:40 AM, Lukas Czerner wrote:
>>> On Tue, May 07, 2019 at 10:48:55AM +0200, Jan Tulak wrote:
>>>> On Tue, May 7, 2019 at 9:10 AM Lukas Czerner <lczerner@redhat.com> wrote:
>>>>> On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
>>>> ...
>>>>>> * Whole device discard at the block level both for a device that has been
>>>>>> completely written and for one that had already been trimmed
>>>>> Yes, usefull. Also note that a long time ago when I've done the testing
>>>>> I noticed that after a discard request, especially after whole device
>>>>> discard, the read/write IO performance went down significanly for some
>>>>> drives. I am sure things have changed, but I think it would be
>>>>> interesting to see how does it behave now.
>>
>> My understanding of how drives (not just SSD's but they are the main
>> target here) can handle a discard can vary a lot, including:
>>
>> * just ignore it for any reason and not return a failure - it is just a
>> hint by spec.
>>
>> * update metadata to mark that region as unused and then defer any real
>> work to later (like doing wear level stuff, pre-erase for writes, etc).
>> This can have a post-discard impact. I think of this kind of like
>> updating page table entries for virtual memory - low cost update now,
>> all real work deferred.
>>
>> * do everything as part of the command - this can be relatively slow,
>> most of the cost of a write I would guess (i.e., go in and over-write
>> the region with zeros or just do the erase of the flash block under the
>> region).
>>
>> Your earlier work supports the need to test IO performance after doing
>> the trims/discards - we might want to test it right away, then see if
>> waiting 10 minutes, 30 minutes, etc helps?
> Using blktrace / blkparse may be a good way to visualize certain
> latency differences of a drive, depending on the scenario.
>
> I tried these quick fio tests in succession while tracing an NVMe device:
>
> - fio --name=writetest --filename=/dev/nvme0n1p1 --rw=write
> --bs=1048576 --size=2G --iodepth=32 --ioengine=libaio --direct=1
> - fio --name=writetest --filename=/dev/nvme0n1p1 --rw=trim
> --bs=1048576 --size=128M --iodepth=32 --ioengine=libaio --direct=1
> - fio --name=writetest --filename=/dev/nvme0n1p1 --rw=write
> --bs=1048576 --size=2G --iodepth=32 --ioengine=libaio --direct=1
>
> ...and if I run "blkparse -t -i nvme0n1p1.blktrace.0", I see output
> that looks like this:
>
> (The number after the "sector + size" in parentheses is the "time
> delta per IO", which I believe is effectively the "completion latency"
> for the IO.)
>
> 259,1   23       42     0.130790560 13843  C  WS 2048 + 256 (   69234) [0]
> 259,1   23       84     0.130832015 13843  C  WS 2304 + 256 (  106529) [0]
> 259,1   23      110     0.130879691 13843  C  WS 2560 + 256 (  151234) [0]
> 259,1   23      127     0.130932938 13843  C  WS 2816 + 256 (  201708) [0]
> 259,1   23      169     0.130985313 13843  C  WS 3072 + 256 (  251695) [0]
> 259,1   23      244     0.131068599 13843  C  WS 3328 + 256 (  332505) [0]
> 259,1   23      255     0.131120364 13843  C  WS 3584 + 256 (  382228) [0]
> 259,1   23      295     0.131169431 13843  C  WS 3840 + 256 (  429079) [0]
> 259,1   23      337     0.131254437 13843  C  WS 4096 + 256 (  452715) [0]
> 259,1   23      379     0.131303693 13843  C  WS 4352 + 256 (  498415) [0]
> ...
>
> 259,1   23     2886     0.145571119     0  C  WS 68864 + 256 (12172318) [0]
> 259,1   23     2887     0.145621801     0  C  WS 69120 + 256 (12220934) [0]
> 259,1   23     2888     0.145707376     0  C  WS 69376 + 256 (12304282) [0]
> 259,1   23     2889     0.145758056 13843  C  WS 69632 + 256 (12305257) [0]
> 259,1   23     2897     0.145806491 13843  C  WS 69888 + 256 (12351416) [0]
> 259,1   23     2932     0.145855909     0  C  WS 70144 + 256 (12398688) [0]
> 259,1   23     2933     0.145906931     0  C  WS 70400 + 256 (12447322) [0]
> 259,1   23     2934     0.145955324     0  C  WS 70656 + 256 (12493640) [0]
> 259,1   23     2935     0.146047271     0  C  WS 70912 + 256 (12583404) [0]
> 259,1   23     2936     0.146098918     0  C  WS 71168 + 256 (12633107) [0]
> 259,1   23     2937     0.146147758     0  C  WS 71424 + 256 (12680779) [0]
> 259,1   23     2938     0.146199611 13843  C  WS 71680 + 256 (12673451) [0]
> 259,1   23     2947     0.146248198 13843  C  WS 71936 + 256 (12717754) [0]
> ...
>
> 259,1   19        8     1.654335893     0  C  DS 2048 + 2048 (  703367) [0]
> 259,1   19       16     1.654407034     0  C  DS 4096 + 2048 (   16801) [0]
> 259,1   19       24     1.654441037     0  C  DS 6144 + 2048 (   14973) [0]
> 259,1   19       32     1.654473187     0  C  DS 8192 + 2048 (   18403) [0]
> 259,1   19       40     1.654508066     0  C  DS 10240 + 2048 (   15949) [0]
> 259,1   19       48     1.654546974     0  C  DS 12288 + 2048 (   25803) [0]
> 259,1   19       56     1.654575186     0  C  DS 14336 + 2048 (   15839) [0]
> 259,1   19       64     1.654602836     0  C  DS 16384 + 2048 (   15449) [0]
> 259,1   19       72     1.654629376     0  C  DS 18432 + 2048 (   14659) [0]
> 259,1   19       80     1.654655744     0  C  DS 20480 + 2048 (   14653) [0]
> 259,1   19       88     1.654682306     0  C  DS 22528 + 2048 (   14769) [0]
> 259,1   19       96     1.654710616     0  C  DS 24576 + 2048 (   16660) [0]
> 259,1   19      104     1.654737113     0  C  DS 26624 + 2048 (   14876) [0]
> 259,1   19      112     1.654763661     0  C  DS 28672 + 2048 (   14707) [0]
> 259,1   19      120     1.654790141     0  C  DS 30720 + 2048 (   14809) [0]
>
> I can see two things:
>
> 1. The writes appear to be limited to 128 kilobytes, which agrees with
> the device's "queue/max_hw_sectors_kb" value.
> 2. The discard completion latency ("C DS" actions) is very low, at
> about 16 microseonds.
>
> It's possible to filter this output further:
>
> blkparse -t -i nvme0n1p1.blktrace.0 | grep C\ *[RWD] | tr -d \(\) |
> awk '{print $4, $6, $7, $8, $9, $10, $11}'
>
> ...to yield output that's more digestible to a graphing program like gnuplot:
>
> 0.130790560 C WS 2048 + 256 69234
> 0.130832015 C WS 2304 + 256 106529
> 0.130879691 C WS 2560 + 256 151234
> 0.130932938 C WS 2816 + 256 201708
> 0.130985313 C WS 3072 + 256 251695
> 0.131068599 C WS 3328 + 256 332505
> 0.131120364 C WS 3584 + 256 382228
>
> ...at which point you can look at the graph, and see patterns, like
> the peak latency during a sustained write, or "two bands" of latency,
> as though there are "two queues" on the device, for some reason, and
> so on.
>
> I usually create a graph with the timestamp as the X axis, and the
> "track-ios" output as the Y axis.


Thanks Bryan - this is very much in line with what I think we need to 
do. If we can get the right mix of jobs for fio to run to verify this, 
it will make it easy for everyone to contribute and for the vendors to 
use internally.

I don't have a lot of time soon, but plan to play with this over the 
next few weeks.

Regards,

Ric


>>>>>> * Discard performance at the block level for 4k discards for a device that
>>>>>> has been completely written and again the same test for a device that has
>>>>>> been completely discarded.
>>>>>>
>>>>>> * Same test for large discards - say at a megabyte and/or gigabyte size?
>>>>>   From my testing (again it was long time ago and things probably changed
>>>>> since then) most of the drives I've seen had largely the same or similar
>>>>> timing for discard request regardless of the size (hence, the conclusion
>>>>> was the bigger the request the better). A small variation I did see
>>>>> could have been explained by kernel implementation and discard_max_bytes
>>>>> limitations as well.
>>>>>
>>>>>> * Same test done at the device optimal discard chunk size and alignment
>>>>>>
>>>>>> Should the discard pattern be done with a random pattern? Or just
>>>>>> sequential?
>>>>> I think that all of the above will be interesting. However there are two
>>>>> sides of it. One is just pure discard performance to figure out what
>>>>> could be the expectations and the other will be "real" workload
>>>>> performance. Since from my experience discard can have an impact on
>>>>> drive IO performance beyond of what's obvious, testing mixed workload
>>>>> (IO + discard) is going to be very important as well. And that's where
>>>>> fio workloads can come in (I actually do not know if fio already
>>>>> supports this or not).
>>>>>
>> Really good points. I think it is probably best to test just at the
>> block device level to eliminate any possible file system interactions
>> here.  The lessons learned though might help file systems handle things
>> more effectively?
>>
>>>> And:
>>>>
>>>> On Tue, May 7, 2019 at 10:22 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>>>> I have some vague recollection this was brought up before but how sure
>>>>> are we that when a discard request is sent down to disk and a response
>>>>> is returned the actual data has indeed been discarded. What about NCQ
>>>>> effects i.e "instant completion" while doing work in the background. Or
>>>>> ignoring the discard request altogether?
>>>> As Nikolay writes in the other thread, I too have a feeling that there
>>>> have been a discard-related discussion at LSF/MM before. And if I
>>>> remember, there were hints that the drives (sometimes) do asynchronous
>>>> trim after returning a success. Which would explain the similar time
>>>> for all sizes and IO drop after trim.
>>> Yes, that was definitely the case  in the past. It's also why we've
>>> seen IO performance drop after a big (whole device) discard as the
>>> device was busy in the background.
>> For SATA specifically, there was a time when the ATA discard command was
>> not queued so we had to drain all other pending requests, do the
>> discard, and then resume. This was painfully slow then (not clear that
>> this was related to the performance impact you saw - it would be an
>> impact I think for the next few dozen commands?).
>>
>> The T13 people (and most drives I hope) fixed this years back to be a
>> queued command so we don't have that same concern now I think.
> There are still some ATA devices that are blacklisted due to problems
> handling queued trim (ATA_HORKAGE_NO_NCQ_TRIM), as well as problems
> handing zero-after-trim (ATA_HORKAGE_ZERO_AFTER_TRIM).  Most newer
> drives fixed those problems, but the older drives will still be out in
> the field until they get replaced with newer drives.
>
> The "zero after trim" issue might be important to applications that
> expect a discard to zero the blocks that were specified in the discard
> command.  For drives that "post-process" discards, is there a time
> threshold of when those blocks are expected to return zeroes?
>
>
> Thanks,
>
> Bryan
>
>>> However Nikolay does have a point. IIRC device is free to ignore discard
>>> requests, I do not think there is any reliable way to actually tell that
>>> the data was really discarded. I can even imagine a situation that the
>>> device is not going to do anything unless it's pass some threshold of
>>> free blocks for wear leveling. If that's the case our tests are not
>>> going to be very useful unless they do stress such corner cases. But
>>> that's just my speculation, so someone with a better knowledge of what
>>> vendors are doing might tell us if it's something to worry about or not.
>>
>> The way I think of it is our "nirvana" state for discard would be:
>>
>> * all drives have very low cost discard commands with minimal
>> post-discard performance impact on the normal workload which would let
>> us issue the in-band discards (-o discard mount option)
>>
>> * drives might still (and should be expected to) ignore some of these
>> commands so freed and "discarded" space might still not be really discarded.
>>
>> * we will still need to run a periodic (once a day? a week?) fstrim to
>> give the drive a chance to clean up anything even when using "mount -o
>> discard". Of course, the fstrim size is bigger I expect than the size
>> from inband discard so testing larger sizes will be important.
>>
>> Does this make sense?
>>
>> Ric
>>
>>
>>>> So, I think that the mixed workload (IO + discard) is a pretty
>>>> important part of the whole topic and a pure discard test doesn't
>>>> really tell us anything, at least for some drives.
>>> I think both are important especially since mixed IO tests are going to
>>> be highly workload specific.
>>>
>>> -Lukas
>>>
>>>> Jan
>>>>
>>>>
>>>>
>>>> --
>>>> Jan Tulak

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-07 15:44           ` Ric Wheeler
@ 2019-05-07 20:09             ` Bryan Gurney
  2019-05-07 21:24               ` Chris Mason
  0 siblings, 1 reply; 30+ messages in thread
From: Bryan Gurney @ 2019-05-07 20:09 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Lukas Czerner, Jan Tulak, Jens Axboe, linux-block,
	Linux FS Devel, Nikolay Borisov

On Tue, May 7, 2019 at 11:44 AM Ric Wheeler <ricwheeler@gmail.com> wrote:
>
>
> On 5/7/19 11:35 AM, Bryan Gurney wrote:
> > On Tue, May 7, 2019 at 8:57 AM Ric Wheeler <ricwheeler@gmail.com> wrote:
> >>
> >> On 5/7/19 5:40 AM, Lukas Czerner wrote:
> >>> On Tue, May 07, 2019 at 10:48:55AM +0200, Jan Tulak wrote:
> >>>> On Tue, May 7, 2019 at 9:10 AM Lukas Czerner <lczerner@redhat.com> wrote:
> >>>>> On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
> >>>> ...
> >>>>>> * Whole device discard at the block level both for a device that has been
> >>>>>> completely written and for one that had already been trimmed
> >>>>> Yes, usefull. Also note that a long time ago when I've done the testing
> >>>>> I noticed that after a discard request, especially after whole device
> >>>>> discard, the read/write IO performance went down significanly for some
> >>>>> drives. I am sure things have changed, but I think it would be
> >>>>> interesting to see how does it behave now.
> >>
> >> My understanding of how drives (not just SSD's but they are the main
> >> target here) can handle a discard can vary a lot, including:
> >>
> >> * just ignore it for any reason and not return a failure - it is just a
> >> hint by spec.
> >>
> >> * update metadata to mark that region as unused and then defer any real
> >> work to later (like doing wear level stuff, pre-erase for writes, etc).
> >> This can have a post-discard impact. I think of this kind of like
> >> updating page table entries for virtual memory - low cost update now,
> >> all real work deferred.
> >>
> >> * do everything as part of the command - this can be relatively slow,
> >> most of the cost of a write I would guess (i.e., go in and over-write
> >> the region with zeros or just do the erase of the flash block under the
> >> region).
> >>
> >> Your earlier work supports the need to test IO performance after doing
> >> the trims/discards - we might want to test it right away, then see if
> >> waiting 10 minutes, 30 minutes, etc helps?
> > Using blktrace / blkparse may be a good way to visualize certain
> > latency differences of a drive, depending on the scenario.
> >
> > I tried these quick fio tests in succession while tracing an NVMe device:
> >
> > - fio --name=writetest --filename=/dev/nvme0n1p1 --rw=write
> > --bs=1048576 --size=2G --iodepth=32 --ioengine=libaio --direct=1
> > - fio --name=writetest --filename=/dev/nvme0n1p1 --rw=trim
> > --bs=1048576 --size=128M --iodepth=32 --ioengine=libaio --direct=1
> > - fio --name=writetest --filename=/dev/nvme0n1p1 --rw=write
> > --bs=1048576 --size=2G --iodepth=32 --ioengine=libaio --direct=1
> >
> > ...and if I run "blkparse -t -i nvme0n1p1.blktrace.0", I see output
> > that looks like this:
> >
> > (The number after the "sector + size" in parentheses is the "time
> > delta per IO", which I believe is effectively the "completion latency"
> > for the IO.)
> >
> > 259,1   23       42     0.130790560 13843  C  WS 2048 + 256 (   69234) [0]
> > 259,1   23       84     0.130832015 13843  C  WS 2304 + 256 (  106529) [0]
> > 259,1   23      110     0.130879691 13843  C  WS 2560 + 256 (  151234) [0]
> > 259,1   23      127     0.130932938 13843  C  WS 2816 + 256 (  201708) [0]
> > 259,1   23      169     0.130985313 13843  C  WS 3072 + 256 (  251695) [0]
> > 259,1   23      244     0.131068599 13843  C  WS 3328 + 256 (  332505) [0]
> > 259,1   23      255     0.131120364 13843  C  WS 3584 + 256 (  382228) [0]
> > 259,1   23      295     0.131169431 13843  C  WS 3840 + 256 (  429079) [0]
> > 259,1   23      337     0.131254437 13843  C  WS 4096 + 256 (  452715) [0]
> > 259,1   23      379     0.131303693 13843  C  WS 4352 + 256 (  498415) [0]
> > ...
> >
> > 259,1   23     2886     0.145571119     0  C  WS 68864 + 256 (12172318) [0]
> > 259,1   23     2887     0.145621801     0  C  WS 69120 + 256 (12220934) [0]
> > 259,1   23     2888     0.145707376     0  C  WS 69376 + 256 (12304282) [0]
> > 259,1   23     2889     0.145758056 13843  C  WS 69632 + 256 (12305257) [0]
> > 259,1   23     2897     0.145806491 13843  C  WS 69888 + 256 (12351416) [0]
> > 259,1   23     2932     0.145855909     0  C  WS 70144 + 256 (12398688) [0]
> > 259,1   23     2933     0.145906931     0  C  WS 70400 + 256 (12447322) [0]
> > 259,1   23     2934     0.145955324     0  C  WS 70656 + 256 (12493640) [0]
> > 259,1   23     2935     0.146047271     0  C  WS 70912 + 256 (12583404) [0]
> > 259,1   23     2936     0.146098918     0  C  WS 71168 + 256 (12633107) [0]
> > 259,1   23     2937     0.146147758     0  C  WS 71424 + 256 (12680779) [0]
> > 259,1   23     2938     0.146199611 13843  C  WS 71680 + 256 (12673451) [0]
> > 259,1   23     2947     0.146248198 13843  C  WS 71936 + 256 (12717754) [0]
> > ...
> >
> > 259,1   19        8     1.654335893     0  C  DS 2048 + 2048 (  703367) [0]
> > 259,1   19       16     1.654407034     0  C  DS 4096 + 2048 (   16801) [0]
> > 259,1   19       24     1.654441037     0  C  DS 6144 + 2048 (   14973) [0]
> > 259,1   19       32     1.654473187     0  C  DS 8192 + 2048 (   18403) [0]
> > 259,1   19       40     1.654508066     0  C  DS 10240 + 2048 (   15949) [0]
> > 259,1   19       48     1.654546974     0  C  DS 12288 + 2048 (   25803) [0]
> > 259,1   19       56     1.654575186     0  C  DS 14336 + 2048 (   15839) [0]
> > 259,1   19       64     1.654602836     0  C  DS 16384 + 2048 (   15449) [0]
> > 259,1   19       72     1.654629376     0  C  DS 18432 + 2048 (   14659) [0]
> > 259,1   19       80     1.654655744     0  C  DS 20480 + 2048 (   14653) [0]
> > 259,1   19       88     1.654682306     0  C  DS 22528 + 2048 (   14769) [0]
> > 259,1   19       96     1.654710616     0  C  DS 24576 + 2048 (   16660) [0]
> > 259,1   19      104     1.654737113     0  C  DS 26624 + 2048 (   14876) [0]
> > 259,1   19      112     1.654763661     0  C  DS 28672 + 2048 (   14707) [0]
> > 259,1   19      120     1.654790141     0  C  DS 30720 + 2048 (   14809) [0]
> >
> > I can see two things:
> >
> > 1. The writes appear to be limited to 128 kilobytes, which agrees with
> > the device's "queue/max_hw_sectors_kb" value.
> > 2. The discard completion latency ("C DS" actions) is very low, at
> > about 16 microseonds.
> >
> > It's possible to filter this output further:
> >
> > blkparse -t -i nvme0n1p1.blktrace.0 | grep C\ *[RWD] | tr -d \(\) |
> > awk '{print $4, $6, $7, $8, $9, $10, $11}'
> >
> > ...to yield output that's more digestible to a graphing program like gnuplot:
> >
> > 0.130790560 C WS 2048 + 256 69234
> > 0.130832015 C WS 2304 + 256 106529
> > 0.130879691 C WS 2560 + 256 151234
> > 0.130932938 C WS 2816 + 256 201708
> > 0.130985313 C WS 3072 + 256 251695
> > 0.131068599 C WS 3328 + 256 332505
> > 0.131120364 C WS 3584 + 256 382228
> >
> > ...at which point you can look at the graph, and see patterns, like
> > the peak latency during a sustained write, or "two bands" of latency,
> > as though there are "two queues" on the device, for some reason, and
> > so on.
> >
> > I usually create a graph with the timestamp as the X axis, and the
> > "track-ios" output as the Y axis.
>
>
> Thanks Bryan - this is very much in line with what I think we need to
> do. If we can get the right mix of jobs for fio to run to verify this,
> it will make it easy for everyone to contribute and for the vendors to
> use internally.
>
> I don't have a lot of time soon, but plan to play with this over the
> next few weeks.
>

I found an example in my trace of the "two bands of latency" behavior.
Consider these three segments of trace data during the writes:

0.218134715 C WS 391168 + 256 14000316
0.218182491 C WS 391424 + 256 14039672
0.218232217 C WS 391680 + 256 14084768
0.218288794 C WS 443392 + 256 1701878
0.218331325 C WS 391936 + 256 14179055
0.218421251 C WS 443648 + 256 1828056
0.218474885 C WS 392192 + 256 14317823
0.218521307 C WS 443904 + 256 1921095
0.218608971 C WS 392448 + 256 14446938
0.218658065 C WS 444160 + 256 2051540
0.218746678 C WS 392704 + 256 14580439
0.218799075 C WS 444416 + 256 2034244
0.218848891 C WS 392960 + 256 14680181
0.218901083 C WS 444672 + 256 2130105

...

0.240683255 C WS 493568 + 256 13397251
0.240732723 C WS 442112 + 256 25521466
0.240823043 C WS 493824 + 256 13531316
0.240916353 C WS 442368 + 256 25349041
0.240965929 C WS 494080 + 256 13668856
0.241013172 C WS 442624 + 256 25437627
0.241099636 C WS 494336 + 256 13797692
0.241148323 C WS 442880 + 256 25567735
0.241199280 C WS 494592 + 256 13888871
0.241287187 C WS 443136 + 256 24708375
0.241335987 C WS 494848 + 256 14020494
0.241411767 C WS 495104 + 256 14091878
0.241458731 C WS 495360 + 256 14136002
0.241511255 C WS 495616 + 256 13464077
0.241564415 C WS 495872 + 256 13512329
0.241612571 C WS 496128 + 256 13555211
0.241664949 C WS 496384 + 256 13602499

...

0.317091373 C WS 828160 + 256 15027149
0.317148521 C WS 828416 + 256 15079407
0.317196720 C WS 828672 + 256 15122334
0.317247118 C WS 828928 + 256 15168420
0.317298052 C WS 829184 + 256 15216417
0.317351888 C WS 884736 + 256 1898567
0.317395234 C WS 829440 + 256 14764060
0.317448311 C WS 884992 + 256 1986978
0.317495724 C WS 829696 + 256 14859221
0.317546906 C WS 885248 + 256 2079970
0.317594701 C WS 829952 + 256 14953233
0.317644606 C WS 885504 + 256 2160782

There's an average latency of 14 milliseconds for these 128 kilobyte
writes.  At 0.218288794 seconds, we can see a sudden appearance of 1.7
millisecond latency times, much lower than the average.

Then we see an alternation of 1.7 millisecond completions and 14
millisecond completions, with these two "latency groups" increasing,
up to about 14 milliseconds and 25 milliseconds at 0.241287187 seconds
into the trace.

At 0.317351888 seconds, we see the pattern start again, with a sudden
appearance of 1.89 millisecond latency write completions, among 14.7
millisecond latency write completions.

If you graph it, it looks like a "triangle wave" pulse, with a
duration of about 23 milliseconds, that repeats after about 100
milliseconds.  In a way, it's like a "heartbeat".  This wouldn't be as
easy to detect with a simple "average" or "percentile" reading.

This was during a simple sequential write at a queue depth of 32, but
what happens with a write after a discard in the same region of
sectors?  This behavior could change, depending on different drive
models, and/or drive controller algorithms.


Thanks,

Bryan

>
>
> >>>>>> * Discard performance at the block level for 4k discards for a device that
> >>>>>> has been completely written and again the same test for a device that has
> >>>>>> been completely discarded.
> >>>>>>
> >>>>>> * Same test for large discards - say at a megabyte and/or gigabyte size?
> >>>>>   From my testing (again it was long time ago and things probably changed
> >>>>> since then) most of the drives I've seen had largely the same or similar
> >>>>> timing for discard request regardless of the size (hence, the conclusion
> >>>>> was the bigger the request the better). A small variation I did see
> >>>>> could have been explained by kernel implementation and discard_max_bytes
> >>>>> limitations as well.
> >>>>>
> >>>>>> * Same test done at the device optimal discard chunk size and alignment
> >>>>>>
> >>>>>> Should the discard pattern be done with a random pattern? Or just
> >>>>>> sequential?
> >>>>> I think that all of the above will be interesting. However there are two
> >>>>> sides of it. One is just pure discard performance to figure out what
> >>>>> could be the expectations and the other will be "real" workload
> >>>>> performance. Since from my experience discard can have an impact on
> >>>>> drive IO performance beyond of what's obvious, testing mixed workload
> >>>>> (IO + discard) is going to be very important as well. And that's where
> >>>>> fio workloads can come in (I actually do not know if fio already
> >>>>> supports this or not).
> >>>>>
> >> Really good points. I think it is probably best to test just at the
> >> block device level to eliminate any possible file system interactions
> >> here.  The lessons learned though might help file systems handle things
> >> more effectively?
> >>
> >>>> And:
> >>>>
> >>>> On Tue, May 7, 2019 at 10:22 AM Nikolay Borisov <nborisov@suse.com> wrote:
> >>>>> I have some vague recollection this was brought up before but how sure
> >>>>> are we that when a discard request is sent down to disk and a response
> >>>>> is returned the actual data has indeed been discarded. What about NCQ
> >>>>> effects i.e "instant completion" while doing work in the background. Or
> >>>>> ignoring the discard request altogether?
> >>>> As Nikolay writes in the other thread, I too have a feeling that there
> >>>> have been a discard-related discussion at LSF/MM before. And if I
> >>>> remember, there were hints that the drives (sometimes) do asynchronous
> >>>> trim after returning a success. Which would explain the similar time
> >>>> for all sizes and IO drop after trim.
> >>> Yes, that was definitely the case  in the past. It's also why we've
> >>> seen IO performance drop after a big (whole device) discard as the
> >>> device was busy in the background.
> >> For SATA specifically, there was a time when the ATA discard command was
> >> not queued so we had to drain all other pending requests, do the
> >> discard, and then resume. This was painfully slow then (not clear that
> >> this was related to the performance impact you saw - it would be an
> >> impact I think for the next few dozen commands?).
> >>
> >> The T13 people (and most drives I hope) fixed this years back to be a
> >> queued command so we don't have that same concern now I think.
> > There are still some ATA devices that are blacklisted due to problems
> > handling queued trim (ATA_HORKAGE_NO_NCQ_TRIM), as well as problems
> > handing zero-after-trim (ATA_HORKAGE_ZERO_AFTER_TRIM).  Most newer
> > drives fixed those problems, but the older drives will still be out in
> > the field until they get replaced with newer drives.
> >
> > The "zero after trim" issue might be important to applications that
> > expect a discard to zero the blocks that were specified in the discard
> > command.  For drives that "post-process" discards, is there a time
> > threshold of when those blocks are expected to return zeroes?
> >
> >
> > Thanks,
> >
> > Bryan
> >
> >>> However Nikolay does have a point. IIRC device is free to ignore discard
> >>> requests, I do not think there is any reliable way to actually tell that
> >>> the data was really discarded. I can even imagine a situation that the
> >>> device is not going to do anything unless it's pass some threshold of
> >>> free blocks for wear leveling. If that's the case our tests are not
> >>> going to be very useful unless they do stress such corner cases. But
> >>> that's just my speculation, so someone with a better knowledge of what
> >>> vendors are doing might tell us if it's something to worry about or not.
> >>
> >> The way I think of it is our "nirvana" state for discard would be:
> >>
> >> * all drives have very low cost discard commands with minimal
> >> post-discard performance impact on the normal workload which would let
> >> us issue the in-band discards (-o discard mount option)
> >>
> >> * drives might still (and should be expected to) ignore some of these
> >> commands so freed and "discarded" space might still not be really discarded.
> >>
> >> * we will still need to run a periodic (once a day? a week?) fstrim to
> >> give the drive a chance to clean up anything even when using "mount -o
> >> discard". Of course, the fstrim size is bigger I expect than the size
> >> from inband discard so testing larger sizes will be important.
> >>
> >> Does this make sense?
> >>
> >> Ric
> >>
> >>
> >>>> So, I think that the mixed workload (IO + discard) is a pretty
> >>>> important part of the whole topic and a pure discard test doesn't
> >>>> really tell us anything, at least for some drives.
> >>> I think both are important especially since mixed IO tests are going to
> >>> be highly workload specific.
> >>>
> >>> -Lukas
> >>>
> >>>> Jan
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Jan Tulak

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-07 20:09             ` Bryan Gurney
@ 2019-05-07 21:24               ` Chris Mason
  2019-06-03 20:01                 ` Ric Wheeler
  0 siblings, 1 reply; 30+ messages in thread
From: Chris Mason @ 2019-05-07 21:24 UTC (permalink / raw)
  To: Bryan Gurney
  Cc: Ric Wheeler, Lukas Czerner, Jan Tulak, Jens Axboe, linux-block,
	Linux FS Devel, Nikolay Borisov, Dennis Zhou

On 7 May 2019, at 16:09, Bryan Gurney wrote:

> I found an example in my trace of the "two bands of latency" behavior.
> Consider these three segments of trace data during the writes:
>

[ ... ]

> There's an average latency of 14 milliseconds for these 128 kilobyte
> writes.  At 0.218288794 seconds, we can see a sudden appearance of 1.7
> millisecond latency times, much lower than the average.
>
> Then we see an alternation of 1.7 millisecond completions and 14
> millisecond completions, with these two "latency groups" increasing,
> up to about 14 milliseconds and 25 milliseconds at 0.241287187 seconds
> into the trace.
>
> At 0.317351888 seconds, we see the pattern start again, with a sudden
> appearance of 1.89 millisecond latency write completions, among 14.7
> millisecond latency write completions.
>
> If you graph it, it looks like a "triangle wave" pulse, with a
> duration of about 23 milliseconds, that repeats after about 100
> milliseconds.  In a way, it's like a "heartbeat".  This wouldn't be as
> easy to detect with a simple "average" or "percentile" reading.
>
> This was during a simple sequential write at a queue depth of 32, but
> what happens with a write after a discard in the same region of
> sectors?  This behavior could change, depending on different drive
> models, and/or drive controller algorithms.
>

I think these are all really interesting, and definitely support the 
idea of a series of tests we do to make sure a drive implements discard 
in the general ways that we expect.

But with that said, I think a more important discussion as filesystem 
developers is how we protect the rest of the filesystem from high 
latencies caused by discards.  For reads and writes, we've been doing 
this for a long time.  IO schedulers have all kinds of checks and 
balances for REQ_META or REQ_SYNC, and we throttle dirty pages and 
readahead and dance around request batching etc etc.

But for discards, we just open the floodgates and hope it works out.  At 
some point we're going to have to figure out how to queue and throttle 
discards as well as we do reads/writes.  That's kind of tricky because 
the FS needs to coordinate when we're allowed to discard something and 
needs to know when the discard is done, and we all have different 
schemes for keeping track.

-chris

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-06 20:56 Testing devices for discard support properly Ric Wheeler
  2019-05-07  7:10 ` Lukas Czerner
  2019-05-07  8:21 ` Nikolay Borisov
@ 2019-05-07 22:04 ` Dave Chinner
  2019-05-08  0:07   ` Ric Wheeler
  2019-05-08 16:16   ` Martin K. Petersen
  2 siblings, 2 replies; 30+ messages in thread
From: Dave Chinner @ 2019-05-07 22:04 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Jens Axboe, linux-block, Linux FS Devel, lczerner

On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
> 
> (repost without the html spam, sorry!)
> 
> Last week at LSF/MM, I suggested we can provide a tool or test suite to test
> discard performance.
> 
> Put in the most positive light, it will be useful for drive vendors to use
> to qualify their offerings before sending them out to the world. For
> customers that care, they can use the same set of tests to help during
> selection to weed out any real issues.
> 
> Also, community users can run the same tools of course and share the
> results.

My big question here is this:

- is "discard" even relevant for future devices?

i.e. before we start saying "we want discard to not suck", perhaps
we should list all the specific uses we ahve for discard, what we
expect to occur, and whether we have better interfaces than
"discard" to acheive that thing.

Indeed, we have fallocate() on block devices now, which means we
have a well defined block device space management API for clearing
and removing allocated block device space. i.e.:

	FALLOC_FL_ZERO_RANGE: Future reads from the range must
	return zero and future writes to the range must not return
	ENOSPC. (i.e. must remain allocated space, can physically
	write zeros to acheive this)

	FALLOC_FL_PUNCH_HOLE: Free the backing store and guarantee
	future reads from the range return zeroes. Future writes to
	the range may return ENOSPC. This operation fails if the
	underlying device cannot do this operation without
	physically writing zeroes.

	FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE: run a
	discard on the range and provide no guarantees about the
	result. It may or may not do anything, and a subsequent read
	could return anything at all.

IMO, trying to "optimise discard" is completely the wrong direction
to take. We should be getting rid of "discard" and it's interfaces
operations - deprecate the ioctls, fix all other kernel callers of
blkdev_issue_discard() to call blkdev_fallocate() and ensure that
drive vendors understand that they need to make FALLOC_FL_ZERO_RANGE
and FALLOC_FL_PUNCH_HOLE work, and that FALLOC_FL_PUNCH_HOLE |
FALLOC_FL_NO_HIDE_STALE is deprecated (like discard) and will be
going away.

So, can we just deprecate blkdev_issue_discard and all the
interfaces that lead to it as a first step?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-07 22:04 ` Dave Chinner
@ 2019-05-08  0:07   ` Ric Wheeler
  2019-05-08  1:14     ` Dave Chinner
  2019-05-08 16:16   ` Martin K. Petersen
  1 sibling, 1 reply; 30+ messages in thread
From: Ric Wheeler @ 2019-05-08  0:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jens Axboe, linux-block, Linux FS Devel, lczerner

On 5/7/19 6:04 PM, Dave Chinner wrote:
> On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
>> (repost without the html spam, sorry!)
>>
>> Last week at LSF/MM, I suggested we can provide a tool or test suite to test
>> discard performance.
>>
>> Put in the most positive light, it will be useful for drive vendors to use
>> to qualify their offerings before sending them out to the world. For
>> customers that care, they can use the same set of tests to help during
>> selection to weed out any real issues.
>>
>> Also, community users can run the same tools of course and share the
>> results.
> My big question here is this:
>
> - is "discard" even relevant for future devices?


Hard to tell - current devices vary greatly.

Keep in mind that discard (or the interfaces you mention below) are not specific 
to SSD devices on flash alone, they are also useful for letting us free up space 
on software block devices. For example, iSCSI targets backed by a file, dm thin 
devices, virtual machines backed by files on the host, etc.

>
> i.e. before we start saying "we want discard to not suck", perhaps
> we should list all the specific uses we ahve for discard, what we
> expect to occur, and whether we have better interfaces than
> "discard" to acheive that thing.
>
> Indeed, we have fallocate() on block devices now, which means we
> have a well defined block device space management API for clearing
> and removing allocated block device space. i.e.:
>
> 	FALLOC_FL_ZERO_RANGE: Future reads from the range must
> 	return zero and future writes to the range must not return
> 	ENOSPC. (i.e. must remain allocated space, can physically
> 	write zeros to acheive this)
>
> 	FALLOC_FL_PUNCH_HOLE: Free the backing store and guarantee
> 	future reads from the range return zeroes. Future writes to
> 	the range may return ENOSPC. This operation fails if the
> 	underlying device cannot do this operation without
> 	physically writing zeroes.
>
> 	FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE: run a
> 	discard on the range and provide no guarantees about the
> 	result. It may or may not do anything, and a subsequent read
> 	could return anything at all.
>
> IMO, trying to "optimise discard" is completely the wrong direction
> to take. We should be getting rid of "discard" and it's interfaces
> operations - deprecate the ioctls, fix all other kernel callers of
> blkdev_issue_discard() to call blkdev_fallocate() and ensure that
> drive vendors understand that they need to make FALLOC_FL_ZERO_RANGE
> and FALLOC_FL_PUNCH_HOLE work, and that FALLOC_FL_PUNCH_HOLE |
> FALLOC_FL_NO_HIDE_STALE is deprecated (like discard) and will be
> going away.
>
> So, can we just deprecate blkdev_issue_discard and all the
> interfaces that lead to it as a first step?


In this case, I think you would lose a couple of things:

* informing the block device on truncate or unlink that the space was freed up 
(or we simply hide that under there some way but then what does this really 
change?). Wouldn't this be the most common source for informing devices of freed 
space?

* the various SCSI/ATA commands are hints - the target device can ignore them - 
so we still need to be able to do clean up passes with something like fstrim I 
think occasionally.

Regards,

Ric


>
> Cheers,
>
> Dave.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08  0:07   ` Ric Wheeler
@ 2019-05-08  1:14     ` Dave Chinner
  2019-05-08 15:05       ` Ric Wheeler
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2019-05-08  1:14 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Jens Axboe, linux-block, Linux FS Devel, lczerner

On Tue, May 07, 2019 at 08:07:53PM -0400, Ric Wheeler wrote:
> On 5/7/19 6:04 PM, Dave Chinner wrote:
> > On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
> > > (repost without the html spam, sorry!)
> > > 
> > > Last week at LSF/MM, I suggested we can provide a tool or test suite to test
> > > discard performance.
> > > 
> > > Put in the most positive light, it will be useful for drive vendors to use
> > > to qualify their offerings before sending them out to the world. For
> > > customers that care, they can use the same set of tests to help during
> > > selection to weed out any real issues.
> > > 
> > > Also, community users can run the same tools of course and share the
> > > results.
> > My big question here is this:
> > 
> > - is "discard" even relevant for future devices?
> 
> 
> Hard to tell - current devices vary greatly.
> 
> Keep in mind that discard (or the interfaces you mention below) are not
> specific to SSD devices on flash alone, they are also useful for letting us
> free up space on software block devices. For example, iSCSI targets backed
> by a file, dm thin devices, virtual machines backed by files on the host,
> etc.

Sure, but those use cases are entirely covered by ithe well defined
semantics of FALLOC_FL_ALLOC, FALLOC_FL_ZERO_RANGE and
FALLOC_FL_PUNCH_HOLE.

> > i.e. before we start saying "we want discard to not suck", perhaps
> > we should list all the specific uses we ahve for discard, what we
> > expect to occur, and whether we have better interfaces than
> > "discard" to acheive that thing.
> > 
> > Indeed, we have fallocate() on block devices now, which means we
> > have a well defined block device space management API for clearing
> > and removing allocated block device space. i.e.:
> > 
> > 	FALLOC_FL_ZERO_RANGE: Future reads from the range must
> > 	return zero and future writes to the range must not return
> > 	ENOSPC. (i.e. must remain allocated space, can physically
> > 	write zeros to acheive this)
> > 
> > 	FALLOC_FL_PUNCH_HOLE: Free the backing store and guarantee
> > 	future reads from the range return zeroes. Future writes to
> > 	the range may return ENOSPC. This operation fails if the
> > 	underlying device cannot do this operation without
> > 	physically writing zeroes.
> > 
> > 	FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE: run a
> > 	discard on the range and provide no guarantees about the
> > 	result. It may or may not do anything, and a subsequent read
> > 	could return anything at all.
> > 
> > IMO, trying to "optimise discard" is completely the wrong direction
> > to take. We should be getting rid of "discard" and it's interfaces
> > operations - deprecate the ioctls, fix all other kernel callers of
> > blkdev_issue_discard() to call blkdev_fallocate() and ensure that
> > drive vendors understand that they need to make FALLOC_FL_ZERO_RANGE
> > and FALLOC_FL_PUNCH_HOLE work, and that FALLOC_FL_PUNCH_HOLE |
> > FALLOC_FL_NO_HIDE_STALE is deprecated (like discard) and will be
> > going away.
> > 
> > So, can we just deprecate blkdev_issue_discard and all the
> > interfaces that lead to it as a first step?
> 
> 
> In this case, I think you would lose a couple of things:
> 
> * informing the block device on truncate or unlink that the space was freed
> up (or we simply hide that under there some way but then what does this
> really change?). Wouldn't this be the most common source for informing
> devices of freed space?

Why would we lose that? The filesytem calls
blkdev_fallocate(FALLOC_FL_PUNCH_HOLE) (or a better, async interface
to the same functionality) instead of blkdev_issue_discard().  i.e.
the filesystems use interfaces with guaranteed semantics instead of
"discard".

> * the various SCSI/ATA commands are hints - the target device can ignore
> them - so we still need to be able to do clean up passes with something like
> fstrim I think occasionally.

And that's the problem we need to solve - as long as the hardware
can treat these operations as hints (i.e. as "discards" rather than
"you must free this space and return zeroes") then there is no
motivation for vendors to improve the status quo.

Nobody can rely on discard to do anything. Even ignoring the device
performance/implementation problems, it's an unusable API from an
application perspective. The first step to fixing the discard
problem is at the block device API level.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08  1:14     ` Dave Chinner
@ 2019-05-08 15:05       ` Ric Wheeler
  2019-05-08 17:03         ` Martin K. Petersen
  0 siblings, 1 reply; 30+ messages in thread
From: Ric Wheeler @ 2019-05-08 15:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jens Axboe, linux-block, Linux FS Devel, lczerner


On 5/7/19 9:14 PM, Dave Chinner wrote:
> On Tue, May 07, 2019 at 08:07:53PM -0400, Ric Wheeler wrote:
>> On 5/7/19 6:04 PM, Dave Chinner wrote:
>>> On Mon, May 06, 2019 at 04:56:44PM -0400, Ric Wheeler wrote:
>>>> (repost without the html spam, sorry!)
>>>>
>>>> Last week at LSF/MM, I suggested we can provide a tool or test suite to test
>>>> discard performance.
>>>>
>>>> Put in the most positive light, it will be useful for drive vendors to use
>>>> to qualify their offerings before sending them out to the world. For
>>>> customers that care, they can use the same set of tests to help during
>>>> selection to weed out any real issues.
>>>>
>>>> Also, community users can run the same tools of course and share the
>>>> results.
>>> My big question here is this:
>>>
>>> - is "discard" even relevant for future devices?
>>
>> Hard to tell - current devices vary greatly.
>>
>> Keep in mind that discard (or the interfaces you mention below) are not
>> specific to SSD devices on flash alone, they are also useful for letting us
>> free up space on software block devices. For example, iSCSI targets backed
>> by a file, dm thin devices, virtual machines backed by files on the host,
>> etc.
> Sure, but those use cases are entirely covered by ithe well defined
> semantics of FALLOC_FL_ALLOC, FALLOC_FL_ZERO_RANGE and
> FALLOC_FL_PUNCH_HOLE.
>
>>> i.e. before we start saying "we want discard to not suck", perhaps
>>> we should list all the specific uses we ahve for discard, what we
>>> expect to occur, and whether we have better interfaces than
>>> "discard" to acheive that thing.
>>>
>>> Indeed, we have fallocate() on block devices now, which means we
>>> have a well defined block device space management API for clearing
>>> and removing allocated block device space. i.e.:
>>>
>>> 	FALLOC_FL_ZERO_RANGE: Future reads from the range must
>>> 	return zero and future writes to the range must not return
>>> 	ENOSPC. (i.e. must remain allocated space, can physically
>>> 	write zeros to acheive this)
>>>
>>> 	FALLOC_FL_PUNCH_HOLE: Free the backing store and guarantee
>>> 	future reads from the range return zeroes. Future writes to
>>> 	the range may return ENOSPC. This operation fails if the
>>> 	underlying device cannot do this operation without
>>> 	physically writing zeroes.
>>>
>>> 	FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE: run a
>>> 	discard on the range and provide no guarantees about the
>>> 	result. It may or may not do anything, and a subsequent read
>>> 	could return anything at all.
>>>
>>> IMO, trying to "optimise discard" is completely the wrong direction
>>> to take. We should be getting rid of "discard" and it's interfaces
>>> operations - deprecate the ioctls, fix all other kernel callers of
>>> blkdev_issue_discard() to call blkdev_fallocate() and ensure that
>>> drive vendors understand that they need to make FALLOC_FL_ZERO_RANGE
>>> and FALLOC_FL_PUNCH_HOLE work, and that FALLOC_FL_PUNCH_HOLE |
>>> FALLOC_FL_NO_HIDE_STALE is deprecated (like discard) and will be
>>> going away.
>>>
>>> So, can we just deprecate blkdev_issue_discard and all the
>>> interfaces that lead to it as a first step?
>>
>> In this case, I think you would lose a couple of things:
>>
>> * informing the block device on truncate or unlink that the space was freed
>> up (or we simply hide that under there some way but then what does this
>> really change?). Wouldn't this be the most common source for informing
>> devices of freed space?
> Why would we lose that? The filesytem calls
> blkdev_fallocate(FALLOC_FL_PUNCH_HOLE) (or a better, async interface
> to the same functionality) instead of blkdev_issue_discard().  i.e.
> the filesystems use interfaces with guaranteed semantics instead of
> "discard".


That all makes sense, but I think it is orthogonal in large part to the 
need to get a good way to measure performance.

>
>> * the various SCSI/ATA commands are hints - the target device can ignore
>> them - so we still need to be able to do clean up passes with something like
>> fstrim I think occasionally.
> And that's the problem we need to solve - as long as the hardware
> can treat these operations as hints (i.e. as "discards" rather than
> "you must free this space and return zeroes") then there is no
> motivation for vendors to improve the status quo.
>
> Nobody can rely on discard to do anything. Even ignoring the device
> performance/implementation problems, it's an unusable API from an
> application perspective. The first step to fixing the discard
> problem is at the block device API level.....
>
> Cheers,
>
> Dave.

For some protocols, there are optional bits that require the device to 
return all zero data on subsequent reads, so there in that case, it is 
not optional now (we just don't use that much I think). In T13 and NVME, 
I think it could be interesting to add those tests specifically. For 
SCSI, I think the "WRITE_SAME" command *might* do discard internally or 
just might end up re-writing large regions of slow, spinning drives so I 
think it is less interesting.

I do think all of the bits you describe seem quite reasonable and 
interesting, but still see use in having simple benchmarks for us (and 
vendors) to use to measure all of this. We do this for drives today for 
write and read, just adding another dimension that needs to be routinely 
measured...

Regards,

Ric




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-07 22:04 ` Dave Chinner
  2019-05-08  0:07   ` Ric Wheeler
@ 2019-05-08 16:16   ` Martin K. Petersen
  2019-05-08 22:31     ` Dave Chinner
  1 sibling, 1 reply; 30+ messages in thread
From: Martin K. Petersen @ 2019-05-08 16:16 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, Jens Axboe, linux-block, Linux FS Devel, lczerner


Hi Dave,

> My big question here is this:
>
> - is "discard" even relevant for future devices?

It's hard to make predictions. Especially about the future. But discard
is definitely relevant on a bunch of current drives across the entire
spectrum from junk to enterprise. Depending on workload,
over-provisioning, media type, etc.

Plus, as Ric pointed out, thin provisioning is also relevant. Different
use case but exactly the same plumbing.

> IMO, trying to "optimise discard" is completely the wrong direction
> to take. We should be getting rid of "discard" and it's interfaces
> operations - deprecate the ioctls, fix all other kernel callers of
> blkdev_issue_discard() to call blkdev_fallocate()

blkdev_fallocate() is implemented using blkdev_issue_discard().

> and ensure that drive vendors understand that they need to make
> FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE work, and that
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE is deprecated (like
> discard) and will be going away.

Fast, cheap, easy. Pick any two.

The issue is that -- from the device perspective -- guaranteeing zeroes
requires substantially more effort than deallocating blocks. To the
point where several vendors have given up making it work altogether and
either report no discard support or silently ignore discard requests
causing you to waste queue slots for no good reason.

So while instant zeroing of a 100TB drive would be nice, I don't think
it's a realistic goal given the architectural limitations of many of
these devices. Conceptually, you'd think it would be as easy as
unlinking an inode. But in practice the devices keep much more (and
different) state around in their FTLs than a filesystem does in its
metadata.

Wrt. device command processing performance:

1. Our expectation is that REQ_DISCARD (FL_PUNCH_HOLE |
   FL_NO_HIDE_STALE), which gets translated into ATA DSM TRIM, NVMe
   DEALLOCATE, SCSI UNMAP, executes in O(1) regardless of the number of
   blocks operated on.

   Due to the ambiguity of ATA DSM TRIM and early SCSI we ended up in a
   situation where the industry applied additional semantics
   (deterministic zeroing) to that particular operation. And that has
   caused grief because devices often end up in the O(n-or-worse) bucket
   when determinism is a requirement.

2. Our expectation for the allocating REQ_ZEROOUT (FL_ZERO_RANGE), which
   gets translated into NVMe WRITE ZEROES, SCSI WRITE SAME, is that the
   command executes in O(n) but that it is faster -- or at least not
   worse -- than doing a regular WRITE to the same block range.

3. Our expectation for the deallocating REQ_ZEROOUT (FL_PUNCH_HOLE),
   which gets translated into ATA DSM TRIM w/ whitelist, NVMe WRITE
   ZEROES w/ DEAC, SCSI WRITE SAME w/ UNMAP, is that the command will
   execute in O(1) for any portion of the block range described by the
   I/O that is aligned to and a multiple of the internal device
   granularity. With an additional small O(n_head_LBs) + O(n_tail_LBs)
   overhead for zeroing any LBs at the beginning and end of the block
   range described by the I/O that do not comprise a full block wrt. the
   internal device granularity.

Does that description make sense?

The problem is that most vendors implement (3) using (1). But can't make
it work well because (3) was -- and still is for ATA -- outside the
scope of what the protocols can express.

And I agree with you that if (3) was implemented correctly in all
devices, we wouldn't need (1) at all. At least not for devices with an
internal granularity << total capacity.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08 15:05       ` Ric Wheeler
@ 2019-05-08 17:03         ` Martin K. Petersen
  2019-05-08 17:09           ` Ric Wheeler
  0 siblings, 1 reply; 30+ messages in thread
From: Martin K. Petersen @ 2019-05-08 17:03 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Dave Chinner, Jens Axboe, linux-block, Linux FS Devel, lczerner


Ric,

> That all makes sense, but I think it is orthogonal in large part to
> the need to get a good way to measure performance.

There are two parts to the performance puzzle:

 1. How does mixing discards/zeroouts with regular reads and writes
    affect system performance?

 2. How does issuing discards affect the tail latency of the device for
    a given workload? Is it worth it?

Providing tooling for (1) is feasible whereas (2) is highly
workload-specific. So unless we can make the cost of (1) negligible,
we'll have to defer (2) to the user.

> For SCSI, I think the "WRITE_SAME" command *might* do discard
> internally or just might end up re-writing large regions of slow,
> spinning drives so I think it is less interesting.

WRITE SAME has an UNMAP flag that tells the device to deallocate, if
possible. The results are deterministic (unlike the UNMAP command).

WRITE SAME also has an ANCHOR flag which provides a use case we
currently don't have fallocate plumbing for: Allocating blocks without
caring about their contents. I.e. the blocks described by the I/O are
locked down to prevent ENOSPC for future writes.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08 17:03         ` Martin K. Petersen
@ 2019-05-08 17:09           ` Ric Wheeler
  2019-05-08 17:25             ` Martin K. Petersen
  2019-05-08 21:58             ` Dave Chinner
  0 siblings, 2 replies; 30+ messages in thread
From: Ric Wheeler @ 2019-05-08 17:09 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Dave Chinner, Jens Axboe, linux-block, Linux FS Devel, lczerner


On 5/8/19 1:03 PM, Martin K. Petersen wrote:
> Ric,
>
>> That all makes sense, but I think it is orthogonal in large part to
>> the need to get a good way to measure performance.
> There are two parts to the performance puzzle:
>
>   1. How does mixing discards/zeroouts with regular reads and writes
>      affect system performance?
>
>   2. How does issuing discards affect the tail latency of the device for
>      a given workload? Is it worth it?
>
> Providing tooling for (1) is feasible whereas (2) is highly
> workload-specific. So unless we can make the cost of (1) negligible,
> we'll have to defer (2) to the user.

Agree, but I think that there is also a base level performance question 
- how does the discard/zero perform by itself.

Specifically, we have had to punt the discard of a whole block device 
before mkfs (back at RH) since it tripped up a significant number of 
devices. Similar pain for small discards (say one fs page) - is it too 
slow to do?

>
>> For SCSI, I think the "WRITE_SAME" command *might* do discard
>> internally or just might end up re-writing large regions of slow,
>> spinning drives so I think it is less interesting.
> WRITE SAME has an UNMAP flag that tells the device to deallocate, if
> possible. The results are deterministic (unlike the UNMAP command).
>
> WRITE SAME also has an ANCHOR flag which provides a use case we
> currently don't have fallocate plumbing for: Allocating blocks without
> caring about their contents. I.e. the blocks described by the I/O are
> locked down to prevent ENOSPC for future writes.

Thanks for that detail! Sounds like ANCHOR in this case exposes whatever 
data is there (similar I suppose to normal block device behavior without 
discard for unused space)? Seems like it would be useful for virtually 
provisioned devices (enterprise arrays or something like dm-thin 
targets) more than normal SSD's?

Ric



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08 17:09           ` Ric Wheeler
@ 2019-05-08 17:25             ` Martin K. Petersen
  2019-05-08 18:12               ` Ric Wheeler
  2019-05-08 21:58             ` Dave Chinner
  1 sibling, 1 reply; 30+ messages in thread
From: Martin K. Petersen @ 2019-05-08 17:25 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Martin K. Petersen, Dave Chinner, Jens Axboe, linux-block,
	Linux FS Devel, lczerner


Ric,

> Agree, but I think that there is also a base level performance
> question - how does the discard/zero perform by itself.  Specifically,
> we have had to punt the discard of a whole block device before mkfs
> (back at RH) since it tripped up a significant number of
> devices. Similar pain for small discards (say one fs page) - is it too
> slow to do?

Sure. Just wanted to emphasize the difference between the performance
cost of executing the command and the potential future performance
impact.

>> WRITE SAME also has an ANCHOR flag which provides a use case we
>> currently don't have fallocate plumbing for: Allocating blocks without
>> caring about their contents. I.e. the blocks described by the I/O are
>> locked down to prevent ENOSPC for future writes.
>
> Thanks for that detail! Sounds like ANCHOR in this case exposes
> whatever data is there (similar I suppose to normal block device
> behavior without discard for unused space)? Seems like it would be
> useful for virtually provisioned devices (enterprise arrays or
> something like dm-thin targets) more than normal SSD's?

It is typically used to pin down important areas to ensure one doesn't
get ENOSPC when writing journal or metadata. However, these are
typically the areas that we deliberately zero to ensure predictable
results. So I think the only case where anchoring makes much sense is on
devices that do zero detection and thus wouldn't actually provision N
blocks full of zeroes.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08 17:25             ` Martin K. Petersen
@ 2019-05-08 18:12               ` Ric Wheeler
  2019-05-09 16:02                 ` Bryan Gurney
  0 siblings, 1 reply; 30+ messages in thread
From: Ric Wheeler @ 2019-05-08 18:12 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Dave Chinner, Jens Axboe, linux-block, Linux FS Devel, lczerner

(stripped out the html junk, resending)

On 5/8/19 1:25 PM, Martin K. Petersen wrote:
>>> WRITE SAME also has an ANCHOR flag which provides a use case we
>>> currently don't have fallocate plumbing for: Allocating blocks without
>>> caring about their contents. I.e. the blocks described by the I/O are
>>> locked down to prevent ENOSPC for future writes.
>> Thanks for that detail! Sounds like ANCHOR in this case exposes
>> whatever data is there (similar I suppose to normal block device
>> behavior without discard for unused space)? Seems like it would be
>> useful for virtually provisioned devices (enterprise arrays or
>> something like dm-thin targets) more than normal SSD's?
> It is typically used to pin down important areas to ensure one doesn't
> get ENOSPC when writing journal or metadata. However, these are
> typically the areas that we deliberately zero to ensure predictable
> results. So I think the only case where anchoring makes much sense is on
> devices that do zero detection and thus wouldn't actually provision N
> blocks full of zeroes.

This behavior at the block layer might also be interesting for something 
like the VDO device (compression/dedup make it near impossible to 
predict how much space is really there since it is content specific). 
Might be useful as a way to hint to VDO about how to give users a 
promise of "at least this much" space? If the content is good for 
compression or dedup, you would get more, but never see less.

Ric



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08 17:09           ` Ric Wheeler
  2019-05-08 17:25             ` Martin K. Petersen
@ 2019-05-08 21:58             ` Dave Chinner
  2019-05-09  2:29               ` Martin K. Petersen
  1 sibling, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2019-05-08 21:58 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Martin K. Petersen, Jens Axboe, linux-block, Linux FS Devel, lczerner

On Wed, May 08, 2019 at 01:09:03PM -0400, Ric Wheeler wrote:
> 
> On 5/8/19 1:03 PM, Martin K. Petersen wrote:
> > Ric,
> > 
> > > That all makes sense, but I think it is orthogonal in large part to
> > > the need to get a good way to measure performance.
> > There are two parts to the performance puzzle:
> > 
> >   1. How does mixing discards/zeroouts with regular reads and writes
> >      affect system performance?
> > 
> >   2. How does issuing discards affect the tail latency of the device for
> >      a given workload? Is it worth it?
> > 
> > Providing tooling for (1) is feasible whereas (2) is highly
> > workload-specific. So unless we can make the cost of (1) negligible,
> > we'll have to defer (2) to the user.
> 
> Agree, but I think that there is also a base level performance question -
> how does the discard/zero perform by itself.
> 
> Specifically, we have had to punt the discard of a whole block device before
> mkfs (back at RH) since it tripped up a significant number of devices.
> Similar pain for small discards (say one fs page) - is it too slow to do?

Small discards are already skipped is the device indicates it has
a minumum discard granularity. This is another reason why the "-o
discard" mount option isn't sufficient by itself and fstrim is still
required - filesystems often only free small isolated chunks of
space at a time and hence never may send discards to the device.

> > > For SCSI, I think the "WRITE_SAME" command *might* do discard
> > > internally or just might end up re-writing large regions of slow,
> > > spinning drives so I think it is less interesting.
> > WRITE SAME has an UNMAP flag that tells the device to deallocate, if
> > possible. The results are deterministic (unlike the UNMAP command).

That's kinda what I'm getting at here - we need to define the
behaviour the OS provides users, and then ensure that the behaviour
is standardised correctly so that devices behave correctly. i.e.  we
want devices to support WRITE_SAME w/ UNMAP flag well (because
that's an exact representation of FALLOC_FL_PUNCH_HOLE
requirements), and don't really care about the UNMAP command.

> > WRITE SAME also has an ANCHOR flag which provides a use case we
> > currently don't have fallocate plumbing for: Allocating blocks without
> > caring about their contents. I.e. the blocks described by the I/O are
> > locked down to prevent ENOSPC for future writes.

So WRITE_SAME (0) with an ANCHOR flag does not return zeroes on
subsequent reads? i.e. it is effectively
fallocate(FALLOC_FL_NO_HIDE_STALE) preallocation semantics?

For many use cases cases we actually want zeroed space to be
guaranteed so we don't expose stale data from previous device use
into the new user's visibility - can that be done with WRITE_SAME
and the ANCHOR flag?

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08 16:16   ` Martin K. Petersen
@ 2019-05-08 22:31     ` Dave Chinner
  2019-05-09  3:55       ` Martin K. Petersen
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2019-05-08 22:31 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Ric Wheeler, Jens Axboe, linux-block, Linux FS Devel, lczerner

On Wed, May 08, 2019 at 12:16:24PM -0400, Martin K. Petersen wrote:
> 
> Hi Dave,
> 
> > My big question here is this:
> >
> > - is "discard" even relevant for future devices?
> 
> It's hard to make predictions. Especially about the future. But discard
> is definitely relevant on a bunch of current drives across the entire
> spectrum from junk to enterprise. Depending on workload,
> over-provisioning, media type, etc.
> 
> Plus, as Ric pointed out, thin provisioning is also relevant. Different
> use case but exactly the same plumbing.
> 
> > IMO, trying to "optimise discard" is completely the wrong direction
> > to take. We should be getting rid of "discard" and it's interfaces
> > operations - deprecate the ioctls, fix all other kernel callers of
> > blkdev_issue_discard() to call blkdev_fallocate()
> 
> blkdev_fallocate() is implemented using blkdev_issue_discard().

Only when told to do PUNCH_HOLE|NO_HIDE_STALE which means "we don't
care what the device does" as this fallcoate command provides no
guarantees for the data returned by subsequent reads. It is,
esssentially, a get out of gaol free mechanism for indeterminate
device capabilities.

> > and ensure that drive vendors understand that they need to make
> > FALLOC_FL_ZERO_RANGE and FALLOC_FL_PUNCH_HOLE work, and that
> > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE is deprecated (like
> > discard) and will be going away.
> 
> Fast, cheap, easy. Pick any two.
> 
> The issue is that -- from the device perspective -- guaranteeing zeroes
> requires substantially more effort than deallocating blocks. To the

People used to make that assertion about filesystems, too. It took
linux filesystem developers years to realise that unwritten extents
are actually very simple and require very little extra code and no
extra space in metadata to implement. If you are already tracking
allocated blocks/space, then you're 99% of the way to efficient
management of logically zeroed disk space.

> point where several vendors have given up making it work altogether and
> either report no discard support or silently ignore discard requests
> causing you to waste queue slots for no good reason.

I call bullshit.

> So while instant zeroing of a 100TB drive would be nice, I don't think
> it's a realistic goal given the architectural limitations of many of
> these devices. Conceptually, you'd think it would be as easy as
> unlinking an inode.

Unlinking an inode is one of the most complex things a filesystem
has to do. Marking allocated space as "contains zeros" is trivial in
comparison.

> But in practice the devices keep much more (and
> different) state around in their FTLs than a filesystem does in its
> metadata.
> 
> Wrt. device command processing performance:
> 
> 1. Our expectation is that REQ_DISCARD (FL_PUNCH_HOLE |
>    FL_NO_HIDE_STALE), which gets translated into ATA DSM TRIM, NVMe
>    DEALLOCATE, SCSI UNMAP, executes in O(1) regardless of the number of
>    blocks operated on.
> 
>    Due to the ambiguity of ATA DSM TRIM and early SCSI we ended up in a
>    situation where the industry applied additional semantics
>    (deterministic zeroing) to that particular operation. And that has
>    caused grief because devices often end up in the O(n-or-worse) bucket
>    when determinism is a requirement.

Which is why I want us to deprecate the use of REQ_DISCARD.


> 2. Our expectation for the allocating REQ_ZEROOUT (FL_ZERO_RANGE), which
>    gets translated into NVMe WRITE ZEROES, SCSI WRITE SAME, is that the
>    command executes in O(n) but that it is faster -- or at least not
>    worse -- than doing a regular WRITE to the same block range.

You're missing the important requirement of fallocate(ZERO_RANGE):
that the space is also allocated and ENOSPC will never be returned
for subsequent writes to that range. i.e. it is allocated but
"unwritten" space that contains zeros.

> 3. Our expectation for the deallocating REQ_ZEROOUT (FL_PUNCH_HOLE),
>    which gets translated into ATA DSM TRIM w/ whitelist, NVMe WRITE
>    ZEROES w/ DEAC, SCSI WRITE SAME w/ UNMAP, is that the command will
>    execute in O(1) for any portion of the block range described by the

FL_PUNCH_HOLE has no O(1) requirement - it has a "all possible space
must be freed" requirement. The larger the range, to longer it will
take.

For example, punching out a range that contains a single extent
might take a couple of hundred microseconds on XFS, but punching out
a range that contains 50 million extents in a single operation (yes,
we see sparse vm image files with that sort of extreme fragmentation
in production systems) can take 5-10 minutes to run.

That is acceptable behaviour for a space deallocation operation.
Expecting space deallocation will always run on O(1) time is ...
insanity. If I were a device vendor being asked to do this, I'd be
saying no, too, because it's simply an unrealistic expectation.

If you're going to suggest any sort of performance guideline, then
O(log N) is the best we can expect for deallocation operations
(where N is the size of the range to be deallocated). This is
possible to implement without significant complexity or requiring
background cleanup and future IO latency impact.....

>    I/O that is aligned to and a multiple of the internal device
>    granularity. With an additional small O(n_head_LBs) + O(n_tail_LBs)
>    overhead for zeroing any LBs at the beginning and end of the block
>    range described by the I/O that do not comprise a full block wrt. the
>    internal device granularity.

That's expected, and exaclty what filesystems do for sub-block punch
and zeroing ranges.

> Does that description make sense?
> 
> The problem is that most vendors implement (3) using (1). But can't make
> it work well because (3) was -- and still is for ATA -- outside the
> scope of what the protocols can express.
> 
> And I agree with you that if (3) was implemented correctly in all
> devices, we wouldn't need (1) at all. At least not for devices with an
> internal granularity << total capacity.

What I'm saying is that we should be pushing standards to ensure (3)
is correctly standardise, certified and implemented because that is
what the "Linux OS" requires from future hardware.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08 21:58             ` Dave Chinner
@ 2019-05-09  2:29               ` Martin K. Petersen
  2019-05-09  3:20                 ` Dave Chinner
  0 siblings, 1 reply; 30+ messages in thread
From: Martin K. Petersen @ 2019-05-09  2:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ric Wheeler, Martin K. Petersen, Jens Axboe, linux-block,
	Linux FS Devel, lczerner


Dave,

>> > WRITE SAME also has an ANCHOR flag which provides a use case we
>> > currently don't have fallocate plumbing for: Allocating blocks without
>> > caring about their contents. I.e. the blocks described by the I/O are
>> > locked down to prevent ENOSPC for future writes.
>
> So WRITE_SAME (0) with an ANCHOR flag does not return zeroes on
> subsequent reads? i.e. it is effectively
> fallocate(FALLOC_FL_NO_HIDE_STALE) preallocation semantics?

The answer is that it depends. It can return zeroes or a device-specific
initialization pattern (oh joy).

> For many use cases cases we actually want zeroed space to be
> guaranteed so we don't expose stale data from previous device use into
> the new user's visibility - can that be done with WRITE_SAME and the
> ANCHOR flag?

That's just a regular zeroout.

We have:

   Allocate and zero:	FALLOC_FL_ZERO_RANGE
   Deallocate and zero:	FALLOC_FL_PUNCH_HOLE
   Deallocate:		FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE

but are missing:

   Allocate:		FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE

The devices that implement anchor semantics are few and far between. I
have yet to see one.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-09  2:29               ` Martin K. Petersen
@ 2019-05-09  3:20                 ` Dave Chinner
  2019-05-09  4:35                   ` Martin K. Petersen
  0 siblings, 1 reply; 30+ messages in thread
From: Dave Chinner @ 2019-05-09  3:20 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Ric Wheeler, Jens Axboe, linux-block, Linux FS Devel, lczerner

On Wed, May 08, 2019 at 10:29:17PM -0400, Martin K. Petersen wrote:
> 
> Dave,
> 
> >> > WRITE SAME also has an ANCHOR flag which provides a use case we
> >> > currently don't have fallocate plumbing for: Allocating blocks without
> >> > caring about their contents. I.e. the blocks described by the I/O are
> >> > locked down to prevent ENOSPC for future writes.
> >
> > So WRITE_SAME (0) with an ANCHOR flag does not return zeroes on
> > subsequent reads? i.e. it is effectively
> > fallocate(FALLOC_FL_NO_HIDE_STALE) preallocation semantics?
> 
> The answer is that it depends. It can return zeroes or a device-specific
> initialization pattern (oh joy).

So they ignore the "write zeroes" part of the command?

And the standards allow that?

> > For many use cases cases we actually want zeroed space to be
> > guaranteed so we don't expose stale data from previous device use into
> > the new user's visibility - can that be done with WRITE_SAME and the
> > ANCHOR flag?
> 
> That's just a regular zeroout.
> 
> We have:
> 
>    Allocate and zero:	FALLOC_FL_ZERO_RANGE
>    Deallocate and zero:	FALLOC_FL_PUNCH_HOLE
>    Deallocate:		FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE
> but are missing:
> 
>    Allocate:		FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE

So we've defined the fallocate flags to have /completely/ different
behaviour on block devices to filesystems.

<sigh>

We excel at screwing up APIs, don't we?

I give up, we've piled the shit too high on this one to dig it out
now....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08 22:31     ` Dave Chinner
@ 2019-05-09  3:55       ` Martin K. Petersen
  2019-05-09 13:40         ` Ric Wheeler
  0 siblings, 1 reply; 30+ messages in thread
From: Martin K. Petersen @ 2019-05-09  3:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, Ric Wheeler, Jens Axboe, linux-block,
	Linux FS Devel, lczerner


Dave,

> Only when told to do PUNCH_HOLE|NO_HIDE_STALE which means "we don't
> care what the device does" as this fallcoate command provides no
> guarantees for the data returned by subsequent reads. It is,
> esssentially, a get out of gaol free mechanism for indeterminate
> device capabilities.

Correct. But the point of discard is to be a lightweight mechanism to
convey to the device that a block range is no longer in use. Nothing
more, nothing less.

Not everybody wants the device to spend resources handling unwritten
extents. I understand the importance of that use case for XFS but other
users really just need deallocate semantics.

> People used to make that assertion about filesystems, too. It took
> linux filesystem developers years to realise that unwritten extents
> are actually very simple and require very little extra code and no
> extra space in metadata to implement. If you are already tracking
> allocated blocks/space, then you're 99% of the way to efficient
> management of logically zeroed disk space.

I don't disagree. But since "discard performance" checkmark appears to
be absent from every product requirements document known to man, very
little energy has been devoted to ensuring that discard operations can
coexist with read/write I/O without impeding the performance.

I'm not saying it's impossible. Just that so far it hasn't been a
priority. Even large volume customers have been unable to compel their
suppliers to produce a device that doesn't suffer one way or the other.

On the SSD device side, vendors typically try to strike a suitable
balance between what's handled by the FTL and what's handled by
over-provisioning.

>> 2. Our expectation for the allocating REQ_ZEROOUT (FL_ZERO_RANGE), which
>>    gets translated into NVMe WRITE ZEROES, SCSI WRITE SAME, is that the
>>    command executes in O(n) but that it is faster -- or at least not
>>    worse -- than doing a regular WRITE to the same block range.
>
> You're missing the important requirement of fallocate(ZERO_RANGE):
> that the space is also allocated and ENOSPC will never be returned
> for subsequent writes to that range. i.e. it is allocated but
> "unwritten" space that contains zeros.

That's what I implied when comparing it to a WRITE.

>> 3. Our expectation for the deallocating REQ_ZEROOUT (FL_PUNCH_HOLE),
>>    which gets translated into ATA DSM TRIM w/ whitelist, NVMe WRITE
>>    ZEROES w/ DEAC, SCSI WRITE SAME w/ UNMAP, is that the command will
>>    execute in O(1) for any portion of the block range described by the
>
> FL_PUNCH_HOLE has no O(1) requirement - it has a "all possible space
> must be freed" requirement. The larger the range, to longer it will
> take.

OK, so maybe my O() notation lacked a media access moniker. What I meant
to convey was that no media writes take place for the properly aligned
multiple of the internal granularity. The FTL update takes however long
it takes, but the only potential media accesses would be the head and
tail pieces. For some types of devices, these might be handled in
translation tables. But for others, zeroing blocks on the media is the
only way to do it.

> That's expected, and exaclty what filesystems do for sub-block punch
> and zeroing ranges.

Yep.

> What I'm saying is that we should be pushing standards to ensure (3)
> is correctly standardise, certified and implemented because that is
> what the "Linux OS" requires from future hardware.

That's well-defined for both NVMe and SCSI.

However, I do not agree that a deallocate operation has to imply
zeroing. I think there are valid use cases for pure deallocate.

In an ideal world the performance difference between (1) and (3) would
be negligible and make this distinction moot. However, we have to
support devices that have a wide variety of media and hardware
characteristics. So I don't see pure deallocate going away. Doesn't mean
that I am not pushing vendors to handle (3) because I think it is very
important. And why we defined WRITE ZEROES in the first place.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-09  3:20                 ` Dave Chinner
@ 2019-05-09  4:35                   ` Martin K. Petersen
  0 siblings, 0 replies; 30+ messages in thread
From: Martin K. Petersen @ 2019-05-09  4:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, Ric Wheeler, Jens Axboe, linux-block,
	Linux FS Devel, lczerner


Dave,

>> The answer is that it depends. It can return zeroes or a
>> device-specific initialization pattern (oh joy).
>
> So they ignore the "write zeroes" part of the command?

I'd have to look to see how ANCHOR and NDOB interact on a WRITE
SAME. That's the closest thing SCSI has to WRITE ZEROES.

You can check whether a device has a non-standard initialization
pattern. It's a bit convoluted given that devices can autonomously
transition blocks between different states based on the initialization
pattern. But again, I don't think anybody has actually implemented this
part of the spec.

>> We have:
>> 
>>    Allocate and zero:	FALLOC_FL_ZERO_RANGE
>>    Deallocate and zero:	FALLOC_FL_PUNCH_HOLE
>>    Deallocate:		FALLOC_FL_PUNCH_HOLE | FALLOC_FL_NO_HIDE_STALE
>> but are missing:
>> 
>>    Allocate:		FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE

Copy and paste error. "Allocate:" would be FALLOC_FL_NO_HIDE_STALE in
the ANCHOR case. It's really just a preallocation but the blocks could
contain something other than zeroes depending on the device.

> So we've defined the fallocate flags to have /completely/ different
> behaviour on block devices to filesystems.

Are you referring to the "Allocate" case or something else? From
fallocate(2):

"Specifying the FALLOC_FL_ZERO_RANGE flag [...] zeroes space [...].
Within the specified range, blocks are preallocated for the regions that
span the holes in the file.  After a successful call, subsequent reads
from this range will return zeroes."

"Specifying the FALLOC_FL_PUNCH_HOLE flag [...] deallocates space [...].
Within the specified range, partial filesystem blocks are zeroed, and
whole filesystem blocks are removed from the file.  After a successful
call, subsequent reads from this range will return zeroes."

That matches the block device behavior as far as I'm concerned.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-09  3:55       ` Martin K. Petersen
@ 2019-05-09 13:40         ` Ric Wheeler
  0 siblings, 0 replies; 30+ messages in thread
From: Ric Wheeler @ 2019-05-09 13:40 UTC (permalink / raw)
  To: Martin K. Petersen, Dave Chinner
  Cc: Jens Axboe, linux-block, Linux FS Devel, lczerner



On 5/8/19 11:55 PM, Martin K. Petersen wrote:
> 
> Dave,
> 
>> Only when told to do PUNCH_HOLE|NO_HIDE_STALE which means "we don't
>> care what the device does" as this fallcoate command provides no
>> guarantees for the data returned by subsequent reads. It is,
>> esssentially, a get out of gaol free mechanism for indeterminate
>> device capabilities.
> 
> Correct. But the point of discard is to be a lightweight mechanism to
> convey to the device that a block range is no longer in use. Nothing
> more, nothing less.
> 
> Not everybody wants the device to spend resources handling unwritten
> extents. I understand the importance of that use case for XFS but other
> users really just need deallocate semantics.
> 
>> People used to make that assertion about filesystems, too. It took
>> linux filesystem developers years to realise that unwritten extents
>> are actually very simple and require very little extra code and no
>> extra space in metadata to implement. If you are already tracking
>> allocated blocks/space, then you're 99% of the way to efficient
>> management of logically zeroed disk space.
> 
> I don't disagree. But since "discard performance" checkmark appears to
> be absent from every product requirements document known to man, very
> little energy has been devoted to ensuring that discard operations can
> coexist with read/write I/O without impeding the performance.
> 
> I'm not saying it's impossible. Just that so far it hasn't been a
> priority. Even large volume customers have been unable to compel their
> suppliers to produce a device that doesn't suffer one way or the other.
> 
> On the SSD device side, vendors typically try to strike a suitable
> balance between what's handled by the FTL and what's handled by
> over-provisioning.
> 
>>> 2. Our expectation for the allocating REQ_ZEROOUT (FL_ZERO_RANGE), which
>>>     gets translated into NVMe WRITE ZEROES, SCSI WRITE SAME, is that the
>>>     command executes in O(n) but that it is faster -- or at least not
>>>     worse -- than doing a regular WRITE to the same block range.
>>
>> You're missing the important requirement of fallocate(ZERO_RANGE):
>> that the space is also allocated and ENOSPC will never be returned
>> for subsequent writes to that range. i.e. it is allocated but
>> "unwritten" space that contains zeros.
> 
> That's what I implied when comparing it to a WRITE.
> 
>>> 3. Our expectation for the deallocating REQ_ZEROOUT (FL_PUNCH_HOLE),
>>>     which gets translated into ATA DSM TRIM w/ whitelist, NVMe WRITE
>>>     ZEROES w/ DEAC, SCSI WRITE SAME w/ UNMAP, is that the command will
>>>     execute in O(1) for any portion of the block range described by the
>>
>> FL_PUNCH_HOLE has no O(1) requirement - it has a "all possible space
>> must be freed" requirement. The larger the range, to longer it will
>> take.
> 
> OK, so maybe my O() notation lacked a media access moniker. What I meant
> to convey was that no media writes take place for the properly aligned
> multiple of the internal granularity. The FTL update takes however long
> it takes, but the only potential media accesses would be the head and
> tail pieces. For some types of devices, these might be handled in
> translation tables. But for others, zeroing blocks on the media is the
> only way to do it.
> 
>> That's expected, and exaclty what filesystems do for sub-block punch
>> and zeroing ranges.
> 
> Yep.
> 
>> What I'm saying is that we should be pushing standards to ensure (3)
>> is correctly standardise, certified and implemented because that is
>> what the "Linux OS" requires from future hardware.
> 
> That's well-defined for both NVMe and SCSI.
> 
> However, I do not agree that a deallocate operation has to imply
> zeroing. I think there are valid use cases for pure deallocate.
> 
> In an ideal world the performance difference between (1) and (3) would
> be negligible and make this distinction moot. However, we have to
> support devices that have a wide variety of media and hardware
> characteristics. So I don't see pure deallocate going away. Doesn't mean
> that I am not pushing vendors to handle (3) because I think it is very
> important. And why we defined WRITE ZEROES in the first place.
> 

All of this makes sense to me.

I think that we can get value out of measuring how close various devices 
come to realizing the above assumptions.  Clearly, file systems (as 
Chris mentioned) do have to adapt to varying device performance issues, 
but I think today the variation can be orders of magnitude for large 
(whole device) discards and that it not something that is easy to 
tolerate....

ric

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-08 18:12               ` Ric Wheeler
@ 2019-05-09 16:02                 ` Bryan Gurney
  2019-05-09 17:27                   ` Ric Wheeler
  0 siblings, 1 reply; 30+ messages in thread
From: Bryan Gurney @ 2019-05-09 16:02 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Martin K. Petersen, Dave Chinner, Jens Axboe, linux-block,
	Linux FS Devel, Lukáš Czerner

On Wed, May 8, 2019 at 2:12 PM Ric Wheeler <ricwheeler@gmail.com> wrote:
>
> (stripped out the html junk, resending)
>
> On 5/8/19 1:25 PM, Martin K. Petersen wrote:
> >>> WRITE SAME also has an ANCHOR flag which provides a use case we
> >>> currently don't have fallocate plumbing for: Allocating blocks without
> >>> caring about their contents. I.e. the blocks described by the I/O are
> >>> locked down to prevent ENOSPC for future writes.
> >> Thanks for that detail! Sounds like ANCHOR in this case exposes
> >> whatever data is there (similar I suppose to normal block device
> >> behavior without discard for unused space)? Seems like it would be
> >> useful for virtually provisioned devices (enterprise arrays or
> >> something like dm-thin targets) more than normal SSD's?
> > It is typically used to pin down important areas to ensure one doesn't
> > get ENOSPC when writing journal or metadata. However, these are
> > typically the areas that we deliberately zero to ensure predictable
> > results. So I think the only case where anchoring makes much sense is on
> > devices that do zero detection and thus wouldn't actually provision N
> > blocks full of zeroes.
>
> This behavior at the block layer might also be interesting for something
> like the VDO device (compression/dedup make it near impossible to
> predict how much space is really there since it is content specific).
> Might be useful as a way to hint to VDO about how to give users a
> promise of "at least this much" space? If the content is good for
> compression or dedup, you would get more, but never see less.
>

In the case of VDO, writing zeroed blocks will not consume space, due
to the zero block elimination in VDO.  However, that also means that
it won't "reserve" space, either.  The WRITE SAME command with the
ANCHOR flag is SCSI, so it won't apply to a bio-based device.

Space savings also results in a write of N blocks having a fair chance
of the end result ultimately using "less than N" blocks, depending on
how much space savings can be achieved.  Likewise, a discard of N
blocks has a chance of reclaiming "less than N" blocks.


Thanks,

Bryan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-09 16:02                 ` Bryan Gurney
@ 2019-05-09 17:27                   ` Ric Wheeler
  2019-05-09 20:35                     ` Bryan Gurney
  0 siblings, 1 reply; 30+ messages in thread
From: Ric Wheeler @ 2019-05-09 17:27 UTC (permalink / raw)
  To: Bryan Gurney
  Cc: Martin K. Petersen, Dave Chinner, Jens Axboe, linux-block,
	Linux FS Devel, Lukáš Czerner



On 5/9/19 12:02 PM, Bryan Gurney wrote:
> On Wed, May 8, 2019 at 2:12 PM Ric Wheeler <ricwheeler@gmail.com> wrote:
>>
>> (stripped out the html junk, resending)
>>
>> On 5/8/19 1:25 PM, Martin K. Petersen wrote:
>>>>> WRITE SAME also has an ANCHOR flag which provides a use case we
>>>>> currently don't have fallocate plumbing for: Allocating blocks without
>>>>> caring about their contents. I.e. the blocks described by the I/O are
>>>>> locked down to prevent ENOSPC for future writes.
>>>> Thanks for that detail! Sounds like ANCHOR in this case exposes
>>>> whatever data is there (similar I suppose to normal block device
>>>> behavior without discard for unused space)? Seems like it would be
>>>> useful for virtually provisioned devices (enterprise arrays or
>>>> something like dm-thin targets) more than normal SSD's?
>>> It is typically used to pin down important areas to ensure one doesn't
>>> get ENOSPC when writing journal or metadata. However, these are
>>> typically the areas that we deliberately zero to ensure predictable
>>> results. So I think the only case where anchoring makes much sense is on
>>> devices that do zero detection and thus wouldn't actually provision N
>>> blocks full of zeroes.
>>
>> This behavior at the block layer might also be interesting for something
>> like the VDO device (compression/dedup make it near impossible to
>> predict how much space is really there since it is content specific).
>> Might be useful as a way to hint to VDO about how to give users a
>> promise of "at least this much" space? If the content is good for
>> compression or dedup, you would get more, but never see less.
>>
> 
> In the case of VDO, writing zeroed blocks will not consume space, due
> to the zero block elimination in VDO.  However, that also means that
> it won't "reserve" space, either.  The WRITE SAME command with the
> ANCHOR flag is SCSI, so it won't apply to a bio-based device.
> 
> Space savings also results in a write of N blocks having a fair chance
> of the end result ultimately using "less than N" blocks, depending on
> how much space savings can be achieved.  Likewise, a discard of N
> blocks has a chance of reclaiming "less than N" blocks.
> 

Are there other API's that let you allocate a minimum set of physical 
blocks to a VDO device?

Thanks!

Ric



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-09 17:27                   ` Ric Wheeler
@ 2019-05-09 20:35                     ` Bryan Gurney
  0 siblings, 0 replies; 30+ messages in thread
From: Bryan Gurney @ 2019-05-09 20:35 UTC (permalink / raw)
  To: Ric Wheeler
  Cc: Martin K. Petersen, Dave Chinner, Jens Axboe, linux-block,
	Linux FS Devel, Lukáš Czerner

On Thu, May 9, 2019 at 1:27 PM Ric Wheeler <ricwheeler@gmail.com> wrote:
>
>
>
> On 5/9/19 12:02 PM, Bryan Gurney wrote:
> > On Wed, May 8, 2019 at 2:12 PM Ric Wheeler <ricwheeler@gmail.com> wrote:
> >>
> >> (stripped out the html junk, resending)
> >>
> >> On 5/8/19 1:25 PM, Martin K. Petersen wrote:
> >>>>> WRITE SAME also has an ANCHOR flag which provides a use case we
> >>>>> currently don't have fallocate plumbing for: Allocating blocks without
> >>>>> caring about their contents. I.e. the blocks described by the I/O are
> >>>>> locked down to prevent ENOSPC for future writes.
> >>>> Thanks for that detail! Sounds like ANCHOR in this case exposes
> >>>> whatever data is there (similar I suppose to normal block device
> >>>> behavior without discard for unused space)? Seems like it would be
> >>>> useful for virtually provisioned devices (enterprise arrays or
> >>>> something like dm-thin targets) more than normal SSD's?
> >>> It is typically used to pin down important areas to ensure one doesn't
> >>> get ENOSPC when writing journal or metadata. However, these are
> >>> typically the areas that we deliberately zero to ensure predictable
> >>> results. So I think the only case where anchoring makes much sense is on
> >>> devices that do zero detection and thus wouldn't actually provision N
> >>> blocks full of zeroes.
> >>
> >> This behavior at the block layer might also be interesting for something
> >> like the VDO device (compression/dedup make it near impossible to
> >> predict how much space is really there since it is content specific).
> >> Might be useful as a way to hint to VDO about how to give users a
> >> promise of "at least this much" space? If the content is good for
> >> compression or dedup, you would get more, but never see less.
> >>
> >
> > In the case of VDO, writing zeroed blocks will not consume space, due
> > to the zero block elimination in VDO.  However, that also means that
> > it won't "reserve" space, either.  The WRITE SAME command with the
> > ANCHOR flag is SCSI, so it won't apply to a bio-based device.
> >
> > Space savings also results in a write of N blocks having a fair chance
> > of the end result ultimately using "less than N" blocks, depending on
> > how much space savings can be achieved.  Likewise, a discard of N
> > blocks has a chance of reclaiming "less than N" blocks.
> >
>
> Are there other API's that let you allocate a minimum set of physical
> blocks to a VDO device?
>

As far as I know: no, not currently.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Testing devices for discard support properly
  2019-05-07 21:24               ` Chris Mason
@ 2019-06-03 20:01                 ` Ric Wheeler
  0 siblings, 0 replies; 30+ messages in thread
From: Ric Wheeler @ 2019-06-03 20:01 UTC (permalink / raw)
  To: Chris Mason, Bryan Gurney
  Cc: Lukas Czerner, Jan Tulak, Jens Axboe, linux-block,
	Linux FS Devel, Nikolay Borisov, Dennis Zhou



On 5/7/19 5:24 PM, Chris Mason wrote:
> On 7 May 2019, at 16:09, Bryan Gurney wrote:
> 
>> I found an example in my trace of the "two bands of latency" behavior.
>> Consider these three segments of trace data during the writes:
>>
> 
> [ ... ]
> 
>> There's an average latency of 14 milliseconds for these 128 kilobyte
>> writes.  At 0.218288794 seconds, we can see a sudden appearance of 1.7
>> millisecond latency times, much lower than the average.
>>
>> Then we see an alternation of 1.7 millisecond completions and 14
>> millisecond completions, with these two "latency groups" increasing,
>> up to about 14 milliseconds and 25 milliseconds at 0.241287187 seconds
>> into the trace.
>>
>> At 0.317351888 seconds, we see the pattern start again, with a sudden
>> appearance of 1.89 millisecond latency write completions, among 14.7
>> millisecond latency write completions.
>>
>> If you graph it, it looks like a "triangle wave" pulse, with a
>> duration of about 23 milliseconds, that repeats after about 100
>> milliseconds.  In a way, it's like a "heartbeat".  This wouldn't be as
>> easy to detect with a simple "average" or "percentile" reading.
>>
>> This was during a simple sequential write at a queue depth of 32, but
>> what happens with a write after a discard in the same region of
>> sectors?  This behavior could change, depending on different drive
>> models, and/or drive controller algorithms.
>>
> 
> I think these are all really interesting, and definitely support the
> idea of a series of tests we do to make sure a drive implements discard
> in the general ways that we expect.
> 
> But with that said, I think a more important discussion as filesystem
> developers is how we protect the rest of the filesystem from high
> latencies caused by discards.  For reads and writes, we've been doing
> this for a long time.  IO schedulers have all kinds of checks and
> balances for REQ_META or REQ_SYNC, and we throttle dirty pages and
> readahead and dance around request batching etc etc.
> 
> But for discards, we just open the floodgates and hope it works out.  At
> some point we're going to have to figure out how to queue and throttle
> discards as well as we do reads/writes.  That's kind of tricky because
> the FS needs to coordinate when we're allowed to discard something and
> needs to know when the discard is done, and we all have different
> schemes for keeping track.
> 
> -chris
> 

Trying to summarize my thoughts here after weeks of other stuff.

We really have two (intertwined) questions:

* does issuing a discard on a device do anything useful - restore 
flagging performance, enhance the life space of the device, etc?

* what is the performance impact of doing a discard & does it vary based 
on the size of the region? (Can we use it to discard a whole device, do 
it for small discards, etc)

To answer the first question, we need a test that can verify that 
without discards (mount with nodiscard), we see a decline in 
performance. For example, multiple overwrites of the entire surface of 
the device (2 -3 full device writes) to make sure all of the spare 
capacity has been consumed, run the target workload we want to measure, 
then do discards of the whole space and run that same target workload.

If the discard does something useful, we should see better performance 
in that second test run.

If discard does not do anything useful, we are "done" with that device - 
no real need to measure performance of a useless mechanism. (Punting on 
the device longevity stuff here, seems like that should be left to the 
hardware vendors).

To answer the second question, we need to measure the performance of the 
discard implementation.

We still have to work to get any device into a well known state - do 
multiple, full device writes without discards. 2-3 passes should do it.

Then run our specific discard test workload - measure the performance of 
large discards (cap the size by the max permitted by the device) and 
small, single page discards. Important to capture min/max/average times 
of the discard. I think it would be best to do this on the block device 
to avoid any file system layer performance impact of deleting 
files/tweaking extents/etc.

Probably easiest to do separate tests for interesting discard sizes 
(each time, doing the full device writes to get back to a known state 
ahead of the test).

This is not meant to be a comprehensive tests/validation, but I think 
that doing the above would be a way to get a good sense of the 
effectiveness and performance of the device mechanism.

Make sense? Did I leave something out?

Ric

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-06-03 20:01 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-06 20:56 Testing devices for discard support properly Ric Wheeler
2019-05-07  7:10 ` Lukas Czerner
2019-05-07  8:48   ` Jan Tulak
2019-05-07  9:40     ` Lukas Czerner
2019-05-07 12:57       ` Ric Wheeler
2019-05-07 15:35         ` Bryan Gurney
2019-05-07 15:44           ` Ric Wheeler
2019-05-07 20:09             ` Bryan Gurney
2019-05-07 21:24               ` Chris Mason
2019-06-03 20:01                 ` Ric Wheeler
2019-05-07  8:21 ` Nikolay Borisov
2019-05-07 22:04 ` Dave Chinner
2019-05-08  0:07   ` Ric Wheeler
2019-05-08  1:14     ` Dave Chinner
2019-05-08 15:05       ` Ric Wheeler
2019-05-08 17:03         ` Martin K. Petersen
2019-05-08 17:09           ` Ric Wheeler
2019-05-08 17:25             ` Martin K. Petersen
2019-05-08 18:12               ` Ric Wheeler
2019-05-09 16:02                 ` Bryan Gurney
2019-05-09 17:27                   ` Ric Wheeler
2019-05-09 20:35                     ` Bryan Gurney
2019-05-08 21:58             ` Dave Chinner
2019-05-09  2:29               ` Martin K. Petersen
2019-05-09  3:20                 ` Dave Chinner
2019-05-09  4:35                   ` Martin K. Petersen
2019-05-08 16:16   ` Martin K. Petersen
2019-05-08 22:31     ` Dave Chinner
2019-05-09  3:55       ` Martin K. Petersen
2019-05-09 13:40         ` Ric Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.