All of lore.kernel.org
 help / color / mirror / Atom feed
* fstrim is takes a long time on Btrfs and NVMe
@ 2019-12-21  6:24 Chris Murphy
  2019-12-21  8:38 ` Andrea Gelmini
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Chris Murphy @ 2019-12-21  6:24 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi,

Recent kernels, I think since 5.1 or 5.2, but tested today on 5.3.18,
5.4.5, 5.5.0rc2, takes quite a long time for `fstrim /` to complete,
just over 1 minute.

Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p7  178G   16G  161G   9% /

fstrim stops on this for pretty much the entire time:
ioctl(3, FITRIM, {start=0, len=0xffffffffffffffff, minlen=0}) = 0

top shows the fstrim process itself isn't consuming much CPU, about
2-3%. Top five items in per top, not much more revealing.

Samples: 220K of event 'cycles', 4000 Hz, Event count (approx.):
3463316966 lost: 0/0 drop: 0/0
Overhead  Shared Object                    Symbol
   1.62%  [kernel]                         [k] find_next_zero_bit
   1.59%  perf                             [.] 0x00000000002ae063
   1.52%  [kernel]                         [k] psi_task_change
   1.41%  [kernel]                         [k] update_blocked_averages
   1.33%  [unknown]                        [.] 0000000000000000

On a different system, with older Samsung 840 SATA SSD, and a fresh
Btrfs, I can't reproduce. It takes less than 1s. Not sure how to get
more information.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-21  6:24 fstrim is takes a long time on Btrfs and NVMe Chris Murphy
@ 2019-12-21  8:38 ` Andrea Gelmini
  2019-12-21  9:27 ` Nikolay Borisov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Andrea Gelmini @ 2019-12-21  8:38 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Fri, Dec 20, 2019 at 11:24:24PM -0700, Chris Murphy wrote:
> Hi,
> 
> Recent kernels, I think since 5.1 or 5.2, but tested today on 5.3.18,
> 5.4.5, 5.5.0rc2, takes quite a long time for `fstrim /` to complete,
> just over 1 minute.

Same effect here, since more than one year (and all update kernel
version).

It happens with my laptop devices:
nvme: KBG30ZMT128G TOSHIBA                     1         128,04  GB / 128,04  GB    512   B +  0 B   (Firm. 0108ADLA)
SSD sata: Samsung SSD 860 EVO 4TB (Firmw. RVT02B6Q)

I guess it doesn't matter, anyway:
nvme: btrfs
ssd: ext4 on cryptsetup + lvm

Ciao,
Gelma

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-21  6:24 fstrim is takes a long time on Btrfs and NVMe Chris Murphy
  2019-12-21  8:38 ` Andrea Gelmini
@ 2019-12-21  9:27 ` Nikolay Borisov
  2019-12-22  3:43   ` Chris Murphy
  2019-12-22 10:40 ` Nikolay Borisov
  2019-12-22 17:43 ` Josef Bacik
  3 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2019-12-21  9:27 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS



On 21.12.19 г. 8:24 ч., Chris Murphy wrote:
> Hi,
> 
> Recent kernels, I think since 5.1 or 5.2, but tested today on 5.3.18,
> 5.4.5, 5.5.0rc2, takes quite a long time for `fstrim /` to complete,
> just over 1 minute.
> 
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/nvme0n1p7  178G   16G  161G   9% /
> 
> fstrim stops on this for pretty much the entire time:
> ioctl(3, FITRIM, {start=0, len=0xffffffffffffffff, minlen=0}) = 0
> 
> top shows the fstrim process itself isn't consuming much CPU, about
> 2-3%. Top five items in per top, not much more revealing.
> 
> Samples: 220K of event 'cycles', 4000 Hz, Event count (approx.):
> 3463316966 lost: 0/0 drop: 0/0
> Overhead  Shared Object                    Symbol
>    1.62%  [kernel]                         [k] find_next_zero_bit
>    1.59%  perf                             [.] 0x00000000002ae063
>    1.52%  [kernel]                         [k] psi_task_change
>    1.41%  [kernel]                         [k] update_blocked_averages
>    1.33%  [unknown]                        [.] 0000000000000000
> 
> On a different system, with older Samsung 840 SATA SSD, and a fresh
> Btrfs, I can't reproduce. It takes less than 1s. Not sure how to get
> more information.


trim implementations are a blackbox and specific to particular hardware.
Can you try with a different filesystem on the same drive? When
implementing the fstrim ioctl there isn't much you can do since discard
requests are just sent to the disk.

Providing blkttraces might yield more insight as to where the requests
spend most time.

> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-21  9:27 ` Nikolay Borisov
@ 2019-12-22  3:43   ` Chris Murphy
  2019-12-22  3:59     ` Chris Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2019-12-22  3:43 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Btrfs BTRFS

On Sat, Dec 21, 2019 at 2:27 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 21.12.19 г. 8:24 ч., Chris Murphy wrote:
> > Hi,
> >
> > Recent kernels, I think since 5.1 or 5.2, but tested today on 5.3.18,
> > 5.4.5, 5.5.0rc2, takes quite a long time for `fstrim /` to complete,
> > just over 1 minute.
> >
> > Filesystem      Size  Used Avail Use% Mounted on
> > /dev/nvme0n1p7  178G   16G  161G   9% /
> >
> > fstrim stops on this for pretty much the entire time:
> > ioctl(3, FITRIM, {start=0, len=0xffffffffffffffff, minlen=0}) = 0
> >
> > top shows the fstrim process itself isn't consuming much CPU, about
> > 2-3%. Top five items in per top, not much more revealing.
> >
> > Samples: 220K of event 'cycles', 4000 Hz, Event count (approx.):
> > 3463316966 lost: 0/0 drop: 0/0
> > Overhead  Shared Object                    Symbol
> >    1.62%  [kernel]                         [k] find_next_zero_bit
> >    1.59%  perf                             [.] 0x00000000002ae063
> >    1.52%  [kernel]                         [k] psi_task_change
> >    1.41%  [kernel]                         [k] update_blocked_averages
> >    1.33%  [unknown]                        [.] 0000000000000000
> >
> > On a different system, with older Samsung 840 SATA SSD, and a fresh
> > Btrfs, I can't reproduce. It takes less than 1s. Not sure how to get
> > more information.
>
>
> trim implementations are a blackbox and specific to particular hardware.
> Can you try with a different filesystem on the same drive? When
> implementing the fstrim ioctl there isn't much you can do since discard
> requests are just sent to the disk.
>
> Providing blkttraces might yield more insight as to where the requests
> spend most time.

Roughly 90% of each CPUs file looks like they're very small block
discards, if I'm reading this correctly at all...

259,0    3   117655    85.094469086  3057  A  DS 233804904 + 688 <-
(259,7) 110910568

Quite a lot are + 32 and +64. Only after 85% through the parsed file
do I see values

259,0    3   127448    91.214170783  3057  A   D 473292774 + 8388607
<- (259,7) 350398438

The bulk of the space is unallocated, which I'm guessing are the large
block discards. And as I think about it, back when fstrim was fast on
this same hardware, the amount discarded exactly matched only
unallocated space, as if unused space in block groups was not
discarded. So this slowness might be related to finding all of those
free space blocks. Further, I'm using space_cache=v2. And further, all
the tests I do on new file systems doesn't show this probably because
they aren't aged like this one is.


    Device size:         178.00GiB
    Device allocated:          52.04GiB
    Device unallocated:         125.96GiB
    Device missing:             0.00B
    Used:              15.15GiB
    Free (estimated):         160.36GiB    (min: 160.36GiB)


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22  3:43   ` Chris Murphy
@ 2019-12-22  3:59     ` Chris Murphy
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2019-12-22  3:59 UTC (permalink / raw)
  To: Btrfs BTRFS; +Cc: Nikolay Borisov

nvme0n1p7.blktrace.0.txt.tar.zst
https://drive.google.com/open?id=1IKS6_4Kij1dcRqAezTqMfUwMDggtnrmD

I'm not sure whether this NVMe drive supports queued trim, or if it's
intentionally slow to avoid the pausing problems seen on non-queued
trim drives? I'm offhand not experiencing any hangs while issuing this
trim, including scrubbing this file system while trimming it (seems
ill advised, but I like misery). However I do get a circular lock
warning...


Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-21  6:24 fstrim is takes a long time on Btrfs and NVMe Chris Murphy
  2019-12-21  8:38 ` Andrea Gelmini
  2019-12-21  9:27 ` Nikolay Borisov
@ 2019-12-22 10:40 ` Nikolay Borisov
  2019-12-22 17:43 ` Josef Bacik
  3 siblings, 0 replies; 17+ messages in thread
From: Nikolay Borisov @ 2019-12-22 10:40 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS



On 21.12.19 г. 8:24 ч., Chris Murphy wrote:
> Hi,
> 
> Recent kernels, I think since 5.1 or 5.2, but tested today on 5.3.18,
> 5.4.5, 5.5.0rc2, takes quite a long time for `fstrim /` to complete,
> just over 1 minute.
> 
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/nvme0n1p7  178G   16G  161G   9% /
> 
> fstrim stops on this for pretty much the entire time:
> ioctl(3, FITRIM, {start=0, len=0xffffffffffffffff, minlen=0}) = 0
> 
> top shows the fstrim process itself isn't consuming much CPU, about
> 2-3%. Top five items in per top, not much more revealing.
> 
> Samples: 220K of event 'cycles', 4000 Hz, Event count (approx.):
> 3463316966 lost: 0/0 drop: 0/0
> Overhead  Shared Object                    Symbol
>    1.62%  [kernel]                         [k] find_next_zero_bit
>    1.59%  perf                             [.] 0x00000000002ae063
>    1.52%  [kernel]                         [k] psi_task_change
>    1.41%  [kernel]                         [k] update_blocked_averages
>    1.33%  [unknown]                        [.] 0000000000000000
> 
> On a different system, with older Samsung 840 SATA SSD, and a fresh
> Btrfs, I can't reproduce. It takes less than 1s. Not sure how to get
> more information.
> 
> 


Ok, indeed your device seems to be taking a long time to do discards,
perhahps it's not using NCQ. OTOH btrfs currently issues synchronous
discards since it's using: blkdev_issue_discard. So naturally every
discard request is blocking.

The solution would be to rework how discard in btrfs works by utilising
the asynchronous discard interface __blkdev_issue_discard et al.
However, that would need a more careful analysis since freespace which
could be waiting to be discarded might be allocated. So for correctness
reasons some sort of synchronization would need to be devised.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-21  6:24 fstrim is takes a long time on Btrfs and NVMe Chris Murphy
                   ` (2 preceding siblings ...)
  2019-12-22 10:40 ` Nikolay Borisov
@ 2019-12-22 17:43 ` Josef Bacik
  2019-12-22 17:49   ` Nikolay Borisov
  3 siblings, 1 reply; 17+ messages in thread
From: Josef Bacik @ 2019-12-22 17:43 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 12/21/19 1:24 AM, Chris Murphy wrote:
> Hi,
> 
> Recent kernels, I think since 5.1 or 5.2, but tested today on 5.3.18,
> 5.4.5, 5.5.0rc2, takes quite a long time for `fstrim /` to complete,
> just over 1 minute.
> 
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/nvme0n1p7  178G   16G  161G   9% /
> 
> fstrim stops on this for pretty much the entire time:
> ioctl(3, FITRIM, {start=0, len=0xffffffffffffffff, minlen=0}) = 0
> 
> top shows the fstrim process itself isn't consuming much CPU, about
> 2-3%. Top five items in per top, not much more revealing.
> 
> Samples: 220K of event 'cycles', 4000 Hz, Event count (approx.):
> 3463316966 lost: 0/0 drop: 0/0
> Overhead  Shared Object                    Symbol
>     1.62%  [kernel]                         [k] find_next_zero_bit
>     1.59%  perf                             [.] 0x00000000002ae063
>     1.52%  [kernel]                         [k] psi_task_change
>     1.41%  [kernel]                         [k] update_blocked_averages
>     1.33%  [unknown]                        [.] 0000000000000000
> 
> On a different system, with older Samsung 840 SATA SSD, and a fresh
> Btrfs, I can't reproduce. It takes less than 1s. Not sure how to get
> more information.
> 
> 

You want to try Dennis's async discard stuff?  That should fix these problems 
for you, the patches are in Dave's tree.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22 17:43 ` Josef Bacik
@ 2019-12-22 17:49   ` Nikolay Borisov
  2019-12-22 18:00     ` Josef Bacik
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2019-12-22 17:49 UTC (permalink / raw)
  To: Josef Bacik, Chris Murphy, Btrfs BTRFS



On 22.12.19 г. 19:43 ч., Josef Bacik wrote:
> On 12/21/19 1:24 AM, Chris Murphy wrote:
>> Hi,
>>
>> Recent kernels, I think since 5.1 or 5.2, but tested today on 5.3.18,
>> 5.4.5, 5.5.0rc2, takes quite a long time for `fstrim /` to complete,
>> just over 1 minute.
>>
>> Filesystem      Size  Used Avail Use% Mounted on
>> /dev/nvme0n1p7  178G   16G  161G   9% /
>>
>> fstrim stops on this for pretty much the entire time:
>> ioctl(3, FITRIM, {start=0, len=0xffffffffffffffff, minlen=0}) = 0
>>
>> top shows the fstrim process itself isn't consuming much CPU, about
>> 2-3%. Top five items in per top, not much more revealing.
>>
>> Samples: 220K of event 'cycles', 4000 Hz, Event count (approx.):
>> 3463316966 lost: 0/0 drop: 0/0
>> Overhead  Shared Object                    Symbol
>>     1.62%  [kernel]                         [k] find_next_zero_bit
>>     1.59%  perf                             [.] 0x00000000002ae063
>>     1.52%  [kernel]                         [k] psi_task_change
>>     1.41%  [kernel]                         [k] update_blocked_averages
>>     1.33%  [unknown]                        [.] 0000000000000000
>>
>> On a different system, with older Samsung 840 SATA SSD, and a fresh
>> Btrfs, I can't reproduce. It takes less than 1s. Not sure how to get
>> more information.
>>
>>
> 
> You want to try Dennis's async discard stuff?  That should fix these
> problems for you, the patches are in Dave's tree.  Thanks,

But aren't those only for inline discards e.g. when you have explicitly
mounted with discard. The use case here is using FITRIM ioctl, does
Dennis' stuff fix this?

> 
> Josef
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22 17:49   ` Nikolay Borisov
@ 2019-12-22 18:00     ` Josef Bacik
  2019-12-22 18:06       ` Nikolay Borisov
  2019-12-22 18:50       ` Chris Murphy
  0 siblings, 2 replies; 17+ messages in thread
From: Josef Bacik @ 2019-12-22 18:00 UTC (permalink / raw)
  To: Nikolay Borisov, Chris Murphy, Btrfs BTRFS

On 12/22/19 12:49 PM, Nikolay Borisov wrote:
> 
> 
> On 22.12.19 г. 19:43 ч., Josef Bacik wrote:
>> On 12/21/19 1:24 AM, Chris Murphy wrote:
>>> Hi,
>>>
>>> Recent kernels, I think since 5.1 or 5.2, but tested today on 5.3.18,
>>> 5.4.5, 5.5.0rc2, takes quite a long time for `fstrim /` to complete,
>>> just over 1 minute.
>>>
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> /dev/nvme0n1p7  178G   16G  161G   9% /
>>>
>>> fstrim stops on this for pretty much the entire time:
>>> ioctl(3, FITRIM, {start=0, len=0xffffffffffffffff, minlen=0}) = 0
>>>
>>> top shows the fstrim process itself isn't consuming much CPU, about
>>> 2-3%. Top five items in per top, not much more revealing.
>>>
>>> Samples: 220K of event 'cycles', 4000 Hz, Event count (approx.):
>>> 3463316966 lost: 0/0 drop: 0/0
>>> Overhead  Shared Object                    Symbol
>>>      1.62%  [kernel]                         [k] find_next_zero_bit
>>>      1.59%  perf                             [.] 0x00000000002ae063
>>>      1.52%  [kernel]                         [k] psi_task_change
>>>      1.41%  [kernel]                         [k] update_blocked_averages
>>>      1.33%  [unknown]                        [.] 0000000000000000
>>>
>>> On a different system, with older Samsung 840 SATA SSD, and a fresh
>>> Btrfs, I can't reproduce. It takes less than 1s. Not sure how to get
>>> more information.
>>>
>>>
>>
>> You want to try Dennis's async discard stuff?  That should fix these
>> problems for you, the patches are in Dave's tree.  Thanks,
> 
> But aren't those only for inline discards e.g. when you have explicitly
> mounted with discard. The use case here is using FITRIM ioctl, does
> Dennis' stuff fix this?
> 

I definitely misread the email, I thought he was talking about the commits being 
slow.  The async discard stuff won't help with fitrim taking forever, there's 
only so much we can do in the face of shitty ssd's.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22 18:00     ` Josef Bacik
@ 2019-12-22 18:06       ` Nikolay Borisov
  2019-12-22 19:08         ` Chris Murphy
  2019-12-22 19:15         ` Roman Mamedov
  2019-12-22 18:50       ` Chris Murphy
  1 sibling, 2 replies; 17+ messages in thread
From: Nikolay Borisov @ 2019-12-22 18:06 UTC (permalink / raw)
  To: Josef Bacik, Chris Murphy, Btrfs BTRFS



On 22.12.19 г. 20:00 ч., Josef Bacik wrote:
> On 12/22/19 12:49 PM, Nikolay Borisov wrote:
>>
>>
>> On 22.12.19 г. 19:43 ч., Josef Bacik wrote:
>>> On 12/21/19 1:24 AM, Chris Murphy wrote:
>>>> Hi,
>>>>
>>>> Recent kernels, I think since 5.1 or 5.2, but tested today on 5.3.18,
>>>> 5.4.5, 5.5.0rc2, takes quite a long time for `fstrim /` to complete,
>>>> just over 1 minute.
>>>>
>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>> /dev/nvme0n1p7  178G   16G  161G   9% /
>>>>
>>>> fstrim stops on this for pretty much the entire time:
>>>> ioctl(3, FITRIM, {start=0, len=0xffffffffffffffff, minlen=0}) = 0
>>>>
>>>> top shows the fstrim process itself isn't consuming much CPU, about
>>>> 2-3%. Top five items in per top, not much more revealing.
>>>>
>>>> Samples: 220K of event 'cycles', 4000 Hz, Event count (approx.):
>>>> 3463316966 lost: 0/0 drop: 0/0
>>>> Overhead  Shared Object                    Symbol
>>>>      1.62%  [kernel]                         [k] find_next_zero_bit
>>>>      1.59%  perf                             [.] 0x00000000002ae063
>>>>      1.52%  [kernel]                         [k] psi_task_change
>>>>      1.41%  [kernel]                         [k]
>>>> update_blocked_averages
>>>>      1.33%  [unknown]                        [.] 0000000000000000
>>>>
>>>> On a different system, with older Samsung 840 SATA SSD, and a fresh
>>>> Btrfs, I can't reproduce. It takes less than 1s. Not sure how to get
>>>> more information.
>>>>
>>>>
>>>
>>> You want to try Dennis's async discard stuff?  That should fix these
>>> problems for you, the patches are in Dave's tree.  Thanks,
>>
>> But aren't those only for inline discards e.g. when you have explicitly
>> mounted with discard. The use case here is using FITRIM ioctl, does
>> Dennis' stuff fix this?
>>
> 
> I definitely misread the email, I thought he was talking about the
> commits being slow.  The async discard stuff won't help with fitrim
> taking forever, there's only so much we can do in the face of shitty
> ssd's.  Thanks,

Well, if we rework how fitrim is implemented - e.g. make discards async
and have some sort of locking to exclude queued extents being allocated
we can alleviate the problem somewhat.

> 
> Josef

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22 18:00     ` Josef Bacik
  2019-12-22 18:06       ` Nikolay Borisov
@ 2019-12-22 18:50       ` Chris Murphy
  1 sibling, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2019-12-22 18:50 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Nikolay Borisov, Chris Murphy, Btrfs BTRFS

On Sun, Dec 22, 2019 at 11:00 AM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On 12/22/19 12:49 PM, Nikolay Borisov wrote:
> > But aren't those only for inline discards e.g. when you have explicitly
> > mounted with discard. The use case here is using FITRIM ioctl, does
> > Dennis' stuff fix this?
> >
>
> I definitely misread the email, I thought he was talking about the commits being
> slow.  The async discard stuff won't help with fitrim taking forever, there's
> only so much we can do in the face of shitty ssd's.  Thanks,

My concern isn't this particular built-in Samsung NVMe in an HP
laptop, but whether it's sane to enable a weekly fstrim by default in
Fedora 32, via util-linux's fstrim.timer. This can't be an uncommon
situation, it's commodity name brand hardware. And opensuse and
Ubuntu, at least, have enabled this timer by default for years.

While the timer is scheduled for Monday at midnight, it's actually
likely to run during the first boot on Monday. Is it plausible there's
a setup whereby startup is blocked for the duration of this fstrim? If
so, that's way more shitty than having a shitty SSD. And if it's not
plausible, then does it matter if fstrim takes 5 minutes to run once a
week on some SSDs? I'm not seeing blocking, but what about other
shitty SSDs?

Blast from the past, this is 9 years old now: "At any rate, I
definitely think both the online trim and the FITRIM have their uses.
One thing that has burnt us in the past is coding too much for the
performance of the current crop of ssds when the next crop ends up
making our optimizations useless. This is the main reason I think the
online trim is going to be better and better. " Chris Mason
https://lwn.net/Articles/417809/

Really the industry has been schizo about this issue: totally mixed
messaging about the problem, the solution, and providing an interface
for it. In one way or another, most SSDs are shitty, depending on your
metric. Perhaps we're only just now coming out of the SSD stone age,
into the bronze age.

But what I don't want to do is make a "one size fits all" weekly timed
fstrim cause worse problems.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22 18:06       ` Nikolay Borisov
@ 2019-12-22 19:08         ` Chris Murphy
  2019-12-22 19:15         ` Roman Mamedov
  1 sibling, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2019-12-22 19:08 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, Chris Murphy, Btrfs BTRFS

On Sun, Dec 22, 2019 at 11:06 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
> Well, if we rework how fitrim is implemented - e.g. make discards async
> and have some sort of locking to exclude queued extents being allocated
> we can alleviate the problem somewhat.

Is it really helpful to fitrim every single lone 4K block? That's one
extreme. The other extreme is only issuing FITRIM to (Btrfs)
unallocated space, missing a bunch of unused space in block groups. Is
there some happy compromise? How about filtering for a minimum
contiguous range of 1MiB?

And leave online discards for the single lonely 4k block case, with
devices that use queued trim?

My understanding is that any SSD that's suitably well over provisioned
for its intended workload, doesn't need unused block hinting of any
kind. It's really the devices that can exhaust their reserve of erased
cells, resulting in writes that are dog slow, that can effectively
take advantage of trim. But does this class of device really benefit
from every possible unused block being reported by trim to the
firmware? Or is there some sane minimum size, bigger than 4K, that
would be useful to report, and also a lot faster to report?



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22 18:06       ` Nikolay Borisov
  2019-12-22 19:08         ` Chris Murphy
@ 2019-12-22 19:15         ` Roman Mamedov
  2019-12-22 22:11           ` Chris Murphy
  1 sibling, 1 reply; 17+ messages in thread
From: Roman Mamedov @ 2019-12-22 19:15 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Josef Bacik, Chris Murphy, Btrfs BTRFS

On Sun, 22 Dec 2019 20:06:57 +0200
Nikolay Borisov <nborisov@suse.com> wrote:

> Well, if we rework how fitrim is implemented - e.g. make discards async
> and have some sort of locking to exclude queued extents being allocated
> we can alleviate the problem somewhat.

Please keep fstrim synchronous, in many cases TRIM is expected to be completed
as it returns, for the next step of making a snapshot of a thin LV for backup,
to shutdown a VM for migration, and so on.

I don't think many really care about how long fstrim takes, it's not a typical
interactive end-user task. By all means it's great to speed it up if possible,
but not via faking that with unexpected tricks of continuing to TRIM in the
background. Sure, SSDs already do that under the hood, but keep in mind that
SSDs are not the only application of TRIM (the other being as mentioned, thin
LVs and sparsely allocated disk images).

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22 19:15         ` Roman Mamedov
@ 2019-12-22 22:11           ` Chris Murphy
  2019-12-22 22:29             ` Nikolay Borisov
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2019-12-22 22:11 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Nikolay Borisov, Josef Bacik, Chris Murphy, Btrfs BTRFS

On Sun, Dec 22, 2019 at 12:15 PM Roman Mamedov <rm@romanrm.net> wrote:
>
> On Sun, 22 Dec 2019 20:06:57 +0200
> Nikolay Borisov <nborisov@suse.com> wrote:
>
> > Well, if we rework how fitrim is implemented - e.g. make discards async
> > and have some sort of locking to exclude queued extents being allocated
> > we can alleviate the problem somewhat.
>
> Please keep fstrim synchronous, in many cases TRIM is expected to be completed
> as it returns, for the next step of making a snapshot of a thin LV for backup,
> to shutdown a VM for migration, and so on.

XFS already does async discards. What's the effect of FIFREEZE on
discards? An LV snapshot freezes the file system on the LV just prior
to the snapshot.

> I don't think many really care about how long fstrim takes, it's not a typical
> interactive end-user task.

I only care if I notice it affecting user space (excepting my timed
use of fstrim for testing).

Speculation: If a scheduled fstrim can block startup, that's not OK. I
don't have enough data to know if it's possible, let alone likely. But
when fstrim takes a minute to discard the unused blocks in only 51GiB
of used block groups (likely highly fragmented free space), and only a
fraction of a second to discard the unused block *groups*, I'm
suspicious startup delays may be possible.

Found this, from 2019 LSFMM
https://lwn.net/Articles/787272/



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22 22:11           ` Chris Murphy
@ 2019-12-22 22:29             ` Nikolay Borisov
  2019-12-22 23:14               ` Chris Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Nikolay Borisov @ 2019-12-22 22:29 UTC (permalink / raw)
  To: Chris Murphy, Roman Mamedov; +Cc: Josef Bacik, Btrfs BTRFS



On 23.12.19 г. 0:11 ч., Chris Murphy wrote:
> On Sun, Dec 22, 2019 at 12:15 PM Roman Mamedov <rm@romanrm.net> wrote:
>>
>> On Sun, 22 Dec 2019 20:06:57 +0200
>> Nikolay Borisov <nborisov@suse.com> wrote:
>>
>>> Well, if we rework how fitrim is implemented - e.g. make discards async
>>> and have some sort of locking to exclude queued extents being allocated
>>> we can alleviate the problem somewhat.
>>
>> Please keep fstrim synchronous, in many cases TRIM is expected to be completed
>> as it returns, for the next step of making a snapshot of a thin LV for backup,
>> to shutdown a VM for migration, and so on.
> 
> XFS already does async discards. What's the effect of FIFREEZE on
> discards? An LV snapshot freezes the file system on the LV just prior
> to the snapshot.

Actually, XFS issues synchronous discards for the FITRIM ioctl i.e
xfs_trim_extents calls blkdev_issue_discard same as with BTRFS. And
Dennis' patches implement async runtime discards (which is what XFS is
using by default).

> 
>> I don't think many really care about how long fstrim takes, it's not a typical
>> interactive end-user task.
> 
> I only care if I notice it affecting user space (excepting my timed
> use of fstrim for testing).
> 
> Speculation: If a scheduled fstrim can block startup, that's not OK. I
> don't have enough data to know if it's possible, let alone likely. But
> when fstrim takes a minute to discard the unused blocks in only 51GiB
> of used block groups (likely highly fragmented free space), and only a
> fraction of a second to discard the unused block *groups*, I'm
> suspicious startup delays may be possible.

If it takes that long then it's the drive's implementaiton at fault.
Whatever we do in software we will only masking the latency, which might
be workable solution for some but not for others.

> 
> Found this, from 2019 LSFMM
> https://lwn.net/Articles/787272/
> 
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22 22:29             ` Nikolay Borisov
@ 2019-12-22 23:14               ` Chris Murphy
  2019-12-22 23:23                 ` Chris Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Chris Murphy @ 2019-12-22 23:14 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Roman Mamedov, Josef Bacik, Btrfs BTRFS

On Sun, Dec 22, 2019 at 3:29 PM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 23.12.19 г. 0:11 ч., Chris Murphy wrote:
> > On Sun, Dec 22, 2019 at 12:15 PM Roman Mamedov <rm@romanrm.net> wrote:
> >>
> >> On Sun, 22 Dec 2019 20:06:57 +0200
> >> Nikolay Borisov <nborisov@suse.com> wrote:
> >>
> >>> Well, if we rework how fitrim is implemented - e.g. make discards async
> >>> and have some sort of locking to exclude queued extents being allocated
> >>> we can alleviate the problem somewhat.
> >>
> >> Please keep fstrim synchronous, in many cases TRIM is expected to be completed
> >> as it returns, for the next step of making a snapshot of a thin LV for backup,
> >> to shutdown a VM for migration, and so on.
> >
> > XFS already does async discards. What's the effect of FIFREEZE on
> > discards? An LV snapshot freezes the file system on the LV just prior
> > to the snapshot.
>
> Actually, XFS issues synchronous discards for the FITRIM ioctl i.e
> xfs_trim_extents calls blkdev_issue_discard same as with BTRFS. And
> Dennis' patches implement async runtime discards (which is what XFS is
> using by default).
>
> >
> >> I don't think many really care about how long fstrim takes, it's not a typical
> >> interactive end-user task.
> >
> > I only care if I notice it affecting user space (excepting my timed
> > use of fstrim for testing).
> >
> > Speculation: If a scheduled fstrim can block startup, that's not OK. I
> > don't have enough data to know if it's possible, let alone likely. But
> > when fstrim takes a minute to discard the unused blocks in only 51GiB
> > of used block groups (likely highly fragmented free space), and only a
> > fraction of a second to discard the unused block *groups*, I'm
> > suspicious startup delays may be possible.
>
> If it takes that long then it's the drive's implementaiton at fault.
> Whatever we do in software we will only masking the latency, which might
> be workable solution for some but not for others.

The point of bringing it up is to drive home the point we don't even
understand the scope of the problem, especially if this behavior is
surprising. It's common hardware.

fstrim on this file system results in 53618 discards to be issued. 35
of these are + 8388607 in size, which I think translates to ~4G but
I'm not sure if these are 4K block size or 512 byte. Seems more
consistent with them being 512 bytes, but it doesn't come out to
exactly 4GiB if I assume that.

Those 35 large discard ranges takes only 0.016951402 seconds. The
remaining 53000+ discards take the overwhelming bulk of time, over a
minute. I have no idea if this delay is lookup/computation for Btrfs
to figure out what the unused blocks are. Or if it's a device delay.
And that comes to about 568 discards per second. That's really
unreasonable drive performance behavior?

Also...

259,0    3   127259    91.202441239  3057  A  DS 177594367 + 1 <-
(259,7) 54700031

A single 512 byte discard? That's suspicious. Btrfs doesn't work in
512 byte increments, minimum unit is 4k for (Btrfs) sector size, isn't
it?



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fstrim is takes a long time on Btrfs and NVMe
  2019-12-22 23:14               ` Chris Murphy
@ 2019-12-22 23:23                 ` Chris Murphy
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2019-12-22 23:23 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nikolay Borisov, Roman Mamedov, Josef Bacik, Btrfs BTRFS

On Sun, Dec 22, 2019 at 4:14 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> fstrim on this file system results in 53618 discards to be issued.

Sorry, that's only for one of the CPUs. blktrace recorded four files.
So quite a bit more discards than that.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-12-22 23:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-21  6:24 fstrim is takes a long time on Btrfs and NVMe Chris Murphy
2019-12-21  8:38 ` Andrea Gelmini
2019-12-21  9:27 ` Nikolay Borisov
2019-12-22  3:43   ` Chris Murphy
2019-12-22  3:59     ` Chris Murphy
2019-12-22 10:40 ` Nikolay Borisov
2019-12-22 17:43 ` Josef Bacik
2019-12-22 17:49   ` Nikolay Borisov
2019-12-22 18:00     ` Josef Bacik
2019-12-22 18:06       ` Nikolay Borisov
2019-12-22 19:08         ` Chris Murphy
2019-12-22 19:15         ` Roman Mamedov
2019-12-22 22:11           ` Chris Murphy
2019-12-22 22:29             ` Nikolay Borisov
2019-12-22 23:14               ` Chris Murphy
2019-12-22 23:23                 ` Chris Murphy
2019-12-22 18:50       ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.