Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async

regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
@ 2023-03-20 22:40 Christopher Price
  2023-03-21 21:26 ` Josef Bacik
  0 siblings, 1 reply; 27+ messages in thread
From: Christopher Price @ 2023-03-20 22:40 UTC (permalink / raw)
  To: slyich
  Cc: anand.jain, boris, clm, dsterba, josef, linux-btrfs, regressions,
	regressions

I can also confirm the issue occurred on and after kernel 6.2, and
confirm the workaround setting btrfs discard iops_limit works for my
device:

08:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
NVMe SSD Controller PM9A1/PM9A3/9
80PRO

I had to use ~6000 for my device, instead of 1000.
I'm currently on kernel 6.2.1.

I am also interested to know what if this is expected.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-03-20 22:40 [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Christopher Price
@ 2023-03-21 21:26 ` Josef Bacik
  2023-03-22  8:38   ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Josef Bacik @ 2023-03-21 21:26 UTC (permalink / raw)
  To: Christopher Price
  Cc: slyich, anand.jain, boris, clm, dsterba, linux-btrfs,
	regressions, regressions

On Mon, Mar 20, 2023 at 6:41 PM Christopher Price
<pricechrispy@gmail.com> wrote:
>
> I can also confirm the issue occurred on and after kernel 6.2, and
> confirm the workaround setting btrfs discard iops_limit works for my
> device:
>
> 08:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
> NVMe SSD Controller PM9A1/PM9A3/9
> 80PRO
>
> I had to use ~6000 for my device, instead of 1000.
> I'm currently on kernel 6.2.1.
>
> I am also interested to know what if this is expected.

We got the defaults based on our testing with our workloads inside of
FB.  Clearly this isn't representative of a normal desktop usage, but
we've also got a lot of workloads so figured if it made the whole
fleet happy it would probably be fine everywhere.

That being said this is tunable for a reason, your workload seems to
generate a lot of free'd extents and discards.  We can probably mess
with the async stuff to maybe pause discarding if there's no other
activity happening on the device at the moment, but tuning it to let
more discards through at a time is also completely valid.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-03-21 21:26 ` Josef Bacik
@ 2023-03-22  8:38   ` Christoph Hellwig
  2023-03-23 22:26     ` Sergei Trofimovich
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2023-03-22  8:38 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Christopher Price, slyich, anand.jain, boris, clm, dsterba,
	linux-btrfs, regressions, regressions

On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote:
> We got the defaults based on our testing with our workloads inside of
> FB.  Clearly this isn't representative of a normal desktop usage, but
> we've also got a lot of workloads so figured if it made the whole
> fleet happy it would probably be fine everywhere.
> 
> That being said this is tunable for a reason, your workload seems to
> generate a lot of free'd extents and discards.  We can probably mess
> with the async stuff to maybe pause discarding if there's no other
> activity happening on the device at the moment, but tuning it to let
> more discards through at a time is also completely valid.  Thanks,

FYI, discard performance differs a lot between different SSDs.
It used to be pretty horrible for most devices early on, and then a
certain hyperscaler started requiring decent performance for enterprise
drives, so many of them are good now.  A lot less so for the typical
consumer drive, especially at the lower end of the spectrum.

And that jut NVMe, the still shipping SATA SSDs are another different
story.  Not helped by the fact that we don't even support ranged
discards for them in Linux.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-03-22  8:38   ` Christoph Hellwig
@ 2023-03-23 22:26     ` Sergei Trofimovich
  2023-04-04 10:49       ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 27+ messages in thread
From: Sergei Trofimovich @ 2023-03-23 22:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba,
	linux-btrfs, regressions, regressions

[-- Attachment #1: Type: text/plain, Size: 2049 bytes --]

On Wed, 22 Mar 2023 01:38:42 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote:
> > We got the defaults based on our testing with our workloads inside of
> > FB.  Clearly this isn't representative of a normal desktop usage, but
> > we've also got a lot of workloads so figured if it made the whole
> > fleet happy it would probably be fine everywhere.
> > 
> > That being said this is tunable for a reason, your workload seems to
> > generate a lot of free'd extents and discards.  We can probably mess
> > with the async stuff to maybe pause discarding if there's no other
> > activity happening on the device at the moment, but tuning it to let
> > more discards through at a time is also completely valid.  Thanks,  
> 
> FYI, discard performance differs a lot between different SSDs.
> It used to be pretty horrible for most devices early on, and then a
> certain hyperscaler started requiring decent performance for enterprise
> drives, so many of them are good now.  A lot less so for the typical
> consumer drive, especially at the lower end of the spectrum.
> 
> And that jut NVMe, the still shipping SATA SSDs are another different
> story.  Not helped by the fact that we don't even support ranged
> discards for them in Linux.

Josef, what did you use as a signal to detect what value was good
enough? Did you crank up the number until discard backlog clears up in a
reasonable time?

I still don't understand what I should take into account to change the
default and whether I should change it at all. Is it fine if the discard
backlog takes a week to get through it? (Or a day? An hour? A minute?)

Is it fine to send discards as fast as device allows instead of doing 10
IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS
makes sense. Or IOPS limit is just a way to reserve most bandwidth to
non-discard workloads? Then I would say unlimited IOPS as a default
would make more sense for btrfs.

-- 

  Sergei

[-- Attachment #2: Цифровая подпись OpenPGP --]
[-- Type: application/pgp-signature, Size: 981 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-03-23 22:26     ` Sergei Trofimovich
@ 2023-04-04 10:49       ` Linux regression tracking (Thorsten Leemhuis)
  2023-04-04 16:04         ` Christoph Hellwig
  2023-04-04 18:23         ` Boris Burkov
  0 siblings, 2 replies; 27+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-04 10:49 UTC (permalink / raw)
  To: Sergei Trofimovich, Christoph Hellwig
  Cc: Josef Bacik, Christopher Price, anand.jain, boris, clm, dsterba,
	linux-btrfs, regressions

On 23.03.23 23:26, Sergei Trofimovich wrote:
> On Wed, 22 Mar 2023 01:38:42 -0700
> Christoph Hellwig <hch@infradead.org> wrote:
> 
>> On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote:
>>> We got the defaults based on our testing with our workloads inside of
>>> FB.  Clearly this isn't representative of a normal desktop usage, but
>>> we've also got a lot of workloads so figured if it made the whole
>>> fleet happy it would probably be fine everywhere.
>>>
>>> That being said this is tunable for a reason, your workload seems to
>>> generate a lot of free'd extents and discards.  We can probably mess
>>> with the async stuff to maybe pause discarding if there's no other
>>> activity happening on the device at the moment, but tuning it to let
>>> more discards through at a time is also completely valid.  Thanks,  

BTW, there is another report about this issue here:
https://bugzilla.redhat.com/show_bug.cgi?id=2182228

/me wonders if there is a opensuse report as well, but a quick search
didn't find one

And as fun fact or for completeness, the issue even made it to reddit, too:
https://www.reddit.com/r/archlinux/comments/121htxn/btrfs_discard_storm_on_62x_kernel/

>> FYI, discard performance differs a lot between different SSDs.
>> It used to be pretty horrible for most devices early on, and then a
>> certain hyperscaler started requiring decent performance for enterprise
>> drives, so many of them are good now.  A lot less so for the typical
>> consumer drive, especially at the lower end of the spectrum.
>>
>> And that jut NVMe, the still shipping SATA SSDs are another different
>> story.  Not helped by the fact that we don't even support ranged
>> discards for them in Linux.

Thx for your comments Christoph. Quick question, just to be sure I
understand things properly:

I assume on some of those problematic devices these discard storms will
lead to a performance regression?

I also heard people saying these discard storms might reduce the life
time of some devices - is that true?

If the answer to at least one of these is "yes" I'd say we it might be
best to revert 63a7cb130718 for now.

> Josef, what did you use as a signal to detect what value was good
> enough? Did you crank up the number until discard backlog clears up in a
> reasonable time?
> 
> I still don't understand what I should take into account to change the
> default and whether I should change it at all. Is it fine if the discard
> backlog takes a week to get through it? (Or a day? An hour? A minute?)
> 
> Is it fine to send discards as fast as device allows instead of doing 10
> IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS
> makes sense. Or IOPS limit is just a way to reserve most bandwidth to
> non-discard workloads? Then I would say unlimited IOPS as a default
> would make more sense for btrfs.

/me would be interested in answers to these questions as well

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 10:49       ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-04-04 16:04         ` Christoph Hellwig
  2023-04-04 16:20           ` Roman Mamedov
  2023-04-04 18:15           ` Chris Mason
  2023-04-04 18:23         ` Boris Burkov
  1 sibling, 2 replies; 27+ messages in thread
From: Christoph Hellwig @ 2023-04-04 16:04 UTC (permalink / raw)
  To: Linux regressions mailing list
  Cc: Sergei Trofimovich, Christoph Hellwig, Josef Bacik,
	Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs

On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> >> And that jut NVMe, the still shipping SATA SSDs are another different
> >> story.  Not helped by the fact that we don't even support ranged
> >> discards for them in Linux.
> 
> Thx for your comments Christoph. Quick question, just to be sure I
> understand things properly:
> 
> I assume on some of those problematic devices these discard storms will
> lead to a performance regression?

Probably.

> I also heard people saying these discard storms might reduce the life
> time of some devices - is that true?

Also very much possible.  There are various SSDs that treat a discard
as a write zeroes and always return zeroes from all discarded blocks.
If the discards are smaller than or not aligned to the internal erase
(super)blocks, this will actually cause additional writes.

> If the answer to at least one of these is "yes" I'd say we it might be
> best to revert 63a7cb130718 for now.

I don't think enabling it is a very a smart idea for most consumer
devices.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 16:04         ` Christoph Hellwig
@ 2023-04-04 16:20           ` Roman Mamedov
  2023-04-04 16:27             ` Christoph Hellwig
  2023-04-04 18:15           ` Chris Mason
  1 sibling, 1 reply; 27+ messages in thread
From: Roman Mamedov @ 2023-04-04 16:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux regressions mailing list, Sergei Trofimovich, Josef Bacik,
	Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs

On Tue, 4 Apr 2023 09:04:03 -0700
Christoph Hellwig <hch@infradead.org> wrote:

> > I also heard people saying these discard storms might reduce the life
> > time of some devices - is that true?
> 
> Also very much possible.  There are various SSDs that treat a discard
> as a write zeroes and always return zeroes from all discarded blocks.
> If the discards are smaller than or not aligned to the internal erase
> (super)blocks, this will actually cause additional writes.

SSDs do not physically erase blocks on discard, that would be very slow.

Instead they nuke corresponding records in the Flash translation layer (FTL)
tables, so that the discarded areas point "nowhere" instead of the actual
stored blocks. And when facing such pointers on trying to resolve read
requests, the controller knows to just return zeroes.

Of course there can be varying behaviors per SSD, e.g. I know of some that
return random garbage instead of zeroes, and some which for a puzzling reason
prefer to return the byte FF instead. But I think the 1st point above should
be universal, pretty certain there are none where a discard/TRIM would take
comparable time to "dd if=/dev/zero of=/dev/ssd" (making it unusable in
practice).

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 16:20           ` Roman Mamedov
@ 2023-04-04 16:27             ` Christoph Hellwig
  2023-04-04 23:37               ` Damien Le Moal
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2023-04-04 16:27 UTC (permalink / raw)
  To: Roman Mamedov
  Cc: Christoph Hellwig, Linux regressions mailing list,
	Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain,
	boris, clm, dsterba, linux-btrfs

On Tue, Apr 04, 2023 at 09:20:27PM +0500, Roman Mamedov wrote:
> SSDs do not physically erase blocks on discard, that would be very slow.
> 
> Instead they nuke corresponding records in the Flash translation layer (FTL)
> tables, so that the discarded areas point "nowhere" instead of the actual
> stored blocks. And when facing such pointers on trying to resolve read
> requests, the controller knows to just return zeroes.

Of course they don't erase blocks on every discard (although if you look
long enough you'll probably find a worst case implementation that does
this anyway).  But you still need to persist your FTL changes, and the
zeroing if any was done by the time your get a FLUSH command, because
without that you'd return different data when reading a block after a
powerfail (i.e. the old data) than before (zeros or a pattern), which is
a no-go.

> Of course there can be varying behaviors per SSD, e.g. I know of some that
> return random garbage instead of zeroes, and some which for a puzzling reason
> prefer to return the byte FF instead.

All of that is valid behavior per the relevant standards.  

> But I think the 1st point above should
> be universal, pretty certain there are none where a discard/TRIM would take
> comparable time to "dd if=/dev/zero of=/dev/ssd" (making it unusable in
> practice).

This is wishful thinking :)  SSDs generall optimize the fast path very
heavily, so slow path command even when they should in theory be faster
due to the underlying optimizations might not be, as they are processed
in software instead of hardware offloads, moved to slower cores, etc.

For discard things have gotten a lot better in the last years, but for
many older devices performance can be outright horrible.

For SATA SSDs the fact that classic TRIM isn't a queued command adds
insult to injury as it always means draining the queue first and not
issuing any I/O until the TRIM command is done.  There is a FPDMA
version not, but I don't think it ws all that widely implemented before
SATA SSDs fell out of favour.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 16:27             ` Christoph Hellwig
@ 2023-04-04 23:37               ` Damien Le Moal
  0 siblings, 0 replies; 27+ messages in thread
From: Damien Le Moal @ 2023-04-04 23:37 UTC (permalink / raw)
  To: Christoph Hellwig, Roman Mamedov
  Cc: Linux regressions mailing list, Sergei Trofimovich, Josef Bacik,
	Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs

On 4/5/23 01:27, Christoph Hellwig wrote:
> On Tue, Apr 04, 2023 at 09:20:27PM +0500, Roman Mamedov wrote:
>> SSDs do not physically erase blocks on discard, that would be very slow.
>>
>> Instead they nuke corresponding records in the Flash translation layer (FTL)
>> tables, so that the discarded areas point "nowhere" instead of the actual
>> stored blocks. And when facing such pointers on trying to resolve read
>> requests, the controller knows to just return zeroes.
> 
> Of course they don't erase blocks on every discard (although if you look
> long enough you'll probably find a worst case implementation that does
> this anyway).  But you still need to persist your FTL changes, and the
> zeroing if any was done by the time your get a FLUSH command, because
> without that you'd return different data when reading a block after a
> powerfail (i.e. the old data) than before (zeros or a pattern), which is
> a no-go.
> 
>> Of course there can be varying behaviors per SSD, e.g. I know of some that
>> return random garbage instead of zeroes, and some which for a puzzling reason
>> prefer to return the byte FF instead.
> 
> All of that is valid behavior per the relevant standards.  
> 
>> But I think the 1st point above should
>> be universal, pretty certain there are none where a discard/TRIM would take
>> comparable time to "dd if=/dev/zero of=/dev/ssd" (making it unusable in
>> practice).
> 
> This is wishful thinking :)  SSDs generall optimize the fast path very
> heavily, so slow path command even when they should in theory be faster
> due to the underlying optimizations might not be, as they are processed
> in software instead of hardware offloads, moved to slower cores, etc.
> 
> For discard things have gotten a lot better in the last years, but for
> many older devices performance can be outright horrible.
> 
> For SATA SSDs the fact that classic TRIM isn't a queued command adds
> insult to injury as it always means draining the queue first and not
> issuing any I/O until the TRIM command is done.  There is a FPDMA
> version not, but I don't think it ws all that widely implemented before
> SATA SSDs fell out of favour.

Not to mention that many of the SATA SSDs actually implementing queued trim are
buggy. See ATA horkage flag ATA_HORKAGE_NO_NCQ_TRIM and the many consumer SSDs
that need that flag.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 16:04         ` Christoph Hellwig
  2023-04-04 16:20           ` Roman Mamedov
@ 2023-04-04 18:15           ` Chris Mason
  2023-04-04 18:51             ` Boris Burkov
                               ` (2 more replies)
  1 sibling, 3 replies; 27+ messages in thread
From: Chris Mason @ 2023-04-04 18:15 UTC (permalink / raw)
  To: Christoph Hellwig, Linux regressions mailing list
  Cc: Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain,
	boris, clm, dsterba, linux-btrfs

On 4/4/23 12:04 PM, Christoph Hellwig wrote:
> On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>> And that jut NVMe, the still shipping SATA SSDs are another different
>>>> story.  Not helped by the fact that we don't even support ranged
>>>> discards for them in Linux.
>>
>> Thx for your comments Christoph. Quick question, just to be sure I
>> understand things properly:
>>
>> I assume on some of those problematic devices these discard storms will
>> lead to a performance regression?

I'm searching through the various threads, but I don't think I've seen
the discard storm quantified?

Boris sent me this:
https://lore.kernel.org/linux-btrfs/ZCxP%2Fll7YjPdb9Ou@infradead.org/T/#m65851e5b8b0caa5320d2b7e322805dd200686f01

Which seems to match the 10 discards per second setting?  We should be
doing more of a dribble than a storm, so I'd like to understand if this
is a separate bug that should be fixed.

> 
> Probably.
> 
>> I also heard people saying these discard storms might reduce the life
>> time of some devices - is that true?
> 
> Also very much possible.  There are various SSDs that treat a discard
> as a write zeroes and always return zeroes from all discarded blocks.
> If the discards are smaller than or not aligned to the internal erase
> (super)blocks, this will actually cause additional writes.
> 
>> If the answer to at least one of these is "yes" I'd say we it might be
>> best to revert 63a7cb130718 for now.
> 
> I don't think enabling it is a very a smart idea for most consumer
> devices.

It seems like a good time to talk through a variations of discard usage
in fb data centers.  We run a pretty wide variety of hardware from
consumer grade ssds to enterprise ssds, and we've run these on
ext4/btrfs/xfs.

(Christoph knows most of this already, so I'm only partially replying to
him here)

First, there was synchronous discard.  These were pretty dark times
because all three of our filesystems would build a batch of synchronous
discards and then wait for them during filesystem commit.  There were
long tail latencies across all of our workloads, and so workload owners
would turn off discard and declare victory over terrible latencies.

Of course this predictably ends up with GC on the drives leading to
terrible latencies because we weren't discarding anymore, and nightly
trims are the obvious answer.  Different workloads would gyrate through
the variations and the only consistent result was unhappiness.

Some places in the fleet still do this, and it can be a pretty simple
tradeoff between the IO impacts of full drive trims vs the latency
impact of built up GC vs over-provisioning.  It works for consistent
workloads, but honestly there aren't many of those.

Along the way both btrfs and xfs have grown variations of async discard.
 The XFS one (sorry if I'm out of date here), didn't include any kind of
rate limiting, so if you were bulk deleting a lot of data, XFS would
effectively queue up so many discards that it actually saturated the
device for a long time, starving reads and writes.  If your workload did
a constant stream of allocation and deletion, the async discards would
just saturate the drive forever.

The workloads that care about latencies on XFS ended up going back to
synchronous discards, and they do a slow-rm hack that nibbles away at
the ends of files with periodic fsyncs mixed in until the file is zero
length.  They love this and it makes me cry.

The btrfs async discard feature was meant to address both of these
cases.  The primary features:

- Get rid of the transaction commit latency
- Enable allocations to steal from discards, reducing discard IO
- Avoid saturating the devices with discards by metering them out

Christoph mentions that modern enterprise drives are much better at
discarding, and we see this in production too.  But, we still have
workloads that switched from XFS to Btrfs because the async discard
feature did a better job of reducing drive write-amp and latencies.

So, honestly from my POV the async discard is best suited to consumer
devices.  Our defaults are probably wrong because no matter what you
choose there's a drive out there that makes it look bad.  Also, laptops
probably don't want the slow dribble.

I know Boris has some ideas on how to make the defaults better, so I'll
let him chime in there.

-chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 18:15           ` Chris Mason
@ 2023-04-04 18:51             ` Boris Burkov
  2023-04-04 19:22               ` David Sterba
  2023-04-10  2:03               ` Michael Bromilow
  2023-04-04 19:08             ` Sergei Trofimovich
  2023-04-05  6:18             ` Christoph Hellwig
  2 siblings, 2 replies; 27+ messages in thread
From: Boris Burkov @ 2023-04-04 18:51 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Linux regressions mailing list,
	Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain,
	clm, dsterba, linux-btrfs

On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote:
> On 4/4/23 12:04 PM, Christoph Hellwig wrote:
> > On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>>> And that jut NVMe, the still shipping SATA SSDs are another different
> >>>> story.  Not helped by the fact that we don't even support ranged
> >>>> discards for them in Linux.
> >>
> >> Thx for your comments Christoph. Quick question, just to be sure I
> >> understand things properly:
> >>
> >> I assume on some of those problematic devices these discard storms will
> >> lead to a performance regression?
> 
> I'm searching through the various threads, but I don't think I've seen
> the discard storm quantified?
> 
> Boris sent me this:
> https://lore.kernel.org/linux-btrfs/ZCxP%2Fll7YjPdb9Ou@infradead.org/T/#m65851e5b8b0caa5320d2b7e322805dd200686f01
> 
> Which seems to match the 10 discards per second setting?  We should be
> doing more of a dribble than a storm, so I'd like to understand if this
> is a separate bug that should be fixed.
> 
> > 
> > Probably.
> > 
> >> I also heard people saying these discard storms might reduce the life
> >> time of some devices - is that true?
> > 
> > Also very much possible.  There are various SSDs that treat a discard
> > as a write zeroes and always return zeroes from all discarded blocks.
> > If the discards are smaller than or not aligned to the internal erase
> > (super)blocks, this will actually cause additional writes.
> > 
> >> If the answer to at least one of these is "yes" I'd say we it might be
> >> best to revert 63a7cb130718 for now.
> > 
> > I don't think enabling it is a very a smart idea for most consumer
> > devices.
> 
> It seems like a good time to talk through a variations of discard usage
> in fb data centers.  We run a pretty wide variety of hardware from
> consumer grade ssds to enterprise ssds, and we've run these on
> ext4/btrfs/xfs.
> 
> (Christoph knows most of this already, so I'm only partially replying to
> him here)
> 
> First, there was synchronous discard.  These were pretty dark times
> because all three of our filesystems would build a batch of synchronous
> discards and then wait for them during filesystem commit.  There were
> long tail latencies across all of our workloads, and so workload owners
> would turn off discard and declare victory over terrible latencies.
> 
> Of course this predictably ends up with GC on the drives leading to
> terrible latencies because we weren't discarding anymore, and nightly
> trims are the obvious answer.  Different workloads would gyrate through
> the variations and the only consistent result was unhappiness.
> 
> Some places in the fleet still do this, and it can be a pretty simple
> tradeoff between the IO impacts of full drive trims vs the latency
> impact of built up GC vs over-provisioning.  It works for consistent
> workloads, but honestly there aren't many of those.
> 
> Along the way both btrfs and xfs have grown variations of async discard.
>  The XFS one (sorry if I'm out of date here), didn't include any kind of
> rate limiting, so if you were bulk deleting a lot of data, XFS would
> effectively queue up so many discards that it actually saturated the
> device for a long time, starving reads and writes.  If your workload did
> a constant stream of allocation and deletion, the async discards would
> just saturate the drive forever.
> 
> The workloads that care about latencies on XFS ended up going back to
> synchronous discards, and they do a slow-rm hack that nibbles away at
> the ends of files with periodic fsyncs mixed in until the file is zero
> length.  They love this and it makes me cry.
> 
> The btrfs async discard feature was meant to address both of these
> cases.  The primary features:
> 
> - Get rid of the transaction commit latency
> - Enable allocations to steal from discards, reducing discard IO
> - Avoid saturating the devices with discards by metering them out
> 
> Christoph mentions that modern enterprise drives are much better at
> discarding, and we see this in production too.  But, we still have
> workloads that switched from XFS to Btrfs because the async discard
> feature did a better job of reducing drive write-amp and latencies.
> 
> So, honestly from my POV the async discard is best suited to consumer
> devices.  Our defaults are probably wrong because no matter what you
> choose there's a drive out there that makes it look bad.  Also, laptops
> probably don't want the slow dribble.
> 
> I know Boris has some ideas on how to make the defaults better, so I'll
> let him chime in there.
> 
> -chris
>

Our reasonable options, as I see them:
- back to nodiscard, rely on periodic trims from the OS.
- leave low iops_limit, drives stay busy unexpectedly long, conclude that
  that's OK, and communicate the tuning/measurement options better.
- set a high iops_limit (e.g. 1000) drives will get back to idle faster.
- change an unset iops_limit to mean truly unlimited async discard, set
  that as the default, and anyone who cares to meter it can set an
  iops_limit.

The regression here is in drive idle time due to modest discard getting
metered out over minutes rather than dealt with relatively quickly. So
I would favor the unlimited async discard mode and will send a patch to
that effect which we can discuss.

IMO, the periodic discard cron screwing up your box once a week or once
a day or whatever is a pretty bad user experience as well, as is randomly
hitting bad latencies because you haven't been discarding often enough.

Boris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 18:51             ` Boris Burkov
@ 2023-04-04 19:22               ` David Sterba
  2023-04-04 19:39                 ` Boris Burkov
  2023-04-10  2:03               ` Michael Bromilow
  1 sibling, 1 reply; 27+ messages in thread
From: David Sterba @ 2023-04-04 19:22 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Chris Mason, Christoph Hellwig, Linux regressions mailing list,
	Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain,
	clm, dsterba, linux-btrfs

On Tue, Apr 04, 2023 at 11:51:51AM -0700, Boris Burkov wrote:
> Our reasonable options, as I see them:
> - back to nodiscard, rely on periodic trims from the OS.

We had that before and it's a fallback in case we can't fix it but still
the problem would persist for anyone enabling async discard so it's only
limiting the impact.

> - leave low iops_limit, drives stay busy unexpectedly long, conclude that
>   that's OK, and communicate the tuning/measurement options better.

This does not sound user friendly, tuning should be possible but not
required by default. We already have enough other things that users need
to decide and in this case I don't know if there's enough information to
even make a good decision upfront.

> - set a high iops_limit (e.g. 1000) drives will get back to idle faster.
> - change an unset iops_limit to mean truly unlimited async discard, set
>   that as the default, and anyone who cares to meter it can set an
>   iops_limit.
> 
> The regression here is in drive idle time due to modest discard getting
> metered out over minutes rather than dealt with relatively quickly. So
> I would favor the unlimited async discard mode and will send a patch to
> that effect which we can discuss.

Can we do options 3 and 4, i.e. set a high iops so that the batch gets
processed faster and (4) that there's the manual override to drop the
limit completely?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 19:22               ` David Sterba
@ 2023-04-04 19:39                 ` Boris Burkov
  2023-04-05  8:17                   ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 27+ messages in thread
From: Boris Burkov @ 2023-04-04 19:39 UTC (permalink / raw)
  To: David Sterba
  Cc: Chris Mason, Christoph Hellwig, Linux regressions mailing list,
	Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain,
	clm, dsterba, linux-btrfs

On Tue, Apr 04, 2023 at 09:22:05PM +0200, David Sterba wrote:
> On Tue, Apr 04, 2023 at 11:51:51AM -0700, Boris Burkov wrote:
> > Our reasonable options, as I see them:
> > - back to nodiscard, rely on periodic trims from the OS.
> 
> We had that before and it's a fallback in case we can't fix it but still
> the problem would persist for anyone enabling async discard so it's only
> limiting the impact.
> 
> > - leave low iops_limit, drives stay busy unexpectedly long, conclude that
> >   that's OK, and communicate the tuning/measurement options better.
> 
> This does not sound user friendly, tuning should be possible but not
> required by default. We already have enough other things that users need
> to decide and in this case I don't know if there's enough information to
> even make a good decision upfront.
> 
> > - set a high iops_limit (e.g. 1000) drives will get back to idle faster.
> > - change an unset iops_limit to mean truly unlimited async discard, set
> >   that as the default, and anyone who cares to meter it can set an
> >   iops_limit.
> > 
> > The regression here is in drive idle time due to modest discard getting
> > metered out over minutes rather than dealt with relatively quickly. So
> > I would favor the unlimited async discard mode and will send a patch to
> > that effect which we can discuss.
> 
> Can we do options 3 and 4, i.e. set a high iops so that the batch gets
> processed faster and (4) that there's the manual override to drop the
> limit completely?

Yup, I'm on it.

Chris also had a good idea for a "time since commit" check to further
limit how long async discard would crank for.

This would look something like "go as fast as I'm allowed for 5s after
every transaction". It could lead to a big buildup that requires a
periodic trim to clear.

I'm playing with adding that as an additional tunable which would apply
if set.

Boris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 19:39                 ` Boris Burkov
@ 2023-04-05  8:17                   ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 27+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-05  8:17 UTC (permalink / raw)
  To: Boris Burkov, David Sterba
  Cc: Chris Mason, Christoph Hellwig, Linux regressions mailing list,
	Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain,
	clm, dsterba, linux-btrfs

On 04.04.23 21:39, Boris Burkov wrote:
> On Tue, Apr 04, 2023 at 09:22:05PM +0200, David Sterba wrote:
>> On Tue, Apr 04, 2023 at 11:51:51AM -0700, Boris Burkov wrote:
>>> Our reasonable options, as I see them:
>>> - back to nodiscard, rely on periodic trims from the OS.
>>
>> We had that before and it's a fallback in case we can't fix it but still
>> the problem would persist for anyone enabling async discard so it's only
>> limiting the impact.
>>
>>> - leave low iops_limit, drives stay busy unexpectedly long, conclude that
>>>   that's OK, and communicate the tuning/measurement options better.
>>
>> This does not sound user friendly, tuning should be possible but not
>> required by default. We already have enough other things that users need
>> to decide and in this case I don't know if there's enough information to
>> even make a good decision upfront.
>>
>>> - set a high iops_limit (e.g. 1000) drives will get back to idle faster.
>>> - change an unset iops_limit to mean truly unlimited async discard, set
>>>   that as the default, and anyone who cares to meter it can set an
>>>   iops_limit.
>>>
>>> The regression here is in drive idle time due to modest discard getting
>>> metered out over minutes rather than dealt with relatively quickly. So
>>> I would favor the unlimited async discard mode and will send a patch to
>>> that effect which we can discuss.
>>
>> Can we do options 3 and 4, i.e. set a high iops so that the batch gets
>> processed faster and (4) that there's the manual override to drop the
>> limit completely?
> 
> Yup, I'm on it.
> [...]

Thx everyone for looking into this and the fruitful discussion
yesterday, much appreciated.

/me will now get back to watching this only from afar and leave things
to the experts

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 18:51             ` Boris Burkov
  2023-04-04 19:22               ` David Sterba
@ 2023-04-10  2:03               ` Michael Bromilow
  2023-04-11 17:52                 ` David Sterba
  1 sibling, 1 reply; 27+ messages in thread
From: Michael Bromilow @ 2023-04-10  2:03 UTC (permalink / raw)
  To: Boris Burkov, Chris Mason
  Cc: Christoph Hellwig, Linux regressions mailing list,
	Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain,
	clm, dsterba, linux-btrfs

On 04/04/2023 19:51, Boris Burkov wrote:
> On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote:
>> So, honestly from my POV the async discard is best suited to consumer
>> devices.  Our defaults are probably wrong because no matter what you
>> choose there's a drive out there that makes it look bad.  Also, laptops
>> probably don't want the slow dribble.
> 
> Our reasonable options, as I see them:
> - back to nodiscard, rely on periodic trims from the OS.
> - leave low iops_limit, drives stay busy unexpectedly long, conclude that
>    that's OK, and communicate the tuning/measurement options better.
> - set a high iops_limit (e.g. 1000) drives will get back to idle faster.
> - change an unset iops_limit to mean truly unlimited async discard, set
>    that as the default, and anyone who cares to meter it can set an
>    iops_limit.
> 
> The regression here is in drive idle time due to modest discard getting
> metered out over minutes rather than dealt with relatively quickly. So
> I would favor the unlimited async discard mode and will send a patch to
> that effect which we can discuss.

Will power usage be taken into consideration? I only noticed this regression
myself when my laptop started to draw a constant extra ~10W from the constant
drive activity, so I imagine other laptop users would also prefer a default
which avoids this if possible. If "relatively quickly" means anything more
than a few seconds I could see that causing rather significant battery life
reductions.

- Michael

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-10  2:03               ` Michael Bromilow
@ 2023-04-11 17:52                 ` David Sterba
  2023-04-11 18:15                   ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 1 reply; 27+ messages in thread
From: David Sterba @ 2023-04-11 17:52 UTC (permalink / raw)
  To: Michael Bromilow
  Cc: Boris Burkov, Chris Mason, Christoph Hellwig,
	Linux regressions mailing list, Sergei Trofimovich, Josef Bacik,
	Christopher Price, anand.jain, clm, dsterba, linux-btrfs

On Mon, Apr 10, 2023 at 03:03:37AM +0100, Michael Bromilow wrote:
> On 04/04/2023 19:51, Boris Burkov wrote:
> > On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote:
> >> So, honestly from my POV the async discard is best suited to consumer
> >> devices.  Our defaults are probably wrong because no matter what you
> >> choose there's a drive out there that makes it look bad.  Also, laptops
> >> probably don't want the slow dribble.
> > 
> > Our reasonable options, as I see them:
> > - back to nodiscard, rely on periodic trims from the OS.
> > - leave low iops_limit, drives stay busy unexpectedly long, conclude that
> >    that's OK, and communicate the tuning/measurement options better.
> > - set a high iops_limit (e.g. 1000) drives will get back to idle faster.
> > - change an unset iops_limit to mean truly unlimited async discard, set
> >    that as the default, and anyone who cares to meter it can set an
> >    iops_limit.
> > 
> > The regression here is in drive idle time due to modest discard getting
> > metered out over minutes rather than dealt with relatively quickly. So
> > I would favor the unlimited async discard mode and will send a patch to
> > that effect which we can discuss.
> 
> Will power usage be taken into consideration? I only noticed this regression
> myself when my laptop started to draw a constant extra ~10W from the constant
> drive activity, so I imagine other laptop users would also prefer a default
> which avoids this if possible. If "relatively quickly" means anything more
> than a few seconds I could see that causing rather significant battery life
> reductions.

This may need to be configured from userspace, e.g. from
laptop-mode-tools, I'm not sure if any of the power usage info is
available from kernel.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-11 17:52                 ` David Sterba
@ 2023-04-11 18:15                   ` Linux regression tracking (Thorsten Leemhuis)
  0 siblings, 0 replies; 27+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-04-11 18:15 UTC (permalink / raw)
  To: dsterba, Michael Bromilow
  Cc: Boris Burkov, Chris Mason, Christoph Hellwig,
	Linux regressions mailing list, Sergei Trofimovich, Josef Bacik,
	Christopher Price, anand.jain, clm, dsterba, linux-btrfs



On 11.04.23 19:52, David Sterba wrote:
> On Mon, Apr 10, 2023 at 03:03:37AM +0100, Michael Bromilow wrote:
>> On 04/04/2023 19:51, Boris Burkov wrote:
>>> On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote:
>>>> So, honestly from my POV the async discard is best suited to consumer
>>>> devices.  Our defaults are probably wrong because no matter what you
>>>> choose there's a drive out there that makes it look bad.  Also, laptops
>>>> probably don't want the slow dribble.
>>>
>>> Our reasonable options, as I see them:
>>> - back to nodiscard, rely on periodic trims from the OS.
>>> - leave low iops_limit, drives stay busy unexpectedly long, conclude that
>>>    that's OK, and communicate the tuning/measurement options better.
>>> - set a high iops_limit (e.g. 1000) drives will get back to idle faster.
>>> - change an unset iops_limit to mean truly unlimited async discard, set
>>>    that as the default, and anyone who cares to meter it can set an
>>>    iops_limit.
>>>
>>> The regression here is in drive idle time due to modest discard getting
>>> metered out over minutes rather than dealt with relatively quickly. So
>>> I would favor the unlimited async discard mode and will send a patch to
>>> that effect which we can discuss.
>>
>> Will power usage be taken into consideration? I only noticed this regression
>> myself when my laptop started to draw a constant extra ~10W from the constant
>> drive activity, so I imagine other laptop users would also prefer a default
>> which avoids this if possible. If "relatively quickly" means anything more
>> than a few seconds I could see that causing rather significant battery life
>> reductions.
> 
> This may need to be configured from userspace, e.g. from
> laptop-mode-tools, I'm not sure if any of the power usage info is
> available from kernel.

Well, from the point of the regression tracker this looks a bit
different: if a system after a kernel updates draw a constant extra
~10W, it's a regression that needs to be fixed. But in the end it's of
course Linus option that matters.

But whatever, Boris afaics is looking for an improved solution that
hopefully avoids the extra power consumption and at the same time does
the right thing on modern enterprise storage devices, so that the
defaults work for the majority of the users. Hopefully this will work
out. :-D

Ciao, Thorsten

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 18:15           ` Chris Mason
  2023-04-04 18:51             ` Boris Burkov
@ 2023-04-04 19:08             ` Sergei Trofimovich
  2023-04-05  6:18             ` Christoph Hellwig
  2 siblings, 0 replies; 27+ messages in thread
From: Sergei Trofimovich @ 2023-04-04 19:08 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Linux regressions mailing list, Josef Bacik,
	Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs

On Tue, 4 Apr 2023 14:15:38 -0400
Chris Mason <clm@meta.com> wrote:

> On 4/4/23 12:04 PM, Christoph Hellwig wrote:
> > On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:  
> >>>> And that jut NVMe, the still shipping SATA SSDs are another different
> >>>> story.  Not helped by the fact that we don't even support ranged
> >>>> discards for them in Linux.  
> >>
> >> Thx for your comments Christoph. Quick question, just to be sure I
> >> understand things properly:
> >>
> >> I assume on some of those problematic devices these discard storms will
> >> lead to a performance regression?  
> 
> I'm searching through the various threads, but I don't think I've seen
> the discard storm quantified?
> 
> Boris sent me this:
> https://lore.kernel.org/linux-btrfs/ZCxP%2Fll7YjPdb9Ou@infradead.org/T/#m65851e5b8b0caa5320d2b7e322805dd200686f01
> 
> Which seems to match the 10 discards per second setting?  We should be
> doing more of a dribble than a storm, so I'd like to understand if this
> is a separate bug that should be fixed.

I called it a storm based on the following LED behaviour:

    https://trofi.github.io/posts.data/281-a-signal-from-the-stars/01-shimmer.gif

It's not a saturated device (as you can see the flashes on and off). But
it's not idle either. "storm" was an exaggeration. Note that it takes
linux 3-4 days to go through a backlog of 300GB of deleted files.

Is 3-4 days expected? Or my setup is somehow unusual? Maybe
`compress=zstd` makes things disproportionally problematic?

> > Probably.
> >   
> >> I also heard people saying these discard storms might reduce the life
> >> time of some devices - is that true?  
> > 
> > Also very much possible.  There are various SSDs that treat a discard
> > as a write zeroes and always return zeroes from all discarded blocks.
> > If the discards are smaller than or not aligned to the internal erase
> > (super)blocks, this will actually cause additional writes.
> >   
> >> If the answer to at least one of these is "yes" I'd say we it might be
> >> best to revert 63a7cb130718 for now.  
> > 
> > I don't think enabling it is a very a smart idea for most consumer
> > devices.  
> 
> It seems like a good time to talk through a variations of discard usage
> in fb data centers.  We run a pretty wide variety of hardware from
> consumer grade ssds to enterprise ssds, and we've run these on
> ext4/btrfs/xfs.
> 
> (Christoph knows most of this already, so I'm only partially replying to
> him here)
> 
> First, there was synchronous discard.  These were pretty dark times
> because all three of our filesystems would build a batch of synchronous
> discards and then wait for them during filesystem commit.  There were
> long tail latencies across all of our workloads, and so workload owners
> would turn off discard and declare victory over terrible latencies.
> 
> Of course this predictably ends up with GC on the drives leading to
> terrible latencies because we weren't discarding anymore, and nightly
> trims are the obvious answer.  Different workloads would gyrate through
> the variations and the only consistent result was unhappiness.
> 
> Some places in the fleet still do this, and it can be a pretty simple
> tradeoff between the IO impacts of full drive trims vs the latency
> impact of built up GC vs over-provisioning.  It works for consistent
> workloads, but honestly there aren't many of those.
> 
> Along the way both btrfs and xfs have grown variations of async discard.
>  The XFS one (sorry if I'm out of date here), didn't include any kind of
> rate limiting, so if you were bulk deleting a lot of data, XFS would
> effectively queue up so many discards that it actually saturated the
> device for a long time, starving reads and writes.  If your workload did
> a constant stream of allocation and deletion, the async discards would
> just saturate the drive forever.
> 
> The workloads that care about latencies on XFS ended up going back to
> synchronous discards, and they do a slow-rm hack that nibbles away at
> the ends of files with periodic fsyncs mixed in until the file is zero
> length.  They love this and it makes me cry.
> 
> The btrfs async discard feature was meant to address both of these
> cases.  The primary features:
> 
> - Get rid of the transaction commit latency
> - Enable allocations to steal from discards, reducing discard IO
> - Avoid saturating the devices with discards by metering them out
> 
> Christoph mentions that modern enterprise drives are much better at
> discarding, and we see this in production too.  But, we still have
> workloads that switched from XFS to Btrfs because the async discard
> feature did a better job of reducing drive write-amp and latencies.
> 
> So, honestly from my POV the async discard is best suited to consumer
> devices.  Our defaults are probably wrong because no matter what you
> choose there's a drive out there that makes it look bad.  Also, laptops
> probably don't want the slow dribble.
> 
> I know Boris has some ideas on how to make the defaults better, so I'll
> let him chime in there.

That makes sense. Thank you!


-- 

  Sergei

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 18:15           ` Chris Mason
  2023-04-04 18:51             ` Boris Burkov
  2023-04-04 19:08             ` Sergei Trofimovich
@ 2023-04-05  6:18             ` Christoph Hellwig
  2023-04-05 12:01               ` Chris Mason
  2 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2023-04-05  6:18 UTC (permalink / raw)
  To: Chris Mason
  Cc: Christoph Hellwig, Linux regressions mailing list,
	Sergei Trofimovich, Josef Bacik, Christopher Price, anand.jain,
	boris, clm, dsterba, linux-btrfs

On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote:
> It seems like a good time to talk through a variations of discard usage
> in fb data centers.  We run a pretty wide variety of hardware from
> consumer grade ssds to enterprise ssds, and we've run these on
> ext4/btrfs/xfs.

Remember what you guys call consumer grade is the top of the shelve
consumer grade SSDs, or in fact low-end enterprise ones that are based
on the latest and greatest top shelve SSDs with firmware specifically
tweaked for you.  That is a whole differnet level from random crappy
no-name SSD from 5 years ago.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-05  6:18             ` Christoph Hellwig
@ 2023-04-05 12:01               ` Chris Mason
  0 siblings, 0 replies; 27+ messages in thread
From: Chris Mason @ 2023-04-05 12:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux regressions mailing list, Sergei Trofimovich, Josef Bacik,
	Christopher Price, anand.jain, boris, clm, dsterba, linux-btrfs

On 4/5/23 2:18 AM, Christoph Hellwig wrote:
> On Tue, Apr 04, 2023 at 02:15:38PM -0400, Chris Mason wrote:
>> It seems like a good time to talk through a variations of discard usage
>> in fb data centers.  We run a pretty wide variety of hardware from
>> consumer grade ssds to enterprise ssds, and we've run these on
>> ext4/btrfs/xfs.
> 
> Remember what you guys call consumer grade is the top of the shelve
> consumer grade SSDs, or in fact low-end enterprise ones that are based
> on the latest and greatest top shelve SSDs with firmware specifically
> tweaked for you.  That is a whole differnet level from random crappy
> no-name SSD from 5 years ago.
> 

This is definitely true, our providers are reliable and support the
products well.  But, you also have to factor in the write cycles, drive
lifetimes, and just the constant threat of statistics at scale.  Drives
might come in off something close to the top shelf, but datacenters are
pretty hard on storage.

Going back to the fedora thread, we've definitely had ssds with
reliability issues caused by trims, but we haven't had any I can think
of where mount -o discard was fundamentally less reliable than periodic
full device trimming.

-chris

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 10:49       ` Linux regression tracking (Thorsten Leemhuis)
  2023-04-04 16:04         ` Christoph Hellwig
@ 2023-04-04 18:23         ` Boris Burkov
  2023-04-04 19:12           ` Sergei Trofimovich
  1 sibling, 1 reply; 27+ messages in thread
From: Boris Burkov @ 2023-04-04 18:23 UTC (permalink / raw)
  To: Linux regression tracking (Thorsten Leemhuis)
  Cc: Sergei Trofimovich, Christoph Hellwig, Josef Bacik,
	Christopher Price, anand.jain, clm, dsterba, linux-btrfs,
	regressions

On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 23.03.23 23:26, Sergei Trofimovich wrote:
> > On Wed, 22 Mar 2023 01:38:42 -0700
> > Christoph Hellwig <hch@infradead.org> wrote:
> > 
> >> On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote:
> >>> We got the defaults based on our testing with our workloads inside of
> >>> FB.  Clearly this isn't representative of a normal desktop usage, but
> >>> we've also got a lot of workloads so figured if it made the whole
> >>> fleet happy it would probably be fine everywhere.
> >>>
> >>> That being said this is tunable for a reason, your workload seems to
> >>> generate a lot of free'd extents and discards.  We can probably mess
> >>> with the async stuff to maybe pause discarding if there's no other
> >>> activity happening on the device at the moment, but tuning it to let
> >>> more discards through at a time is also completely valid.  Thanks,  
> 
> BTW, there is another report about this issue here:
> https://bugzilla.redhat.com/show_bug.cgi?id=2182228
> 
> /me wonders if there is a opensuse report as well, but a quick search
> didn't find one
> 
> And as fun fact or for completeness, the issue even made it to reddit, too:
> https://www.reddit.com/r/archlinux/comments/121htxn/btrfs_discard_storm_on_62x_kernel/

Good find, but also:
https://www.reddit.com/r/Fedora/comments/vjfpkv/periodic_trim_freezes_ssd/
So without harder data, there is a bit of bias inherent in cherrypicking
negative impressions from the internet.

> 
> >> FYI, discard performance differs a lot between different SSDs.
> >> It used to be pretty horrible for most devices early on, and then a
> >> certain hyperscaler started requiring decent performance for enterprise
> >> drives, so many of them are good now.  A lot less so for the typical
> >> consumer drive, especially at the lower end of the spectrum.
> >>
> >> And that jut NVMe, the still shipping SATA SSDs are another different
> >> story.  Not helped by the fact that we don't even support ranged
> >> discards for them in Linux.
> 
> Thx for your comments Christoph. Quick question, just to be sure I
> understand things properly:
> 
> I assume on some of those problematic devices these discard storms will
> lead to a performance regression?
> 
> I also heard people saying these discard storms might reduce the life
> time of some devices - is that true?
> 
> If the answer to at least one of these is "yes" I'd say we it might be
> best to revert 63a7cb130718 for now.
> 
> > Josef, what did you use as a signal to detect what value was good
> > enough? Did you crank up the number until discard backlog clears up in a
> > reasonable time?

Josef is OOO, so I'll try to clarify some things around async discard,
hopefully it's helpful to anyone wondering how to tune it.

Like you guessed, our tuning basically consists of looking at the
discardable_extents/discardable_bytes metric in the fleet and ensuring
it looks sane, and that we see an improvement in I/O tail latencies or
fix some concrete instances of bad tail latencies. e.g. with
iops_limit=10, we see concrete cases of bad latency go away and don't
see a steady buildup of discardable_extents.

> > 
> > I still don't understand what I should take into account to change the
> > default and whether I should change it at all. Is it fine if the discard
> > backlog takes a week to get through it? (Or a day? An hour? A minute?)

I believe the relevant metrics are:

- number of trims issued/bytes trimmed (you would measure this by tracing
  and by looking at discard_extent_bytes and discard_bitmap_bytes)
- bytes "wrongly" trimmed. (extents that were reallocated without getting
  trimmed are exposed in discard_bytes_saved, so if that drops, you are
  maybe trimming things that you may have not needed to)
- discardable_extents/discardable_bytes (in sysfs; the outstanding work)
- tail latency of file system operations
- disk idle time

By doing periodic trim you tradeoff better bytes_saved and better disk
idle time (big trim once a week, vs. "trim all the time" against worse
tail latency during the trim itself, and risking trimming too
infrequently, leading to worse latency on a drive that needs a trim.

> > 
> > Is it fine to send discards as fast as device allows instead of doing 10
> > IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS
> > makes sense. Or IOPS limit is just a way to reserve most bandwidth to
> > non-discard workloads? Then I would say unlimited IOPS as a default
> > would make more sense for btrfs.

Unfortunately, btrfs currently doesn't have a "fully unlimited" async
discard no matter how you tune it. Ignoring kbps_limit, which only
serves to increase the delay, iops_limit has an effective range between
1 and 1000. The basic premise of btrfs async discard is to meter out
the discards at a steady rate, asynchronously from file system
operations, so the effect of the tunables is to set that delay between
discard operations. The delay is clamped between 1ms and 1000ms, so
iops_limit > 1000 is the same as iops_limit = 1000. iops_limit=0 does
not lead to unmetered discards, but rather hits a hardcoded case of
metering them out over 6 hours. (no clue why, I don't personally love
that...)

Hope that's somewhat helpful,
Boris

> 
> /me would be interested in answers to these questions as well
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-04-04 18:23         ` Boris Burkov
@ 2023-04-04 19:12           ` Sergei Trofimovich
  0 siblings, 0 replies; 27+ messages in thread
From: Sergei Trofimovich @ 2023-04-04 19:12 UTC (permalink / raw)
  To: Boris Burkov
  Cc: Linux regression tracking (Thorsten Leemhuis),
	Christoph Hellwig, Josef Bacik, Christopher Price, anand.jain,
	clm, dsterba, linux-btrfs, regressions

On Tue, 4 Apr 2023 11:23:12 -0700
Boris Burkov <boris@bur.io> wrote:

> On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> > On 23.03.23 23:26, Sergei Trofimovich wrote:  
> > > On Wed, 22 Mar 2023 01:38:42 -0700
> > > Christoph Hellwig <hch@infradead.org> wrote:
> > >   
> > >> On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote:  
> > >>> We got the defaults based on our testing with our workloads inside of
> > >>> FB.  Clearly this isn't representative of a normal desktop usage, but
> > >>> we've also got a lot of workloads so figured if it made the whole
> > >>> fleet happy it would probably be fine everywhere.
> > >>>
> > >>> That being said this is tunable for a reason, your workload seems to
> > >>> generate a lot of free'd extents and discards.  We can probably mess
> > >>> with the async stuff to maybe pause discarding if there's no other
> > >>> activity happening on the device at the moment, but tuning it to let
> > >>> more discards through at a time is also completely valid.  Thanks,    
> > 
> > BTW, there is another report about this issue here:
> > https://bugzilla.redhat.com/show_bug.cgi?id=2182228
> > 
> > /me wonders if there is a opensuse report as well, but a quick search
> > didn't find one
> > 
> > And as fun fact or for completeness, the issue even made it to reddit, too:
> > https://www.reddit.com/r/archlinux/comments/121htxn/btrfs_discard_storm_on_62x_kernel/  
> 
> Good find, but also:
> https://www.reddit.com/r/Fedora/comments/vjfpkv/periodic_trim_freezes_ssd/
> So without harder data, there is a bit of bias inherent in cherrypicking
> negative impressions from the internet.
> 
> >   
> > >> FYI, discard performance differs a lot between different SSDs.
> > >> It used to be pretty horrible for most devices early on, and then a
> > >> certain hyperscaler started requiring decent performance for enterprise
> > >> drives, so many of them are good now.  A lot less so for the typical
> > >> consumer drive, especially at the lower end of the spectrum.
> > >>
> > >> And that jut NVMe, the still shipping SATA SSDs are another different
> > >> story.  Not helped by the fact that we don't even support ranged
> > >> discards for them in Linux.  
> > 
> > Thx for your comments Christoph. Quick question, just to be sure I
> > understand things properly:
> > 
> > I assume on some of those problematic devices these discard storms will
> > lead to a performance regression?
> > 
> > I also heard people saying these discard storms might reduce the life
> > time of some devices - is that true?
> > 
> > If the answer to at least one of these is "yes" I'd say we it might be
> > best to revert 63a7cb130718 for now.
> >   
> > > Josef, what did you use as a signal to detect what value was good
> > > enough? Did you crank up the number until discard backlog clears up in a
> > > reasonable time?  
> 
> Josef is OOO, so I'll try to clarify some things around async discard,
> hopefully it's helpful to anyone wondering how to tune it.
> 
> Like you guessed, our tuning basically consists of looking at the
> discardable_extents/discardable_bytes metric in the fleet and ensuring
> it looks sane, and that we see an improvement in I/O tail latencies or
> fix some concrete instances of bad tail latencies. e.g. with
> iops_limit=10, we see concrete cases of bad latency go away and don't
> see a steady buildup of discardable_extents.
> 
> > > 
> > > I still don't understand what I should take into account to change the
> > > default and whether I should change it at all. Is it fine if the discard
> > > backlog takes a week to get through it? (Or a day? An hour? A minute?)  
> 
> I believe the relevant metrics are:
> 
> - number of trims issued/bytes trimmed (you would measure this by tracing
>   and by looking at discard_extent_bytes and discard_bitmap_bytes)
> - bytes "wrongly" trimmed. (extents that were reallocated without getting
>   trimmed are exposed in discard_bytes_saved, so if that drops, you are
>   maybe trimming things that you may have not needed to)
> - discardable_extents/discardable_bytes (in sysfs; the outstanding work)
> - tail latency of file system operations
> - disk idle time
> 
> By doing periodic trim you tradeoff better bytes_saved and better disk
> idle time (big trim once a week, vs. "trim all the time" against worse
> tail latency during the trim itself, and risking trimming too
> infrequently, leading to worse latency on a drive that needs a trim.
> 
> > > 
> > > Is it fine to send discards as fast as device allows instead of doing 10
> > > IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS
> > > makes sense. Or IOPS limit is just a way to reserve most bandwidth to
> > > non-discard workloads? Then I would say unlimited IOPS as a default
> > > would make more sense for btrfs.  
> 
> Unfortunately, btrfs currently doesn't have a "fully unlimited" async
> discard no matter how you tune it. Ignoring kbps_limit, which only
> serves to increase the delay, iops_limit has an effective range between
> 1 and 1000. The basic premise of btrfs async discard is to meter out
> the discards at a steady rate, asynchronously from file system
> operations, so the effect of the tunables is to set that delay between
> discard operations. The delay is clamped between 1ms and 1000ms, so
> iops_limit > 1000 is the same as iops_limit = 1000. iops_limit=0 does
> not lead to unmetered discards, but rather hits a hardcoded case of
> metering them out over 6 hours. (no clue why, I don't personally love
> that...)
> 
> Hope that's somewhat helpful,

Thank you, Boris! That is very helpful. `discard_bytes_saved` is a good
(and not very obvious to understand!) signal.

> Boris
> 
> > 
> > /me would be interested in answers to these questions as well
> > 
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > --
> > Everything you wanna know about Linux kernel regression tracking:
> > https://linux-regtracking.leemhuis.info/about/#tldr
> > If I did something stupid, please tell me, as explained on that page.  

-- 

  Sergei

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <Y/+n1wS/4XAH7X1p@nz>]

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
       [not found] <Y/+n1wS/4XAH7X1p@nz>
@ 2023-03-02  8:04 ` Linux regression tracking #adding (Thorsten Leemhuis)
  2023-04-04 10:52   ` Linux regression tracking #update (Thorsten Leemhuis)
  2023-04-21 13:56   ` Linux regression tracking #update (Thorsten Leemhuis)
       [not found] ` <94cf49d0-fa2d-cc2c-240e-222706d69eb3@oracle.com>
  1 sibling, 2 replies; 27+ messages in thread
From: Linux regression tracking #adding (Thorsten Leemhuis) @ 2023-03-02  8:04 UTC (permalink / raw)
  To: Sergei Trofimovich, linux-btrfs
  Cc: David Sterba, Boris Burkov, Chris Mason, Josef Bacik,
	Linux kernel regressions list

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 01.03.23 20:30, Sergei Trofimovich wrote:
> Hi btrfs maintainers!
> 
> Tl;DR:
> 
>   After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I
>   see constant DISCARD storm towards my NVME device be it idle or not.
> 
>   No storm: v6.1 and older
>   Has storm: v6.2 and newer

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 63a7cb13071842
#regzbot title btrfs: DISCARD storm towards NVME device be it idle or not
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

> More words:
> 
> After upgrade from 6.1 to 6.2 I noticed that Disk led on my desktop
> started flashing incessantly regardless of present or absent workload.
> 
> I think I confirmed the storm with `perf`: led flashes align with output
> of:
> 
>     # perf ftrace -a -T 'nvme_setup*' | cat
> 
>     kworker/6:1H-298     [006]   2569.645201: nvme_setup_cmd <-nvme_queue_rq
>     kworker/6:1H-298     [006]   2569.645205: nvme_setup_discard <-nvme_setup_cmd
>     kworker/6:1H-298     [006]   2569.749198: nvme_setup_cmd <-nvme_queue_rq
>     kworker/6:1H-298     [006]   2569.749202: nvme_setup_discard <-nvme_setup_cmd
>     kworker/6:1H-298     [006]   2569.853204: nvme_setup_cmd <-nvme_queue_rq
>     kworker/6:1H-298     [006]   2569.853209: nvme_setup_discard <-nvme_setup_cmd
>     kworker/6:1H-298     [006]   2569.958198: nvme_setup_cmd <-nvme_queue_rq
>     kworker/6:1H-298     [006]   2569.958202: nvme_setup_discard <-nvme_setup_cmd
> 
> `iotop` shows no read/write IO at all (expected).
> 
> I was able to bisect it down to this commit:
> 
>   $ git bisect good
>   63a7cb13071842966c1ce931edacbc23573aada5 is the first bad commit
>   commit 63a7cb13071842966c1ce931edacbc23573aada5
>   Author: David Sterba <dsterba@suse.com>
>   Date:   Tue Jul 26 20:54:10 2022 +0200
> 
>     btrfs: auto enable discard=async when possible
> 
>     There's a request to automatically enable async discard for capable
>     devices. We can do that, the async mode is designed to wait for larger
>     freed extents and is not intrusive, with limits to iops, kbps or latency.
> 
>     The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .
> 
>     The automatic selection is done if there's at least one discard capable
>     device in the filesystem (not capable devices are skipped). Mounting
>     with any other discard option will honor that option, notably mounting
>     with nodiscard will keep it disabled.
> 
>     Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/
>     Reviewed-by: Boris Burkov <boris@bur.io>
>     Signed-off-by: David Sterba <dsterba@suse.com>
> 
>    fs/btrfs/ctree.h   |  1 +
>    fs/btrfs/disk-io.c | 14 ++++++++++++++
>    fs/btrfs/super.c   |  2 ++
>    fs/btrfs/volumes.c |  3 +++
>    fs/btrfs/volumes.h |  2 ++
>    5 files changed, 22 insertions(+)
> 
> Is this storm a known issue? I did not dig too much into the patch. But
> glancing at it this bit looks slightly off:
> 
>     +       if (bdev_max_discard_sectors(bdev))
>     +               fs_devices->discardable = true;
> 
> Is it expected that there is no `= false` assignment?
> 
> This is the list of `btrfs` filesystems I have:
> 
>   $ cat /proc/mounts | fgrep btrfs
>   /dev/nvme0n1p3 / btrfs rw,noatime,compress=zstd:3,ssd,space_cache,subvolid=848,subvol=/nixos 0 0
>   /dev/sda3 /mnt/archive btrfs rw,noatime,compress=zstd:3,space_cache,subvolid=5,subvol=/ 0 0
>   # skipped bind mounts
> 
> The device is:
> 
>   $ lspci | fgrep -i Solid
>   01:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive (rev 03)
> 
> Can you help me debug the source of discards storm?
> 
> Is it an expected discard storm?
> 
> Is it problematic for SSD life span?
> 
> Thank you!
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-03-02  8:04 ` Linux regression tracking #adding (Thorsten Leemhuis)
@ 2023-04-04 10:52   ` Linux regression tracking #update (Thorsten Leemhuis)
  2023-04-21 13:56   ` Linux regression tracking #update (Thorsten Leemhuis)
  1 sibling, 0 replies; 27+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-04-04 10:52 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Linux kernel regressions list

On 02.03.23 09:04, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 01.03.23 20:30, Sergei Trofimovich wrote:
>> Hi btrfs maintainers!
>>
>> Tl;DR:
>>
>>   After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I
>>   see constant DISCARD storm towards my NVME device be it idle or not.
>>
>>   No storm: v6.1 and older
>>   Has storm: v6.2 and newer
> 
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
> 
> #regzbot ^introduced 63a7cb13071842
> #regzbot title btrfs: DISCARD storm towards NVME device be it idle or not
> #regzbot ignore-activity

#regzbot monitor:
https://lore.kernel.org/all/CAHmG9huwQcQXvy3HS0OP9bKFxwUa3aQj9MXZCr74emn0U+efqQ@mail.gmail.com/
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-03-02  8:04 ` Linux regression tracking #adding (Thorsten Leemhuis)
  2023-04-04 10:52   ` Linux regression tracking #update (Thorsten Leemhuis)
@ 2023-04-21 13:56   ` Linux regression tracking #update (Thorsten Leemhuis)
  1 sibling, 0 replies; 27+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-04-21 13:56 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Linux kernel regressions list

[TLDR: This mail in primarily relevant for Linux kernel regression
tracking. See link in footer if these mails annoy you.]

On 02.03.23 09:04, Linux regression tracking #adding (Thorsten Leemhuis)
wrote:
> On 01.03.23 20:30, Sergei Trofimovich wrote:
>>
>> Tl;DR:
>>
>>   After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I
>>   see constant DISCARD storm towards my NVME device be it idle or not.
>>
>>   No storm: v6.1 and older
>>   Has storm: v6.2 and newer
> 
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
> 
> #regzbot ^introduced 63a7cb13071842
> #regzbot title btrfs: DISCARD storm towards NVME device be it idle or not
> #regzbot ignore-activity

FWIW, the fix is in next now, regzbot just missed it due to a url
encoding bug:

#regzbot monitor:
https://lore.kernel.org/linux-btrfs/cover.1680723651.git.boris@bur.io/T/#m21c1856ccc8fd24ccb6fbad7d9b54f0bae115d25
#regzbot fix: 2e55571fddf
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 27+ messages in thread

[parent not found: <94cf49d0-fa2d-cc2c-240e-222706d69eb3@oracle.com>]

[parent not found: <20230302105406.2cd367f7@nz>]

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
       [not found]   ` <20230302105406.2cd367f7@nz>
@ 2023-03-15 11:44     ` Linux regression tracking (Thorsten Leemhuis)
  2023-03-15 16:34       ` Sergei Trofimovich
  0 siblings, 1 reply; 27+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-03-15 11:44 UTC (permalink / raw)
  To: Sergei Trofimovich, Anand Jain
  Cc: linux-btrfs, David Sterba, Boris Burkov, Chris Mason,
	Josef Bacik, Linux kernel regressions list

Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

I added this to the tracking, but it seems nothing happened for nearly
two weeks.

Sergei, did you find a workaround that works for you? Or is this
something that other people might run into as well and thus better
should be fixed?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

On 02.03.23 11:54, Sergei Trofimovich wrote:
> On Thu, 2 Mar 2023 17:12:27 +0800
> Anand Jain <anand.jain@oracle.com> wrote:
> 
>> On 3/2/23 03:30, Sergei Trofimovich wrote:
>>> Hi btrfs maintainers!
>>>
>>> Tl;DR:
>>>
>>>    After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I
>>>    see constant DISCARD storm towards my NVME device be it idle or not.
>>>
>>>    No storm: v6.1 and older
>>>    Has storm: v6.2 and newer
>>>
>>> More words:
>>>
>>> After upgrade from 6.1 to 6.2 I noticed that Disk led on my desktop
>>> started flashing incessantly regardless of present or absent workload.
>>>
>>> I think I confirmed the storm with `perf`: led flashes align with output
>>> of:
>>>
>>>      # perf ftrace -a -T 'nvme_setup*' | cat
>>>
>>>      kworker/6:1H-298     [006]   2569.645201: nvme_setup_cmd <-nvme_queue_rq
>>>      kworker/6:1H-298     [006]   2569.645205: nvme_setup_discard <-nvme_setup_cmd
>>>      kworker/6:1H-298     [006]   2569.749198: nvme_setup_cmd <-nvme_queue_rq
>>>      kworker/6:1H-298     [006]   2569.749202: nvme_setup_discard <-nvme_setup_cmd
>>>      kworker/6:1H-298     [006]   2569.853204: nvme_setup_cmd <-nvme_queue_rq
>>>      kworker/6:1H-298     [006]   2569.853209: nvme_setup_discard <-nvme_setup_cmd
>>>      kworker/6:1H-298     [006]   2569.958198: nvme_setup_cmd <-nvme_queue_rq
>>>      kworker/6:1H-298     [006]   2569.958202: nvme_setup_discard <-nvme_setup_cmd
>>>
>>> `iotop` shows no read/write IO at all (expected).
>>>
>>> I was able to bisect it down to this commit:
>>>
>>>    $ git bisect good
>>>    63a7cb13071842966c1ce931edacbc23573aada5 is the first bad commit
>>>    commit 63a7cb13071842966c1ce931edacbc23573aada5
>>>    Author: David Sterba <dsterba@suse.com>
>>>    Date:   Tue Jul 26 20:54:10 2022 +0200
>>>
>>>      btrfs: auto enable discard=async when possible
>>>
>>>      There's a request to automatically enable async discard for capable
>>>      devices. We can do that, the async mode is designed to wait for larger
>>>      freed extents and is not intrusive, with limits to iops, kbps or latency.
>>>
>>>      The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .
>>>
>>>      The automatic selection is done if there's at least one discard capable
>>>      device in the filesystem (not capable devices are skipped). Mounting
>>>      with any other discard option will honor that option, notably mounting
>>>      with nodiscard will keep it disabled.
>>>
>>>      Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/
>>>      Reviewed-by: Boris Burkov <boris@bur.io>
>>>      Signed-off-by: David Sterba <dsterba@suse.com>
>>>
>>>     fs/btrfs/ctree.h   |  1 +
>>>     fs/btrfs/disk-io.c | 14 ++++++++++++++
>>>     fs/btrfs/super.c   |  2 ++
>>>     fs/btrfs/volumes.c |  3 +++
>>>     fs/btrfs/volumes.h |  2 ++
>>>     5 files changed, 22 insertions(+)
>>>
>>> Is this storm a known issue? I did not dig too much into the patch. But
>>> glancing at it this bit looks slightly off:
>>>
>>>      +       if (bdev_max_discard_sectors(bdev))
>>>      +               fs_devices->discardable = true;
>>>
>>> Is it expected that there is no `= false` assignment?
>>>
>>> This is the list of `btrfs` filesystems I have:
>>>
>>>    $ cat /proc/mounts | fgrep btrfs
>>>    /dev/nvme0n1p3 / btrfs rw,noatime,compress=zstd:3,ssd,space_cache,subvolid=848,subvol=/nixos 0 0
>>>    /dev/sda3 /mnt/archive btrfs rw,noatime,compress=zstd:3,space_cache,subvolid=5,subvol=/ 0 0
>>>    # skipped bind mounts
>>>   
>>
>>
>>
>>> The device is:
>>>
>>>    $ lspci | fgrep -i Solid
>>>    01:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive (rev 03)  
>>
>>
>>   It is a SSD device with NVME interface, that needs regular discard.
>>   Why not try tune io intensity using
>>
>>   /sys/fs/btrfs/<uuid>/discard
>>
>>   options?
>>
>>   Maybe not all discardable sectors are not issued at once. It is a good
>>   idea to try with a fresh mkfs (which runs discard at mkfs) to see if
>>   discard is being issued even if there are no fs activities.
> 
> Ah, thank you Anand! I poked a bit more in `perf ftrace` and I think I
> see a "slow" pass through the discard backlog:
> 
>     /sys/fs/btrfs/<UUID>/discard$  cat iops_limit
>     10
> 
> Twice a minute I get a short burst of file creates/deletes that produces
> a bit of free space in many block groups. That enqueues hundreds of work
> items.
> 
>     $ sudo perf ftrace -a -T 'btrfs_discard_workfn' -T 'btrfs_issue_discard' -T 'btrfs_discard_queue_work'
>      btrfs-transacti-407     [011]  42800.424027: btrfs_discard_queue_work <-__btrfs_add_free_space
>      btrfs-transacti-407     [011]  42800.424070: btrfs_discard_queue_work <-__btrfs_add_free_space
>      ...
>      btrfs-transacti-407     [011]  42800.425053: btrfs_discard_queue_work <-__btrfs_add_free_space
>      btrfs-transacti-407     [011]  42800.425055: btrfs_discard_queue_work <-__btrfs_add_free_space
> 
> 193 entries of btrfs_discard_queue_work.
> It took 1ms to enqueue all of the work into the workqueue.
>     
>      kworker/u64:1-2379115 [000]  42800.487010: btrfs_discard_workfn <-process_one_work
>      kworker/u64:1-2379115 [000]  42800.487028: btrfs_issue_discard <-btrfs_discard_extent
>      kworker/u64:1-2379115 [005]  42800.594010: btrfs_discard_workfn <-process_one_work
>      kworker/u64:1-2379115 [005]  42800.594031: btrfs_issue_discard <-btrfs_discard_extent
>      ...
>      kworker/u64:15-2396822 [007]  42830.441487: btrfs_discard_workfn <-process_one_work
>      kworker/u64:15-2396822 [007]  42830.441502: btrfs_issue_discard <-btrfs_discard_extent
>      kworker/u64:15-2396822 [000]  42830.546497: btrfs_discard_workfn <-process_one_work
>      kworker/u64:15-2396822 [000]  42830.546524: btrfs_issue_discard <-btrfs_discard_extent
> 
> 286 pairs of btrfs_discard_workfn / btrfs_issue_discard.
> Each pair takes 10ms to process, which seems to match iops_limit=10.
> That means I can get about 300 discards per second max.
> 
>      btrfs-transacti-407     [002]  42830.634216: btrfs_discard_queue_work <-__btrfs_add_free_space
>      btrfs-transacti-407     [002]  42830.634228: btrfs_discard_queue_work <-__btrfs_add_free_space
>      ...
> 
> Next transaction started 30 seconds later, which is a default commit
> interval.
> 
> My file system is of 512GB size. My guess I get about one discard entry
> per block group on each 
> 
> Does my system keeps up with scheduled discard backlog? Can I peek at
> workqueue size?
> 
> Is iops_limit=10 a reasonable default for discard=async? It feels like
> for larger file systems it will not be enough even for this idle state.
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
  2023-03-15 11:44     ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-03-15 16:34       ` Sergei Trofimovich
  0 siblings, 0 replies; 27+ messages in thread
From: Sergei Trofimovich @ 2023-03-15 16:34 UTC (permalink / raw)
  To: Linux regression tracking (Thorsten Leemhuis)
  Cc: Linux regressions mailing list, Anand Jain, linux-btrfs,
	David Sterba, Boris Burkov, Chris Mason, Josef Bacik

[-- Attachment #1: Type: text/plain, Size: 8348 bytes --]

On Wed, 15 Mar 2023 12:44:34 +0100
"Linux regression tracking (Thorsten Leemhuis)" <regressions@leemhuis.info> wrote:

> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> for once, to make this easily accessible to everyone.
> 
> I added this to the tracking, but it seems nothing happened for nearly
> two weeks.
> 
> Sergei, did you find a workaround that works for you? Or is this
> something that other people might run into as well and thus better
> should be fixed?

I used the workaround of cranking up IOPS of discard from 10 to 1000:

    # echo 1000 > /sys//fs/btrfs/<UUID>/discard/iops_limit

But I am not sure if it's a safe or a reasonable fix. I would prefer someone from
btrfs to comment if it's an expected kernel's behaviour.

> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 
> #regzbot poke
> 
> On 02.03.23 11:54, Sergei Trofimovich wrote:
> > On Thu, 2 Mar 2023 17:12:27 +0800
> > Anand Jain <anand.jain@oracle.com> wrote:
> >   
> >> On 3/2/23 03:30, Sergei Trofimovich wrote:  
> >>> Hi btrfs maintainers!
> >>>
> >>> Tl;DR:
> >>>
> >>>    After 63a7cb13071842 "btrfs: auto enable discard=async when possible" I
> >>>    see constant DISCARD storm towards my NVME device be it idle or not.
> >>>
> >>>    No storm: v6.1 and older
> >>>    Has storm: v6.2 and newer
> >>>
> >>> More words:
> >>>
> >>> After upgrade from 6.1 to 6.2 I noticed that Disk led on my desktop
> >>> started flashing incessantly regardless of present or absent workload.
> >>>
> >>> I think I confirmed the storm with `perf`: led flashes align with output
> >>> of:
> >>>
> >>>      # perf ftrace -a -T 'nvme_setup*' | cat
> >>>
> >>>      kworker/6:1H-298     [006]   2569.645201: nvme_setup_cmd <-nvme_queue_rq
> >>>      kworker/6:1H-298     [006]   2569.645205: nvme_setup_discard <-nvme_setup_cmd
> >>>      kworker/6:1H-298     [006]   2569.749198: nvme_setup_cmd <-nvme_queue_rq
> >>>      kworker/6:1H-298     [006]   2569.749202: nvme_setup_discard <-nvme_setup_cmd
> >>>      kworker/6:1H-298     [006]   2569.853204: nvme_setup_cmd <-nvme_queue_rq
> >>>      kworker/6:1H-298     [006]   2569.853209: nvme_setup_discard <-nvme_setup_cmd
> >>>      kworker/6:1H-298     [006]   2569.958198: nvme_setup_cmd <-nvme_queue_rq
> >>>      kworker/6:1H-298     [006]   2569.958202: nvme_setup_discard <-nvme_setup_cmd
> >>>
> >>> `iotop` shows no read/write IO at all (expected).
> >>>
> >>> I was able to bisect it down to this commit:
> >>>
> >>>    $ git bisect good
> >>>    63a7cb13071842966c1ce931edacbc23573aada5 is the first bad commit
> >>>    commit 63a7cb13071842966c1ce931edacbc23573aada5
> >>>    Author: David Sterba <dsterba@suse.com>
> >>>    Date:   Tue Jul 26 20:54:10 2022 +0200
> >>>
> >>>      btrfs: auto enable discard=async when possible
> >>>
> >>>      There's a request to automatically enable async discard for capable
> >>>      devices. We can do that, the async mode is designed to wait for larger
> >>>      freed extents and is not intrusive, with limits to iops, kbps or latency.
> >>>
> >>>      The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .
> >>>
> >>>      The automatic selection is done if there's at least one discard capable
> >>>      device in the filesystem (not capable devices are skipped). Mounting
> >>>      with any other discard option will honor that option, notably mounting
> >>>      with nodiscard will keep it disabled.
> >>>
> >>>      Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/
> >>>      Reviewed-by: Boris Burkov <boris@bur.io>
> >>>      Signed-off-by: David Sterba <dsterba@suse.com>
> >>>
> >>>     fs/btrfs/ctree.h   |  1 +
> >>>     fs/btrfs/disk-io.c | 14 ++++++++++++++
> >>>     fs/btrfs/super.c   |  2 ++
> >>>     fs/btrfs/volumes.c |  3 +++
> >>>     fs/btrfs/volumes.h |  2 ++
> >>>     5 files changed, 22 insertions(+)
> >>>
> >>> Is this storm a known issue? I did not dig too much into the patch. But
> >>> glancing at it this bit looks slightly off:
> >>>
> >>>      +       if (bdev_max_discard_sectors(bdev))
> >>>      +               fs_devices->discardable = true;
> >>>
> >>> Is it expected that there is no `= false` assignment?
> >>>
> >>> This is the list of `btrfs` filesystems I have:
> >>>
> >>>    $ cat /proc/mounts | fgrep btrfs
> >>>    /dev/nvme0n1p3 / btrfs rw,noatime,compress=zstd:3,ssd,space_cache,subvolid=848,subvol=/nixos 0 0
> >>>    /dev/sda3 /mnt/archive btrfs rw,noatime,compress=zstd:3,space_cache,subvolid=5,subvol=/ 0 0
> >>>    # skipped bind mounts
> >>>     
> >>
> >>
> >>  
> >>> The device is:
> >>>
> >>>    $ lspci | fgrep -i Solid
> >>>    01:00.0 Non-Volatile memory controller: ADATA Technology Co., Ltd. XPG SX8200 Pro PCIe Gen3x4 M.2 2280 Solid State Drive (rev 03)    
> >>
> >>
> >>   It is a SSD device with NVME interface, that needs regular discard.
> >>   Why not try tune io intensity using
> >>
> >>   /sys/fs/btrfs/<uuid>/discard
> >>
> >>   options?
> >>
> >>   Maybe not all discardable sectors are not issued at once. It is a good
> >>   idea to try with a fresh mkfs (which runs discard at mkfs) to see if
> >>   discard is being issued even if there are no fs activities.  
> > 
> > Ah, thank you Anand! I poked a bit more in `perf ftrace` and I think I
> > see a "slow" pass through the discard backlog:
> > 
> >     /sys/fs/btrfs/<UUID>/discard$  cat iops_limit
> >     10
> > 
> > Twice a minute I get a short burst of file creates/deletes that produces
> > a bit of free space in many block groups. That enqueues hundreds of work
> > items.
> > 
> >     $ sudo perf ftrace -a -T 'btrfs_discard_workfn' -T 'btrfs_issue_discard' -T 'btrfs_discard_queue_work'
> >      btrfs-transacti-407     [011]  42800.424027: btrfs_discard_queue_work <-__btrfs_add_free_space
> >      btrfs-transacti-407     [011]  42800.424070: btrfs_discard_queue_work <-__btrfs_add_free_space
> >      ...
> >      btrfs-transacti-407     [011]  42800.425053: btrfs_discard_queue_work <-__btrfs_add_free_space
> >      btrfs-transacti-407     [011]  42800.425055: btrfs_discard_queue_work <-__btrfs_add_free_space
> > 
> > 193 entries of btrfs_discard_queue_work.
> > It took 1ms to enqueue all of the work into the workqueue.
> >     
> >      kworker/u64:1-2379115 [000]  42800.487010: btrfs_discard_workfn <-process_one_work
> >      kworker/u64:1-2379115 [000]  42800.487028: btrfs_issue_discard <-btrfs_discard_extent
> >      kworker/u64:1-2379115 [005]  42800.594010: btrfs_discard_workfn <-process_one_work
> >      kworker/u64:1-2379115 [005]  42800.594031: btrfs_issue_discard <-btrfs_discard_extent
> >      ...
> >      kworker/u64:15-2396822 [007]  42830.441487: btrfs_discard_workfn <-process_one_work
> >      kworker/u64:15-2396822 [007]  42830.441502: btrfs_issue_discard <-btrfs_discard_extent
> >      kworker/u64:15-2396822 [000]  42830.546497: btrfs_discard_workfn <-process_one_work
> >      kworker/u64:15-2396822 [000]  42830.546524: btrfs_issue_discard <-btrfs_discard_extent
> > 
> > 286 pairs of btrfs_discard_workfn / btrfs_issue_discard.
> > Each pair takes 10ms to process, which seems to match iops_limit=10.
> > That means I can get about 300 discards per second max.
> > 
> >      btrfs-transacti-407     [002]  42830.634216: btrfs_discard_queue_work <-__btrfs_add_free_space
> >      btrfs-transacti-407     [002]  42830.634228: btrfs_discard_queue_work <-__btrfs_add_free_space
> >      ...
> > 
> > Next transaction started 30 seconds later, which is a default commit
> > interval.
> > 
> > My file system is of 512GB size. My guess I get about one discard entry
> > per block group on each 
> > 
> > Does my system keeps up with scheduled discard backlog? Can I peek at
> > workqueue size?
> > 
> > Is iops_limit=10 a reasonable default for discard=async? It feels like
> > for larger file systems it will not be enough even for this idle state.
> >   


-- 

  Sergei

[-- Attachment #2: Цифровая подпись OpenPGP --]
[-- Type: application/pgp-signature, Size: 981 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-04-21 13:56 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-20 22:40 [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Christopher Price
2023-03-21 21:26 ` Josef Bacik
2023-03-22  8:38   ` Christoph Hellwig
2023-03-23 22:26     ` Sergei Trofimovich
2023-04-04 10:49       ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 16:04         ` Christoph Hellwig
2023-04-04 16:20           ` Roman Mamedov
2023-04-04 16:27             ` Christoph Hellwig
2023-04-04 23:37               ` Damien Le Moal
2023-04-04 18:15           ` Chris Mason
2023-04-04 18:51             ` Boris Burkov
2023-04-04 19:22               ` David Sterba
2023-04-04 19:39                 ` Boris Burkov
2023-04-05  8:17                   ` Linux regression tracking (Thorsten Leemhuis)
2023-04-10  2:03               ` Michael Bromilow
2023-04-11 17:52                 ` David Sterba
2023-04-11 18:15                   ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 19:08             ` Sergei Trofimovich
2023-04-05  6:18             ` Christoph Hellwig
2023-04-05 12:01               ` Chris Mason
2023-04-04 18:23         ` Boris Burkov
2023-04-04 19:12           ` Sergei Trofimovich
     [not found] <Y/+n1wS/4XAH7X1p@nz>
2023-03-02  8:04 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-04 10:52   ` Linux regression tracking #update (Thorsten Leemhuis)
2023-04-21 13:56   ` Linux regression tracking #update (Thorsten Leemhuis)
     [not found] ` <94cf49d0-fa2d-cc2c-240e-222706d69eb3@oracle.com>
     [not found]   ` <20230302105406.2cd367f7@nz>
2023-03-15 11:44     ` Linux regression tracking (Thorsten Leemhuis)
2023-03-15 16:34       ` Sergei Trofimovich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).