Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async

From: Boris Burkov <boris@bur.io>
To: "Linux regression tracking (Thorsten Leemhuis)"
	<regressions@leemhuis.info>
Cc: Sergei Trofimovich <slyich@gmail.com>,
	Christoph Hellwig <hch@infradead.org>,
	Josef Bacik <josef@toxicpanda.com>,
	Christopher Price <pricechrispy@gmail.com>,
	anand.jain@oracle.com, clm@fb.com, dsterba@suse.com,
	linux-btrfs@vger.kernel.org, regressions@lists.linux.dev
Subject: Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
Date: Tue, 4 Apr 2023 11:23:12 -0700	[thread overview]
Message-ID: <20230404182256.GA344341@zen> (raw)
In-Reply-To: <20d85dc4-b6c2-dac1-fdc6-94e44b43692a@leemhuis.info>

On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 23.03.23 23:26, Sergei Trofimovich wrote:
> > On Wed, 22 Mar 2023 01:38:42 -0700
> > Christoph Hellwig <hch@infradead.org> wrote:
> > 
> >> On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote:
> >>> We got the defaults based on our testing with our workloads inside of
> >>> FB.  Clearly this isn't representative of a normal desktop usage, but
> >>> we've also got a lot of workloads so figured if it made the whole
> >>> fleet happy it would probably be fine everywhere.
> >>>
> >>> That being said this is tunable for a reason, your workload seems to
> >>> generate a lot of free'd extents and discards.  We can probably mess
> >>> with the async stuff to maybe pause discarding if there's no other
> >>> activity happening on the device at the moment, but tuning it to let
> >>> more discards through at a time is also completely valid.  Thanks,  
> 
> BTW, there is another report about this issue here:
> https://bugzilla.redhat.com/show_bug.cgi?id=2182228
> 
> /me wonders if there is a opensuse report as well, but a quick search
> didn't find one
> 
> And as fun fact or for completeness, the issue even made it to reddit, too:
> https://www.reddit.com/r/archlinux/comments/121htxn/btrfs_discard_storm_on_62x_kernel/

Good find, but also:
https://www.reddit.com/r/Fedora/comments/vjfpkv/periodic_trim_freezes_ssd/
So without harder data, there is a bit of bias inherent in cherrypicking
negative impressions from the internet.

> 
> >> FYI, discard performance differs a lot between different SSDs.
> >> It used to be pretty horrible for most devices early on, and then a
> >> certain hyperscaler started requiring decent performance for enterprise
> >> drives, so many of them are good now.  A lot less so for the typical
> >> consumer drive, especially at the lower end of the spectrum.
> >>
> >> And that jut NVMe, the still shipping SATA SSDs are another different
> >> story.  Not helped by the fact that we don't even support ranged
> >> discards for them in Linux.
> 
> Thx for your comments Christoph. Quick question, just to be sure I
> understand things properly:
> 
> I assume on some of those problematic devices these discard storms will
> lead to a performance regression?
> 
> I also heard people saying these discard storms might reduce the life
> time of some devices - is that true?
> 
> If the answer to at least one of these is "yes" I'd say we it might be
> best to revert 63a7cb130718 for now.
> 
> > Josef, what did you use as a signal to detect what value was good
> > enough? Did you crank up the number until discard backlog clears up in a
> > reasonable time?

Josef is OOO, so I'll try to clarify some things around async discard,
hopefully it's helpful to anyone wondering how to tune it.

Like you guessed, our tuning basically consists of looking at the
discardable_extents/discardable_bytes metric in the fleet and ensuring
it looks sane, and that we see an improvement in I/O tail latencies or
fix some concrete instances of bad tail latencies. e.g. with
iops_limit=10, we see concrete cases of bad latency go away and don't
see a steady buildup of discardable_extents.

> > 
> > I still don't understand what I should take into account to change the
> > default and whether I should change it at all. Is it fine if the discard
> > backlog takes a week to get through it? (Or a day? An hour? A minute?)

I believe the relevant metrics are:

- number of trims issued/bytes trimmed (you would measure this by tracing
  and by looking at discard_extent_bytes and discard_bitmap_bytes)
- bytes "wrongly" trimmed. (extents that were reallocated without getting
  trimmed are exposed in discard_bytes_saved, so if that drops, you are
  maybe trimming things that you may have not needed to)
- discardable_extents/discardable_bytes (in sysfs; the outstanding work)
- tail latency of file system operations
- disk idle time

By doing periodic trim you tradeoff better bytes_saved and better disk
idle time (big trim once a week, vs. "trim all the time" against worse
tail latency during the trim itself, and risking trimming too
infrequently, leading to worse latency on a drive that needs a trim.

> > 
> > Is it fine to send discards as fast as device allows instead of doing 10
> > IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS
> > makes sense. Or IOPS limit is just a way to reserve most bandwidth to
> > non-discard workloads? Then I would say unlimited IOPS as a default
> > would make more sense for btrfs.

Unfortunately, btrfs currently doesn't have a "fully unlimited" async
discard no matter how you tune it. Ignoring kbps_limit, which only
serves to increase the delay, iops_limit has an effective range between
1 and 1000. The basic premise of btrfs async discard is to meter out
the discards at a steady rate, asynchronously from file system
operations, so the effect of the tunables is to set that delay between
discard operations. The delay is clamped between 1ms and 1000ms, so
iops_limit > 1000 is the same as iops_limit = 1000. iops_limit=0 does
not lead to unmetered discards, but rather hits a hardcoded case of
metering them out over 6 hours. (no clue why, I don't personally love
that...)

Hope that's somewhat helpful,
Boris

> 
> /me would be interested in answers to these questions as well
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.