From: Sergei Trofimovich <slyich@gmail.com>
To: Boris Burkov <boris@bur.io>
Cc: "Linux regression tracking (Thorsten Leemhuis)"
<regressions@leemhuis.info>,
Christoph Hellwig <hch@infradead.org>,
Josef Bacik <josef@toxicpanda.com>,
Christopher Price <pricechrispy@gmail.com>,
anand.jain@oracle.com, clm@fb.com, dsterba@suse.com,
linux-btrfs@vger.kernel.org, regressions@lists.linux.dev
Subject: Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
Date: Tue, 4 Apr 2023 20:12:53 +0100 [thread overview]
Message-ID: <20230404201253.096c9c09@nz> (raw)
In-Reply-To: <20230404182256.GA344341@zen>
On Tue, 4 Apr 2023 11:23:12 -0700
Boris Burkov <boris@bur.io> wrote:
> On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> > On 23.03.23 23:26, Sergei Trofimovich wrote:
> > > On Wed, 22 Mar 2023 01:38:42 -0700
> > > Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > >> On Tue, Mar 21, 2023 at 05:26:49PM -0400, Josef Bacik wrote:
> > >>> We got the defaults based on our testing with our workloads inside of
> > >>> FB. Clearly this isn't representative of a normal desktop usage, but
> > >>> we've also got a lot of workloads so figured if it made the whole
> > >>> fleet happy it would probably be fine everywhere.
> > >>>
> > >>> That being said this is tunable for a reason, your workload seems to
> > >>> generate a lot of free'd extents and discards. We can probably mess
> > >>> with the async stuff to maybe pause discarding if there's no other
> > >>> activity happening on the device at the moment, but tuning it to let
> > >>> more discards through at a time is also completely valid. Thanks,
> >
> > BTW, there is another report about this issue here:
> > https://bugzilla.redhat.com/show_bug.cgi?id=2182228
> >
> > /me wonders if there is a opensuse report as well, but a quick search
> > didn't find one
> >
> > And as fun fact or for completeness, the issue even made it to reddit, too:
> > https://www.reddit.com/r/archlinux/comments/121htxn/btrfs_discard_storm_on_62x_kernel/
>
> Good find, but also:
> https://www.reddit.com/r/Fedora/comments/vjfpkv/periodic_trim_freezes_ssd/
> So without harder data, there is a bit of bias inherent in cherrypicking
> negative impressions from the internet.
>
> >
> > >> FYI, discard performance differs a lot between different SSDs.
> > >> It used to be pretty horrible for most devices early on, and then a
> > >> certain hyperscaler started requiring decent performance for enterprise
> > >> drives, so many of them are good now. A lot less so for the typical
> > >> consumer drive, especially at the lower end of the spectrum.
> > >>
> > >> And that jut NVMe, the still shipping SATA SSDs are another different
> > >> story. Not helped by the fact that we don't even support ranged
> > >> discards for them in Linux.
> >
> > Thx for your comments Christoph. Quick question, just to be sure I
> > understand things properly:
> >
> > I assume on some of those problematic devices these discard storms will
> > lead to a performance regression?
> >
> > I also heard people saying these discard storms might reduce the life
> > time of some devices - is that true?
> >
> > If the answer to at least one of these is "yes" I'd say we it might be
> > best to revert 63a7cb130718 for now.
> >
> > > Josef, what did you use as a signal to detect what value was good
> > > enough? Did you crank up the number until discard backlog clears up in a
> > > reasonable time?
>
> Josef is OOO, so I'll try to clarify some things around async discard,
> hopefully it's helpful to anyone wondering how to tune it.
>
> Like you guessed, our tuning basically consists of looking at the
> discardable_extents/discardable_bytes metric in the fleet and ensuring
> it looks sane, and that we see an improvement in I/O tail latencies or
> fix some concrete instances of bad tail latencies. e.g. with
> iops_limit=10, we see concrete cases of bad latency go away and don't
> see a steady buildup of discardable_extents.
>
> > >
> > > I still don't understand what I should take into account to change the
> > > default and whether I should change it at all. Is it fine if the discard
> > > backlog takes a week to get through it? (Or a day? An hour? A minute?)
>
> I believe the relevant metrics are:
>
> - number of trims issued/bytes trimmed (you would measure this by tracing
> and by looking at discard_extent_bytes and discard_bitmap_bytes)
> - bytes "wrongly" trimmed. (extents that were reallocated without getting
> trimmed are exposed in discard_bytes_saved, so if that drops, you are
> maybe trimming things that you may have not needed to)
> - discardable_extents/discardable_bytes (in sysfs; the outstanding work)
> - tail latency of file system operations
> - disk idle time
>
> By doing periodic trim you tradeoff better bytes_saved and better disk
> idle time (big trim once a week, vs. "trim all the time" against worse
> tail latency during the trim itself, and risking trimming too
> infrequently, leading to worse latency on a drive that needs a trim.
>
> > >
> > > Is it fine to send discards as fast as device allows instead of doing 10
> > > IOPS? Does IOPS limit consider a device wearing tradeoff? Then low IOPS
> > > makes sense. Or IOPS limit is just a way to reserve most bandwidth to
> > > non-discard workloads? Then I would say unlimited IOPS as a default
> > > would make more sense for btrfs.
>
> Unfortunately, btrfs currently doesn't have a "fully unlimited" async
> discard no matter how you tune it. Ignoring kbps_limit, which only
> serves to increase the delay, iops_limit has an effective range between
> 1 and 1000. The basic premise of btrfs async discard is to meter out
> the discards at a steady rate, asynchronously from file system
> operations, so the effect of the tunables is to set that delay between
> discard operations. The delay is clamped between 1ms and 1000ms, so
> iops_limit > 1000 is the same as iops_limit = 1000. iops_limit=0 does
> not lead to unmetered discards, but rather hits a hardcoded case of
> metering them out over 6 hours. (no clue why, I don't personally love
> that...)
>
> Hope that's somewhat helpful,
Thank you, Boris! That is very helpful. `discard_bytes_saved` is a good
(and not very obvious to understand!) signal.
> Boris
>
> >
> > /me would be interested in answers to these questions as well
> >
> > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> > --
> > Everything you wanna know about Linux kernel regression tracking:
> > https://linux-regtracking.leemhuis.info/about/#tldr
> > If I did something stupid, please tell me, as explained on that page.
--
Sergei
next prev parent reply other threads:[~2023-04-04 19:12 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-20 22:40 [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async Christopher Price
2023-03-21 21:26 ` Josef Bacik
2023-03-22 8:38 ` Christoph Hellwig
2023-03-23 22:26 ` Sergei Trofimovich
2023-04-04 10:49 ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 16:04 ` Christoph Hellwig
2023-04-04 16:20 ` Roman Mamedov
2023-04-04 16:27 ` Christoph Hellwig
2023-04-04 23:37 ` Damien Le Moal
2023-04-04 18:15 ` Chris Mason
2023-04-04 18:51 ` Boris Burkov
2023-04-04 19:22 ` David Sterba
2023-04-04 19:39 ` Boris Burkov
2023-04-05 8:17 ` Linux regression tracking (Thorsten Leemhuis)
2023-04-10 2:03 ` Michael Bromilow
2023-04-11 17:52 ` David Sterba
2023-04-11 18:15 ` Linux regression tracking (Thorsten Leemhuis)
2023-04-04 19:08 ` Sergei Trofimovich
2023-04-05 6:18 ` Christoph Hellwig
2023-04-05 12:01 ` Chris Mason
2023-04-04 18:23 ` Boris Burkov
2023-04-04 19:12 ` Sergei Trofimovich [this message]
[not found] <Y/+n1wS/4XAH7X1p@nz>
2023-03-02 8:04 ` Linux regression tracking #adding (Thorsten Leemhuis)
2023-04-04 10:52 ` Linux regression tracking #update (Thorsten Leemhuis)
2023-04-21 13:56 ` Linux regression tracking #update (Thorsten Leemhuis)
[not found] ` <94cf49d0-fa2d-cc2c-240e-222706d69eb3@oracle.com>
[not found] ` <20230302105406.2cd367f7@nz>
2023-03-15 11:44 ` Linux regression tracking (Thorsten Leemhuis)
2023-03-15 16:34 ` Sergei Trofimovich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230404201253.096c9c09@nz \
--to=slyich@gmail.com \
--cc=anand.jain@oracle.com \
--cc=boris@bur.io \
--cc=clm@fb.com \
--cc=dsterba@suse.com \
--cc=hch@infradead.org \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=pricechrispy@gmail.com \
--cc=regressions@leemhuis.info \
--cc=regressions@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).