Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async

From: Chris Mason <clm@meta.com>
To: Christoph Hellwig <hch@infradead.org>,
	Linux regressions mailing list <regressions@lists.linux.dev>
Cc: Sergei Trofimovich <slyich@gmail.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Christopher Price <pricechrispy@gmail.com>,
	anand.jain@oracle.com, boris@bur.io, clm@fb.com,
	dsterba@suse.com, linux-btrfs@vger.kernel.org
Subject: Re: [6.2 regression][bisected]discard storm on idle since v6.1-rc8-59-g63a7cb130718 discard=async
Date: Tue, 4 Apr 2023 14:15:38 -0400	[thread overview]
Message-ID: <41141706-2685-1b32-8624-c895a3b219ea@meta.com> (raw)
In-Reply-To: <ZCxKc5ZzP3Np71IC@infradead.org>

On 4/4/23 12:04 PM, Christoph Hellwig wrote:
> On Tue, Apr 04, 2023 at 12:49:40PM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>> And that jut NVMe, the still shipping SATA SSDs are another different
>>>> story.  Not helped by the fact that we don't even support ranged
>>>> discards for them in Linux.
>>
>> Thx for your comments Christoph. Quick question, just to be sure I
>> understand things properly:
>>
>> I assume on some of those problematic devices these discard storms will
>> lead to a performance regression?

I'm searching through the various threads, but I don't think I've seen
the discard storm quantified?

Boris sent me this:
https://lore.kernel.org/linux-btrfs/ZCxP%2Fll7YjPdb9Ou@infradead.org/T/#m65851e5b8b0caa5320d2b7e322805dd200686f01

Which seems to match the 10 discards per second setting?  We should be
doing more of a dribble than a storm, so I'd like to understand if this
is a separate bug that should be fixed.

> 
> Probably.
> 
>> I also heard people saying these discard storms might reduce the life
>> time of some devices - is that true?
> 
> Also very much possible.  There are various SSDs that treat a discard
> as a write zeroes and always return zeroes from all discarded blocks.
> If the discards are smaller than or not aligned to the internal erase
> (super)blocks, this will actually cause additional writes.
> 
>> If the answer to at least one of these is "yes" I'd say we it might be
>> best to revert 63a7cb130718 for now.
> 
> I don't think enabling it is a very a smart idea for most consumer
> devices.

It seems like a good time to talk through a variations of discard usage
in fb data centers.  We run a pretty wide variety of hardware from
consumer grade ssds to enterprise ssds, and we've run these on
ext4/btrfs/xfs.

(Christoph knows most of this already, so I'm only partially replying to
him here)

First, there was synchronous discard.  These were pretty dark times
because all three of our filesystems would build a batch of synchronous
discards and then wait for them during filesystem commit.  There were
long tail latencies across all of our workloads, and so workload owners
would turn off discard and declare victory over terrible latencies.

Of course this predictably ends up with GC on the drives leading to
terrible latencies because we weren't discarding anymore, and nightly
trims are the obvious answer.  Different workloads would gyrate through
the variations and the only consistent result was unhappiness.

Some places in the fleet still do this, and it can be a pretty simple
tradeoff between the IO impacts of full drive trims vs the latency
impact of built up GC vs over-provisioning.  It works for consistent
workloads, but honestly there aren't many of those.

Along the way both btrfs and xfs have grown variations of async discard.
 The XFS one (sorry if I'm out of date here), didn't include any kind of
rate limiting, so if you were bulk deleting a lot of data, XFS would
effectively queue up so many discards that it actually saturated the
device for a long time, starving reads and writes.  If your workload did
a constant stream of allocation and deletion, the async discards would
just saturate the drive forever.

The workloads that care about latencies on XFS ended up going back to
synchronous discards, and they do a slow-rm hack that nibbles away at
the ends of files with periodic fsyncs mixed in until the file is zero
length.  They love this and it makes me cry.

The btrfs async discard feature was meant to address both of these
cases.  The primary features:

- Get rid of the transaction commit latency
- Enable allocations to steal from discards, reducing discard IO
- Avoid saturating the devices with discards by metering them out

Christoph mentions that modern enterprise drives are much better at
discarding, and we see this in production too.  But, we still have
workloads that switched from XFS to Btrfs because the async discard
feature did a better job of reducing drive write-amp and latencies.

So, honestly from my POV the async discard is best suited to consumer
devices.  Our defaults are probably wrong because no matter what you
choose there's a drive out there that makes it look bad.  Also, laptops
probably don't want the slow dribble.

I know Boris has some ideas on how to make the defaults better, so I'll
let him chime in there.

-chris