Re: [PATCH v6 00/22] btrfs: async discard support

From: Dennis Zhou <dennis@kernel.org>
To: dsterba@suse.cz, Dennis Zhou <dennis@kernel.org>,
	David Sterba <dsterba@suse.com>, Chris Mason <clm@fb.com>,
	Josef Bacik <josef@toxicpanda.com>,
	Omar Sandoval <osandov@osandov.com>,
	kernel-team@fb.com, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v6 00/22] btrfs: async discard support
Date: Thu, 19 Dec 2019 15:17:55 -0600	[thread overview]
Message-ID: <20191219211755.GA38803@dennisz-mbp.dhcp.thefacebook.com> (raw)
In-Reply-To: <20191219203438.GS3929@twin.jikos.cz>

On Thu, Dec 19, 2019 at 09:34:38PM +0100, David Sterba wrote:
> On Tue, Dec 17, 2019 at 07:06:00PM -0500, Dennis Zhou wrote:
> > > Regarding the slow io submission, I tried to increase the iops value,
> > > default was 10, but 100 and 1000 made no change. Increasing the maximum
> > > discard request size to 128M worked (when there was such long extent
> > > ready). I was expecting a burst of like 4 consecutive IOs after a 600MB
> > > file is deleted.  I did not try to tweak bps_limit because there was
> > > nothing to limit.
> > 
> > Ah so there's actually a max time between discards set to 10 seconds as
> > the maximum timeout is calculated over 6 hours. So if we only have 6
> > extents, we'd discard 1 per hour(ish given it decays), but this is
> > clamped to 10 seconds.
> > 
> > At least on our servers, we seem to discard at a reasonable rate to
> > prevent performance penalties during a large number of reads and writes
> > while maintaining reasonable write amplification performance. Also,
> > metadata blocks aren't tracked, so on freeing of a whole metadata block
> > group (minus relocation), we'll trickle discards slightly slower than
> > expected.
> 
> So after watching the sysfs numbers, my observation is that the overall
> strategy of the async discard is to wait for larger ranges and discard
> one range every 10 seconds. This is a slow process, but this makes sense
> when there are reads or writes going on so the discard IO penalty is
> marginal.
> 

Yeah, (un)fortunately on our systems we're running chef fairly
frequently which results in a lot of IO in addition to package
deployment. This actually drives the system to have a fairly high steady
state number of untrimmed extents and results in a bit faster paced
discarding rate.

> Running full fstrim will flush all the discardable extents so there's a
> way to reset the discardable queue. What I still don't see as optimal is
> the single discard request sent per one period. Namely because there's
> the iops_limit knob.
> 

Yeah, it's not really ideal at the moment for much slower paced systems
such as our own laptops. Adding persistence would also make a big
difference here.

> My idea is that each timeout, 'iops_limit' times 'max_discard_size' is
> called, so the discard batches are large in total. However, this has
> impact on reads and writes and also on the device itself, I'm not sure
> if the too frequent discards are not making things worse (as this is a
> known problem).
> 

I spent a bit of time looking at the impact of discard on some drives
and my conclusion was that the iops rate is more impactful than the size
of the discards (within reason, which is why there's the
max_discard_size). On a particular drive, I noticed if I went over 10
iops of discards on a sustained simple read write workload, the
latencies would double. That's kind of where the 10 iops limit comes
from. Given the latency impact, that's why this more or less trickles it
down in pieces rather than as a larger batch.

> I'm interested in more strategies that you could have tested in your
> setups, either bps based or rate limited etc. The current one seems ok
> for first implementation but we might want to tune it once we get
> feedback from more users.

Definitely, one of the things I want to do is experiment with different
limits and see how this all correlates with write amplification. I'm
sure there's some happy medium that we can identify that's a lot less
arbitrary than what's current set forth. I imagine it should result in
some graph that we can correlate delay and rate of discarding to a
particular write amp given a fixed workload.

Thanks,
Dennis