Re: Regression in XFS for fsync heavy workload

From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-xfs@vger.kernel.org, "Darrick J. Wong" <djwong@kernel.org>,
	Dave Chinner <dchinner@redhat.com>
Subject: Re: Regression in XFS for fsync heavy workload
Date: Thu, 17 Mar 2022 10:38:28 +1100	[thread overview]
Message-ID: <20220316233828.GU3927073@dread.disaster.area> (raw)
In-Reply-To: <20220316095437.ogwo2fxfpddaerie@quack3.lan>

On Wed, Mar 16, 2022 at 10:54:37AM +0100, Jan Kara wrote:
> On Wed 16-03-22 12:06:27, Dave Chinner wrote:
> > On Tue, Mar 15, 2022 at 01:49:43PM +0100, Jan Kara wrote:
> > > Hello,
> > > 
> > > I was tracking down a regression in dbench workload on XFS we have
> > > identified during our performance testing. These are results from one of
> > > our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
> > > disk):
> > > 
> > > 			       good		       bad
> > > Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
> > > Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
> > > Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
> > > Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
> > > Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
> > > Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
> > > Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*
> > > 
> > > Note that the numbers are actually times to complete workload, not
> > > traditional dbench throughput numbers so lower is better.
> > 
> > How does this work? Dbench is a fixed time workload - the only
> > variability from run to run is the time it takes to run the cleanup
> > phase. Is this some hacked around version of dbench?
> 
> Yes, dbench is a fixed time workload but in fact there is the workload file
> that gets executed in a loop. We run a modified version of dbench
> (https://github.com/mcgrof/dbench) which also reports time spent by each
> execution of the workload file (--show-execute-time option), which has much
> better statistical properties than throughput reported by dbench normally
> (which is for example completely ignorant of metadata operations and that
> leads to big fluctuations in reported numbers especially for high client
> counts).

The high client count fluctuations are actually meaningful and very
enlightening if you know why the fluctations are occurring. :)

e.g. at 280-300 clients on a maximally sized XFS log we run out of
log reservation space and fall off the lockless fast path. At this
point throughput is determined by metadata IO throughput and
transaction reservation latency, not page cache write IO throughput.
IOWs, variations in performance directly reflect the latency impact
of full cycle metadata operations, not just journal fsync
throughput.

At 512 clients, the page cache footprint of dbench is about 10GB.
Hence somewhere around ~7-800 clients on a 16GB RAM machine the
workload  will no longer fit in the page cache, and so now memory
reclaim and page cache repopulation affects measured throughput.

So, yeah, I tend to run bandwidth measurement up to very high client
counts because it gives much more insight into full subsystem cycle
behaviour than just running the "does fsync scale" aspect that low
client count testing exercises....

> > When doing this work, I didn't count cache flushes. What I looked at
> > was the number of log forces vs the number of sleeps waiting on log
> > forces vs log writes vs the number of stalls waiting for log writes.
> > These numbers showed improvements across the board, so any increase
> > in overhead from physical cache flushes was not reflected in the
> > throughput increases I was measuring at the "fsync drives log
> > forces" level.
> 
> Thanks for detailed explanation! I'd just note that e.g. for a machine with
> 8 CPUs, 32 GB of Ram and Intel SSD behind a megaraid_sas controller (it is
> some Dell PowerEdge server) we see even larger regressions like:
> 
>                     good                      bad
> Amean 	1	97.93	( 0.00%)	135.67	( -38.54%)
> Amean 	2	147.69	( 0.00%)	194.82	( -31.91%)
> Amean 	4	242.82	( 0.00%)	352.98	( -45.36%)
> Amean 	8	375.36	( 0.00%)	591.03	( -57.45%)
> 
> I didn't investigate on this machine (it was doing some other tests and I
> had another machine in my hands which also showed some, although smaller,
> regression) but now reading your explanations I'm curious why the
> regression grows with number of threads on that machine. Maybe the culprit
> is different there or just the dynamics isn't as we imagine it on that
> storage controller... I guess I'll borrow the machine and check it.

That sounds more like a poor caching implementation in the hardware
RAID controller than anything else.

> > > The reason as far as I understand it is that
> > > xlog_cil_push_work() never actually ends up writing the iclog (I can see
> > > this in the traces) because it is writing just very small amounts (my
> > > debugging shows xlog_cil_push_work() tends to add 300-1000 bytes to iclog,
> > > 4000 bytes is the largest number I've seen) and very frequent fsync(2)
> > > calls from dbench always end up forcing iclog before it gets filled. So the
> > > cache flushes issued by xlog_cil_push_work() are just pointless overhead
> > > for this workload AFAIU.
> > 
> > It's not quite that straight forward.
> > 
> > Keep in mind that the block layer is supposed to merge new flush
> > requests that occur while there is still a flush in progress. hence
> > the only time this async flush should cause extra flush requests to
> > physically occur unless you have storage that either ignores flush
> > requests (in which case we don't care because bio_submit() aborts
> > real quick) or is really, really fast and so cache flush requests
> > complete before we start hitting the block layer merge case or
> > slowing down other IO.  If storage is slow and there's any amoutn of
> > concurrency, then we're going to be waiting on merged flush requests
> > in the block layer if there's any amount of concurrency, so the
> > impact is fairly well bound there, too.
> >
> > Hence cache flush latency is only going to impact on very
> > low concurrency workloads where any additional wait time directly
> > translates to reduced throughput. That's pretty much what your
> > numbers indicate, too.
> 
> Yes, for higher thread counts I agree flush merging should mitigate the
> impact. But note that there is still some overhead of additional flushes
> because the block layer will merge only with flushes that are queued and
> not yet issued to the device. If there is flush in progress, new flush will
> be queued and will get submitted once the first one completes. It is only
> the third flush that gets merged to the second one.

Yup, and in this workload we'll generally have 4 concurrent CIL push
works being run, so we're over that threshold most of the time on
heavy concurrent fsync workloads.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com