Re: Regression in XFS for fsync heavy workload

From: Jan Kara <jack@suse.cz>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-xfs@vger.kernel.org, "Darrick J. Wong" <djwong@kernel.org>,
	Dave Chinner <dchinner@redhat.com>
Subject: Re: Regression in XFS for fsync heavy workload
Date: Wed, 16 Mar 2022 20:38:40 +0100	[thread overview]
Message-ID: <20220316193840.3t2ahjxnkvmk6okz@quack3.lan> (raw)
In-Reply-To: <20220316100934.6bcg75zcfvoyizzl@quack3.lan>

On Wed 16-03-22 11:09:34, Jan Kara wrote:
> On Wed 16-03-22 18:44:59, Dave Chinner wrote:
> > On Wed, Mar 16, 2022 at 12:06:27PM +1100, Dave Chinner wrote:
> > > On Tue, Mar 15, 2022 at 01:49:43PM +0100, Jan Kara wrote:
> > > > Hello,
> > > > 
> > > > I was tracking down a regression in dbench workload on XFS we have
> > > > identified during our performance testing. These are results from one of
> > > > our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
> > > > disk):
> > > > 
> > > > 			       good		       bad
> > > > Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
> > > > Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
> > > > Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
> > > > Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
> > > > Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
> > > > Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
> > > > Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*
> > > > 
> > > > Note that the numbers are actually times to complete workload, not
> > > > traditional dbench throughput numbers so lower is better.
> > ....
> > 
> > > > This should still
> > > > submit it rather early to provide the latency advantage. Otherwise postpone
> > > > the flush to the moment we know we are going to flush the iclog to save
> > > > pointless flushes. But we would have to record whether the flush happened
> > > > or not in the iclog and it would all get a bit hairy...
> > > 
> > > I think we can just set the NEED_FLUSH flag appropriately.
> > > 
> > > However, given all this, I'm wondering if the async cache flush was
> > > really a case of premature optimisation. That is, we don't really
> > > gain anything by reducing the flush latency of the first iclog write
> > > wehn we are writing 100-1000 iclogs before the commit record, and it
> > > can be harmful to some workloads by issuing more flushes than we
> > > need to.
> > > 
> > > So perhaps the right thing to do is just get rid of it and always
> > > mark the first iclog in a checkpoint as NEED_FLUSH....
> > 
> > So I've run some tests on code that does this, and the storage I've
> > tested it on shows largely no difference in stream CIL commit and
> > fsync heavy workloads when comparing synv vs as cache flushes. On
> > set of tests was against high speed NVMe ssds, the other against
> > old, slower SATA SSDs.
> > 
> > Jan, can you run the patch below (against 5.17-rc8) and see what
> > results you get on your modified dbench test?
> 
> Sure, I'll run the test. I forgot to mention that in vanilla upstream kernel
> I could see the difference in the number of cache flushes caused by the
> XFS changes but not actual change in dbench numbers (they were still
> comparable to the bad ones from my test). The XFS change made material
> difference to dbench performance only together with scheduler / cpuidling /
> frequency scaling fixes we have in our SLE kernel (I didn't try to pin down
> which exactly - I guess I can try working around that by using performance
> cpufreq governor and disabling low cstates so that I can test stock
> vanilla kernels). Thanks for the patch!

Yup, so with limiting cstates and performance cpufreq governor I can see
your patch helps significantly the dbench performance:

                   5.18-rc8-vanilla       5.18-rc8-patched
Amean     1        71.22 (   0.00%)       64.94 *   8.81%*
Amean     2        93.03 (   0.00%)       84.80 *   8.85%*
Amean     4       150.54 (   0.00%)      137.51 *   8.66%*
Amean     8       252.53 (   0.00%)      242.24 *   4.08%*
Amean     16      454.13 (   0.00%)      439.08 *   3.31%*
Amean     32      835.24 (   0.00%)      829.74 *   0.66%*
Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%*

The performance is restored to values before commit bad77c375e8d ("xfs: CIL
checkpoint flushes caches unconditionally") as well as the number of
flushes.

								Honza

> > xfs: drop async cache flushes from CIL commits.
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > As discussed here:
> > 
> > https://lore.kernel.org/linux-xfs/20220316010627.GO3927073@dread.disaster.area/T/#t
> > 
> > This is a prototype for removing async cache flushes from the CIL
> > checkpoint path. Fast NVME storage.
> > 
> > From `dbench -t 30`, current TOT:
> > 
> > clients		async			sync
> > 		BW	Latency		BW	Latency
> > 1		 767.14   0.858		 792.10   0.812
> > 8		2231.18   5.607		2263.24  10.985
> > 16		3023.25   5.779		2999.16   7.087
> > 32		3712.80  11.468		3649.19   9.967
> > 128		5997.98  13.719		6973.84  12.620
> > 512		4256.29 104.050		5089.61  97.548
> > 
> > From `dbench -t 30`, CIL scale:
> > 
> > clients		async			sync
> > 		BW	Latency		BW	Latency
> > 1		 935.18   0.855		 915.64   0.903
> > 8		2404.51   6.873		2341.77   6.511
> > 16		3003.42   6.460		2931.57   6.529
> > 32		3697.23   7.939		3596.28   7.894
> > 128		7237.43  15.495		7217.74  11.588
> > 512		5079.24  90.587		5167.08  95.822
> > 
> > fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
> > 
> > 	create		chown		unlink
> > async   1m41s		1m16s		2m03s
> > sync	1m40s		1m19s		1m54s
> > 
> > async log iops: up to 17kiops
> > async log b/w: up to 570MB/s
> > 
> > sync log iops: up to 18kiops
> > sync log b/w: up to 600MB/s
> > 
> > Ok, so there's really no difference from async flushes on really
> > fast storage.
> > 
> > Slower storage:
> > 
> > From `dbench -t 30`, CIL scale:
> > 
> > clients		async			sync
> > 		BW	Latency		BW	Latency
> > 1		  78.59  15.792		  83.78  10.729
> > 8		 367.88  92.067		 404.63  59.943
> > 16		 564.51  72.524		 602.71  76.089
> > 32		 831.66 105.984		 870.26 110.482
> > 128		1659.76 102.969		1624.73  91.356
> > 512		2135.91 223.054		2603.07 161.160
> > 
> > fsmark, 16 threads, create w/32k logbsize
> > 
> > 	create		unlink
> > async   5m06s		4m15s
> > sync	5m00s		4m22s
> > 
> > 
> > Mostly no change here, either. Possibly a bit better fsync overload
> > behaviour with sync flushes.
> > 
> > I think we can probably just get rid of async flushes altogether for
> > the moment. It looked necessary when developing the code, but seems
> > to be complexity we don't actually need now that it's settled down a
> > bit and all the bugs have been flushed out.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c | 27 +++++++++++++++++++--------
> >  1 file changed, 19 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 83a039762b81..14746253805b 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -705,11 +705,21 @@ xlog_cil_set_ctx_write_state(
> >  		 * The LSN we need to pass to the log items on transaction
> >  		 * commit is the LSN reported by the first log vector write, not
> >  		 * the commit lsn. If we use the commit record lsn then we can
> > -		 * move the tail beyond the grant write head.
> > +		 * move the grant write head beyond the tail LSN and overwrite
> > +		 * it.
> >  		 */
> >  		ctx->start_lsn = lsn;
> >  		wake_up_all(&cil->xc_start_wait);
> >  		spin_unlock(&cil->xc_push_lock);
> > +
> > +		/*
> > +		 * Make sure the metadata we are about to overwrite in the log
> > +		 * has been flushed to stable storage before this iclog is
> > +		 * issued.
> > +		 */
> > +		spin_lock(&cil->xc_log->l_icloglock);
> > +		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> > +		spin_unlock(&cil->xc_log->l_icloglock);
> >  		return;
> >  	}
> >  
> > @@ -888,10 +898,10 @@ xlog_cil_push_work(
> >  	struct xfs_trans_header thdr;
> >  	struct xfs_log_iovec	lhdr;
> >  	struct xfs_log_vec	lvhdr = { NULL };
> > -	xfs_lsn_t		preflush_tail_lsn;
> > +//	xfs_lsn_t		preflush_tail_lsn;
> >  	xfs_csn_t		push_seq;
> >  	struct bio		bio;
> > -	DECLARE_COMPLETION_ONSTACK(bdev_flush);
> > +//	DECLARE_COMPLETION_ONSTACK(bdev_flush);
> >  	bool			push_commit_stable;
> >  
> >  	new_ctx = xlog_cil_ctx_alloc();
> > @@ -974,9 +984,9 @@ xlog_cil_push_work(
> >  	 * before the iclog write. To detect whether the log tail moves, sample
> >  	 * the tail LSN *before* we issue the flush.
> >  	 */
> > -	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
> > -	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
> > -				&bdev_flush);
> > +//	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
> > +//	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
> > +//				&bdev_flush);
> >  
> >  	/*
> >  	 * Pull all the log vectors off the items in the CIL, and remove the
> > @@ -1058,7 +1068,7 @@ xlog_cil_push_work(
> >  	 * Before we format and submit the first iclog, we have to ensure that
> >  	 * the metadata writeback ordering cache flush is complete.
> >  	 */
> > -	wait_for_completion(&bdev_flush);
> > +//	wait_for_completion(&bdev_flush);
> >  
> >  	error = xlog_cil_write_chain(ctx, &lvhdr);
> >  	if (error)
> > @@ -1118,7 +1128,8 @@ xlog_cil_push_work(
> >  	if (push_commit_stable &&
> >  	    ctx->commit_iclog->ic_state == XLOG_STATE_ACTIVE)
> >  		xlog_state_switch_iclogs(log, ctx->commit_iclog, 0);
> > -	xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn);
> > +//	xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn);
> > +	xlog_state_release_iclog(log, ctx->commit_iclog, 0);
> >  
> >  	/* Not safe to reference ctx now! */
> >  
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR