All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression in XFS for fsync heavy workload
@ 2022-03-15 12:49 Jan Kara
  2022-03-16  1:06 ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2022-03-15 12:49 UTC (permalink / raw)
  To: linux-xfs; +Cc: Darrick J. Wong, Dave Chinner

Hello,

I was tracking down a regression in dbench workload on XFS we have
identified during our performance testing. These are results from one of
our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
disk):

			       good		       bad
Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*

Note that the numbers are actually times to complete workload, not
traditional dbench throughput numbers so lower is better. Eventually I have
tracked down the problem to commit bad77c375e8d ("xfs: CIL checkpoint
flushes caches unconditionally"). Before this commit we submit ~63k cache
flush requests during the dbench run, after this commit we submit ~150k
cache flush requests. And the additional cache flushes are coming from
xlog_cil_push_work(). The reason as far as I understand it is that
xlog_cil_push_work() never actually ends up writing the iclog (I can see
this in the traces) because it is writing just very small amounts (my
debugging shows xlog_cil_push_work() tends to add 300-1000 bytes to iclog,
4000 bytes is the largest number I've seen) and very frequent fsync(2)
calls from dbench always end up forcing iclog before it gets filled. So the
cache flushes issued by xlog_cil_push_work() are just pointless overhead
for this workload AFAIU.

Is there a way we could help this? I had some idea like call
xfs_flush_bdev_async() only once we find enough items while walking the
cil->xc_cil list that we think iclog write is likely. This should still
submit it rather early to provide the latency advantage. Otherwise postpone
the flush to the moment we know we are going to flush the iclog to save
pointless flushes. But we would have to record whether the flush happened
or not in the iclog and it would all get a bit hairy... I'm definitely not
an expert in XFS logging code so that's why I'm just writing here my
current findings if people have some ideas.

								Honza


-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression in XFS for fsync heavy workload
  2022-03-15 12:49 Regression in XFS for fsync heavy workload Jan Kara
@ 2022-03-16  1:06 ` Dave Chinner
  2022-03-16  7:44   ` Dave Chinner
  2022-03-16  9:54   ` Jan Kara
  0 siblings, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2022-03-16  1:06 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, Darrick J. Wong, Dave Chinner

On Tue, Mar 15, 2022 at 01:49:43PM +0100, Jan Kara wrote:
> Hello,
> 
> I was tracking down a regression in dbench workload on XFS we have
> identified during our performance testing. These are results from one of
> our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
> disk):
> 
> 			       good		       bad
> Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
> Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
> Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
> Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
> Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
> Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
> Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*
> 
> Note that the numbers are actually times to complete workload, not
> traditional dbench throughput numbers so lower is better.

How does this work? Dbench is a fixed time workload - the only
variability from run to run is the time it takes to run the cleanup
phase. Is this some hacked around version of dbench?

> Eventually I have
> tracked down the problem to commit bad77c375e8d ("xfs: CIL checkpoint
> flushes caches unconditionally").  Before this commit we submit ~63k cache
> flush requests during the dbench run, after this commit we submit ~150k
> cache flush requests. And the additional cache flushes are coming from
> xlog_cil_push_work().

Yup, that was a known trade-off with this series of changes to the
REQ_FLUSH/REQ_FUA behaviour of the CIL - fsync heavy workloads could
see higher cache flush counts, and low concurrency rates would see
that as increased runtime because there isn't the journal write load
needed to compeltely mitigate the impact of more frequent cache
flushes.

The trade-off is, OTOH, that other workloads see 10-1000x lower cache
flush rates when streaming large CIL checkpoints through the log as
fast as it can issue sequential IO to the disk. e.g. a 32MB CIL
threshold (>256MB journal) with 32KB iclogbuf size (default) on an
async workload will write ~1000 iclogs to flush a 32MB CIL
checkpoint. Prior to this series, every one of those iclog writes
would be issued as REQ_FLUSH|REQ_FUA. After this patch series,
we do a REQ_FLUSH from xfs_flush_bdev_async(), the first iclog is
REQ_FUA, and the last iclog is REQ_FLUSH|REQ_FUA. IOW, we reduce
the number of FUA + cache flushes by 3 orders or magnitude for these
cases....

There's a heap more detail of the changes and the complexity in
the commit message for the commit two patches further along in
that series where these benefits are realised - see commit
eef983ffeae7 ("xfs: journal IO cache flush reductions") for more
details.

When doing this work, I didn't count cache flushes. What I looked at
was the number of log forces vs the number of sleeps waiting on log
forces vs log writes vs the number of stalls waiting for log writes.
These numbers showed improvements across the board, so any increase
in overhead from physical cache flushes was not reflected in the
throughput increases I was measuring at the "fsync drives log
forces" level.

> The reason as far as I understand it is that
> xlog_cil_push_work() never actually ends up writing the iclog (I can see
> this in the traces) because it is writing just very small amounts (my
> debugging shows xlog_cil_push_work() tends to add 300-1000 bytes to iclog,
> 4000 bytes is the largest number I've seen) and very frequent fsync(2)
> calls from dbench always end up forcing iclog before it gets filled. So the
> cache flushes issued by xlog_cil_push_work() are just pointless overhead
> for this workload AFAIU.

It's not quite that straight forward.

Keep in mind that the block layer is supposed to merge new flush
requests that occur while there is still a flush in progress. hence
the only time this async flush should cause extra flush requests to
physically occur unless you have storage that either ignores flush
requests (in which case we don't care because bio_submit() aborts
real quick) or is really, really fast and so cache flush requests
complete before we start hitting the block layer merge case or
slowing down other IO.  If storage is slow and there's any amoutn of
concurrency, then we're going to be waiting on merged flush requests
in the block layer if there's any amount of concurrency, so the
impact is fairly well bound there, too.

Hence cache flush latency is only going to impact on very
low concurrency workloads where any additional wait time directly
translates to reduced throughput. That's pretty much what your
numbers indicate, too.

> Is there a way we could help this? I had some idea like call
> xfs_flush_bdev_async() only once we find enough items while walking the
> cil->xc_cil list that we think iclog write is likely.

That might be possible, but I don't see that a list walk can
determine this effectively. I've been putting off trying to optimise
this stuff because the infrastructure needed to make decisions like
this efficiently is still backed up waiting on merge.

We need the xlog_write() rework merged, because this patch:

https://lore.kernel.org/linux-xfs/20220309052937.2696447-1-david@fromorbit.com/T/#mf335766c6c17dbf9c438ed30fa0b7e15d355a6be

provides the information we need to determine how many iclogs we are
likely to need to determine whether to make a sync or async flush.

The resultant rewrite of xlog_write() also gives us a callback for
the first iclog in any write to set state on that iclog appropriate
for the current context. That can easily be used to set the
XLOG_ICL_NEED_FLUSH flag if an async flush has not been completed.

> This should still
> submit it rather early to provide the latency advantage. Otherwise postpone
> the flush to the moment we know we are going to flush the iclog to save
> pointless flushes. But we would have to record whether the flush happened
> or not in the iclog and it would all get a bit hairy...

I think we can just set the NEED_FLUSH flag appropriately.

However, given all this, I'm wondering if the async cache flush was
really a case of premature optimisation. That is, we don't really
gain anything by reducing the flush latency of the first iclog write
wehn we are writing 100-1000 iclogs before the commit record, and it
can be harmful to some workloads by issuing more flushes than we
need to.

So perhaps the right thing to do is just get rid of it and always
mark the first iclog in a checkpoint as NEED_FLUSH....

> I'm definitely not
> an expert in XFS logging code so that's why I'm just writing here my
> current findings if people have some ideas.

Thanks for running the tests and doing some investigation, Jan. We
should be able to mitigate some of this impact, so let me run a
couple of experiments here and I'll get back to you.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression in XFS for fsync heavy workload
  2022-03-16  1:06 ` Dave Chinner
@ 2022-03-16  7:44   ` Dave Chinner
  2022-03-16 10:09     ` Jan Kara
  2022-03-16  9:54   ` Jan Kara
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2022-03-16  7:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, Darrick J. Wong, Dave Chinner

On Wed, Mar 16, 2022 at 12:06:27PM +1100, Dave Chinner wrote:
> On Tue, Mar 15, 2022 at 01:49:43PM +0100, Jan Kara wrote:
> > Hello,
> > 
> > I was tracking down a regression in dbench workload on XFS we have
> > identified during our performance testing. These are results from one of
> > our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
> > disk):
> > 
> > 			       good		       bad
> > Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
> > Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
> > Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
> > Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
> > Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
> > Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
> > Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*
> > 
> > Note that the numbers are actually times to complete workload, not
> > traditional dbench throughput numbers so lower is better.
....

> > This should still
> > submit it rather early to provide the latency advantage. Otherwise postpone
> > the flush to the moment we know we are going to flush the iclog to save
> > pointless flushes. But we would have to record whether the flush happened
> > or not in the iclog and it would all get a bit hairy...
> 
> I think we can just set the NEED_FLUSH flag appropriately.
> 
> However, given all this, I'm wondering if the async cache flush was
> really a case of premature optimisation. That is, we don't really
> gain anything by reducing the flush latency of the first iclog write
> wehn we are writing 100-1000 iclogs before the commit record, and it
> can be harmful to some workloads by issuing more flushes than we
> need to.
> 
> So perhaps the right thing to do is just get rid of it and always
> mark the first iclog in a checkpoint as NEED_FLUSH....

So I've run some tests on code that does this, and the storage I've
tested it on shows largely no difference in stream CIL commit and
fsync heavy workloads when comparing synv vs as cache flushes. On
set of tests was against high speed NVMe ssds, the other against
old, slower SATA SSDs.

Jan, can you run the patch below (against 5.17-rc8) and see what
results you get on your modified dbench test?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: drop async cache flushes from CIL commits.

From: Dave Chinner <dchinner@redhat.com>

As discussed here:

https://lore.kernel.org/linux-xfs/20220316010627.GO3927073@dread.disaster.area/T/#t

This is a prototype for removing async cache flushes from the CIL
checkpoint path. Fast NVME storage.

From `dbench -t 30`, current TOT:

clients		async			sync
		BW	Latency		BW	Latency
1		 767.14   0.858		 792.10   0.812
8		2231.18   5.607		2263.24  10.985
16		3023.25   5.779		2999.16   7.087
32		3712.80  11.468		3649.19   9.967
128		5997.98  13.719		6973.84  12.620
512		4256.29 104.050		5089.61  97.548

From `dbench -t 30`, CIL scale:

clients		async			sync
		BW	Latency		BW	Latency
1		 935.18   0.855		 915.64   0.903
8		2404.51   6.873		2341.77   6.511
16		3003.42   6.460		2931.57   6.529
32		3697.23   7.939		3596.28   7.894
128		7237.43  15.495		7217.74  11.588
512		5079.24  90.587		5167.08  95.822

fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize

	create		chown		unlink
async   1m41s		1m16s		2m03s
sync	1m40s		1m19s		1m54s

async log iops: up to 17kiops
async log b/w: up to 570MB/s

sync log iops: up to 18kiops
sync log b/w: up to 600MB/s

Ok, so there's really no difference from async flushes on really
fast storage.

Slower storage:

From `dbench -t 30`, CIL scale:

clients		async			sync
		BW	Latency		BW	Latency
1		  78.59  15.792		  83.78  10.729
8		 367.88  92.067		 404.63  59.943
16		 564.51  72.524		 602.71  76.089
32		 831.66 105.984		 870.26 110.482
128		1659.76 102.969		1624.73  91.356
512		2135.91 223.054		2603.07 161.160

fsmark, 16 threads, create w/32k logbsize

	create		unlink
async   5m06s		4m15s
sync	5m00s		4m22s


Mostly no change here, either. Possibly a bit better fsync overload
behaviour with sync flushes.

I think we can probably just get rid of async flushes altogether for
the moment. It looked necessary when developing the code, but seems
to be complexity we don't actually need now that it's settled down a
bit and all the bugs have been flushed out.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 83a039762b81..14746253805b 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -705,11 +705,21 @@ xlog_cil_set_ctx_write_state(
 		 * The LSN we need to pass to the log items on transaction
 		 * commit is the LSN reported by the first log vector write, not
 		 * the commit lsn. If we use the commit record lsn then we can
-		 * move the tail beyond the grant write head.
+		 * move the grant write head beyond the tail LSN and overwrite
+		 * it.
 		 */
 		ctx->start_lsn = lsn;
 		wake_up_all(&cil->xc_start_wait);
 		spin_unlock(&cil->xc_push_lock);
+
+		/*
+		 * Make sure the metadata we are about to overwrite in the log
+		 * has been flushed to stable storage before this iclog is
+		 * issued.
+		 */
+		spin_lock(&cil->xc_log->l_icloglock);
+		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
+		spin_unlock(&cil->xc_log->l_icloglock);
 		return;
 	}
 
@@ -888,10 +898,10 @@ xlog_cil_push_work(
 	struct xfs_trans_header thdr;
 	struct xfs_log_iovec	lhdr;
 	struct xfs_log_vec	lvhdr = { NULL };
-	xfs_lsn_t		preflush_tail_lsn;
+//	xfs_lsn_t		preflush_tail_lsn;
 	xfs_csn_t		push_seq;
 	struct bio		bio;
-	DECLARE_COMPLETION_ONSTACK(bdev_flush);
+//	DECLARE_COMPLETION_ONSTACK(bdev_flush);
 	bool			push_commit_stable;
 
 	new_ctx = xlog_cil_ctx_alloc();
@@ -974,9 +984,9 @@ xlog_cil_push_work(
 	 * before the iclog write. To detect whether the log tail moves, sample
 	 * the tail LSN *before* we issue the flush.
 	 */
-	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
-	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
-				&bdev_flush);
+//	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
+//	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
+//				&bdev_flush);
 
 	/*
 	 * Pull all the log vectors off the items in the CIL, and remove the
@@ -1058,7 +1068,7 @@ xlog_cil_push_work(
 	 * Before we format and submit the first iclog, we have to ensure that
 	 * the metadata writeback ordering cache flush is complete.
 	 */
-	wait_for_completion(&bdev_flush);
+//	wait_for_completion(&bdev_flush);
 
 	error = xlog_cil_write_chain(ctx, &lvhdr);
 	if (error)
@@ -1118,7 +1128,8 @@ xlog_cil_push_work(
 	if (push_commit_stable &&
 	    ctx->commit_iclog->ic_state == XLOG_STATE_ACTIVE)
 		xlog_state_switch_iclogs(log, ctx->commit_iclog, 0);
-	xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn);
+//	xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn);
+	xlog_state_release_iclog(log, ctx->commit_iclog, 0);
 
 	/* Not safe to reference ctx now! */
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Regression in XFS for fsync heavy workload
  2022-03-16  1:06 ` Dave Chinner
  2022-03-16  7:44   ` Dave Chinner
@ 2022-03-16  9:54   ` Jan Kara
  2022-03-16 23:38     ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2022-03-16  9:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Darrick J. Wong, Dave Chinner

On Wed 16-03-22 12:06:27, Dave Chinner wrote:
> On Tue, Mar 15, 2022 at 01:49:43PM +0100, Jan Kara wrote:
> > Hello,
> > 
> > I was tracking down a regression in dbench workload on XFS we have
> > identified during our performance testing. These are results from one of
> > our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
> > disk):
> > 
> > 			       good		       bad
> > Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
> > Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
> > Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
> > Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
> > Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
> > Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
> > Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*
> > 
> > Note that the numbers are actually times to complete workload, not
> > traditional dbench throughput numbers so lower is better.
> 
> How does this work? Dbench is a fixed time workload - the only
> variability from run to run is the time it takes to run the cleanup
> phase. Is this some hacked around version of dbench?

Yes, dbench is a fixed time workload but in fact there is the workload file
that gets executed in a loop. We run a modified version of dbench
(https://github.com/mcgrof/dbench) which also reports time spent by each
execution of the workload file (--show-execute-time option), which has much
better statistical properties than throughput reported by dbench normally
(which is for example completely ignorant of metadata operations and that
leads to big fluctuations in reported numbers especially for high client
counts).

> > Eventually I have
> > tracked down the problem to commit bad77c375e8d ("xfs: CIL checkpoint
> > flushes caches unconditionally").  Before this commit we submit ~63k cache
> > flush requests during the dbench run, after this commit we submit ~150k
> > cache flush requests. And the additional cache flushes are coming from
> > xlog_cil_push_work().
> 
> Yup, that was a known trade-off with this series of changes to the
> REQ_FLUSH/REQ_FUA behaviour of the CIL - fsync heavy workloads could
> see higher cache flush counts, and low concurrency rates would see
> that as increased runtime because there isn't the journal write load
> needed to compeltely mitigate the impact of more frequent cache
> flushes.
> 
> The trade-off is, OTOH, that other workloads see 10-1000x lower cache
> flush rates when streaming large CIL checkpoints through the log as
> fast as it can issue sequential IO to the disk. e.g. a 32MB CIL
> threshold (>256MB journal) with 32KB iclogbuf size (default) on an
> async workload will write ~1000 iclogs to flush a 32MB CIL
> checkpoint. Prior to this series, every one of those iclog writes
> would be issued as REQ_FLUSH|REQ_FUA. After this patch series,
> we do a REQ_FLUSH from xfs_flush_bdev_async(), the first iclog is
> REQ_FUA, and the last iclog is REQ_FLUSH|REQ_FUA. IOW, we reduce
> the number of FUA + cache flushes by 3 orders or magnitude for these
> cases....
> 
> There's a heap more detail of the changes and the complexity in
> the commit message for the commit two patches further along in
> that series where these benefits are realised - see commit
> eef983ffeae7 ("xfs: journal IO cache flush reductions") for more
> details.
> 
> When doing this work, I didn't count cache flushes. What I looked at
> was the number of log forces vs the number of sleeps waiting on log
> forces vs log writes vs the number of stalls waiting for log writes.
> These numbers showed improvements across the board, so any increase
> in overhead from physical cache flushes was not reflected in the
> throughput increases I was measuring at the "fsync drives log
> forces" level.

Thanks for detailed explanation! I'd just note that e.g. for a machine with
8 CPUs, 32 GB of Ram and Intel SSD behind a megaraid_sas controller (it is
some Dell PowerEdge server) we see even larger regressions like:

                    good                      bad
Amean 	1	97.93	( 0.00%)	135.67	( -38.54%)
Amean 	2	147.69	( 0.00%)	194.82	( -31.91%)
Amean 	4	242.82	( 0.00%)	352.98	( -45.36%)
Amean 	8	375.36	( 0.00%)	591.03	( -57.45%)

I didn't investigate on this machine (it was doing some other tests and I
had another machine in my hands which also showed some, although smaller,
regression) but now reading your explanations I'm curious why the
regression grows with number of threads on that machine. Maybe the culprit
is different there or just the dynamics isn't as we imagine it on that
storage controller... I guess I'll borrow the machine and check it.

> > The reason as far as I understand it is that
> > xlog_cil_push_work() never actually ends up writing the iclog (I can see
> > this in the traces) because it is writing just very small amounts (my
> > debugging shows xlog_cil_push_work() tends to add 300-1000 bytes to iclog,
> > 4000 bytes is the largest number I've seen) and very frequent fsync(2)
> > calls from dbench always end up forcing iclog before it gets filled. So the
> > cache flushes issued by xlog_cil_push_work() are just pointless overhead
> > for this workload AFAIU.
> 
> It's not quite that straight forward.
> 
> Keep in mind that the block layer is supposed to merge new flush
> requests that occur while there is still a flush in progress. hence
> the only time this async flush should cause extra flush requests to
> physically occur unless you have storage that either ignores flush
> requests (in which case we don't care because bio_submit() aborts
> real quick) or is really, really fast and so cache flush requests
> complete before we start hitting the block layer merge case or
> slowing down other IO.  If storage is slow and there's any amoutn of
> concurrency, then we're going to be waiting on merged flush requests
> in the block layer if there's any amount of concurrency, so the
> impact is fairly well bound there, too.
>
> Hence cache flush latency is only going to impact on very
> low concurrency workloads where any additional wait time directly
> translates to reduced throughput. That's pretty much what your
> numbers indicate, too.

Yes, for higher thread counts I agree flush merging should mitigate the
impact. But note that there is still some overhead of additional flushes
because the block layer will merge only with flushes that are queued and
not yet issued to the device. If there is flush in progress, new flush will
be queued and will get submitted once the first one completes. It is only
the third flush that gets merged to the second one.

> > This should still
> > submit it rather early to provide the latency advantage. Otherwise postpone
> > the flush to the moment we know we are going to flush the iclog to save
> > pointless flushes. But we would have to record whether the flush happened
> > or not in the iclog and it would all get a bit hairy...
> 
> I think we can just set the NEED_FLUSH flag appropriately.
> 
> However, given all this, I'm wondering if the async cache flush was
> really a case of premature optimisation. That is, we don't really
> gain anything by reducing the flush latency of the first iclog write
> wehn we are writing 100-1000 iclogs before the commit record, and it
> can be harmful to some workloads by issuing more flushes than we
> need to.
> 
> So perhaps the right thing to do is just get rid of it and always
> mark the first iclog in a checkpoint as NEED_FLUSH....
> 
> > I'm definitely not
> > an expert in XFS logging code so that's why I'm just writing here my
> > current findings if people have some ideas.
> 
> Thanks for running the tests and doing some investigation, Jan. We
> should be able to mitigate some of this impact, so let me run a
> couple of experiments here and I'll get back to you.

Cool, thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression in XFS for fsync heavy workload
  2022-03-16  7:44   ` Dave Chinner
@ 2022-03-16 10:09     ` Jan Kara
  2022-03-16 19:38       ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2022-03-16 10:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Darrick J. Wong, Dave Chinner

On Wed 16-03-22 18:44:59, Dave Chinner wrote:
> On Wed, Mar 16, 2022 at 12:06:27PM +1100, Dave Chinner wrote:
> > On Tue, Mar 15, 2022 at 01:49:43PM +0100, Jan Kara wrote:
> > > Hello,
> > > 
> > > I was tracking down a regression in dbench workload on XFS we have
> > > identified during our performance testing. These are results from one of
> > > our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
> > > disk):
> > > 
> > > 			       good		       bad
> > > Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
> > > Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
> > > Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
> > > Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
> > > Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
> > > Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
> > > Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*
> > > 
> > > Note that the numbers are actually times to complete workload, not
> > > traditional dbench throughput numbers so lower is better.
> ....
> 
> > > This should still
> > > submit it rather early to provide the latency advantage. Otherwise postpone
> > > the flush to the moment we know we are going to flush the iclog to save
> > > pointless flushes. But we would have to record whether the flush happened
> > > or not in the iclog and it would all get a bit hairy...
> > 
> > I think we can just set the NEED_FLUSH flag appropriately.
> > 
> > However, given all this, I'm wondering if the async cache flush was
> > really a case of premature optimisation. That is, we don't really
> > gain anything by reducing the flush latency of the first iclog write
> > wehn we are writing 100-1000 iclogs before the commit record, and it
> > can be harmful to some workloads by issuing more flushes than we
> > need to.
> > 
> > So perhaps the right thing to do is just get rid of it and always
> > mark the first iclog in a checkpoint as NEED_FLUSH....
> 
> So I've run some tests on code that does this, and the storage I've
> tested it on shows largely no difference in stream CIL commit and
> fsync heavy workloads when comparing synv vs as cache flushes. On
> set of tests was against high speed NVMe ssds, the other against
> old, slower SATA SSDs.
> 
> Jan, can you run the patch below (against 5.17-rc8) and see what
> results you get on your modified dbench test?

Sure, I'll run the test. I forgot to mention that in vanilla upstream kernel
I could see the difference in the number of cache flushes caused by the
XFS changes but not actual change in dbench numbers (they were still
comparable to the bad ones from my test). The XFS change made material
difference to dbench performance only together with scheduler / cpuidling /
frequency scaling fixes we have in our SLE kernel (I didn't try to pin down
which exactly - I guess I can try working around that by using performance
cpufreq governor and disabling low cstates so that I can test stock
vanilla kernels). Thanks for the patch!

								Honza

> xfs: drop async cache flushes from CIL commits.
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> As discussed here:
> 
> https://lore.kernel.org/linux-xfs/20220316010627.GO3927073@dread.disaster.area/T/#t
> 
> This is a prototype for removing async cache flushes from the CIL
> checkpoint path. Fast NVME storage.
> 
> From `dbench -t 30`, current TOT:
> 
> clients		async			sync
> 		BW	Latency		BW	Latency
> 1		 767.14   0.858		 792.10   0.812
> 8		2231.18   5.607		2263.24  10.985
> 16		3023.25   5.779		2999.16   7.087
> 32		3712.80  11.468		3649.19   9.967
> 128		5997.98  13.719		6973.84  12.620
> 512		4256.29 104.050		5089.61  97.548
> 
> From `dbench -t 30`, CIL scale:
> 
> clients		async			sync
> 		BW	Latency		BW	Latency
> 1		 935.18   0.855		 915.64   0.903
> 8		2404.51   6.873		2341.77   6.511
> 16		3003.42   6.460		2931.57   6.529
> 32		3697.23   7.939		3596.28   7.894
> 128		7237.43  15.495		7217.74  11.588
> 512		5079.24  90.587		5167.08  95.822
> 
> fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
> 
> 	create		chown		unlink
> async   1m41s		1m16s		2m03s
> sync	1m40s		1m19s		1m54s
> 
> async log iops: up to 17kiops
> async log b/w: up to 570MB/s
> 
> sync log iops: up to 18kiops
> sync log b/w: up to 600MB/s
> 
> Ok, so there's really no difference from async flushes on really
> fast storage.
> 
> Slower storage:
> 
> From `dbench -t 30`, CIL scale:
> 
> clients		async			sync
> 		BW	Latency		BW	Latency
> 1		  78.59  15.792		  83.78  10.729
> 8		 367.88  92.067		 404.63  59.943
> 16		 564.51  72.524		 602.71  76.089
> 32		 831.66 105.984		 870.26 110.482
> 128		1659.76 102.969		1624.73  91.356
> 512		2135.91 223.054		2603.07 161.160
> 
> fsmark, 16 threads, create w/32k logbsize
> 
> 	create		unlink
> async   5m06s		4m15s
> sync	5m00s		4m22s
> 
> 
> Mostly no change here, either. Possibly a bit better fsync overload
> behaviour with sync flushes.
> 
> I think we can probably just get rid of async flushes altogether for
> the moment. It looked necessary when developing the code, but seems
> to be complexity we don't actually need now that it's settled down a
> bit and all the bugs have been flushed out.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 83a039762b81..14746253805b 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -705,11 +705,21 @@ xlog_cil_set_ctx_write_state(
>  		 * The LSN we need to pass to the log items on transaction
>  		 * commit is the LSN reported by the first log vector write, not
>  		 * the commit lsn. If we use the commit record lsn then we can
> -		 * move the tail beyond the grant write head.
> +		 * move the grant write head beyond the tail LSN and overwrite
> +		 * it.
>  		 */
>  		ctx->start_lsn = lsn;
>  		wake_up_all(&cil->xc_start_wait);
>  		spin_unlock(&cil->xc_push_lock);
> +
> +		/*
> +		 * Make sure the metadata we are about to overwrite in the log
> +		 * has been flushed to stable storage before this iclog is
> +		 * issued.
> +		 */
> +		spin_lock(&cil->xc_log->l_icloglock);
> +		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> +		spin_unlock(&cil->xc_log->l_icloglock);
>  		return;
>  	}
>  
> @@ -888,10 +898,10 @@ xlog_cil_push_work(
>  	struct xfs_trans_header thdr;
>  	struct xfs_log_iovec	lhdr;
>  	struct xfs_log_vec	lvhdr = { NULL };
> -	xfs_lsn_t		preflush_tail_lsn;
> +//	xfs_lsn_t		preflush_tail_lsn;
>  	xfs_csn_t		push_seq;
>  	struct bio		bio;
> -	DECLARE_COMPLETION_ONSTACK(bdev_flush);
> +//	DECLARE_COMPLETION_ONSTACK(bdev_flush);
>  	bool			push_commit_stable;
>  
>  	new_ctx = xlog_cil_ctx_alloc();
> @@ -974,9 +984,9 @@ xlog_cil_push_work(
>  	 * before the iclog write. To detect whether the log tail moves, sample
>  	 * the tail LSN *before* we issue the flush.
>  	 */
> -	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
> -	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
> -				&bdev_flush);
> +//	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
> +//	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
> +//				&bdev_flush);
>  
>  	/*
>  	 * Pull all the log vectors off the items in the CIL, and remove the
> @@ -1058,7 +1068,7 @@ xlog_cil_push_work(
>  	 * Before we format and submit the first iclog, we have to ensure that
>  	 * the metadata writeback ordering cache flush is complete.
>  	 */
> -	wait_for_completion(&bdev_flush);
> +//	wait_for_completion(&bdev_flush);
>  
>  	error = xlog_cil_write_chain(ctx, &lvhdr);
>  	if (error)
> @@ -1118,7 +1128,8 @@ xlog_cil_push_work(
>  	if (push_commit_stable &&
>  	    ctx->commit_iclog->ic_state == XLOG_STATE_ACTIVE)
>  		xlog_state_switch_iclogs(log, ctx->commit_iclog, 0);
> -	xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn);
> +//	xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn);
> +	xlog_state_release_iclog(log, ctx->commit_iclog, 0);
>  
>  	/* Not safe to reference ctx now! */
>  
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression in XFS for fsync heavy workload
  2022-03-16 10:09     ` Jan Kara
@ 2022-03-16 19:38       ` Jan Kara
  2022-03-16 23:26         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2022-03-16 19:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Darrick J. Wong, Dave Chinner

On Wed 16-03-22 11:09:34, Jan Kara wrote:
> On Wed 16-03-22 18:44:59, Dave Chinner wrote:
> > On Wed, Mar 16, 2022 at 12:06:27PM +1100, Dave Chinner wrote:
> > > On Tue, Mar 15, 2022 at 01:49:43PM +0100, Jan Kara wrote:
> > > > Hello,
> > > > 
> > > > I was tracking down a regression in dbench workload on XFS we have
> > > > identified during our performance testing. These are results from one of
> > > > our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
> > > > disk):
> > > > 
> > > > 			       good		       bad
> > > > Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
> > > > Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
> > > > Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
> > > > Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
> > > > Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
> > > > Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
> > > > Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*
> > > > 
> > > > Note that the numbers are actually times to complete workload, not
> > > > traditional dbench throughput numbers so lower is better.
> > ....
> > 
> > > > This should still
> > > > submit it rather early to provide the latency advantage. Otherwise postpone
> > > > the flush to the moment we know we are going to flush the iclog to save
> > > > pointless flushes. But we would have to record whether the flush happened
> > > > or not in the iclog and it would all get a bit hairy...
> > > 
> > > I think we can just set the NEED_FLUSH flag appropriately.
> > > 
> > > However, given all this, I'm wondering if the async cache flush was
> > > really a case of premature optimisation. That is, we don't really
> > > gain anything by reducing the flush latency of the first iclog write
> > > wehn we are writing 100-1000 iclogs before the commit record, and it
> > > can be harmful to some workloads by issuing more flushes than we
> > > need to.
> > > 
> > > So perhaps the right thing to do is just get rid of it and always
> > > mark the first iclog in a checkpoint as NEED_FLUSH....
> > 
> > So I've run some tests on code that does this, and the storage I've
> > tested it on shows largely no difference in stream CIL commit and
> > fsync heavy workloads when comparing synv vs as cache flushes. On
> > set of tests was against high speed NVMe ssds, the other against
> > old, slower SATA SSDs.
> > 
> > Jan, can you run the patch below (against 5.17-rc8) and see what
> > results you get on your modified dbench test?
> 
> Sure, I'll run the test. I forgot to mention that in vanilla upstream kernel
> I could see the difference in the number of cache flushes caused by the
> XFS changes but not actual change in dbench numbers (they were still
> comparable to the bad ones from my test). The XFS change made material
> difference to dbench performance only together with scheduler / cpuidling /
> frequency scaling fixes we have in our SLE kernel (I didn't try to pin down
> which exactly - I guess I can try working around that by using performance
> cpufreq governor and disabling low cstates so that I can test stock
> vanilla kernels). Thanks for the patch!

Yup, so with limiting cstates and performance cpufreq governor I can see
your patch helps significantly the dbench performance:

                   5.18-rc8-vanilla       5.18-rc8-patched
Amean     1        71.22 (   0.00%)       64.94 *   8.81%*
Amean     2        93.03 (   0.00%)       84.80 *   8.85%*
Amean     4       150.54 (   0.00%)      137.51 *   8.66%*
Amean     8       252.53 (   0.00%)      242.24 *   4.08%*
Amean     16      454.13 (   0.00%)      439.08 *   3.31%*
Amean     32      835.24 (   0.00%)      829.74 *   0.66%*
Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%*

The performance is restored to values before commit bad77c375e8d ("xfs: CIL
checkpoint flushes caches unconditionally") as well as the number of
flushes.


								Honza

> > xfs: drop async cache flushes from CIL commits.
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > As discussed here:
> > 
> > https://lore.kernel.org/linux-xfs/20220316010627.GO3927073@dread.disaster.area/T/#t
> > 
> > This is a prototype for removing async cache flushes from the CIL
> > checkpoint path. Fast NVME storage.
> > 
> > From `dbench -t 30`, current TOT:
> > 
> > clients		async			sync
> > 		BW	Latency		BW	Latency
> > 1		 767.14   0.858		 792.10   0.812
> > 8		2231.18   5.607		2263.24  10.985
> > 16		3023.25   5.779		2999.16   7.087
> > 32		3712.80  11.468		3649.19   9.967
> > 128		5997.98  13.719		6973.84  12.620
> > 512		4256.29 104.050		5089.61  97.548
> > 
> > From `dbench -t 30`, CIL scale:
> > 
> > clients		async			sync
> > 		BW	Latency		BW	Latency
> > 1		 935.18   0.855		 915.64   0.903
> > 8		2404.51   6.873		2341.77   6.511
> > 16		3003.42   6.460		2931.57   6.529
> > 32		3697.23   7.939		3596.28   7.894
> > 128		7237.43  15.495		7217.74  11.588
> > 512		5079.24  90.587		5167.08  95.822
> > 
> > fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
> > 
> > 	create		chown		unlink
> > async   1m41s		1m16s		2m03s
> > sync	1m40s		1m19s		1m54s
> > 
> > async log iops: up to 17kiops
> > async log b/w: up to 570MB/s
> > 
> > sync log iops: up to 18kiops
> > sync log b/w: up to 600MB/s
> > 
> > Ok, so there's really no difference from async flushes on really
> > fast storage.
> > 
> > Slower storage:
> > 
> > From `dbench -t 30`, CIL scale:
> > 
> > clients		async			sync
> > 		BW	Latency		BW	Latency
> > 1		  78.59  15.792		  83.78  10.729
> > 8		 367.88  92.067		 404.63  59.943
> > 16		 564.51  72.524		 602.71  76.089
> > 32		 831.66 105.984		 870.26 110.482
> > 128		1659.76 102.969		1624.73  91.356
> > 512		2135.91 223.054		2603.07 161.160
> > 
> > fsmark, 16 threads, create w/32k logbsize
> > 
> > 	create		unlink
> > async   5m06s		4m15s
> > sync	5m00s		4m22s
> > 
> > 
> > Mostly no change here, either. Possibly a bit better fsync overload
> > behaviour with sync flushes.
> > 
> > I think we can probably just get rid of async flushes altogether for
> > the moment. It looked necessary when developing the code, but seems
> > to be complexity we don't actually need now that it's settled down a
> > bit and all the bugs have been flushed out.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c | 27 +++++++++++++++++++--------
> >  1 file changed, 19 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 83a039762b81..14746253805b 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -705,11 +705,21 @@ xlog_cil_set_ctx_write_state(
> >  		 * The LSN we need to pass to the log items on transaction
> >  		 * commit is the LSN reported by the first log vector write, not
> >  		 * the commit lsn. If we use the commit record lsn then we can
> > -		 * move the tail beyond the grant write head.
> > +		 * move the grant write head beyond the tail LSN and overwrite
> > +		 * it.
> >  		 */
> >  		ctx->start_lsn = lsn;
> >  		wake_up_all(&cil->xc_start_wait);
> >  		spin_unlock(&cil->xc_push_lock);
> > +
> > +		/*
> > +		 * Make sure the metadata we are about to overwrite in the log
> > +		 * has been flushed to stable storage before this iclog is
> > +		 * issued.
> > +		 */
> > +		spin_lock(&cil->xc_log->l_icloglock);
> > +		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
> > +		spin_unlock(&cil->xc_log->l_icloglock);
> >  		return;
> >  	}
> >  
> > @@ -888,10 +898,10 @@ xlog_cil_push_work(
> >  	struct xfs_trans_header thdr;
> >  	struct xfs_log_iovec	lhdr;
> >  	struct xfs_log_vec	lvhdr = { NULL };
> > -	xfs_lsn_t		preflush_tail_lsn;
> > +//	xfs_lsn_t		preflush_tail_lsn;
> >  	xfs_csn_t		push_seq;
> >  	struct bio		bio;
> > -	DECLARE_COMPLETION_ONSTACK(bdev_flush);
> > +//	DECLARE_COMPLETION_ONSTACK(bdev_flush);
> >  	bool			push_commit_stable;
> >  
> >  	new_ctx = xlog_cil_ctx_alloc();
> > @@ -974,9 +984,9 @@ xlog_cil_push_work(
> >  	 * before the iclog write. To detect whether the log tail moves, sample
> >  	 * the tail LSN *before* we issue the flush.
> >  	 */
> > -	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
> > -	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
> > -				&bdev_flush);
> > +//	preflush_tail_lsn = atomic64_read(&log->l_tail_lsn);
> > +//	xfs_flush_bdev_async(&bio, log->l_mp->m_ddev_targp->bt_bdev,
> > +//				&bdev_flush);
> >  
> >  	/*
> >  	 * Pull all the log vectors off the items in the CIL, and remove the
> > @@ -1058,7 +1068,7 @@ xlog_cil_push_work(
> >  	 * Before we format and submit the first iclog, we have to ensure that
> >  	 * the metadata writeback ordering cache flush is complete.
> >  	 */
> > -	wait_for_completion(&bdev_flush);
> > +//	wait_for_completion(&bdev_flush);
> >  
> >  	error = xlog_cil_write_chain(ctx, &lvhdr);
> >  	if (error)
> > @@ -1118,7 +1128,8 @@ xlog_cil_push_work(
> >  	if (push_commit_stable &&
> >  	    ctx->commit_iclog->ic_state == XLOG_STATE_ACTIVE)
> >  		xlog_state_switch_iclogs(log, ctx->commit_iclog, 0);
> > -	xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn);
> > +//	xlog_state_release_iclog(log, ctx->commit_iclog, preflush_tail_lsn);
> > +	xlog_state_release_iclog(log, ctx->commit_iclog, 0);
> >  
> >  	/* Not safe to reference ctx now! */
> >  
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression in XFS for fsync heavy workload
  2022-03-16 19:38       ` Jan Kara
@ 2022-03-16 23:26         ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2022-03-16 23:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, Darrick J. Wong, Dave Chinner

On Wed, Mar 16, 2022 at 08:38:40PM +0100, Jan Kara wrote:
> On Wed 16-03-22 11:09:34, Jan Kara wrote:
> > On Wed 16-03-22 18:44:59, Dave Chinner wrote:
> > > On Wed, Mar 16, 2022 at 12:06:27PM +1100, Dave Chinner wrote:
> > > > On Tue, Mar 15, 2022 at 01:49:43PM +0100, Jan Kara wrote:
> > > > > Hello,
> > > > > 
> > > > > I was tracking down a regression in dbench workload on XFS we have
> > > > > identified during our performance testing. These are results from one of
> > > > > our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
> > > > > disk):
> > > > > 
> > > > > 			       good		       bad
> > > > > Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
> > > > > Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
> > > > > Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
> > > > > Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
> > > > > Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
> > > > > Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
> > > > > Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*
> > > > > 
> > > > > Note that the numbers are actually times to complete workload, not
> > > > > traditional dbench throughput numbers so lower is better.
> > > ....
> > > 
> > > > > This should still
> > > > > submit it rather early to provide the latency advantage. Otherwise postpone
> > > > > the flush to the moment we know we are going to flush the iclog to save
> > > > > pointless flushes. But we would have to record whether the flush happened
> > > > > or not in the iclog and it would all get a bit hairy...
> > > > 
> > > > I think we can just set the NEED_FLUSH flag appropriately.
> > > > 
> > > > However, given all this, I'm wondering if the async cache flush was
> > > > really a case of premature optimisation. That is, we don't really
> > > > gain anything by reducing the flush latency of the first iclog write
> > > > wehn we are writing 100-1000 iclogs before the commit record, and it
> > > > can be harmful to some workloads by issuing more flushes than we
> > > > need to.
> > > > 
> > > > So perhaps the right thing to do is just get rid of it and always
> > > > mark the first iclog in a checkpoint as NEED_FLUSH....
> > > 
> > > So I've run some tests on code that does this, and the storage I've
> > > tested it on shows largely no difference in stream CIL commit and
> > > fsync heavy workloads when comparing synv vs as cache flushes. On
> > > set of tests was against high speed NVMe ssds, the other against
> > > old, slower SATA SSDs.
> > > 
> > > Jan, can you run the patch below (against 5.17-rc8) and see what
> > > results you get on your modified dbench test?
> > 
> > Sure, I'll run the test. I forgot to mention that in vanilla upstream kernel
> > I could see the difference in the number of cache flushes caused by the
> > XFS changes but not actual change in dbench numbers (they were still
> > comparable to the bad ones from my test). The XFS change made material
> > difference to dbench performance only together with scheduler / cpuidling /
> > frequency scaling fixes we have in our SLE kernel (I didn't try to pin down
> > which exactly - I guess I can try working around that by using performance
> > cpufreq governor and disabling low cstates so that I can test stock
> > vanilla kernels). Thanks for the patch!
> 
> Yup, so with limiting cstates and performance cpufreq governor I can see
> your patch helps significantly the dbench performance:
> 
>                    5.18-rc8-vanilla       5.18-rc8-patched
> Amean     1        71.22 (   0.00%)       64.94 *   8.81%*
> Amean     2        93.03 (   0.00%)       84.80 *   8.85%*
> Amean     4       150.54 (   0.00%)      137.51 *   8.66%*
> Amean     8       252.53 (   0.00%)      242.24 *   4.08%*
> Amean     16      454.13 (   0.00%)      439.08 *   3.31%*
> Amean     32      835.24 (   0.00%)      829.74 *   0.66%*
> Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%*
> 
> The performance is restored to values before commit bad77c375e8d ("xfs: CIL
> checkpoint flushes caches unconditionally") as well as the number of
> flushes.

OK, good to know, thanks for testing quickly. I'll spin this up into
a proper patch that removes the async flush functionality and
support infrastructure.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression in XFS for fsync heavy workload
  2022-03-16  9:54   ` Jan Kara
@ 2022-03-16 23:38     ` Dave Chinner
  2022-03-17 11:56       ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2022-03-16 23:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-xfs, Darrick J. Wong, Dave Chinner

On Wed, Mar 16, 2022 at 10:54:37AM +0100, Jan Kara wrote:
> On Wed 16-03-22 12:06:27, Dave Chinner wrote:
> > On Tue, Mar 15, 2022 at 01:49:43PM +0100, Jan Kara wrote:
> > > Hello,
> > > 
> > > I was tracking down a regression in dbench workload on XFS we have
> > > identified during our performance testing. These are results from one of
> > > our test machine (server with 64GB of RAM, 48 CPUs, SATA SSD for the test
> > > disk):
> > > 
> > > 			       good		       bad
> > > Amean     1        64.29 (   0.00%)       73.11 * -13.70%*
> > > Amean     2        84.71 (   0.00%)       98.05 * -15.75%*
> > > Amean     4       146.97 (   0.00%)      148.29 *  -0.90%*
> > > Amean     8       252.94 (   0.00%)      254.91 *  -0.78%*
> > > Amean     16      454.79 (   0.00%)      456.70 *  -0.42%*
> > > Amean     32      858.84 (   0.00%)      857.74 (   0.13%)
> > > Amean     64     1828.72 (   0.00%)     1865.99 *  -2.04%*
> > > 
> > > Note that the numbers are actually times to complete workload, not
> > > traditional dbench throughput numbers so lower is better.
> > 
> > How does this work? Dbench is a fixed time workload - the only
> > variability from run to run is the time it takes to run the cleanup
> > phase. Is this some hacked around version of dbench?
> 
> Yes, dbench is a fixed time workload but in fact there is the workload file
> that gets executed in a loop. We run a modified version of dbench
> (https://github.com/mcgrof/dbench) which also reports time spent by each
> execution of the workload file (--show-execute-time option), which has much
> better statistical properties than throughput reported by dbench normally
> (which is for example completely ignorant of metadata operations and that
> leads to big fluctuations in reported numbers especially for high client
> counts).

The high client count fluctuations are actually meaningful and very
enlightening if you know why the fluctations are occurring. :)

e.g. at 280-300 clients on a maximally sized XFS log we run out of
log reservation space and fall off the lockless fast path. At this
point throughput is determined by metadata IO throughput and
transaction reservation latency, not page cache write IO throughput.
IOWs, variations in performance directly reflect the latency impact
of full cycle metadata operations, not just journal fsync
throughput.

At 512 clients, the page cache footprint of dbench is about 10GB.
Hence somewhere around ~7-800 clients on a 16GB RAM machine the
workload  will no longer fit in the page cache, and so now memory
reclaim and page cache repopulation affects measured throughput.

So, yeah, I tend to run bandwidth measurement up to very high client
counts because it gives much more insight into full subsystem cycle
behaviour than just running the "does fsync scale" aspect that low
client count testing exercises....

> > When doing this work, I didn't count cache flushes. What I looked at
> > was the number of log forces vs the number of sleeps waiting on log
> > forces vs log writes vs the number of stalls waiting for log writes.
> > These numbers showed improvements across the board, so any increase
> > in overhead from physical cache flushes was not reflected in the
> > throughput increases I was measuring at the "fsync drives log
> > forces" level.
> 
> Thanks for detailed explanation! I'd just note that e.g. for a machine with
> 8 CPUs, 32 GB of Ram and Intel SSD behind a megaraid_sas controller (it is
> some Dell PowerEdge server) we see even larger regressions like:
> 
>                     good                      bad
> Amean 	1	97.93	( 0.00%)	135.67	( -38.54%)
> Amean 	2	147.69	( 0.00%)	194.82	( -31.91%)
> Amean 	4	242.82	( 0.00%)	352.98	( -45.36%)
> Amean 	8	375.36	( 0.00%)	591.03	( -57.45%)
> 
> I didn't investigate on this machine (it was doing some other tests and I
> had another machine in my hands which also showed some, although smaller,
> regression) but now reading your explanations I'm curious why the
> regression grows with number of threads on that machine. Maybe the culprit
> is different there or just the dynamics isn't as we imagine it on that
> storage controller... I guess I'll borrow the machine and check it.

That sounds more like a poor caching implementation in the hardware
RAID controller than anything else.

> > > The reason as far as I understand it is that
> > > xlog_cil_push_work() never actually ends up writing the iclog (I can see
> > > this in the traces) because it is writing just very small amounts (my
> > > debugging shows xlog_cil_push_work() tends to add 300-1000 bytes to iclog,
> > > 4000 bytes is the largest number I've seen) and very frequent fsync(2)
> > > calls from dbench always end up forcing iclog before it gets filled. So the
> > > cache flushes issued by xlog_cil_push_work() are just pointless overhead
> > > for this workload AFAIU.
> > 
> > It's not quite that straight forward.
> > 
> > Keep in mind that the block layer is supposed to merge new flush
> > requests that occur while there is still a flush in progress. hence
> > the only time this async flush should cause extra flush requests to
> > physically occur unless you have storage that either ignores flush
> > requests (in which case we don't care because bio_submit() aborts
> > real quick) or is really, really fast and so cache flush requests
> > complete before we start hitting the block layer merge case or
> > slowing down other IO.  If storage is slow and there's any amoutn of
> > concurrency, then we're going to be waiting on merged flush requests
> > in the block layer if there's any amount of concurrency, so the
> > impact is fairly well bound there, too.
> >
> > Hence cache flush latency is only going to impact on very
> > low concurrency workloads where any additional wait time directly
> > translates to reduced throughput. That's pretty much what your
> > numbers indicate, too.
> 
> Yes, for higher thread counts I agree flush merging should mitigate the
> impact. But note that there is still some overhead of additional flushes
> because the block layer will merge only with flushes that are queued and
> not yet issued to the device. If there is flush in progress, new flush will
> be queued and will get submitted once the first one completes. It is only
> the third flush that gets merged to the second one.

Yup, and in this workload we'll generally have 4 concurrent CIL push
works being run, so we're over that threshold most of the time on
heavy concurrent fsync workloads.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Regression in XFS for fsync heavy workload
  2022-03-16 23:38     ` Dave Chinner
@ 2022-03-17 11:56       ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2022-03-17 11:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, linux-xfs, Darrick J. Wong, Dave Chinner

On Thu 17-03-22 10:38:28, Dave Chinner wrote:
> On Wed, Mar 16, 2022 at 10:54:37AM +0100, Jan Kara wrote:
> > On Wed 16-03-22 12:06:27, Dave Chinner wrote:
> > > When doing this work, I didn't count cache flushes. What I looked at
> > > was the number of log forces vs the number of sleeps waiting on log
> > > forces vs log writes vs the number of stalls waiting for log writes.
> > > These numbers showed improvements across the board, so any increase
> > > in overhead from physical cache flushes was not reflected in the
> > > throughput increases I was measuring at the "fsync drives log
> > > forces" level.
> > 
> > Thanks for detailed explanation! I'd just note that e.g. for a machine with
> > 8 CPUs, 32 GB of Ram and Intel SSD behind a megaraid_sas controller (it is
> > some Dell PowerEdge server) we see even larger regressions like:
> > 
> >                     good                      bad
> > Amean 	1	97.93	( 0.00%)	135.67	( -38.54%)
> > Amean 	2	147.69	( 0.00%)	194.82	( -31.91%)
> > Amean 	4	242.82	( 0.00%)	352.98	( -45.36%)
> > Amean 	8	375.36	( 0.00%)	591.03	( -57.45%)
> > 
> > I didn't investigate on this machine (it was doing some other tests and I
> > had another machine in my hands which also showed some, although smaller,
> > regression) but now reading your explanations I'm curious why the
> > regression grows with number of threads on that machine. Maybe the culprit
> > is different there or just the dynamics isn't as we imagine it on that
> > storage controller... I guess I'll borrow the machine and check it.
> 
> That sounds more like a poor caching implementation in the hardware
> RAID controller than anything else.

Likely. I did a run with your patch on this machine now and original
performance was restored.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-03-17 11:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-15 12:49 Regression in XFS for fsync heavy workload Jan Kara
2022-03-16  1:06 ` Dave Chinner
2022-03-16  7:44   ` Dave Chinner
2022-03-16 10:09     ` Jan Kara
2022-03-16 19:38       ` Jan Kara
2022-03-16 23:26         ` Dave Chinner
2022-03-16  9:54   ` Jan Kara
2022-03-16 23:38     ` Dave Chinner
2022-03-17 11:56       ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.