Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 2/2] xfs: Throttle commits on delayed background CIL push
Date: Tue, 8 Oct 2019 09:22:40 -0400	[thread overview]
Message-ID: <20191008132240.GA3520@bfoster> (raw)
In-Reply-To: <20191008025157.GE16973@dread.disaster.area>

On Tue, Oct 08, 2019 at 01:51:57PM +1100, Dave Chinner wrote:
> On Fri, Oct 04, 2019 at 07:50:01AM -0400, Brian Foster wrote:
> > On Fri, Oct 04, 2019 at 12:27:55PM +1000, Dave Chinner wrote:
> > > On Thu, Oct 03, 2019 at 10:41:14AM -0400, Brian Foster wrote:
> > > > Hmm, I'm also not sure the lockless reservation algorithm is totally
> > > > immune to increased concurrency in this regard. What prevents multiple
> > > > tasks from racing through xlog_grant_head_check() and blowing past the
> > > > log head, for example?
> > > 
> > > Nothing. Debug kernels even emit a "xlog_verify_grant_tail: space >
> > > BBTOB(tail_blocks)" messages when that happens. It's pretty
> > > difficult to do this in real world conditions, even when there is
> > > lots of concurrency being used.
> > > 
> > 
> > Hm, Ok. Though I've seen that alert enough times that I
> > (unintentionally) ignore it at this point, so it can't be that hard to
> > reproduce. ;) That is usually during fstests however, and not a typical
> > workload that I recall.
> 
> I can't say I've seen it for a long time now - I want to say "years"
> but I may well have simply missed it on the rare occasion it has
> occurred and fstests hasn't captured it. i.e. fstests is supposed to
> capture unusual things like this appearing in dmesg during a
> test....
> 
> > Of course, there's a difference between
> > reproducing the basic condition and taking it to the point where it
> > manifests into a problem.
> 
> *nod*
> 
> > > But here's the rub: it's not actually the end of the world because
> > > the reservation doesn't actually determine how much of the log is
> > > currently being used by running transactions - the reservation is
> > > for a maximal rolling iteration of a permanent transaction, not the
> > > initial transaction will be running. Hence if we overrun
> > > occassionally we don't immediately run out of log space and corrupt
> > > the log.
> > > 
> > 
> > Ok, that much is evident from the amount of time this mechanism has been
> > in place without any notable issues.
> > 
> > > Yes, if none of the rolling transactions complete and they all need
> > > to use their entire reservation, and the tail of the log cannot be
> > > moved forward because it is pinned by one of the transactions that
> > > is running, then we'll likely get a log hang on a regrant on the
> > > write head. But if any of the transactions don't use all of their
> > > reservation, then the overrun gets soaked up by the unused parts of
> > > the transactions that are completed and returned to reservation
> > > head, and nobody even notices taht there was a temporary overrun of
> > > the grant head space.
> > > 
> > 
> > Ok, I didn't expect this to be some catastrophic problem or really a
> > problem with your patch simply based on the lifetime of the code and how
> > the grant heads are actually used. I was going to suggest an assert or
> > something to detect whether batching behavior as a side effect of the
> > commit throttle would ever increase likelihood of this situation, but it
> > looks like the grant verify function somewhat serves that purpose
> > already.
> 
> Yeah - xlog_verify_grant_tail() will the report reservation
> overruns, but the serious log space problems (i.e. head overwritting
> the tail) are detected by xlog_verify_tail_lsn() when we stamp the
> tail_lsn into the current iclog header. That's still done under the
> icloglock and the AIL lock, so the comparison of the tail with the
> current log head is still completely serialised.
> 
> > I'd _prefer_ to see something, at least in DEBUG mode, that indicates
> > the frequency of the fundamental incorrect accounting condition as
> > opposed to just the side effect of blowing the tail (because the latter
> > depends on other difficult to reproduce factors), but I'd have to think
> > about that some more as it would need to balance against normal/expected
> > execution flow. Thanks for the background.
> 
> You can test that just by removing the XLOG_TAIL_WARN flag setting,
> then it will warn on every reservation overrun rather than just the
> first.
> 

That's not quite what I'm after. That just removes the oneshot nature of
the existing check. The current debug check looks for a side effect of
the current algorithm in the form of overrunning the tail. What I would
like to see, somehow or another, is something that provides earlier and
more useful information on the frequency/scale of occurrence where
multiple reservations have been made based on the same grant baseline.

This is not so much an error check so not something that should be an
assert or warning, but rather more of a metric that provides a base for
comparison whenever we might have code changes or workloads that
potentially change this behavior. For example, consider a debug mode
stat or sysfs file that could be used to dump out a counter of "same
value" grant head samples after a benchmark workload or fstests run. The
fact that this value might go up or down is not necessarily a problem.
But that would provide some debug mode data to identify and analyze
potentially unintended side effects like transient spikes in concurrent
grant head checks due to blocking/scheduling changes in log reservation
tasks.

See the appended RFC for a quick idea of what I mean. This is slightly
racy, but I think conservative enough to provide valid values with
respect to already racy reservation implementation. On my slow 4xcpu vm,
I see occasional sample counts of 2 running a -p8 fsstress. If I stick a
smallish delay in xfs_log_reserve(), the frequency of the output
increases and I see occasional bumps to 3, a spike of 8 in one case,
etc. Of course I'd probably factor the atomic calls into DEBUG only
inline functions that can be compiled out on production kernels and
replace the tracepoint with a global counter (exported via stats or
sysfs knob), but this just illustrates the idea. The global counter
could also be replaced with (or kept in addition to) something that
tracks a max concurrency value if that is more useful. Any thoughts on
something like this?

> > > Hence occasional overruns on the reservation head before they start
> > > blocking isn't really a problem in practice because the probability
> > > of all the transaction reservation of all transactions running being
> > > required to make forwards progress is extremely small.
> > > 
> > > Basically, we gave up "perfect reservation space grant accounting"
> > > because performance was extremely important and risk of log hangs as
> > > a result of overruns was considered to be extremely low and worth
> > > taking for the benefits the algorithm provided. This was just a
> > > simple, pragmatic risk based engineering decision.
> > > 
> > 
> > FWIW, the comment for xlog_verify_tail() also suggests the potential for
> > false positives and references a panic tag, which all seems kind of
> > erratic and misleading compared to what you explain here.
> 
> Well, it's fundamentally an unserialised check, so it can race with
> other reservation grants, commits that release grant space and tail
> lsn updates. Hence it's not a 100% reliable debug check.
> 

Right. Yet there is a panic knob...

> It also used to be run at all times, not just under
> XFS_CONFIG_DEBUG=y, which is why it has a panic tag associated with
> it. When we first deployed it, we weren't 100% sure there weren't
> customer workloads that would trip over this and hang the log, so
> we gave ourselves a way of triggering kernel dumps the instant an
> overrun was detected. Hence a site that had log hangs with this
> message in the logs could turn on the panic tag and we'd get a
> kernel dump to analyse...
> 

Ok, makes sense. This kind of speaks to the earlier point around having
more useful data. While this isn't necessarily a problem right now, we
have no real data to tell us whether some particular code change alters
this behavior. If this was enough of a concern when the change was first
put in place to insert a panic hook, then it stands to reason it's
something we should at least be cognizant of when making changes that
could impact its behavior.

> Since then, this code has been relegated to debug code but the panic
> tag still exists. It could be turned back into a ASSERT now, but
> it's still useful the way it is as it means debug kernels don't fall
> over the moment a spurious overrun occurs...
> 

Yeah, ISTM the panic bit could be removed at this point. The warning (as
opposed to an assert) is probably reasonable so long as the check itself
is racy so as to not cause false positive panics with fatal asserts
enabled.

Brian

--- 8< ---

RFC: crude concurrency stat on reserve grant head

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index a2beee9f74da..1d3056176e6e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -175,6 +175,7 @@ xlog_grant_head_init(
 	xlog_assign_grant_head(&head->grant, 1, 0);
 	INIT_LIST_HEAD(&head->waiters);
 	spin_lock_init(&head->lock);
+	atomic64_set(&head->sample_cnt, 0);
 }
 
 STATIC void
@@ -446,6 +447,7 @@ xfs_log_reserve(
 	struct xlog_ticket	*tic;
 	int			need_bytes;
 	int			error = 0;
+	int64_t			sample_cnt;
 
 	ASSERT(client == XFS_TRANSACTION || client == XFS_LOG);
 
@@ -465,13 +467,19 @@ xfs_log_reserve(
 
 	error = xlog_grant_head_check(log, &log->l_reserve_head, tic,
 				      &need_bytes);
+	atomic64_inc(&log->l_reserve_head.sample_cnt);
 	if (error)
 		goto out_error;
 
+	sample_cnt = atomic64_xchg(&log->l_reserve_head.sample_cnt, 0);
 	xlog_grant_add_space(log, &log->l_reserve_head.grant, need_bytes);
 	xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes);
 	trace_xfs_log_reserve_exit(log, tic);
 	xlog_verify_grant_tail(log);
+
+	if (sample_cnt > 1)
+		trace_printk("%d: %lld\n", __LINE__, sample_cnt);
+
 	return 0;
 
 out_error:
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b880c23cb6e4..62e4949f91c4 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -339,6 +339,7 @@ struct xlog_grant_head {
 	spinlock_t		lock ____cacheline_aligned_in_smp;
 	struct list_head	waiters;
 	atomic64_t		grant;
+	atomic64_t		sample_cnt;
 };
 
 /*