Re: [PATCH] xfs: defer online discard submission to a workqueue

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>, linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: defer online discard submission to a workqueue
Date: Thu, 8 Nov 2018 08:50:02 -0500	[thread overview]
Message-ID: <20181108135001.GA2796@bfoster> (raw)
In-Reply-To: <20181108003806.GA19305@dastard>

On Thu, Nov 08, 2018 at 11:38:06AM +1100, Dave Chinner wrote:
> On Wed, Nov 07, 2018 at 08:42:24AM -0500, Brian Foster wrote:
> > On Wed, Nov 07, 2018 at 08:18:02AM +1100, Dave Chinner wrote:
> > > On Tue, Nov 06, 2018 at 09:23:11AM -0500, Brian Foster wrote:
> > > > On Mon, Nov 05, 2018 at 01:51:39PM -0800, Christoph Hellwig wrote:
> > > > > On Mon, Nov 05, 2018 at 01:10:21PM -0500, Brian Foster wrote:
> > > > > > When online discard is enabled, discards of busy extents are
> > > > > > submitted asynchronously as a bio chain. bio completion and
> > > > > > resulting busy extent cleanup is deferred to a workqueue. Async
> > > > > > discard submission is intended to avoid blocking log forces on a
> > > > > > full discard sequence which can take a noticeable amount of time in
> > > > > > some cases.
> > > > > > 
> > > > > > We've had reports of this still producing log force stalls with XFS
> > > > > > on VDO,
> > > > > 
> > > > > Please fix this in VDO instead.  We should not work around out of
> > > > > tree code making stupid decisions.
> > > > 
> > > > I assume the "stupid decision" refers to sync discard execution. I'm not
> > > > familiar with the internals of VDO, this is just what I was told.
> > > 
> > > IMO, what VDO does is irrelevant - any call to submit_bio() can
> > > block if the request queue is full. Hence if we've drowned the queue
> > > in discards and the device is slow at discards, then we are going to
> > > block submitting discards.
> > > 
> > > > My
> > > > understanding is that these discards can stack up and take enough time
> > > > that a limit on outstanding discards is required, which now that I think
> > > > of it makes me somewhat skeptical of the whole serial execution thing.
> > > > Hitting that outstanding discard request limit is what bubbles up the
> > > > stack and affects XFS by holding up log forces, since new discard
> > > > submissions are presumably blocked on completion of the oldest
> > > > outstanding request.
> > > 
> > > Exactly.
> > > 
> > > > I'm not quite sure what happens in the block layer if that limit were
> > > > lifted. Perhaps it assumes throttling responsibility directly via
> > > > queues/plugs? I'd guess that at minimum we'd end up blocking indirectly
> > > > somewhere (via memory allocation pressure?) anyways, so ISTM that some
> > > > kind of throttling is inevitable in this situation. What am I missing?
> > > 
> > > We still need to throttle discards - they have to play nice with all
> > > the other IO we need to dispatch concurrently.
> > > 
> > > I have two issues with the proposed patch:
> > > 
> > > 1. it puts both discard dispatch and completion processing on the
> > > one work qeueue, so if the queue is filled with dispatch requests,
> > > IO completion queuing gets blocked. That's not the best thing to be
> > > doing.
> > > 
> > 
> > This is an unbound workqueue with max_active == 0. AIUI, that means we
> > can have something like 256 execution contexts (worker threads?) per
> > cpu.
> 
> .....
>         WQ_MAX_ACTIVE           = 512,    /* I like 512, better ideas? */
>         WQ_MAX_UNBOUND_PER_CPU  = 4,      /* 4 * #cpus for unbound wq */
>         WQ_DFL_ACTIVE           = WQ_MAX_ACTIVE / 2,
> };
> 
> /* unbound wq's aren't per-cpu, scale max_active according to #cpus */
> #define WQ_UNBOUND_MAX_ACTIVE   \
>         max_t(int, WQ_MAX_ACTIVE, num_possible_cpus() * WQ_MAX_UNBOUND_PER_CPU)
> 
> 
> IOWs, unbound queues are not per-cpu and they are execution limited
> to max(512, NR_CPUS * 4) kworker threads.
> 
> The default (for max_active = 0), however, is still WQ_DFL_ACTIVE so
> the total number of active workers for the unbound xfs_discard_wq
> is 256.
> 

Sounds about right..

> 
> > Given that, plus the batching that occurs in XFS due to delayed
> > logging and discard bio chaining, that seems rather unlikely. Unless I'm
> > misunderstanding the mechanism, I think that means filling the queue as
> > such and blocking discard submission basically consumes one of those
> > contexts.
> 
> Yes, but consider the situation where we've got a slow discard
> device and we're removing a file with millions of extents. We've got
> to issue millions of discard ops in this case. because dispatch
> queueing is not bound, we're easily going to overflow the discard
> workqueue because the freeing transactions will run far ahead of the
> discard operations. Sooner or later we consume all 256 discard_wq
> worker threads with blocked discard submission.  Then both log IO
> completion and discard I ocompletion will block on workqueue
> submission and we deadlock because discard completion can't
> run....
> 
> > 
> > Of course, the CIL context structure appears to be technically unbound
> 
> I'm missing something here - What bit of that structure is unbound?
> 

I mean to say that the ctx is not a fixed/limited resource in the
context of a discard submission workqueue. There's no throttling, so we
can essentially discard as many busy extent lists as the user can
generate (this is essentially the same point as your million extent file
example above).

> > as well and it's trivial to just add a separate submission workqueue,
> > but I'd like to at least make sure we're on the same page as to the need
> > (so it can be documented clearly as well).
> 
> A separate submission queue doesn't really solve log Io completion
> blocking problem. Yes, it solves the discard completion deadlock,
> but we eventually end up in the same place on sustained discard
> workloads with submission queuing blocking on a full work queue.
> 

Well, for the purposes of this patch it's fine that we still block on
discard submission. The point is just that we do so in a separate
context from log I/O completion and thus avoid associated log force
stalls.

> Workqueues are no the way to solve unbound queue depth problems.
> That's what Kernel threads are for. e.g. this is the reason the
> xfsaild is a kernel thread, not a work queue. The amount of
> writeback work queued on the AIL can be hundreds of thousands of
> objects, way more than a workqueue can handle. This discard problem
> is no different - concurrent dispatch through kworker threads buys
> us nothing - we just fill the request queue from hundreds of threads
> instead of filling it from just one.
> 

Fair, but the point wasn't really to buy any kind of high level discard
submission improvement over the current scheme. The goal was to avoid
log force stalls on a time consuming operation that has no explicit
serialization requirement with respect to log force completion...

> The workqueue approach has other problems, too, like dispatch across
> worker threads means discard is not FIFO scheduled - it's completely
> random as to the order in which discards get fed to the device
> request queue. Hence discards can be starved because whenever the
> worker thread runs to process it's queue it finds the device request
> queue already full and blocks again.
> 

... but Ok. I can see that the worst case of creating and blocking
hundreds of worker threads on discard submission kind of creates a mess
even without considering the resulting effect on the discard operations
themselves. There's no real point or need for all of those threads.

> Having a single kernel thread that walks the discard queue on each
> context and then each context in sequence order gives us FIFO
> dispatch of discard requests. It would block only on full request
> queues giving us a much more predictable log-force-to-completion
> latency. It allows for the possiblity of merging discards across
> multiple CIL contexts, to directly control the rate of discard, and
> to skip small discards or even -turn off discard- when the backlog
> gets too great.
> 
> The workqueue approach just doesn't allow anything like this to be
> done because every discard context is kept separate from every other
> context and there is no coordination at all between them.
> 

The high level kthread approach sounds like a nice idea to me. I'm not
sure that it's technically necessary to address this specific problem
though. I.e., it would be nice if we could still separate enhancement
from bug fix.

> > > 2. log forces no longer wait for discards to be dispatched - they
> > > just queue them. This means the mechanism xfs_extent_busy_flush()
> > > uses to dispatch pending discards (synchrnous log force) can return
> > > before discards have even been dispatched to disk. Hence we can
> > > expect to see longer wait and tail latencies when busy extents are
> > > encountered by the allocator. Whether this is a problem or not needs
> > > further investigation.
> > > 
> > 
> > Firstly, I think latency is kind of moot in the problematic case. The
> > queue is already drowning in requests that presumably are going to take
> > minutes to complete. In that case, the overhead of kicking a workqueue
> > to do the submission is probably negligible.
> 
> Yes, the overhead of kicking the queue is negliable. That's not the
> problem though. By queuing discards rather than submitting them we
> go from a single FIFO dispatch model (by in-order iclog IO
> completion) to a concurrent, uncoordinated dispatch model.  It's the
> loss of FIFO behaviour because the synchrnous log force no longer
> controls dispatch order that leads to unpredictable and long tail
> latencies in dispatch completion, hence causing the problems for the
> proceeses now waiting on specific extent discard completion rather
> than just the log force.
> 

I think I see what you're getting at here: discard submission via log
completion means that submission is serialized (with respect to other
iclogs) and in log buf order, since that is enforced by the log callback
invoking code. Since the callbacks execute one at a time, blocking in
submission likely means the same, single log I/O completion execution
context ends up iterating over and processing the other iclogs with
pending callbacks once the current submit blockage clears. Yes?

If I'm following that correctly, couldn't we provide very similar
behavior by using a separate "ordered" workqueue for submission? The
ordered wq sets max_active to 1, which means we still have a single
submitter, yet we still create a queue that disconnects submit from log
force completion. I suppose we could also allocate a separate structure
from xfs_cil_ctx to track pending discard jobs and slim down memory
usage closer to something that might be required for a thread-based
implementation.

>From there, we can still explore additional enhancements incrementally,
such as pushing some of the busy extent processing/sorting into the
workqueue, eventually converting the wq-based mechanism into a
thread-based one, etc. Thoughts?

Brian

> In some cases they'll get woken faster (don't ahve to wait for
> discards to be dispatched), but it is equally likely they'll have to
> wait for much, much longer. In essence, the async dispatch by
> workqueue model removes all assumptions we've made about the
> predictablility of discard completion latency. FIFO is predictable,
> concurrent async dispatch by workqueue is completely unpredictable.
> 
> If we really want to do fully async dispatch of discards, I think we
> need to use a controllable "single dispatch by kernel thread" model
> like the AIL, not use workqueues and spray the dispatch in an
> uncoordinated, uncontrollable manner across hundreds of kernel
> threads....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com