All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>,
	linux-block@vger.kernel.org, John Garry <john.garry@huawei.com>,
	Bart Van Assche <bvanassche@acm.org>,
	Hannes Reinecke <hare@suse.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline
Date: Mon, 25 May 2020 17:20:50 +0800	[thread overview]
Message-ID: <20200525092050.GA802845@T590> (raw)
In-Reply-To: <98028672-be06-fb18-8cb3-b4ae1422caf9@suse.de>

On Fri, May 22, 2020 at 11:25:42AM +0200, Hannes Reinecke wrote:
> On 5/20/20 7:06 PM, Christoph Hellwig wrote:
> > From: Ming Lei <ming.lei@redhat.com>
> > 
> > Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> > up queue mapping. Thomas mentioned the following point[1]:
> > 
> > "That was the constraint of managed interrupts from the very beginning:
> > 
> >   The driver/subsystem has to quiesce the interrupt line and the associated
> >   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >   until it's restarted by the core when the CPU is plugged in again."
> > 
> > However, current blk-mq implementation doesn't quiesce hw queue before
> > the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
> > cpuhp state handled after the CPU is down, so there isn't any chance to
> > quiesce the hctx before shutting down the CPU.
> > 
> > Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
> > where the last CPU goes away, and wait for completion of in-flight
> > requests.  This guarantees that there is no inflight I/O before shutting
> > down the managed IRQ.
> > 
> > Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
> > to wait for completion of in-flight requests from these drivers to avoid
> > a potential dead-lock. It is safe to do this for stacking drivers as those
> > do not use interrupts at all and their I/O completions are triggered by
> > underlying devices I/O completion.
> > 
> > [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > [hch: different retry mechanism, merged two patches, minor cleanups]
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >   block/blk-mq-debugfs.c     |   2 +
> >   block/blk-mq-tag.c         |   8 +++
> >   block/blk-mq.c             | 114 ++++++++++++++++++++++++++++++++++++-
> >   drivers/block/loop.c       |   2 +-
> >   drivers/md/dm-rq.c         |   2 +-
> >   include/linux/blk-mq.h     |  10 ++++
> >   include/linux/cpuhotplug.h |   1 +
> >   7 files changed, 135 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index 96b7a35c898a7..15df3a36e9fa4 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
> >   	HCTX_STATE_NAME(STOPPED),
> >   	HCTX_STATE_NAME(TAG_ACTIVE),
> >   	HCTX_STATE_NAME(SCHED_RESTART),
> > +	HCTX_STATE_NAME(INACTIVE),
> >   };
> >   #undef HCTX_STATE_NAME
> > @@ -239,6 +240,7 @@ static const char *const hctx_flag_name[] = {
> >   	HCTX_FLAG_NAME(TAG_SHARED),
> >   	HCTX_FLAG_NAME(BLOCKING),
> >   	HCTX_FLAG_NAME(NO_SCHED),
> > +	HCTX_FLAG_NAME(STACKING),
> >   };
> >   #undef HCTX_FLAG_NAME
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index c27c6dfc7d36e..3925d1e55bc8f 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
> >   	sbitmap_finish_wait(bt, ws, &wait);
> >   found_tag:
> > +	/*
> > +	 * Give up this allocation if the hctx is inactive.  The caller will
> > +	 * retry on an active hctx.
> > +	 */
> > +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> > +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> > +		return -1;
> > +	}
> >   	return tag + tag_offset;
> >   }
> I really wish we could have a dedicated NO_TAG value; returning
> -1 for an unsigned int always feels dodgy.
> And we could kill all the various internal definitions in drivers/scsi ...
> 
> Hmm?

Good catch.

> 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 42aee2978464b..672c7e3f61243 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -375,14 +375,32 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
> >   			e->type->ops.limit_depth(data->cmd_flags, data);
> >   	}
> > +retry:
> >   	data->ctx = blk_mq_get_ctx(q);
> >   	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
> >   	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
> >   		blk_mq_tag_busy(data->hctx);
> > +	/*
> > +	 * Waiting allocations only fail because of an inactive hctx.  In that
> > +	 * case just retry the hctx assignment and tag allocation as CPU hotplug
> > +	 * should have migrated us to an online CPU by now.
> > +	 */
> >   	tag = blk_mq_get_tag(data);
> > -	if (tag == BLK_MQ_TAG_FAIL)
> > -		return NULL;
> > +	if (tag == BLK_MQ_TAG_FAIL) {
> > +		if (data->flags & BLK_MQ_REQ_NOWAIT)
> > +			return NULL;
> > +
> > +		/*
> > +		 * All kthreads that can perform I/O should have been moved off
> > +		 * this CPU by the time the the CPU hotplug statemachine has
> > +		 * shut down a hctx.  But better be sure with an extra sanity
> > +		 * check.
> > +		 */
> > +		if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
> > +			return NULL;
> > +		goto retry;
> > +	}
> >   	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
> >   }
> > @@ -2324,6 +2342,86 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> >   	return -ENOMEM;
> >   }
> > +struct rq_iter_data {
> > +	struct blk_mq_hw_ctx *hctx;
> > +	bool has_rq;
> > +};
> > +
> > +static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
> > +{
> > +	struct rq_iter_data *iter_data = data;
> > +
> > +	if (rq->mq_hctx != iter_data->hctx)
> > +		return true;
> > +	iter_data->has_rq = true;
> > +	return false;
> > +}
> > +
> > +static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	struct blk_mq_tags *tags = hctx->sched_tags ?
> > +			hctx->sched_tags : hctx->tags;
> > +	struct rq_iter_data data = {
> > +		.hctx	= hctx,
> > +	};
> > +
> > +	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> > +	return data.has_rq;
> > +}
> > +
> To my reading this would only work reliably if we run the iteration on one
> of the cpus in the corresponding mask for the hctx.
> Yet I fail to see any provision for this neither here nor in the previous
> patch
> How do you guarantee that?
> Or is there some implicit mechanism which I've missed?

Yeah, it is guaranteed by memory barrier.

[1] tag allocation path

sbitmap_get()	/* smp_mb() is implied */
test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)

[2] cpu is going to offline

set_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)
smp_mb__after_atomic();
blk_mq_hctx_has_requests()
	blk_mq_all_tag_iter
		bt_tags_for_each
			sbitmap_for_each_set

So setting tag bit and reading BLK_MQ_S_INACTIVE is ordered, and
setting BLK_MQ_S_INACTIVE and readding tag bit is ordered too.

Then either one request is re-tried to be allocated on another online CPU,
or the request is allocated & drained before the cpu is going to
offline.


Thanks,
Ming


  reply	other threads:[~2020-05-25  9:21 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-20 17:06 blk-mq: improvement CPU hotplug (simplified version) v3 Christoph Hellwig
2020-05-20 17:06 ` [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
2020-05-20 18:16   ` Bart Van Assche
2020-05-22  9:11   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 2/6] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
2020-05-20 18:22   ` Bart Van Assche
2020-05-22  9:13   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 3/6] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
2020-05-20 20:10   ` Bart Van Assche
2020-05-20 17:06 ` [PATCH 4/6] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
2020-05-22  9:17   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
2020-05-20 20:24   ` Bart Van Assche
2020-05-27  6:05     ` Christoph Hellwig
2020-05-22  9:18   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
2020-05-22  9:25   ` Hannes Reinecke
2020-05-25  9:20     ` Ming Lei [this message]
2020-05-20 21:46 ` blk-mq: improvement CPU hotplug (simplified version) v3 Bart Van Assche
2020-05-21  2:57   ` Ming Lei
2020-05-21  3:50     ` Bart Van Assche
2020-05-21  4:33       ` Ming Lei
2020-05-21 19:15         ` Bart Van Assche
2020-05-22  2:39           ` Ming Lei
2020-05-22 14:47             ` Keith Busch
2020-05-23  3:05               ` Ming Lei
2020-05-23 15:19             ` Bart Van Assche
2020-05-25  4:09               ` Ming Lei
2020-05-25 15:32                 ` Bart Van Assche
2020-05-25 16:38                   ` Keith Busch
2020-05-26  0:37                   ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200525092050.GA802845@T590 \
    --to=ming.lei@redhat.com \
    --cc=bvanassche@acm.org \
    --cc=hare@suse.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=john.garry@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.