Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: Christoph Hellwig <hch@lst.de>
Cc: linux-block@vger.kernel.org, John Garry <john.garry@huawei.com>,
	Bart Van Assche <bvanassche@acm.org>,
	Hannes Reinecke <hare@suse.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ming Lei <ming.lei@redhat.com>
Subject: Re: [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline
Date: Fri, 22 May 2020 11:25:42 +0200
Message-ID: <98028672-be06-fb18-8cb3-b4ae1422caf9@suse.de> (raw)
In-Reply-To: <20200520170635.2094101-7-hch@lst.de>

On 5/20/20 7:06 PM, Christoph Hellwig wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "That was the constraint of managed interrupts from the very beginning:
> 
>   The driver/subsystem has to quiesce the interrupt line and the associated
>   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>   until it's restarted by the core when the CPU is plugged in again."
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
> cpuhp state handled after the CPU is down, so there isn't any chance to
> quiesce the hctx before shutting down the CPU.
> 
> Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
> where the last CPU goes away, and wait for completion of in-flight
> requests.  This guarantees that there is no inflight I/O before shutting
> down the managed IRQ.
> 
> Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
> to wait for completion of in-flight requests from these drivers to avoid
> a potential dead-lock. It is safe to do this for stacking drivers as those
> do not use interrupts at all and their I/O completions are triggered by
> underlying devices I/O completion.
> 
> [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> [hch: different retry mechanism, merged two patches, minor cleanups]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq-debugfs.c     |   2 +
>   block/blk-mq-tag.c         |   8 +++
>   block/blk-mq.c             | 114 ++++++++++++++++++++++++++++++++++++-
>   drivers/block/loop.c       |   2 +-
>   drivers/md/dm-rq.c         |   2 +-
>   include/linux/blk-mq.h     |  10 ++++
>   include/linux/cpuhotplug.h |   1 +
>   7 files changed, 135 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 96b7a35c898a7..15df3a36e9fa4 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
>   	HCTX_STATE_NAME(STOPPED),
>   	HCTX_STATE_NAME(TAG_ACTIVE),
>   	HCTX_STATE_NAME(SCHED_RESTART),
> +	HCTX_STATE_NAME(INACTIVE),
>   };
>   #undef HCTX_STATE_NAME
>   
> @@ -239,6 +240,7 @@ static const char *const hctx_flag_name[] = {
>   	HCTX_FLAG_NAME(TAG_SHARED),
>   	HCTX_FLAG_NAME(BLOCKING),
>   	HCTX_FLAG_NAME(NO_SCHED),
> +	HCTX_FLAG_NAME(STACKING),
>   };
>   #undef HCTX_FLAG_NAME
>   
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index c27c6dfc7d36e..3925d1e55bc8f 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   	sbitmap_finish_wait(bt, ws, &wait);
>   
>   found_tag:
> +	/*
> +	 * Give up this allocation if the hctx is inactive.  The caller will
> +	 * retry on an active hctx.
> +	 */
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> +		return -1;
> +	}
>   	return tag + tag_offset;
>   }
>   
I really wish we could have a dedicated NO_TAG value; returning
-1 for an unsigned int always feels dodgy.
And we could kill all the various internal definitions in drivers/scsi ...

Hmm?

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 42aee2978464b..672c7e3f61243 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -375,14 +375,32 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
>   			e->type->ops.limit_depth(data->cmd_flags, data);
>   	}
>   
> +retry:
>   	data->ctx = blk_mq_get_ctx(q);
>   	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
>   	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
>   		blk_mq_tag_busy(data->hctx);
>   
> +	/*
> +	 * Waiting allocations only fail because of an inactive hctx.  In that
> +	 * case just retry the hctx assignment and tag allocation as CPU hotplug
> +	 * should have migrated us to an online CPU by now.
> +	 */
>   	tag = blk_mq_get_tag(data);
> -	if (tag == BLK_MQ_TAG_FAIL)
> -		return NULL;
> +	if (tag == BLK_MQ_TAG_FAIL) {
> +		if (data->flags & BLK_MQ_REQ_NOWAIT)
> +			return NULL;
> +
> +		/*
> +		 * All kthreads that can perform I/O should have been moved off
> +		 * this CPU by the time the the CPU hotplug statemachine has
> +		 * shut down a hctx.  But better be sure with an extra sanity
> +		 * check.
> +		 */
> +		if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
> +			return NULL;
> +		goto retry;
> +	}
>   	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
>   }
>   
> @@ -2324,6 +2342,86 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>   	return -ENOMEM;
>   }
>   
> +struct rq_iter_data {
> +	struct blk_mq_hw_ctx *hctx;
> +	bool has_rq;
> +};
> +
> +static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
> +{
> +	struct rq_iter_data *iter_data = data;
> +
> +	if (rq->mq_hctx != iter_data->hctx)
> +		return true;
> +	iter_data->has_rq = true;
> +	return false;
> +}
> +
> +static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct blk_mq_tags *tags = hctx->sched_tags ?
> +			hctx->sched_tags : hctx->tags;
> +	struct rq_iter_data data = {
> +		.hctx	= hctx,
> +	};
> +
> +	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> +	return data.has_rq;
> +}
> +
To my reading this would only work reliably if we run the iteration on 
one of the cpus in the corresponding mask for the hctx.
Yet I fail to see any provision for this neither here nor in the 
previous patch
How do you guarantee that?
Or is there some implicit mechanism which I've missed?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

  reply index

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-20 17:06 blk-mq: improvement CPU hotplug (simplified version) v3 Christoph Hellwig
2020-05-20 17:06 ` [PATCH 1/6] blk-mq: remove the bio argument to ->prepare_request Christoph Hellwig
2020-05-20 18:16   ` Bart Van Assche
2020-05-22  9:11   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 2/6] blk-mq: simplify the blk_mq_get_request calling convention Christoph Hellwig
2020-05-20 18:22   ` Bart Van Assche
2020-05-22  9:13   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 3/6] blk-mq: move more request initialization to blk_mq_rq_ctx_init Christoph Hellwig
2020-05-20 20:10   ` Bart Van Assche
2020-05-20 17:06 ` [PATCH 4/6] blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx Christoph Hellwig
2020-05-22  9:17   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 5/6] blk-mq: add blk_mq_all_tag_iter Christoph Hellwig
2020-05-20 20:24   ` Bart Van Assche
2020-05-27  6:05     ` Christoph Hellwig
2020-05-22  9:18   ` Hannes Reinecke
2020-05-20 17:06 ` [PATCH 6/6] blk-mq: drain I/O when all CPUs in a hctx are offline Christoph Hellwig
2020-05-22  9:25   ` Hannes Reinecke [this message]
2020-05-25  9:20     ` Ming Lei
2020-05-20 21:46 ` blk-mq: improvement CPU hotplug (simplified version) v3 Bart Van Assche
2020-05-21  2:57   ` Ming Lei
2020-05-21  3:50     ` Bart Van Assche
2020-05-21  4:33       ` Ming Lei
2020-05-21 19:15         ` Bart Van Assche
2020-05-22  2:39           ` Ming Lei
2020-05-22 14:47             ` Keith Busch
2020-05-23  3:05               ` Ming Lei
2020-05-23 15:19             ` Bart Van Assche
2020-05-25  4:09               ` Ming Lei
2020-05-25 15:32                 ` Bart Van Assche
2020-05-25 16:38                   ` Keith Busch
2020-05-26  0:37                   ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98028672-be06-fb18-8cb3-b4ae1422caf9@suse.de \
    --to=hare@suse.de \
    --cc=bvanassche@acm.org \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=john.garry@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git