All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, John Garry <john.garry@huawei.com>,
	Bart Van Assche <bvanassche@acm.org>,
	Hannes Reinecke <hare@suse.com>, Christoph Hellwig <hch@lst.de>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Re: [PATCH V8 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline
Date: Sat, 25 Apr 2020 11:24:53 +0800	[thread overview]
Message-ID: <20200425032453.GD477579@T590> (raw)
In-Reply-To: <adaaadf2-7b8e-e8a0-0cee-35b170d45c77@suse.de>

On Fri, Apr 24, 2020 at 03:23:08PM +0200, Hannes Reinecke wrote:
> On 4/24/20 12:23 PM, Ming Lei wrote:
> > Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> > up queue mapping. Thomas mentioned the following point[1]:
> > 
> > "
> >   That was the constraint of managed interrupts from the very beginning:
> > 
> >    The driver/subsystem has to quiesce the interrupt line and the associated
> >    queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> >    until it's restarted by the core when the CPU is plugged in again.
> > "
> > 
> > However, current blk-mq implementation doesn't quiesce hw queue before
> > the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is
> > one cpuhp state handled after the CPU is down, so there isn't any chance
> > to quiesce hctx for blk-mq wrt. CPU hotplug.
> > 
> > Add new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE for blk-mq to stop queues
> > and wait for completion of in-flight requests.
> > 
> > We will stop hw queue and wait for completion of in-flight requests
> > when one hctx is becoming dead in the following patch. This way may
> > cause dead-lock for some stacking blk-mq drivers, such as dm-rq and
> > loop.
> > 
> > Add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ and mark it for dm-rq and
> > loop, so we needn't to wait for completion of in-flight requests from
> > dm-rq & loop, then the potential dead-lock can be avoided.
> > 
> > [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> > 
> > Cc: John Garry <john.garry@huawei.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq-debugfs.c     |  1 +
> >   block/blk-mq.c             | 19 +++++++++++++++++++
> >   drivers/block/loop.c       |  2 +-
> >   drivers/md/dm-rq.c         |  2 +-
> >   include/linux/blk-mq.h     |  3 +++
> >   include/linux/cpuhotplug.h |  1 +
> >   6 files changed, 26 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index b3f2ba483992..8e745826eb86 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -239,6 +239,7 @@ static const char *const hctx_flag_name[] = {
> >   	HCTX_FLAG_NAME(TAG_SHARED),
> >   	HCTX_FLAG_NAME(BLOCKING),
> >   	HCTX_FLAG_NAME(NO_SCHED),
> > +	HCTX_FLAG_NAME(NO_MANAGED_IRQ),
> >   };
> >   #undef HCTX_FLAG_NAME
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 65f0aaed55ff..d432cc74ef78 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -2261,6 +2261,16 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
> >   	return -ENOMEM;
> >   }
> > +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> > +{
> > +	return 0;
> > +}
> > +
> > +static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
> > +{
> > +	return 0;
> > +}
> > +
> >   /*
> >    * 'cpu' is going away. splice any existing rq_list entries from this
> >    * software queue to the hw queue dispatch list, and ensure that it
> > @@ -2297,6 +2307,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
> >   static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
> >   {
> > +	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
> > +		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> > +						    &hctx->cpuhp_online);
> >   	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
> >   					    &hctx->cpuhp_dead);
> >   }
> > @@ -2356,6 +2369,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
> >   {
> >   	hctx->queue_num = hctx_idx;
> > +	if (!(hctx->flags & BLK_MQ_F_NO_MANAGED_IRQ))
> > +		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> > +				&hctx->cpuhp_online);
> >   	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
> >   	hctx->tags = set->tags[hctx_idx];
> > @@ -3610,6 +3626,9 @@ static int __init blk_mq_init(void)
> >   {
> >   	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
> >   				blk_mq_hctx_notify_dead);
> > +	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
> > +				blk_mq_hctx_notify_online,
> > +				blk_mq_hctx_notify_offline);
> >   	return 0;
> >   }
> >   subsys_initcall(blk_mq_init);
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index da693e6a834e..784f2e038b55 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
> >   	lo->tag_set.queue_depth = 128;
> >   	lo->tag_set.numa_node = NUMA_NO_NODE;
> >   	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
> > -	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> > +	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
> >   	lo->tag_set.driver_data = lo;
> >   	err = blk_mq_alloc_tag_set(&lo->tag_set);
> > diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> > index 3f8577e2c13b..5f1ff70ac029 100644
> > --- a/drivers/md/dm-rq.c
> > +++ b/drivers/md/dm-rq.c
> > @@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
> >   	md->tag_set->ops = &dm_mq_ops;
> >   	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
> >   	md->tag_set->numa_node = md->numa_node_id;
> > -	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
> > +	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_MANAGED_IRQ;
> >   	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
> >   	md->tag_set->driver_data = md;
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index b45148ba3291..f550b5274b8b 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
> >   	 */
> >   	atomic_t		nr_active;
> > +	/** @cpuhp_online: List to store request if CPU is going to die */
> > +	struct hlist_node	cpuhp_online;
> >   	/** @cpuhp_dead: List to store request if some CPU die. */
> >   	struct hlist_node	cpuhp_dead;
> >   	/** @kobj: Kernel object for sysfs. */
> > @@ -391,6 +393,7 @@ struct blk_mq_ops {
> >   enum {
> >   	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
> >   	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> > +	BLK_MQ_F_NO_MANAGED_IRQ	= 1 << 2,
> >   	BLK_MQ_F_BLOCKING	= 1 << 5,
> >   	BLK_MQ_F_NO_SCHED	= 1 << 6,
> >   	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> > diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> > index 77d70b633531..24b3a77810b6 100644
> > --- a/include/linux/cpuhotplug.h
> > +++ b/include/linux/cpuhotplug.h
> > @@ -152,6 +152,7 @@ enum cpuhp_state {
> >   	CPUHP_AP_SMPBOOT_THREADS,
> >   	CPUHP_AP_X86_VDSO_VMA_ONLINE,
> >   	CPUHP_AP_IRQ_AFFINITY_ONLINE,
> > +	CPUHP_AP_BLK_MQ_ONLINE,
> >   	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
> >   	CPUHP_AP_X86_INTEL_EPB_ONLINE,
> >   	CPUHP_AP_PERF_ONLINE,
> > 
> Ho-hum.
> 
> I do agree for the loop and the CPUHP part (not that I'm qualified to just
> the latter, but anyway).
> For the dm side I'm less certain.
> Thing is, we rarely get hardware interrupts directly to the device-mapper
> device, but rather to the underlying hardware LLD.
> I even not quite sure what exactly the implications of managed interrupts
> with dm are; after all, we're using softirqs here, don't we?
> 
> So for DM I'd rather wait for the I/O on the underlying devices' hctx to
> become quiesce, and not kill them ourselves.
> Not sure if the device-mapper framework _can_ do this right now, though.
> Mike?

The problem the patchset tries to address is drivers applying
managed interrupt. When all CPUs of one managed interrupt line are
offline, the IO interrupt may be never to trigger, so IO timeout may
be triggered or IO hang if no timeout handler is provided.

So any drivers which don't use managed interrupt can be marked as
BLK_MQ_F_NO_MANAGED_IRQ.

For dm-rq, request completion is always triggered by underlying request,
so once underlying request is guaranteed to be completed, dm-rq's
request can be completed too.


Thanks,
Ming


  reply	other threads:[~2020-04-25  3:25 UTC|newest]

Thread overview: 81+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-24 10:23 [PATCH V8 00/11] blk-mq: improvement CPU hotplug Ming Lei
2020-04-24 10:23 ` [PATCH V8 01/11] block: clone nr_integrity_segments and write_hint in blk_rq_prep_clone Ming Lei
2020-04-24 10:32   ` Christoph Hellwig
2020-04-24 12:43   ` Hannes Reinecke
2020-04-24 16:11   ` Martin K. Petersen
2020-04-24 10:23 ` [PATCH V8 02/11] block: add helper for copying request Ming Lei
2020-04-24 10:23   ` Ming Lei
2020-04-24 10:35   ` Christoph Hellwig
2020-04-24 12:43   ` Hannes Reinecke
2020-04-24 16:12   ` Martin K. Petersen
2020-04-24 10:23 ` [PATCH V8 03/11] blk-mq: mark blk_mq_get_driver_tag as static Ming Lei
2020-04-24 12:44   ` Hannes Reinecke
2020-04-24 16:13   ` Martin K. Petersen
2020-04-24 10:23 ` [PATCH V8 04/11] blk-mq: assign rq->tag in blk_mq_get_driver_tag Ming Lei
2020-04-24 10:35   ` Christoph Hellwig
2020-04-24 13:02   ` Hannes Reinecke
2020-04-25  2:54     ` Ming Lei
2020-04-25 18:26       ` Hannes Reinecke
2020-04-24 10:23 ` [PATCH V8 05/11] blk-mq: support rq filter callback when iterating rqs Ming Lei
2020-04-24 13:17   ` Hannes Reinecke
2020-04-25  3:04     ` Ming Lei
2020-04-24 10:23 ` [PATCH V8 06/11] blk-mq: prepare for draining IO when hctx's all CPUs are offline Ming Lei
2020-04-24 13:23   ` Hannes Reinecke
2020-04-25  3:24     ` Ming Lei [this message]
2020-04-24 10:23 ` [PATCH V8 07/11] blk-mq: stop to handle IO and drain IO before hctx becomes inactive Ming Lei
2020-04-24 10:38   ` Christoph Hellwig
2020-04-25  3:17     ` Ming Lei
2020-04-25  8:32       ` Christoph Hellwig
2020-04-25  9:34         ` Ming Lei
2020-04-25  9:53           ` Ming Lei
2020-04-25 15:48             ` Christoph Hellwig
2020-04-26  2:06               ` Ming Lei
2020-04-26  8:19                 ` John Garry
2020-04-27 15:36                 ` Christoph Hellwig
2020-04-28  1:10                   ` Ming Lei
2020-04-27 19:03               ` Paul E. McKenney
2020-04-28  6:54                 ` Christoph Hellwig
2020-04-28 15:58               ` Peter Zijlstra
2020-04-29  2:16                 ` Ming Lei
2020-04-29  8:07                   ` Will Deacon
2020-04-29  9:46                     ` Ming Lei
2020-04-29 12:27                       ` Will Deacon
2020-04-29 13:43                         ` Ming Lei
2020-04-29 17:34                           ` Will Deacon
2020-04-30  0:39                             ` Ming Lei
2020-04-30 11:04                               ` Will Deacon
2020-04-30 14:02                                 ` Ming Lei
2020-05-05 15:46                                   ` Christoph Hellwig
2020-05-06  1:24                                     ` Ming Lei
2020-05-06  7:28                                       ` Will Deacon
2020-05-06  8:07                                         ` Ming Lei
2020-05-06  9:56                                           ` Will Deacon
2020-05-06 10:22                                             ` Ming Lei
2020-04-29 17:46                           ` Paul E. McKenney
2020-04-30  0:43                             ` Ming Lei
2020-04-24 13:27   ` Hannes Reinecke
2020-04-25  3:30     ` Ming Lei
2020-04-24 13:42   ` John Garry
2020-04-25  3:41     ` Ming Lei
2020-04-24 10:23 ` [PATCH V8 08/11] block: add blk_end_flush_machinery Ming Lei
2020-04-24 10:41   ` Christoph Hellwig
2020-04-25  3:44     ` Ming Lei
2020-04-25  8:11       ` Christoph Hellwig
2020-04-25  9:51         ` Ming Lei
2020-04-24 13:47   ` Hannes Reinecke
2020-04-25  3:47     ` Ming Lei
2020-04-24 10:23 ` [PATCH V8 09/11] blk-mq: add blk_mq_hctx_handle_dead_cpu for handling cpu dead Ming Lei
2020-04-24 10:42   ` Christoph Hellwig
2020-04-25  3:48     ` Ming Lei
2020-04-24 13:48   ` Hannes Reinecke
2020-04-24 10:23 ` [PATCH V8 10/11] blk-mq: re-submit IO in case that hctx is inactive Ming Lei
2020-04-24 10:44   ` Christoph Hellwig
2020-04-25  3:52     ` Ming Lei
2020-04-24 13:55   ` Hannes Reinecke
2020-04-25  3:59     ` Ming Lei
2020-04-24 10:23 ` [PATCH V8 11/11] block: deactivate hctx when the hctx is actually inactive Ming Lei
2020-04-24 10:43   ` Christoph Hellwig
2020-04-24 13:56   ` Hannes Reinecke
2020-04-24 15:23 ` [PATCH V8 00/11] blk-mq: improvement CPU hotplug Jens Axboe
2020-04-24 15:40   ` Christoph Hellwig
2020-04-24 15:41     ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200425032453.GD477579@T590 \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=hare@suse.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=john.garry@huawei.com \
    --cc=linux-block@vger.kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.