All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
@ 2018-09-13 12:15 Ming Lei
  2018-09-13 12:15 ` [PATCH V3 01/17] blk-mq: allow to pass default queue flags for creating & initializing queue Ming Lei
                   ` (18 more replies)
  0 siblings, 19 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Ming Lei

Hi,

This patchset introduces per-host admin request queue for submitting
admin request only, and uses this approach to implement both SCSI
quiesce and runtime PM in one very simply way. Also runtime PM deadlock
can be avoided in case that request pool is used up, such as when too
many IO requests are allocated before resuming device.

The idea is borrowed from NVMe.

In this patchset, admin request(all requests submitted via __scsi_execute) will
be submitted via one per-host admin queue, and the request is still
associated with the same scsi_device as before, and respects this
scsi_device's all kinds of limits too. Admin queue shares host tags with
other IO queues.

One core idea is that for any admin request submitted from this admin queue,
this request won't be called back to block layer via the associated IO
queue(scsi_device). And this is done in the 3rd patch. So once IO queue
is frozen, it can be observed as really frozen from block layer view.

SCSI quiesce is implemented by admin queue in very simple way, see patch
15.

Also runtime PM for legacy path is simplified too, see patch 16, and device
resume is moved to blk_queue_enter().

blk-mq simply follows legacy's approach for supporting runtime PM.

Also the fast IO path is simplified much, see blk_queue_enter().


gitweb:
	https://github.com/ming1/linux/commits/v4.19-rc-next-scsi_admin_queue_v3

Both runtime PM and system suspend on both legacy & blk-mq have been verified,
and not see regression when running blktests.

Any comments are welcome!

Thanks,
Ming

V3->V2:
	- add comment on hanlding admin queue inside hctx_may_queue() (4/17)
	- improve runtime suspend helper as suggested by Jianchao (16/17)
	- remove RFC

V1->V2:
	- convert NO_SCHED to ADMIN flag, don't allocate driver tag budget
	for admin queue, as pointed by Jianchao(4/17)
	- fix one issue in run scsi queue: admin queue shares IO queue depth
	when sending one command to this scsi_device(10/17)
	- fix one race between runtime PM and system suspend(16/17)
	- iterate over scheduler tags instead of driver tags for counting
	allocated requests(17/17)


Ming Lei (17):
  blk-mq: allow to pass default queue flags for creating & initializing
    queue
  blk-mq: convert BLK_MQ_F_NO_SCHED into per-queue flag
  block: rename QUEUE_FLAG_NO_SCHED as QUEUE_FLAG_ADMIN
  blk-mq: don't reserve tags for admin queue
  SCSI: try to retrieve request_queue via 'scsi_cmnd' if possible
  SCSI: pass 'scsi_device' instance from 'scsi_request'
  SCSI: prepare for introducing admin queue for legacy path
  SCSI: pass scsi_device to scsi_mq_prep_fn
  SCSI: don't set .queuedata in scsi_mq_alloc_queue()
  SCSI: deal with admin queue busy
  SCSI: track pending admin commands
  SCSI: create admin queue for each host
  SCSI: use the dedicated admin queue to send admin commands
  SCSI: transport_spi: resume a quiesced device
  SCSI: use admin queue to implement queue QUIESCE
  block: simplify runtime PM support
  block: enable runtime PM for blk-mq

 block/blk-core.c                    | 189 ++++++++++++------------
 block/blk-mq-debugfs.c              |   3 +-
 block/blk-mq-tag.c                  |  33 ++++-
 block/blk-mq-tag.h                  |   2 +
 block/blk-mq.c                      |  44 ++++--
 block/elevator.c                    |  28 +---
 drivers/ata/libata-eh.c             |   2 +-
 drivers/block/null_blk_main.c       |   7 +-
 drivers/nvme/host/fc.c              |   4 +-
 drivers/nvme/host/pci.c             |   4 +-
 drivers/nvme/host/rdma.c            |   4 +-
 drivers/nvme/target/loop.c          |   4 +-
 drivers/scsi/hosts.c                |   9 ++
 drivers/scsi/libsas/sas_ata.c       |   2 +-
 drivers/scsi/libsas/sas_scsi_host.c |   2 +-
 drivers/scsi/scsi_error.c           |   2 +-
 drivers/scsi/scsi_lib.c             | 278 ++++++++++++++++++++++++++----------
 drivers/scsi/scsi_priv.h            |   1 +
 drivers/scsi/scsi_scan.c            |   1 +
 drivers/scsi/scsi_sysfs.c           |   1 +
 drivers/scsi/scsi_transport_spi.c   |   3 +
 include/linux/blk-mq.h              |  22 ++-
 include/linux/blkdev.h              |  14 +-
 include/scsi/scsi_device.h          |   5 +-
 include/scsi/scsi_host.h            |   2 +
 include/scsi/scsi_request.h         |   5 +-
 26 files changed, 439 insertions(+), 232 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH V3 01/17] blk-mq: allow to pass default queue flags for creating & initializing queue
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 02/17] blk-mq: convert BLK_MQ_F_NO_SCHED into per-queue flag Ming Lei
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

Prepare for converting the flag of BLK_MQ_F_NO_SCHED into per-queue
flag, since the following patches need this way for supporting per-host
admin queue.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 16 +++++++++-------
 include/linux/blk-mq.h | 19 ++++++++++++++++---
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a59c72..d524efc5d1bc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2476,7 +2476,8 @@ void blk_mq_release(struct request_queue *q)
 	free_percpu(q->queue_ctx);
 }
 
-struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
+struct request_queue *__blk_mq_init_queue(struct blk_mq_tag_set *set,
+					  unsigned long def_flags)
 {
 	struct request_queue *uninit_q, *q;
 
@@ -2484,13 +2485,13 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 	if (!uninit_q)
 		return ERR_PTR(-ENOMEM);
 
-	q = blk_mq_init_allocated_queue(set, uninit_q);
+	q = __blk_mq_init_allocated_queue(set, uninit_q, def_flags);
 	if (IS_ERR(q))
 		blk_cleanup_queue(uninit_q);
 
 	return q;
 }
-EXPORT_SYMBOL(blk_mq_init_queue);
+EXPORT_SYMBOL(__blk_mq_init_queue);
 
 static int blk_mq_hw_ctx_size(struct blk_mq_tag_set *tag_set)
 {
@@ -2564,8 +2565,9 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	blk_mq_sysfs_register(q);
 }
 
-struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
-						  struct request_queue *q)
+struct request_queue *__blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
+						    struct request_queue *q,
+						    unsigned long def_flags)
 {
 	/* mark the queue as mq asap */
 	q->mq_ops = set->ops;
@@ -2599,7 +2601,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 
 	q->nr_queues = nr_cpu_ids;
 
-	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
+	q->queue_flags |= def_flags;
 
 	if (!(set->flags & BLK_MQ_F_SG_MERGE))
 		queue_flag_set_unlocked(QUEUE_FLAG_NO_SG_MERGE, q);
@@ -2649,7 +2651,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	q->mq_ops = NULL;
 	return ERR_PTR(-ENOMEM);
 }
-EXPORT_SYMBOL(blk_mq_init_allocated_queue);
+EXPORT_SYMBOL(__blk_mq_init_allocated_queue);
 
 void blk_mq_free_queue(struct request_queue *q)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 1da59c16f637..7f6ecd7b35ce 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -200,9 +200,22 @@ enum {
 	((policy & ((1 << BLK_MQ_F_ALLOC_POLICY_BITS) - 1)) \
 		<< BLK_MQ_F_ALLOC_POLICY_START_BIT)
 
-struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
-struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
-						  struct request_queue *q);
+struct request_queue *__blk_mq_init_queue(struct blk_mq_tag_set *, unsigned long);
+struct request_queue *__blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
+						    struct request_queue *q,
+						    unsigned long def_flags);
+
+static inline struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
+{
+	return __blk_mq_init_queue(set, QUEUE_FLAG_MQ_DEFAULT);
+}
+
+static inline struct request_queue *
+blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, struct request_queue *q)
+{
+	return __blk_mq_init_allocated_queue(set, q, QUEUE_FLAG_MQ_DEFAULT);
+}
+
 int blk_mq_register_dev(struct device *, struct request_queue *);
 void blk_mq_unregister_dev(struct device *, struct request_queue *);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 02/17] blk-mq: convert BLK_MQ_F_NO_SCHED into per-queue flag
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
  2018-09-13 12:15 ` [PATCH V3 01/17] blk-mq: allow to pass default queue flags for creating & initializing queue Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 03/17] block: rename QUEUE_FLAG_NO_SCHED as QUEUE_FLAG_ADMIN Ming Lei
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

We need to support admin queue for scsi host, and not like NVMe,
this support is only from logic view, and the admin queue still has
to share same tags with IO queues.

Convert BLK_MQ_F_NO_SCHED into per-queue flag so that we can support
admin queue for SCSI.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c        | 2 +-
 block/blk-mq.c                | 2 +-
 block/elevator.c              | 3 +--
 drivers/block/null_blk_main.c | 7 ++++---
 drivers/nvme/host/fc.c        | 4 ++--
 drivers/nvme/host/pci.c       | 4 ++--
 drivers/nvme/host/rdma.c      | 4 ++--
 drivers/nvme/target/loop.c    | 4 ++--
 include/linux/blk-mq.h        | 1 -
 include/linux/blkdev.h        | 5 +++++
 10 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..246c9afb6f5d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -133,6 +133,7 @@ static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(SCSI_PASSTHROUGH),
 	QUEUE_FLAG_NAME(QUIESCED),
 	QUEUE_FLAG_NAME(PREEMPT_ONLY),
+	QUEUE_FLAG_NAME(NO_SCHED),
 };
 #undef QUEUE_FLAG_NAME
 
@@ -246,7 +247,6 @@ static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(SG_MERGE),
 	HCTX_FLAG_NAME(BLOCKING),
-	HCTX_FLAG_NAME(NO_SCHED),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d524efc5d1bc..5b56ed306cd9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2633,7 +2633,7 @@ struct request_queue *__blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	blk_mq_add_queue_tag_set(set, q);
 	blk_mq_map_swqueue(q);
 
-	if (!(set->flags & BLK_MQ_F_NO_SCHED)) {
+	if (!blk_queue_no_sched(q)) {
 		int ret;
 
 		ret = elevator_init_mq(q);
diff --git a/block/elevator.c b/block/elevator.c
index 6a06b5d040e5..8fb8754222fa 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1111,8 +1111,7 @@ static int __elevator_change(struct request_queue *q, const char *name)
 
 static inline bool elv_support_iosched(struct request_queue *q)
 {
-	if (q->mq_ops && q->tag_set && (q->tag_set->flags &
-				BLK_MQ_F_NO_SCHED))
+	if (q->mq_ops && blk_queue_no_sched(q))
 		return false;
 	return true;
 }
diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 6127e3ff7b4b..5d9504e65725 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1617,8 +1617,6 @@ static int null_init_tag_set(struct nullb *nullb, struct blk_mq_tag_set *set)
 	set->numa_node = nullb ? nullb->dev->home_node : g_home_node;
 	set->cmd_size	= sizeof(struct nullb_cmd);
 	set->flags = BLK_MQ_F_SHOULD_MERGE;
-	if (g_no_sched)
-		set->flags |= BLK_MQ_F_NO_SCHED;
 	set->driver_data = NULL;
 
 	if ((nullb && nullb->dev->blocking) || g_blocking)
@@ -1703,6 +1701,9 @@ static int null_add_dev(struct nullb_device *dev)
 		goto out_free_nullb;
 
 	if (dev->queue_mode == NULL_Q_MQ) {
+		unsigned long q_flags = g_no_sched ?
+			QUEUE_FLAG_MQ_NO_SCHED_DEFAULT : QUEUE_FLAG_MQ_DEFAULT;
+
 		if (shared_tags) {
 			nullb->tag_set = &tag_set;
 			rv = 0;
@@ -1718,7 +1719,7 @@ static int null_add_dev(struct nullb_device *dev)
 			goto out_cleanup_queues;
 
 		nullb->tag_set->timeout = 5 * HZ;
-		nullb->q = blk_mq_init_queue(nullb->tag_set);
+		nullb->q = __blk_mq_init_queue(nullb->tag_set, q_flags);
 		if (IS_ERR(nullb->q)) {
 			rv = -ENOMEM;
 			goto out_cleanup_tags;
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 611e70cae754..7048e1444210 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3034,14 +3034,14 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 	ctrl->admin_tag_set.driver_data = ctrl;
 	ctrl->admin_tag_set.nr_hw_queues = 1;
 	ctrl->admin_tag_set.timeout = ADMIN_TIMEOUT;
-	ctrl->admin_tag_set.flags = BLK_MQ_F_NO_SCHED;
 
 	ret = blk_mq_alloc_tag_set(&ctrl->admin_tag_set);
 	if (ret)
 		goto out_free_queues;
 	ctrl->ctrl.admin_tagset = &ctrl->admin_tag_set;
 
-	ctrl->ctrl.admin_q = blk_mq_init_queue(&ctrl->admin_tag_set);
+	ctrl->ctrl.admin_q = __blk_mq_init_queue(&ctrl->admin_tag_set,
+			QUEUE_FLAG_MQ_NO_SCHED_DEFAULT);
 	if (IS_ERR(ctrl->ctrl.admin_q)) {
 		ret = PTR_ERR(ctrl->ctrl.admin_q);
 		goto out_free_admin_tag_set;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d668682f91df..73a3bd980fc9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1492,14 +1492,14 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
 		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
 		dev->admin_tagset.cmd_size = nvme_pci_cmd_size(dev, false);
-		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
 		dev->admin_tagset.driver_data = dev;
 
 		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
 			return -ENOMEM;
 		dev->ctrl.admin_tagset = &dev->admin_tagset;
 
-		dev->ctrl.admin_q = blk_mq_init_queue(&dev->admin_tagset);
+		dev->ctrl.admin_q = __blk_mq_init_queue(&dev->admin_tagset,
+				QUEUE_FLAG_MQ_NO_SCHED_DEFAULT);
 		if (IS_ERR(dev->ctrl.admin_q)) {
 			blk_mq_free_tag_set(&dev->admin_tagset);
 			return -ENOMEM;
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index dc042017c293..d61c057c0a71 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -692,7 +692,6 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->driver_data = ctrl;
 		set->nr_hw_queues = 1;
 		set->timeout = ADMIN_TIMEOUT;
-		set->flags = BLK_MQ_F_NO_SCHED;
 	} else {
 		set = &ctrl->tag_set;
 		memset(set, 0, sizeof(*set));
@@ -770,7 +769,8 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 			goto out_free_async_qe;
 		}
 
-		ctrl->ctrl.admin_q = blk_mq_init_queue(&ctrl->admin_tag_set);
+		ctrl->ctrl.admin_q = __blk_mq_init_queue(&ctrl->admin_tag_set,
+				QUEUE_FLAG_MQ_NO_SCHED_DEFAULT);
 		if (IS_ERR(ctrl->ctrl.admin_q)) {
 			error = PTR_ERR(ctrl->ctrl.admin_q);
 			goto out_free_tagset;
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 9908082b32c4..c689621c2187 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -368,7 +368,6 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
 	ctrl->admin_tag_set.driver_data = ctrl;
 	ctrl->admin_tag_set.nr_hw_queues = 1;
 	ctrl->admin_tag_set.timeout = ADMIN_TIMEOUT;
-	ctrl->admin_tag_set.flags = BLK_MQ_F_NO_SCHED;
 
 	ctrl->queues[0].ctrl = ctrl;
 	error = nvmet_sq_init(&ctrl->queues[0].nvme_sq);
@@ -381,7 +380,8 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
 		goto out_free_sq;
 	ctrl->ctrl.admin_tagset = &ctrl->admin_tag_set;
 
-	ctrl->ctrl.admin_q = blk_mq_init_queue(&ctrl->admin_tag_set);
+	ctrl->ctrl.admin_q = __blk_mq_init_queue(&ctrl->admin_tag_set,
+			QUEUE_FLAG_MQ_NO_SCHED_DEFAULT);
 	if (IS_ERR(ctrl->ctrl.admin_q)) {
 		error = PTR_ERR(ctrl->ctrl.admin_q);
 		goto out_free_tagset;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7f6ecd7b35ce..afde18ac5b31 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -181,7 +181,6 @@ enum {
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
 	BLK_MQ_F_SG_MERGE	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
-	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d6869e0e2b64..a2b110ec422d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -699,6 +699,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_NO_SCHED	30	/* no scheduler allowed */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -708,6 +709,9 @@ struct request_queue {
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
 				 (1 << QUEUE_FLAG_POLL))
 
+#define QUEUE_FLAG_MQ_NO_SCHED_DEFAULT	 (QUEUE_FLAG_MQ_DEFAULT |	\
+					  (1 << QUEUE_FLAG_NO_SCHED))
+
 void blk_queue_flag_set(unsigned int flag, struct request_queue *q);
 void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
 bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q);
@@ -739,6 +743,7 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
 #define blk_queue_preempt_only(q)				\
 	test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
+#define blk_queue_no_sched(q)	test_bit(QUEUE_FLAG_NO_SCHED, &(q)->queue_flags)
 
 extern int blk_set_preempt_only(struct request_queue *q);
 extern void blk_clear_preempt_only(struct request_queue *q);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 03/17] block: rename QUEUE_FLAG_NO_SCHED as QUEUE_FLAG_ADMIN
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
  2018-09-13 12:15 ` [PATCH V3 01/17] blk-mq: allow to pass default queue flags for creating & initializing queue Ming Lei
  2018-09-13 12:15 ` [PATCH V3 02/17] blk-mq: convert BLK_MQ_F_NO_SCHED into per-queue flag Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 04/17] blk-mq: don't reserve tags for admin queue Ming Lei
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

Now all users of QUEUE_FLAG_NO_SCHED is for admin queue only, and not
see any drivers need this flag for IO queue.

So rename it as QUEUE_FLAG_ADMIN, which looks more straightforward.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c        | 2 +-
 block/blk-mq.c                | 2 +-
 block/elevator.c              | 2 +-
 drivers/block/null_blk_main.c | 2 +-
 drivers/nvme/host/fc.c        | 2 +-
 drivers/nvme/host/pci.c       | 2 +-
 drivers/nvme/host/rdma.c      | 2 +-
 drivers/nvme/target/loop.c    | 2 +-
 include/linux/blkdev.h        | 8 ++++----
 9 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 246c9afb6f5d..8df013e9f242 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -133,7 +133,7 @@ static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(SCSI_PASSTHROUGH),
 	QUEUE_FLAG_NAME(QUIESCED),
 	QUEUE_FLAG_NAME(PREEMPT_ONLY),
-	QUEUE_FLAG_NAME(NO_SCHED),
+	QUEUE_FLAG_NAME(ADMIN),
 };
 #undef QUEUE_FLAG_NAME
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5b56ed306cd9..7868daaf6de0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2633,7 +2633,7 @@ struct request_queue *__blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	blk_mq_add_queue_tag_set(set, q);
 	blk_mq_map_swqueue(q);
 
-	if (!blk_queue_no_sched(q)) {
+	if (!blk_queue_admin(q)) {
 		int ret;
 
 		ret = elevator_init_mq(q);
diff --git a/block/elevator.c b/block/elevator.c
index 8fb8754222fa..d6abba76c89e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1111,7 +1111,7 @@ static int __elevator_change(struct request_queue *q, const char *name)
 
 static inline bool elv_support_iosched(struct request_queue *q)
 {
-	if (q->mq_ops && blk_queue_no_sched(q))
+	if (q->mq_ops && blk_queue_admin(q))
 		return false;
 	return true;
 }
diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 5d9504e65725..9fb358007e43 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1702,7 +1702,7 @@ static int null_add_dev(struct nullb_device *dev)
 
 	if (dev->queue_mode == NULL_Q_MQ) {
 		unsigned long q_flags = g_no_sched ?
-			QUEUE_FLAG_MQ_NO_SCHED_DEFAULT : QUEUE_FLAG_MQ_DEFAULT;
+			QUEUE_FLAG_MQ_ADMIN_DEFAULT : QUEUE_FLAG_MQ_DEFAULT;
 
 		if (shared_tags) {
 			nullb->tag_set = &tag_set;
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 7048e1444210..a920d13c3538 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3041,7 +3041,7 @@ nvme_fc_init_ctrl(struct device *dev, struct nvmf_ctrl_options *opts,
 	ctrl->ctrl.admin_tagset = &ctrl->admin_tag_set;
 
 	ctrl->ctrl.admin_q = __blk_mq_init_queue(&ctrl->admin_tag_set,
-			QUEUE_FLAG_MQ_NO_SCHED_DEFAULT);
+			QUEUE_FLAG_MQ_ADMIN_DEFAULT);
 	if (IS_ERR(ctrl->ctrl.admin_q)) {
 		ret = PTR_ERR(ctrl->ctrl.admin_q);
 		goto out_free_admin_tag_set;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 73a3bd980fc9..10716a00a6b4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1499,7 +1499,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 		dev->ctrl.admin_tagset = &dev->admin_tagset;
 
 		dev->ctrl.admin_q = __blk_mq_init_queue(&dev->admin_tagset,
-				QUEUE_FLAG_MQ_NO_SCHED_DEFAULT);
+				QUEUE_FLAG_MQ_ADMIN_DEFAULT);
 		if (IS_ERR(dev->ctrl.admin_q)) {
 			blk_mq_free_tag_set(&dev->admin_tagset);
 			return -ENOMEM;
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index d61c057c0a71..f901b3dafac5 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -770,7 +770,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 		}
 
 		ctrl->ctrl.admin_q = __blk_mq_init_queue(&ctrl->admin_tag_set,
-				QUEUE_FLAG_MQ_NO_SCHED_DEFAULT);
+				QUEUE_FLAG_MQ_ADMIN_DEFAULT);
 		if (IS_ERR(ctrl->ctrl.admin_q)) {
 			error = PTR_ERR(ctrl->ctrl.admin_q);
 			goto out_free_tagset;
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index c689621c2187..8fca59e6b3c3 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -381,7 +381,7 @@ static int nvme_loop_configure_admin_queue(struct nvme_loop_ctrl *ctrl)
 	ctrl->ctrl.admin_tagset = &ctrl->admin_tag_set;
 
 	ctrl->ctrl.admin_q = __blk_mq_init_queue(&ctrl->admin_tag_set,
-			QUEUE_FLAG_MQ_NO_SCHED_DEFAULT);
+			QUEUE_FLAG_MQ_ADMIN_DEFAULT);
 	if (IS_ERR(ctrl->ctrl.admin_q)) {
 		error = PTR_ERR(ctrl->ctrl.admin_q);
 		goto out_free_tagset;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index a2b110ec422d..2dbc7524a169 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -699,7 +699,7 @@ struct request_queue {
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
 #define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
-#define QUEUE_FLAG_NO_SCHED	30	/* no scheduler allowed */
+#define QUEUE_FLAG_ADMIN	30	/* admin queue */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -709,8 +709,8 @@ struct request_queue {
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
 				 (1 << QUEUE_FLAG_POLL))
 
-#define QUEUE_FLAG_MQ_NO_SCHED_DEFAULT	 (QUEUE_FLAG_MQ_DEFAULT |	\
-					  (1 << QUEUE_FLAG_NO_SCHED))
+#define QUEUE_FLAG_MQ_ADMIN_DEFAULT	 (QUEUE_FLAG_MQ_DEFAULT |	\
+					  (1 << QUEUE_FLAG_ADMIN))
 
 void blk_queue_flag_set(unsigned int flag, struct request_queue *q);
 void blk_queue_flag_clear(unsigned int flag, struct request_queue *q);
@@ -743,7 +743,7 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
 #define blk_queue_preempt_only(q)				\
 	test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
-#define blk_queue_no_sched(q)	test_bit(QUEUE_FLAG_NO_SCHED, &(q)->queue_flags)
+#define blk_queue_admin(q)	test_bit(QUEUE_FLAG_ADMIN, &(q)->queue_flags)
 
 extern int blk_set_preempt_only(struct request_queue *q);
 extern void blk_clear_preempt_only(struct request_queue *q);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 04/17] blk-mq: don't reserve tags for admin queue
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (2 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 03/17] block: rename QUEUE_FLAG_NO_SCHED as QUEUE_FLAG_ADMIN Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 05/17] SCSI: try to retrieve request_queue via 'scsi_cmnd' if possible Ming Lei
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

Not necessary to reserve tags for admin queue since there isn't
many inflight commands in admin queue usually.

This change won't starve admin queue too because each blocked queue
has equal priority to get one new tag when one driver tag is released,
no matter it is freed from any queue.

So that IO performance won't be affected after admin queue(shared tags
with IO queues) is introduced in the following patches.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 94e1ed667b6e..7b0390f1c764 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -30,7 +30,8 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
 bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
 	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
-	    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+	    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
+	    !blk_queue_admin(hctx->queue))
 		atomic_inc(&hctx->tags->active_queues);
 
 	return true;
@@ -57,7 +58,8 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 		return;
 
-	atomic_dec(&tags->active_queues);
+	if (!blk_queue_admin(hctx->queue))
+		atomic_dec(&tags->active_queues);
 
 	blk_mq_tag_wakeup_all(tags, false);
 }
@@ -82,6 +84,12 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 	if (bt->sb.depth == 1)
 		return true;
 
+	/*
+	 * Needn't to deal with admin queue specially here even though we
+	 * don't take it account to tags->active_queues, so blk_queue_admin()
+	 * can be avoided to check in the fast path, also with implicit benefit
+	 * of avoiding too many in-flight admin requests
+	 */
 	users = atomic_read(&hctx->tags->active_queues);
 	if (!users)
 		return true;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 05/17] SCSI: try to retrieve request_queue via 'scsi_cmnd' if possible
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (3 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 04/17] blk-mq: don't reserve tags for admin queue Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 06/17] SCSI: pass 'scsi_device' instance from 'scsi_request' Ming Lei
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

Prepare for introduing per-host admin queue.

The most important part is that the request originated from admin queue
can't be called back to the IO queue associated with scsi_device, especially,
one request may be requeued, timedout or completed via block layer helper, so
what we should do is to use 'scsi_cmnd->request->q' to retrieve the request
queue, and pass that to block layer helper, instead of
sdev->request_queue.

Fortunately most of users of 'scsi_device->request_queue' aren't in related IO
path(requeue, timeout, complete, run queue), so the audit isn't more difficult
than I thought of.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/ata/libata-eh.c             | 2 +-
 drivers/scsi/libsas/sas_ata.c       | 2 +-
 drivers/scsi/libsas/sas_scsi_host.c | 2 +-
 drivers/scsi/scsi_error.c           | 2 +-
 drivers/scsi/scsi_lib.c             | 6 +++---
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 01306c018398..fbceea6b62a9 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -919,7 +919,7 @@ static void ata_eh_set_pending(struct ata_port *ap, int fastdrain)
 void ata_qc_schedule_eh(struct ata_queued_cmd *qc)
 {
 	struct ata_port *ap = qc->ap;
-	struct request_queue *q = qc->scsicmd->device->request_queue;
+	struct request_queue *q = qc->scsicmd->request->q;
 	unsigned long flags;
 
 	WARN_ON(!ap->ops->error_handler);
diff --git a/drivers/scsi/libsas/sas_ata.c b/drivers/scsi/libsas/sas_ata.c
index 64a958a99f6a..fcf46437b756 100644
--- a/drivers/scsi/libsas/sas_ata.c
+++ b/drivers/scsi/libsas/sas_ata.c
@@ -601,7 +601,7 @@ void sas_ata_task_abort(struct sas_task *task)
 
 	/* Bounce SCSI-initiated commands to the SCSI EH */
 	if (qc->scsicmd) {
-		struct request_queue *q = qc->scsicmd->device->request_queue;
+		struct request_queue *q = qc->scsicmd->request->q;
 		unsigned long flags;
 
 		spin_lock_irqsave(q->queue_lock, flags);
diff --git a/drivers/scsi/libsas/sas_scsi_host.c b/drivers/scsi/libsas/sas_scsi_host.c
index 33229348dcb6..91e192f93ae1 100644
--- a/drivers/scsi/libsas/sas_scsi_host.c
+++ b/drivers/scsi/libsas/sas_scsi_host.c
@@ -933,7 +933,7 @@ void sas_task_abort(struct sas_task *task)
 	if (dev_is_sata(task->dev)) {
 		sas_ata_task_abort(task);
 	} else {
-		struct request_queue *q = sc->device->request_queue;
+		struct request_queue *q = sc->request->q;
 		unsigned long flags;
 
 		spin_lock_irqsave(q->queue_lock, flags);
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index b7a8fdfeb2f4..9f19c80b983c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1360,7 +1360,7 @@ static int scsi_eh_try_stu(struct scsi_cmnd *scmd)
 		int i, rtn = NEEDS_RETRY;
 
 		for (i = 0; rtn == NEEDS_RETRY && i < 2; i++)
-			rtn = scsi_send_eh_cmnd(scmd, stu_command, 6, scmd->device->request_queue->rq_timeout, 0);
+			rtn = scsi_send_eh_cmnd(scmd, stu_command, 6, scmd->request->q->rq_timeout, 0);
 
 		if (rtn == SUCCESS)
 			return 0;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index eb97d2dd3651..6fb8fd3ccc2c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -168,7 +168,7 @@ static void scsi_mq_requeue_cmd(struct scsi_cmnd *cmd)
 static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, bool unbusy)
 {
 	struct scsi_device *device = cmd->device;
-	struct request_queue *q = device->request_queue;
+	struct request_queue *q = cmd->request->q;
 	unsigned long flags;
 
 	SCSI_LOG_MLQUEUE(1, scmd_printk(KERN_INFO, cmd,
@@ -668,7 +668,7 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
 {
 	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
 	struct scsi_device *sdev = cmd->device;
-	struct request_queue *q = sdev->request_queue;
+	struct request_queue *q = cmd->request->q;
 
 	if (blk_update_request(req, error, bytes))
 		return true;
@@ -1038,7 +1038,7 @@ static int scsi_io_completion_nz_result(struct scsi_cmnd *cmd, int result,
 void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 {
 	int result = cmd->result;
-	struct request_queue *q = cmd->device->request_queue;
+	struct request_queue *q = cmd->request->q;
 	struct request *req = cmd->request;
 	blk_status_t blk_stat = BLK_STS_OK;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 06/17] SCSI: pass 'scsi_device' instance from 'scsi_request'
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (4 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 05/17] SCSI: try to retrieve request_queue via 'scsi_cmnd' if possible Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 07/17] SCSI: prepare for introducing admin queue for legacy path Ming Lei
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

This patch prepares for introducing SCSI per-host admin queue, which
is only used for queuing admin requests, which are now submitted via
__scsi_execute().

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c     | 2 ++
 include/scsi/scsi_request.h | 5 ++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6fb8fd3ccc2c..2800dfae19cd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -279,6 +279,8 @@ int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	rq->cmd_len = COMMAND_SIZE(cmd[0]);
 	memcpy(rq->cmd, cmd, rq->cmd_len);
 	rq->retries = retries;
+	rq->sdev = sdev;	/* only valid in submit path */
+
 	req->timeout = timeout;
 	req->cmd_flags |= flags;
 	req->rq_flags |= rq_flags | RQF_QUIET;
diff --git a/include/scsi/scsi_request.h b/include/scsi/scsi_request.h
index b06f28c74908..0de6901b48ab 100644
--- a/include/scsi/scsi_request.h
+++ b/include/scsi/scsi_request.h
@@ -14,7 +14,10 @@ struct scsi_request {
 	unsigned int	sense_len;
 	unsigned int	resid_len;	/* residual count */
 	int		retries;
-	void		*sense;
+	union {
+		void		*sense;
+		struct scsi_device *sdev;
+	};
 };
 
 static inline struct scsi_request *scsi_req(struct request *rq)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 07/17] SCSI: prepare for introducing admin queue for legacy path
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (5 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 06/17] SCSI: pass 'scsi_device' instance from 'scsi_request' Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 08/17] SCSI: pass scsi_device to scsi_mq_prep_fn Ming Lei
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

Uses scsi_is_admin_queue() and scsi_get_scsi_dev() to retrieve
'scsi_device' for legacy path.

The same approach can be used in SCSI_MQ path too, just not very efficiently,
and will deal with that in the patch when introducing admin queue for SCSI_MQ.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 37 +++++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2800dfae19cd..2f541b4fb32b 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -46,6 +46,20 @@ static DEFINE_MUTEX(scsi_sense_cache_mutex);
 
 static void scsi_mq_uninit_cmd(struct scsi_cmnd *cmd);
 
+/* For admin queue, its queuedata is NULL */
+static inline bool scsi_is_admin_queue(struct request_queue *q)
+{
+	return !q->queuedata;
+}
+
+/* This helper can only be used in req prep stage */
+static inline struct scsi_device *scsi_get_scsi_dev(struct request *rq)
+{
+	if (scsi_is_admin_queue(rq->q))
+		return scsi_req(rq)->sdev;
+	return rq->q->queuedata;
+}
+
 static inline struct kmem_cache *
 scsi_select_sense_cache(bool unchecked_isa_dma)
 {
@@ -1426,10 +1440,9 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
 }
 
 static int
-scsi_prep_return(struct request_queue *q, struct request *req, int ret)
+scsi_prep_return(struct scsi_device *sdev, struct request_queue *q,
+		struct request *req, int ret)
 {
-	struct scsi_device *sdev = q->queuedata;
-
 	switch (ret) {
 	case BLKPREP_KILL:
 	case BLKPREP_INVALID:
@@ -1461,7 +1474,7 @@ scsi_prep_return(struct request_queue *q, struct request *req, int ret)
 
 static int scsi_prep_fn(struct request_queue *q, struct request *req)
 {
-	struct scsi_device *sdev = q->queuedata;
+	struct scsi_device *sdev = scsi_get_scsi_dev(req);
 	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
 	int ret;
 
@@ -1486,7 +1499,7 @@ static int scsi_prep_fn(struct request_queue *q, struct request *req)
 
 	ret = scsi_setup_cmnd(sdev, req);
 out:
-	return scsi_prep_return(q, req, ret);
+	return scsi_prep_return(sdev, q, req, ret);
 }
 
 static void scsi_unprep_fn(struct request_queue *q, struct request *req)
@@ -1663,6 +1676,9 @@ static int scsi_lld_busy(struct request_queue *q)
 	if (blk_queue_dying(q))
 		return 0;
 
+	if (WARN_ON_ONCE(scsi_is_admin_queue(q)))
+		return 0;
+
 	shost = sdev->host;
 
 	/*
@@ -1866,7 +1882,7 @@ static void scsi_request_fn(struct request_queue *q)
 	__releases(q->queue_lock)
 	__acquires(q->queue_lock)
 {
-	struct scsi_device *sdev = q->queuedata;
+	struct scsi_device *sdev;
 	struct Scsi_Host *shost;
 	struct scsi_cmnd *cmd;
 	struct request *req;
@@ -1875,7 +1891,6 @@ static void scsi_request_fn(struct request_queue *q)
 	 * To start with, we keep looping until the queue is empty, or until
 	 * the host is no longer able to accept any more requests.
 	 */
-	shost = sdev->host;
 	for (;;) {
 		int rtn;
 		/*
@@ -1887,6 +1902,10 @@ static void scsi_request_fn(struct request_queue *q)
 		if (!req)
 			break;
 
+		cmd = blk_mq_rq_to_pdu(req);
+		sdev = cmd->device;
+		shost = sdev->host;
+
 		if (unlikely(!scsi_device_online(sdev))) {
 			sdev_printk(KERN_ERR, sdev,
 				    "rejecting I/O to offline device\n");
@@ -1904,7 +1923,6 @@ static void scsi_request_fn(struct request_queue *q)
 			blk_start_request(req);
 
 		spin_unlock_irq(q->queue_lock);
-		cmd = blk_mq_rq_to_pdu(req);
 		if (cmd != req->special) {
 			printk(KERN_CRIT "impossible request in %s.\n"
 					 "please mail a stack trace to "
@@ -2382,6 +2400,9 @@ struct scsi_device *scsi_device_from_queue(struct request_queue *q)
 {
 	struct scsi_device *sdev = NULL;
 
+	/* admin queue won't be exposed to external users */
+	WARN_ON_ONCE(scsi_is_admin_queue(q));
+
 	if (q->mq_ops) {
 		if (q->mq_ops == &scsi_mq_ops)
 			sdev = q->queuedata;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 08/17] SCSI: pass scsi_device to scsi_mq_prep_fn
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (6 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 07/17] SCSI: prepare for introducing admin queue for legacy path Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 09/17] SCSI: don't set .queuedata in scsi_mq_alloc_queue() Ming Lei
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

This patchset will introduce per-host admin queue, so it may not to get
'scsi_device' via q->queuedata.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2f541b4fb32b..bc04389de560 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2020,10 +2020,9 @@ static unsigned int scsi_mq_sgl_size(struct Scsi_Host *shost)
 		sizeof(struct scatterlist);
 }
 
-static int scsi_mq_prep_fn(struct request *req)
+static int scsi_mq_prep_fn(struct scsi_device *sdev, struct request *req)
 {
 	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
-	struct scsi_device *sdev = req->q->queuedata;
 	struct Scsi_Host *shost = sdev->host;
 	struct scatterlist *sg;
 
@@ -2119,7 +2118,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 		goto out_dec_target_busy;
 
 	if (!(req->rq_flags & RQF_DONTPREP)) {
-		ret = prep_to_mq(scsi_mq_prep_fn(req));
+		ret = prep_to_mq(scsi_mq_prep_fn(sdev, req));
 		if (ret != BLK_STS_OK)
 			goto out_dec_host_busy;
 		req->rq_flags |= RQF_DONTPREP;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 09/17] SCSI: don't set .queuedata in scsi_mq_alloc_queue()
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (7 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 08/17] SCSI: pass scsi_device to scsi_mq_prep_fn Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 10/17] SCSI: deal with admin queue busy Ming Lei
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

.queuedata is set in scsi_alloc_sdev() for both non-mq and scsi_mq,
so not necessary to do it in scsi_mq_alloc_queue().

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index bc04389de560..1072b2e303d9 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2354,7 +2354,6 @@ struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev)
 	if (IS_ERR(sdev->request_queue))
 		return NULL;
 
-	sdev->request_queue->queuedata = sdev;
 	__scsi_init_queue(sdev->host, sdev->request_queue);
 	blk_queue_flag_set(QUEUE_FLAG_SCSI_PASSTHROUGH, sdev->request_queue);
 	return sdev->request_queue;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 10/17] SCSI: deal with admin queue busy
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (8 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 09/17] SCSI: don't set .queuedata in scsi_mq_alloc_queue() Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 11/17] SCSI: track pending admin commands Ming Lei
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

When request originated from admin queue isn't queued successfully, we
deal with it just like for normal requests, that said the admin queue
will be rerun after one request in this host is completed.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c  | 61 ++++++++++++++++++++++++++++++++++--------------
 include/scsi/scsi_host.h |  2 ++
 2 files changed, 46 insertions(+), 17 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 1072b2e303d9..8d129b601cc5 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -475,10 +475,14 @@ static void scsi_starved_list_run(struct Scsi_Host *shost)
 	LIST_HEAD(starved_list);
 	struct scsi_device *sdev;
 	unsigned long flags;
+	bool run_admin;
 
 	spin_lock_irqsave(shost->host_lock, flags);
 	list_splice_init(&shost->starved_list, &starved_list);
 
+	run_admin = shost->run_admin_queue;
+	shost->run_admin_queue = false;
+
 	while (!list_empty(&starved_list)) {
 		struct request_queue *slq;
 
@@ -527,6 +531,10 @@ static void scsi_starved_list_run(struct Scsi_Host *shost)
 	/* put any unprocessed entries back */
 	list_splice(&starved_list, &shost->starved_list);
 	spin_unlock_irqrestore(shost->host_lock, flags);
+
+	/* no need to get queue for admin_q */
+	if (run_admin)
+		scsi_kick_queue(shost->admin_q);
 }
 
 /*
@@ -534,26 +542,30 @@ static void scsi_starved_list_run(struct Scsi_Host *shost)
  *
  * Purpose:    Select a proper request queue to serve next
  *
- * Arguments:  q       - last request's queue
+ * Arguments:	sdev	- the last request's scsi_device
+ * 		q       - last request's queue, which may points to
+ * 			host->admin_q
  *
  * Returns:     Nothing
  *
  * Notes:      The previous command was completely finished, start
  *             a new one if possible.
  */
-static void scsi_run_queue(struct request_queue *q)
+static void scsi_run_queue(struct scsi_device *sdev, struct request_queue *q)
 {
-	struct scsi_device *sdev = q->queuedata;
+	struct Scsi_Host *shost = sdev->host;
 
 	if (scsi_target(sdev)->single_lun)
 		scsi_single_lun_run(sdev);
-	if (!list_empty(&sdev->host->starved_list))
-		scsi_starved_list_run(sdev->host);
 
-	if (q->mq_ops)
-		blk_mq_run_hw_queues(q, false);
-	else
-		blk_run_queue(q);
+	if (!list_empty(&shost->starved_list) || shost->run_admin_queue)
+		scsi_starved_list_run(shost);
+
+	scsi_kick_queue(q);
+
+	/* q may points to host->admin_queue */
+	if (sdev->request_queue != q)
+		scsi_kick_queue(sdev->request_queue);
 }
 
 void scsi_requeue_run_queue(struct work_struct *work)
@@ -563,7 +575,7 @@ void scsi_requeue_run_queue(struct work_struct *work)
 
 	sdev = container_of(work, struct scsi_device, requeue_work);
 	q = sdev->request_queue;
-	scsi_run_queue(q);
+	scsi_run_queue(sdev, q);
 }
 
 /*
@@ -597,7 +609,7 @@ static void scsi_requeue_command(struct request_queue *q, struct scsi_cmnd *cmd)
 	blk_requeue_request(q, req);
 	spin_unlock_irqrestore(q->queue_lock, flags);
 
-	scsi_run_queue(q);
+	scsi_run_queue(sdev, q);
 
 	put_device(&sdev->sdev_gendev);
 }
@@ -607,7 +619,7 @@ void scsi_run_host_queues(struct Scsi_Host *shost)
 	struct scsi_device *sdev;
 
 	shost_for_each_device(sdev, shost)
-		scsi_run_queue(sdev->request_queue);
+		scsi_run_queue(sdev, sdev->request_queue);
 }
 
 static void scsi_uninit_cmd(struct scsi_cmnd *cmd)
@@ -715,8 +727,13 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
 
 		__blk_mq_end_request(req, error);
 
+		/*
+		 * scsi_device is shared between host->admin_queue and
+		 * sdev->request_queue
+		 */
 		if (scsi_target(sdev)->single_lun ||
-		    !list_empty(&sdev->host->starved_list))
+		    !list_empty(&sdev->host->starved_list) ||
+		    sdev->host->run_admin_queue || scsi_is_admin_queue(q))
 			kblockd_schedule_work(&sdev->requeue_work);
 		else
 			blk_mq_run_hw_queues(q, true);
@@ -732,7 +749,7 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
 		blk_finish_request(req, error);
 		spin_unlock_irqrestore(q->queue_lock, flags);
 
-		scsi_run_queue(q);
+		scsi_run_queue(sdev, q);
 	}
 
 	put_device(&sdev->sdev_gendev);
@@ -1544,6 +1561,12 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
 	return 1;
 out_dec:
 	atomic_dec(&sdev->device_busy);
+
+	if (unlikely(scsi_is_admin_queue(q))) {
+		spin_lock_irq(sdev->host->host_lock);
+		sdev->host->run_admin_queue = true;
+		spin_unlock_irq(sdev->host->host_lock);
+	}
 	return 0;
 }
 
@@ -1552,7 +1575,7 @@ static inline int scsi_dev_queue_ready(struct request_queue *q,
  * @sdev: scsi device on starget to check.
  */
 static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
-					   struct scsi_device *sdev)
+					   struct scsi_device *sdev, bool admin)
 {
 	struct scsi_target *starget = scsi_target(sdev);
 	unsigned int busy;
@@ -1594,6 +1617,8 @@ static inline int scsi_target_queue_ready(struct Scsi_Host *shost,
 starved:
 	spin_lock_irq(shost->host_lock);
 	list_move_tail(&sdev->starved_entry, &shost->starved_list);
+	if (admin)
+		shost->run_admin_queue = true;
 	spin_unlock_irq(shost->host_lock);
 out_dec:
 	if (starget->can_queue > 0)
@@ -1650,6 +1675,8 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
 	spin_lock_irq(shost->host_lock);
 	if (list_empty(&sdev->starved_entry))
 		list_add_tail(&sdev->starved_entry, &shost->starved_list);
+	if (scsi_is_admin_queue(q))
+		shost->run_admin_queue = true;
 	spin_unlock_irq(shost->host_lock);
 out_dec:
 	scsi_dec_host_busy(shost);
@@ -1949,7 +1976,7 @@ static void scsi_request_fn(struct request_queue *q)
 			goto not_ready;
 		}
 
-		if (!scsi_target_queue_ready(shost, sdev))
+		if (!scsi_target_queue_ready(shost, sdev, scsi_is_admin_queue(q)))
 			goto not_ready;
 
 		if (!scsi_host_queue_ready(q, shost, sdev))
@@ -2112,7 +2139,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 		goto out_put_budget;
 
 	ret = BLK_STS_RESOURCE;
-	if (!scsi_target_queue_ready(shost, sdev))
+	if (!scsi_target_queue_ready(shost, sdev, scsi_is_admin_queue(q)))
 		goto out_put_budget;
 	if (!scsi_host_queue_ready(q, shost, sdev))
 		goto out_dec_target_busy;
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 5ea06d310a25..4896e6003aeb 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -532,6 +532,8 @@ struct Scsi_Host {
 	struct list_head	__devices;
 	struct list_head	__targets;
 	
+	struct request_queue	*admin_q;
+	bool                    run_admin_queue;
 	struct list_head	starved_list;
 
 	spinlock_t		default_lock;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (9 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 10/17] SCSI: deal with admin queue busy Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-14  3:33   ` jianchao.wang
  2018-09-13 12:15 ` [PATCH V3 12/17] SCSI: create admin queue for each host Ming Lei
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

Firstly we have to make sure that all pending admin commands to
one same scsi_device are completed before removing the scsi_device.

Secondly scsi_internal_device_block() needs this too.

So introduce one waitqueue and atomic counter for this purpose.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c    | 6 ++++++
 drivers/scsi/scsi_scan.c   | 1 +
 drivers/scsi/scsi_sysfs.c  | 1 +
 include/scsi/scsi_device.h | 4 ++++
 4 files changed, 12 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8d129b601cc5..4db08458a127 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -299,6 +299,8 @@ int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	req->cmd_flags |= flags;
 	req->rq_flags |= rq_flags | RQF_QUIET;
 
+	atomic_inc(&sdev->nr_admin_pending);
+
 	/*
 	 * head injection *required* here otherwise quiesce won't work
 	 */
@@ -323,6 +325,9 @@ int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
  out:
 	blk_put_request(req);
 
+	atomic_dec(&sdev->nr_admin_pending);
+	wake_up_all(&sdev->admin_wq);
+
 	return ret;
 }
 EXPORT_SYMBOL(__scsi_execute);
@@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
 		else
 			scsi_wait_for_queuecommand(sdev);
 	}
+	wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
 	mutex_unlock(&sdev->state_mutex);
 
 	return err;
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 78ca63dfba4a..b83ad0dc8890 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -243,6 +243,7 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	mutex_init(&sdev->inquiry_mutex);
 	INIT_WORK(&sdev->event_work, scsi_evt_thread);
 	INIT_WORK(&sdev->requeue_work, scsi_requeue_run_queue);
+	init_waitqueue_head(&sdev->admin_wq);
 
 	sdev->sdev_gendev.parent = get_device(&starget->dev);
 	sdev->sdev_target = starget;
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 3aee9464a7bf..8bcb7ecc0c06 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
 
 	blk_cleanup_queue(sdev->request_queue);
 	cancel_work_sync(&sdev->requeue_work);
+	wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
 
 	if (sdev->host->hostt->slave_destroy)
 		sdev->host->hostt->slave_destroy(sdev);
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 202f4d6a4342..f6820da1dc37 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -227,6 +227,10 @@ struct scsi_device {
 	struct mutex		state_mutex;
 	enum scsi_device_state sdev_state;
 	struct task_struct	*quiesced_by;
+
+	atomic_t		nr_admin_pending;
+	wait_queue_head_t       admin_wq;
+
 	unsigned long		sdev_data[0];
 } __attribute__((aligned(sizeof(unsigned long))));
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 12/17] SCSI: create admin queue for each host
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (10 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 11/17] SCSI: track pending admin commands Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 13/17] SCSI: use the dedicated admin queue to send admin commands Ming Lei
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

The created admin queue will be used to send internal admin commands,
so we can simplify the sync between some admin commands and IO requests,
typical examples are system suspend and runtime PM.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/hosts.c     |   9 ++++
 drivers/scsi/scsi_lib.c  | 117 +++++++++++++++++++++++++++++++++++++++++------
 drivers/scsi/scsi_priv.h |   1 +
 3 files changed, 114 insertions(+), 13 deletions(-)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index ea4b0bb0c1cd..7c1f56c85475 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -242,6 +242,9 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 
 	shost->dma_dev = dma_dev;
 
+	if (!scsi_init_admin_queue(shost))
+		goto out_remove_tags;
+
 	/*
 	 * Increase usage count temporarily here so that calling
 	 * scsi_autopm_put_host() will trigger runtime idle if there is
@@ -309,6 +312,9 @@ int scsi_add_host_with_dma(struct Scsi_Host *shost, struct device *dev,
 	pm_runtime_disable(&shost->shost_gendev);
 	pm_runtime_set_suspended(&shost->shost_gendev);
 	pm_runtime_put_noidle(&shost->shost_gendev);
+	blk_cleanup_queue(shost->admin_q);
+	blk_put_queue(shost->admin_q);
+ out_remove_tags:
 	if (shost_use_blk_mq(shost))
 		scsi_mq_destroy_tags(shost);
  fail:
@@ -344,6 +350,9 @@ static void scsi_host_dev_release(struct device *dev)
 		kfree(dev_name(&shost->shost_dev));
 	}
 
+	blk_cleanup_queue(shost->admin_q);
+	blk_put_queue(shost->admin_q);
+
 	if (shost_use_blk_mq(shost)) {
 		if (shost->tag_set.tags)
 			scsi_mq_destroy_tags(shost);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 4db08458a127..87a88094b1eb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2099,19 +2099,22 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
 	blk_mq_complete_request(cmd->request);
 }
 
-static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
+static void __scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx,
+		struct scsi_device *sdev)
 {
-	struct request_queue *q = hctx->queue;
-	struct scsi_device *sdev = q->queuedata;
-
 	atomic_dec(&sdev->device_busy);
 	put_device(&sdev->sdev_gendev);
 }
 
-static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
+static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx)
+{
+	__scsi_mq_put_budget(hctx, hctx->queue->queuedata);
+}
+
+static bool __scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx,
+		struct scsi_device *sdev)
 {
 	struct request_queue *q = hctx->queue;
-	struct scsi_device *sdev = q->queuedata;
 
 	if (!get_device(&sdev->sdev_gendev))
 		goto out;
@@ -2128,12 +2131,17 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
 	return false;
 }
 
-static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
-			 const struct blk_mq_queue_data *bd)
+static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx)
+{
+	return __scsi_mq_get_budget(hctx, hctx->queue->queuedata);
+}
+
+static blk_status_t __scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
+			 const struct blk_mq_queue_data *bd,
+			 struct scsi_device *sdev)
 {
 	struct request *req = bd->rq;
 	struct request_queue *q = req->q;
-	struct scsi_device *sdev = q->queuedata;
 	struct Scsi_Host *shost = sdev->host;
 	struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
 	blk_status_t ret;
@@ -2181,7 +2189,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (scsi_target(sdev)->can_queue > 0)
 		atomic_dec(&scsi_target(sdev)->target_busy);
 out_put_budget:
-	scsi_mq_put_budget(hctx);
+	__scsi_mq_put_budget(hctx, sdev);
 	switch (ret) {
 	case BLK_STS_OK:
 		break;
@@ -2203,6 +2211,29 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	return ret;
 }
 
+static blk_status_t scsi_admin_queue_rq(struct blk_mq_hw_ctx *hctx,
+			 const struct blk_mq_queue_data *bd)
+{
+	struct scsi_device *sdev = scsi_req(bd->rq)->sdev;
+
+	WARN_ON_ONCE(hctx->queue == sdev->request_queue);
+
+	if (!__scsi_mq_get_budget(hctx, sdev))
+		return BLK_STS_RESOURCE;
+
+	return __scsi_queue_rq(hctx, bd, sdev);
+}
+
+static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
+			 const struct blk_mq_queue_data *bd)
+{
+	struct scsi_device *sdev = hctx->queue->queuedata;
+
+	WARN_ON_ONCE(hctx->queue == sdev->host->admin_q);
+
+	return __scsi_queue_rq(hctx, bd, hctx->queue->queuedata);
+}
+
 static enum blk_eh_timer_return scsi_timeout(struct request *req,
 		bool reserved)
 {
@@ -2335,9 +2366,9 @@ static void scsi_old_exit_rq(struct request_queue *q, struct request *rq)
 			       cmd->sense_buffer);
 }
 
-struct request_queue *scsi_old_alloc_queue(struct scsi_device *sdev)
+static struct request_queue *__scsi_old_alloc_queue(struct Scsi_Host *shost,
+		bool admin)
 {
-	struct Scsi_Host *shost = sdev->host;
 	struct request_queue *q;
 
 	q = blk_alloc_queue_node(GFP_KERNEL, NUMA_NO_NODE, NULL);
@@ -2356,7 +2387,9 @@ struct request_queue *scsi_old_alloc_queue(struct scsi_device *sdev)
 	}
 
 	__scsi_init_queue(shost, q);
-	blk_queue_flag_set(QUEUE_FLAG_SCSI_PASSTHROUGH, q);
+
+	if (!admin)
+		blk_queue_flag_set(QUEUE_FLAG_SCSI_PASSTHROUGH, q);
 	blk_queue_prep_rq(q, scsi_prep_fn);
 	blk_queue_unprep_rq(q, scsi_unprep_fn);
 	blk_queue_softirq_done(q, scsi_softirq_done);
@@ -2365,6 +2398,23 @@ struct request_queue *scsi_old_alloc_queue(struct scsi_device *sdev)
 	return q;
 }
 
+struct request_queue *scsi_old_alloc_queue(struct scsi_device *sdev)
+{
+	return __scsi_old_alloc_queue(sdev->host, false);
+}
+
+static struct request_queue *scsi_old_alloc_admin_queue(struct Scsi_Host *shost)
+{
+	struct request_queue *q = __scsi_old_alloc_queue(shost, true);
+
+	if (!q)
+		return NULL;
+
+	blk_queue_init_tags(q, shost->cmd_per_lun, shost->bqt,
+			    shost->hostt->tag_alloc_policy);
+	return q;
+}
+
 static const struct blk_mq_ops scsi_mq_ops = {
 	.get_budget	= scsi_mq_get_budget,
 	.put_budget	= scsi_mq_put_budget,
@@ -2380,6 +2430,16 @@ static const struct blk_mq_ops scsi_mq_ops = {
 	.map_queues	= scsi_map_queues,
 };
 
+static const struct blk_mq_ops scsi_mq_admin_ops = {
+	.queue_rq	= scsi_admin_queue_rq,
+	.complete	= scsi_softirq_done,
+	.timeout	= scsi_timeout,
+	.init_request	= scsi_mq_init_request,
+	.exit_request	= scsi_mq_exit_request,
+	.initialize_rq_fn = scsi_initialize_rq,
+	.map_queues	= scsi_map_queues,
+};
+
 struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev)
 {
 	sdev->request_queue = blk_mq_init_queue(&sdev->host->tag_set);
@@ -2391,6 +2451,37 @@ struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev)
 	return sdev->request_queue;
 }
 
+static struct request_queue *scsi_mq_alloc_admin_queue(struct Scsi_Host *shost)
+{
+	struct request_queue *q = __blk_mq_init_queue(&shost->tag_set,
+			QUEUE_FLAG_MQ_ADMIN_DEFAULT);
+	if (IS_ERR(q))
+		return NULL;
+
+	q->mq_ops = &scsi_mq_admin_ops;
+
+	__scsi_init_queue(shost, q);
+
+	return q;
+}
+
+struct request_queue *scsi_init_admin_queue(struct Scsi_Host *shost)
+{
+	struct request_queue *q;
+
+	if (shost_use_blk_mq(shost))
+		q = scsi_mq_alloc_admin_queue(shost);
+	else
+		q = scsi_old_alloc_admin_queue(shost);
+
+	if (!q)
+		return NULL;
+
+	WARN_ON(!blk_get_queue(q));
+	shost->admin_q = q;
+	return q;
+}
+
 int scsi_mq_setup_tags(struct Scsi_Host *shost)
 {
 	unsigned int cmd_size, sgl_size;
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 99f1db5e467e..0553acbc3f65 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -94,6 +94,7 @@ extern void scsi_run_host_queues(struct Scsi_Host *shost);
 extern void scsi_requeue_run_queue(struct work_struct *work);
 extern struct request_queue *scsi_old_alloc_queue(struct scsi_device *sdev);
 extern struct request_queue *scsi_mq_alloc_queue(struct scsi_device *sdev);
+extern struct request_queue *scsi_init_admin_queue(struct Scsi_Host *shost);
 extern void scsi_start_queue(struct scsi_device *sdev);
 extern int scsi_mq_setup_tags(struct Scsi_Host *shost);
 extern void scsi_mq_destroy_tags(struct Scsi_Host *shost);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 13/17] SCSI: use the dedicated admin queue to send admin commands
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (11 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 12/17] SCSI: create admin queue for each host Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 14/17] SCSI: transport_spi: resume a quiesced device Ming Lei
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

Now the per-host dedicated admin queue is ready, so use this queue to
send admin commands only.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 87a88094b1eb..1e75515cc7ba 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -279,14 +279,14 @@ int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	struct scsi_request *rq;
 	int ret = DRIVER_ERROR << 24;
 
-	req = blk_get_request(sdev->request_queue,
+	req = blk_get_request(sdev->host->admin_q,
 			data_direction == DMA_TO_DEVICE ?
 			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
 	if (IS_ERR(req))
 		return ret;
 	rq = scsi_req(req);
 
-	if (bufflen &&	blk_rq_map_kern(sdev->request_queue, req,
+	if (bufflen &&	blk_rq_map_kern(req->q, req,
 					buffer, bufflen, GFP_NOIO))
 		goto out;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 14/17] SCSI: transport_spi: resume a quiesced device
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (12 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 13/17] SCSI: use the dedicated admin queue to send admin commands Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 15/17] SCSI: use admin queue to implement queue QUIESCE Ming Lei
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

We have to preempt freeze queue in scsi_device_quiesce(),
and unfreeze in scsi_device_resume(), so call scsi_device_resume()
for the device which is quiesced by scsi_device_quiesce().

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_transport_spi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/scsi_transport_spi.c b/drivers/scsi/scsi_transport_spi.c
index 40b85b752b79..e4174e8137a8 100644
--- a/drivers/scsi/scsi_transport_spi.c
+++ b/drivers/scsi/scsi_transport_spi.c
@@ -1052,6 +1052,9 @@ spi_dv_device(struct scsi_device *sdev)
 
 	scsi_target_resume(starget);
 
+	/* undo what scsi_device_quiesce() did */
+	scsi_device_resume(sdev);
+
 	spi_initial_dv(starget) = 1;
 
  out_free:
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 15/17] SCSI: use admin queue to implement queue QUIESCE
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (13 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 14/17] SCSI: transport_spi: resume a quiesced device Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 16/17] block: simplify runtime PM support Ming Lei
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

All admin commands are sent via per-host admin queue, so we can simply
freeze the IO queue for quiescing scsi device.

Also the current SCSI stack guarantees that any request originated from
admin queue won't be called back to block layer via the associated IO queue,
and it is always dealt with by the admin queue.

So it is safe to submit admin request via admin queue when the associated IO
queue is frozen, and this way matches the PREEMPT flag perfectly.

Finally, we can remove the preempt_only approach for supporting SCSI
quiesce, then the code in block fast path is simplified a lot.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c           | 44 ++------------------------------------------
 block/blk-mq-debugfs.c     |  1 -
 drivers/scsi/scsi_lib.c    | 29 +++--------------------------
 include/linux/blkdev.h     |  6 ------
 include/scsi/scsi_device.h |  1 -
 5 files changed, 5 insertions(+), 76 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 4dbc93f43b38..f51c19b381e3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -421,26 +421,6 @@ void blk_sync_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_sync_queue);
 
 /**
- * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
- * @q: request queue pointer
- *
- * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
- * set and 1 if the flag was already set.
- */
-int blk_set_preempt_only(struct request_queue *q)
-{
-	return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
-}
-EXPORT_SYMBOL_GPL(blk_set_preempt_only);
-
-void blk_clear_preempt_only(struct request_queue *q)
-{
-	blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-	wake_up_all(&q->mq_freeze_wq);
-}
-EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
-
-/**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
  * @q:	The queue to run
  *
@@ -917,27 +897,8 @@ EXPORT_SYMBOL(blk_alloc_queue);
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-	const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
-
 	while (true) {
-		bool success = false;
-
-		rcu_read_lock();
-		if (percpu_ref_tryget_live(&q->q_usage_counter)) {
-			/*
-			 * The code that sets the PREEMPT_ONLY flag is
-			 * responsible for ensuring that that flag is globally
-			 * visible before the queue is unfrozen.
-			 */
-			if (preempt || !blk_queue_preempt_only(q)) {
-				success = true;
-			} else {
-				percpu_ref_put(&q->q_usage_counter);
-			}
-		}
-		rcu_read_unlock();
-
-		if (success)
+		if (percpu_ref_tryget_live(&q->q_usage_counter))
 			return 0;
 
 		if (flags & BLK_MQ_REQ_NOWAIT)
@@ -953,8 +914,7 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		smp_rmb();
 
 		wait_event(q->mq_freeze_wq,
-			   (atomic_read(&q->mq_freeze_depth) == 0 &&
-			    (preempt || !blk_queue_preempt_only(q))) ||
+			   atomic_read(&q->mq_freeze_depth) == 0 ||
 			   blk_queue_dying(q));
 		if (blk_queue_dying(q))
 			return -ENODEV;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 8df013e9f242..82df43ec322f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -132,7 +132,6 @@ static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(REGISTERED),
 	QUEUE_FLAG_NAME(SCSI_PASSTHROUGH),
 	QUEUE_FLAG_NAME(QUIESCED),
-	QUEUE_FLAG_NAME(PREEMPT_ONLY),
 	QUEUE_FLAG_NAME(ADMIN),
 };
 #undef QUEUE_FLAG_NAME
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 1e75515cc7ba..85fed8d96c8a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -3180,34 +3180,12 @@ static void scsi_wait_for_queuecommand(struct scsi_device *sdev)
 int
 scsi_device_quiesce(struct scsi_device *sdev)
 {
-	struct request_queue *q = sdev->request_queue;
 	int err;
 
-	/*
-	 * It is allowed to call scsi_device_quiesce() multiple times from
-	 * the same context but concurrent scsi_device_quiesce() calls are
-	 * not allowed.
-	 */
-	WARN_ON_ONCE(sdev->quiesced_by && sdev->quiesced_by != current);
-
-	blk_set_preempt_only(q);
-
-	blk_mq_freeze_queue(q);
-	/*
-	 * Ensure that the effect of blk_set_preempt_only() will be visible
-	 * for percpu_ref_tryget() callers that occur after the queue
-	 * unfreeze even if the queue was already frozen before this function
-	 * was called. See also https://lwn.net/Articles/573497/.
-	 */
-	synchronize_rcu();
-	blk_mq_unfreeze_queue(q);
+	blk_mq_freeze_queue(sdev->request_queue);
 
 	mutex_lock(&sdev->state_mutex);
 	err = scsi_device_set_state(sdev, SDEV_QUIESCE);
-	if (err == 0)
-		sdev->quiesced_by = current;
-	else
-		blk_clear_preempt_only(q);
 	mutex_unlock(&sdev->state_mutex);
 
 	return err;
@@ -3230,12 +3208,11 @@ void scsi_device_resume(struct scsi_device *sdev)
 	 * device deleted during suspend)
 	 */
 	mutex_lock(&sdev->state_mutex);
-	WARN_ON_ONCE(!sdev->quiesced_by);
-	sdev->quiesced_by = NULL;
-	blk_clear_preempt_only(sdev->request_queue);
 	if (sdev->sdev_state == SDEV_QUIESCE)
 		scsi_device_set_state(sdev, SDEV_RUNNING);
 	mutex_unlock(&sdev->state_mutex);
+
+	blk_mq_unfreeze_queue(sdev->request_queue);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2dbc7524a169..1bd4f02d11c0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -698,7 +698,6 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26	/* queue has been registered to a disk */
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
-#define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
 #define QUEUE_FLAG_ADMIN	30	/* admin queue */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
@@ -740,14 +739,9 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 			     REQ_FAILFAST_DRIVER))
 #define blk_queue_quiesced(q)	test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
-#define blk_queue_preempt_only(q)				\
-	test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 #define blk_queue_admin(q)	test_bit(QUEUE_FLAG_ADMIN, &(q)->queue_flags)
 
-extern int blk_set_preempt_only(struct request_queue *q);
-extern void blk_clear_preempt_only(struct request_queue *q);
-
 static inline int queue_in_flight(struct request_queue *q)
 {
 	return q->in_flight[0] + q->in_flight[1];
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index f6820da1dc37..666b58799cec 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -226,7 +226,6 @@ struct scsi_device {
 	unsigned char		access_state;
 	struct mutex		state_mutex;
 	enum scsi_device_state sdev_state;
-	struct task_struct	*quiesced_by;
 
 	atomic_t		nr_admin_pending;
 	wait_queue_head_t       admin_wq;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 16/17] block: simplify runtime PM support
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (14 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 15/17] SCSI: use admin queue to implement queue QUIESCE Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-13 12:15 ` [PATCH V3 17/17] block: enable runtime PM for blk-mq Ming Lei
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

This patch simplifies runtime PM support by the following approach:

1) resume device in blk_queue_enter() if this device is
runtime-suspended or runtime-suspending

2) freeze queue in blk_pre_runtime_suspend()

3) unfreeze queue in blk_pre_runtime_resume()

4) remove checking on RRF_PM because now we requires out-of-band PM
request to resume device

5) introduce blk_unfreeze_queue_lock() and blk_freeze_queue_lock()
so that both runtime-PM and system-PM can use them to freeze/unfreeze
queue and avoid freeze & unfreeze mismatch

Then we can remove blk_pm_allow_request(), and more importantly this way
can be applied to blk-mq path too.

Finally the IO queue associated with scsi_device is kept as runtime
resumed in __scsi_execute() when sending non-PM RQF_REQUEST, and this
way makes sure that the LUN is active for handling non-PM RQF_PREEMPT.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c        | 106 ++++++++++++++++++++++++++++--------------------
 block/blk-mq.c          |  22 ++++++++++
 block/elevator.c        |  25 ------------
 drivers/scsi/scsi_lib.c |  14 +++++--
 include/linux/blk-mq.h  |   2 +
 include/linux/blkdev.h  |   3 ++
 6 files changed, 101 insertions(+), 71 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f51c19b381e3..07c5243c51ec 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -890,6 +890,28 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+#ifdef CONFIG_PM
+static void blk_resume_queue(struct request_queue *q)
+{
+	int rpm_status;
+
+	if (!q->dev)
+		return;
+
+	spin_lock_irq(q->queue_lock);
+	rpm_status = q->rpm_status;
+	spin_unlock_irq(q->queue_lock);
+
+	/* PM request needs to be dealt with out of band */
+	if (rpm_status == RPM_SUSPENDED || rpm_status == RPM_SUSPENDING)
+		pm_runtime_resume(q->dev);
+}
+#else
+static void blk_resume_queue(struct request_queue *q)
+{
+}
+#endif
+
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
@@ -913,11 +935,20 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		 */
 		smp_rmb();
 
+		blk_resume_queue(q);
+
 		wait_event(q->mq_freeze_wq,
 			   atomic_read(&q->mq_freeze_depth) == 0 ||
 			   blk_queue_dying(q));
 		if (blk_queue_dying(q))
 			return -ENODEV;
+
+		/*
+		 * This allocation may be blocked via queue freezing before
+		 * the queue is suspended, so we have to resume queue again
+		 * after waking up.
+		 */
+		blk_resume_queue(q);
 	}
 }
 
@@ -1023,6 +1054,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
 	q->bypass_depth = 1;
 	queue_flag_set_unlocked(QUEUE_FLAG_BYPASS, q);
 
+	mutex_init(&q->freeze_lock);
 	init_waitqueue_head(&q->mq_freeze_wq);
 
 	/*
@@ -1470,6 +1502,23 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
 	return ERR_PTR(-ENOMEM);
 }
 
+#ifdef CONFIG_PM
+static void blk_pm_add_request(struct request_queue *q)
+{
+	if (q->dev)
+		q->nr_pending++;
+}
+static void blk_pm_put_request(struct request_queue *q)
+{
+	if (q->dev && !--q->nr_pending)
+		pm_runtime_mark_last_busy(q->dev);
+}
+#else
+static inline void blk_pm_put_request(struct request_queue *q) {}
+static inline void blk_pm_add_request(struct request_queue *q){}
+#endif
+
+
 /**
  * get_request - get a free request
  * @q: request_queue to allocate request from
@@ -1498,16 +1547,19 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 
 	rl = blk_get_rl(q, bio);	/* transferred to @rq on success */
 retry:
+	blk_pm_add_request(q);
 	rq = __get_request(rl, op, bio, flags, gfp);
 	if (!IS_ERR(rq))
 		return rq;
 
 	if (op & REQ_NOWAIT) {
+		blk_pm_put_request(q);
 		blk_put_rl(rl);
 		return ERR_PTR(-EAGAIN);
 	}
 
 	if ((flags & BLK_MQ_REQ_NOWAIT) || unlikely(blk_queue_dying(q))) {
+		blk_pm_put_request(q);
 		blk_put_rl(rl);
 		return rq;
 	}
@@ -1518,6 +1570,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 
 	trace_block_sleeprq(q, bio, op);
 
+	blk_pm_put_request(q);
 	spin_unlock_irq(q->queue_lock);
 	io_schedule();
 
@@ -1686,16 +1739,6 @@ void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part)
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
-#ifdef CONFIG_PM
-static void blk_pm_put_request(struct request *rq)
-{
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-		pm_runtime_mark_last_busy(rq->q->dev);
-}
-#else
-static inline void blk_pm_put_request(struct request *rq) {}
-#endif
-
 void __blk_put_request(struct request_queue *q, struct request *req)
 {
 	req_flags_t rq_flags = req->rq_flags;
@@ -1711,7 +1754,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 	lockdep_assert_held(q->queue_lock);
 
 	blk_req_zone_write_unlock(req);
-	blk_pm_put_request(req);
+	blk_pm_put_request(q);
 
 	elv_completed_request(q, req);
 
@@ -2712,30 +2755,6 @@ void blk_account_io_done(struct request *req, u64 now)
 	}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static bool blk_pm_allow_request(struct request *rq)
-{
-	switch (rq->q->rpm_status) {
-	case RPM_RESUMING:
-	case RPM_SUSPENDING:
-		return rq->rq_flags & RQF_PM;
-	case RPM_SUSPENDED:
-		return false;
-	default:
-		return true;
-	}
-}
-#else
-static bool blk_pm_allow_request(struct request *rq)
-{
-	return true;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
 	struct hd_struct *part;
@@ -2780,13 +2799,8 @@ static struct request *elv_next_request(struct request_queue *q)
 	WARN_ON_ONCE(q->mq_ops);
 
 	while (1) {
-		list_for_each_entry(rq, &q->queue_head, queuelist) {
-			if (blk_pm_allow_request(rq))
-				return rq;
-
-			if (rq->rq_flags & RQF_SOFTBARRIER)
-				break;
-		}
+		list_for_each_entry(rq, &q->queue_head, queuelist)
+			return rq;
 
 		/*
 		 * Flush request is running and flush request isn't queueable
@@ -3790,6 +3804,10 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 		q->rpm_status = RPM_SUSPENDING;
 	}
 	spin_unlock_irq(q->queue_lock);
+
+	if (!ret)
+		blk_freeze_queue_lock(q);
+
 	return ret;
 }
 EXPORT_SYMBOL(blk_pre_runtime_suspend);
@@ -3867,13 +3885,15 @@ void blk_post_runtime_resume(struct request_queue *q, int err)
 	spin_lock_irq(q->queue_lock);
 	if (!err) {
 		q->rpm_status = RPM_ACTIVE;
-		__blk_run_queue(q);
 		pm_runtime_mark_last_busy(q->dev);
 		pm_request_autosuspend(q->dev);
 	} else {
 		q->rpm_status = RPM_SUSPENDED;
 	}
 	spin_unlock_irq(q->queue_lock);
+
+	if (!err)
+		blk_unfreeze_queue_lock(q);
 }
 EXPORT_SYMBOL(blk_post_runtime_resume);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7868daaf6de0..c119336d0561 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -204,6 +204,28 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
+void blk_unfreeze_queue_lock(struct request_queue *q)
+{
+	mutex_lock(&q->freeze_lock);
+	if (q->q_frozen) {
+		blk_mq_unfreeze_queue(q);
+		q->q_frozen = false;
+	}
+	mutex_unlock(&q->freeze_lock);
+}
+EXPORT_SYMBOL(blk_unfreeze_queue_lock);
+
+void blk_freeze_queue_lock(struct request_queue *q)
+{
+	mutex_lock(&q->freeze_lock);
+	if (!q->q_frozen) {
+		blk_mq_freeze_queue(q);
+		q->q_frozen = true;
+	}
+	mutex_unlock(&q->freeze_lock);
+}
+EXPORT_SYMBOL(blk_freeze_queue_lock);
+
 /*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
  * mpt3sas driver such that this function can be removed.
diff --git a/block/elevator.c b/block/elevator.c
index d6abba76c89e..2e4d347ad726 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -557,27 +557,6 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
 		e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
 }
 
-#ifdef CONFIG_PM
-static void blk_pm_requeue_request(struct request *rq)
-{
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM))
-		rq->q->nr_pending--;
-}
-
-static void blk_pm_add_request(struct request_queue *q, struct request *rq)
-{
-	if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
-	    (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
-		pm_request_resume(q->dev);
-}
-#else
-static inline void blk_pm_requeue_request(struct request *rq) {}
-static inline void blk_pm_add_request(struct request_queue *q,
-				      struct request *rq)
-{
-}
-#endif
-
 void elv_requeue_request(struct request_queue *q, struct request *rq)
 {
 	/*
@@ -592,8 +571,6 @@ void elv_requeue_request(struct request_queue *q, struct request *rq)
 
 	rq->rq_flags &= ~RQF_STARTED;
 
-	blk_pm_requeue_request(rq);
-
 	__elv_add_request(q, rq, ELEVATOR_INSERT_REQUEUE);
 }
 
@@ -620,8 +597,6 @@ void __elv_add_request(struct request_queue *q, struct request *rq, int where)
 {
 	trace_block_rq_insert(q, rq);
 
-	blk_pm_add_request(q, rq);
-
 	rq->q = q;
 
 	if (rq->rq_flags & RQF_SOFTBARRIER) {
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 85fed8d96c8a..84d1c450b858 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -278,12 +278,17 @@ int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	struct request *req;
 	struct scsi_request *rq;
 	int ret = DRIVER_ERROR << 24;
+	bool pm_rq = rq_flags & RQF_PM;
+
+	if (!pm_rq)
+		scsi_autopm_get_device(sdev);
 
 	req = blk_get_request(sdev->host->admin_q,
 			data_direction == DMA_TO_DEVICE ?
 			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
 	if (IS_ERR(req))
-		return ret;
+		goto fail;
+
 	rq = scsi_req(req);
 
 	if (bufflen &&	blk_rq_map_kern(req->q, req,
@@ -327,6 +332,9 @@ int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 
 	atomic_dec(&sdev->nr_admin_pending);
 	wake_up_all(&sdev->admin_wq);
+ fail:
+	if (!pm_rq)
+		scsi_autopm_put_device(sdev);
 
 	return ret;
 }
@@ -3182,7 +3190,7 @@ scsi_device_quiesce(struct scsi_device *sdev)
 {
 	int err;
 
-	blk_mq_freeze_queue(sdev->request_queue);
+	blk_freeze_queue_lock(sdev->request_queue);
 
 	mutex_lock(&sdev->state_mutex);
 	err = scsi_device_set_state(sdev, SDEV_QUIESCE);
@@ -3212,7 +3220,7 @@ void scsi_device_resume(struct scsi_device *sdev)
 		scsi_device_set_state(sdev, SDEV_RUNNING);
 	mutex_unlock(&sdev->state_mutex);
 
-	blk_mq_unfreeze_queue(sdev->request_queue);
+	blk_unfreeze_queue_lock(sdev->request_queue);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index afde18ac5b31..00970a0b4b06 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -295,6 +295,8 @@ void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
+void blk_freeze_queue_lock(struct request_queue *q);
+void blk_unfreeze_queue_lock(struct request_queue *q);
 
 int blk_mq_map_queues(struct blk_mq_tag_set *set);
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1bd4f02d11c0..4b2abdccec1f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -635,6 +635,9 @@ struct request_queue {
 	int			bypass_depth;
 	atomic_t		mq_freeze_depth;
 
+	bool			q_frozen;
+	struct mutex		freeze_lock;
+
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
 	struct bsg_class_device bsg_dev;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH V3 17/17] block: enable runtime PM for blk-mq
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (15 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 16/17] block: simplify runtime PM support Ming Lei
@ 2018-09-13 12:15 ` Ming Lei
  2018-09-14 10:33   ` kbuild test robot
  2018-09-15  1:13   ` kbuild test robot
  2018-09-14  7:27 ` [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM jianchao.wang
  2018-09-17  6:34 ` Hannes Reinecke
  18 siblings, 2 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-13 12:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Jianchao Wang, Hannes Reinecke,
	Johannes Thumshirn, Adrian Hunter, James E.J. Bottomley,
	Martin K. Petersen, linux-scsi

Now blk-mq can borrow the runtime PM approach from legacy path, so
enable it simply. The only difference with legacy is that:

1) blk_mq_queue_sched_tag_busy_iter() is introduced for checking if queue
is idle, instead of maintaining one counter.

2) we have to iterate over scheduler tags for counting how many requests
entering queue because requests in hw tags don't cover these allocated
and not dispatched.

Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "James E.J. Bottomley" <jejb@linux.vnet.ibm.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c   | 39 ++++++++++++++++++++++++++++++++++-----
 block/blk-mq-tag.c | 21 +++++++++++++++++++--
 block/blk-mq-tag.h |  2 ++
 block/blk-mq.c     |  4 ++++
 4 files changed, 59 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 07c5243c51ec..1aac76ae8c52 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3755,11 +3755,8 @@ EXPORT_SYMBOL(blk_finish_plug);
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-	/* Don't enable runtime PM for blk-mq until it is ready */
-	if (q->mq_ops) {
-		pm_runtime_disable(dev);
+	if (WARN_ON_ONCE(blk_queue_admin(q)))
 		return;
-	}
 
 	q->dev = dev;
 	q->rpm_status = RPM_ACTIVE;
@@ -3768,6 +3765,23 @@ void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 }
 EXPORT_SYMBOL(blk_pm_runtime_init);
 
+static void blk_mq_pm_count_req(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	unsigned long *cnt = priv;
+
+	(*cnt)++;
+}
+
+static bool blk_mq_pm_queue_busy(struct request_queue *q)
+{
+	unsigned long cnt = 0;
+
+	blk_mq_queue_sched_tag_busy_iter(q, blk_mq_pm_count_req, &cnt);
+
+	return cnt > 0;
+}
+
 /**
  * blk_pre_runtime_suspend - Pre runtime suspend check
  * @q: the queue of the device
@@ -3792,12 +3806,20 @@ EXPORT_SYMBOL(blk_pm_runtime_init);
 int blk_pre_runtime_suspend(struct request_queue *q)
 {
 	int ret = 0;
+	bool busy = true;
+	unsigned long last_busy;
 
 	if (!q->dev)
 		return ret;
 
+	last_busy = READ_ONCE(q->dev->power.last_busy);
+
+	if (q->mq_ops)
+		busy = blk_mq_pm_queue_busy(q);
+
 	spin_lock_irq(q->queue_lock);
-	if (q->nr_pending) {
+	busy = q->mq_ops ? busy : !!q->nr_pending;
+	if (busy) {
 		ret = -EBUSY;
 		pm_runtime_mark_last_busy(q->dev);
 	} else {
@@ -3805,6 +3827,13 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 	}
 	spin_unlock_irq(q->queue_lock);
 
+	/*
+	 * Any new IO during this window will prevent the current suspend
+	 * from going on
+	 */
+	if (unlikely(last_busy != READ_ONCE(q->dev->power.last_busy)))
+		ret = -EBUSY;
+
 	if (!ret)
 		blk_freeze_queue_lock(q);
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7b0390f1c764..70e76dd035c1 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -322,8 +322,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
-void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
-		void *priv)
+static void __blk_mq_queue_tag_busy_iter(struct request_queue *q,
+		busy_iter_fn *fn, void *priv, bool sched_tag)
 {
 	struct blk_mq_hw_ctx *hctx;
 	int i;
@@ -344,6 +344,9 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 	queue_for_each_hw_ctx(q, hctx, i) {
 		struct blk_mq_tags *tags = hctx->tags;
 
+		if (sched_tag && hctx->sched_tags)
+			tags = hctx->sched_tags;
+
 		/*
 		 * If not software queues are currently mapped to this
 		 * hardware queue, there's nothing to check
@@ -358,6 +361,20 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 	rcu_read_unlock();
 }
 
+void blk_mq_queue_tag_busy_iter(struct request_queue *q,
+		busy_iter_fn *fn, void *priv)
+{
+
+	__blk_mq_queue_tag_busy_iter(q, fn, priv, false);
+}
+
+void blk_mq_queue_sched_tag_busy_iter(struct request_queue *q,
+		busy_iter_fn *fn, void *priv)
+{
+
+	__blk_mq_queue_tag_busy_iter(q, fn, priv, true);
+}
+
 static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
 		    bool round_robin, int node)
 {
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..5513c3eeab00 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -35,6 +35,8 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
+void blk_mq_queue_sched_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
+		void *priv);
 
 static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
 						 struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c119336d0561..1f6bc927804e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -25,6 +25,7 @@
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
 #include <linux/prefetch.h>
+#include <linux/pm_runtime.h>
 
 #include <trace/events/block.h>
 
@@ -503,6 +504,9 @@ static void __blk_mq_free_request(struct request *rq)
 		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
+
+	if (q->dev)
+		pm_runtime_mark_last_busy(q->dev);
 }
 
 void blk_mq_free_request(struct request *rq)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-13 12:15 ` [PATCH V3 11/17] SCSI: track pending admin commands Ming Lei
@ 2018-09-14  3:33   ` jianchao.wang
  2018-09-14 11:33     ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: jianchao.wang @ 2018-09-14  3:33 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe
  Cc: linux-block, Alan Stern, Christoph Hellwig, Bart Van Assche,
	Hannes Reinecke, Johannes Thumshirn, Adrian Hunter,
	James E.J. Bottomley, Martin K. Petersen, linux-scsi

Hi Ming

On 09/13/2018 08:15 PM, Ming Lei wrote:
>  EXPORT_SYMBOL(__scsi_execute);
> @@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
>  		else
>  			scsi_wait_for_queuecommand(sdev);
>  	}
> +	wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
>  	mutex_unlock(&sdev->state_mutex);
>  
>  	return err;
...
> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> index 3aee9464a7bf..8bcb7ecc0c06 100644
> --- a/drivers/scsi/scsi_sysfs.c
> +++ b/drivers/scsi/scsi_sysfs.c
> @@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
>  
>  	blk_cleanup_queue(sdev->request_queue);
>  	cancel_work_sync(&sdev->requeue_work);
> +	wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))

This nr_admin_pending could drain the ongoing scsi_request_fn/scsi_queue_rq,
but I'm afraid it cannot stop new ones coming in, such as the ones that have passed
the sdev state checking and have not crossed the atomic_inc(&sdev->nr_admin_pending).



Thanks
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (16 preceding siblings ...)
  2018-09-13 12:15 ` [PATCH V3 17/17] block: enable runtime PM for blk-mq Ming Lei
@ 2018-09-14  7:27 ` jianchao.wang
  2018-09-16 13:09   ` Ming Lei
  2018-09-17  6:34 ` Hannes Reinecke
  18 siblings, 1 reply; 41+ messages in thread
From: jianchao.wang @ 2018-09-14  7:27 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block

Hi Ming

On 09/13/2018 08:15 PM, Ming Lei wrote:
> This patchset introduces per-host admin request queue for submitting
> admin request only, and uses this approach to implement both SCSI
> quiesce and runtime PM in one very simply way. Also runtime PM deadlock
> can be avoided in case that request pool is used up, such as when too
> many IO requests are allocated before resuming device

To be honest, after look in to the SCSI part of this patchset, I really
suspect whether it is worth to introduce this per scsi-host adminq.
Too much complexity is introduced into SCSI. Compared with this, the current
preempt-only feature look much simpler.

If you just don't like the preempt-only checking in the hot path, we may introduce
a new interface similar with blk_queue_enter for the non hot path.

With your patch percpu-refcount: relax limit on percpu_ref_reinit()

(totally no test)
diff --git a/block/blk-core.c b/block/blk-core.c
index 4dbc93f..dd7f007 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -427,14 +427,24 @@ EXPORT_SYMBOL(blk_sync_queue);
  * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
  * set and 1 if the flag was already set.
  */
-int blk_set_preempt_only(struct request_queue *q)
+int blk_set_preempt_only(struct request_queue *q, bool drain)
 {
-    return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+    if (blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q))
+        return 1;
+    percpu_ref_kill(&q->q_usage_counter);
+    synchronize_sched();
+
+    if (drain)
+        wait_event(q->mq_freeze_wq,
+                percpu_ref_is_zero(&q->q_usage_counter));
+
+    return 0;
 }
 EXPORT_SYMBOL_GPL(blk_set_preempt_only);
 
 void blk_clear_preempt_only(struct request_queue *q)
 {
+    percpu_ref_reinit(&q->q_usage_counter);
     blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
     wake_up_all(&q->mq_freeze_wq);
 }
@@ -913,31 +923,13 @@ EXPORT_SYMBOL(blk_alloc_queue);
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
- * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
+ * @flags: BLK_MQ_REQ_NOWAIT
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-    const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
-
     while (true) {
-        bool success = false;
-
-        rcu_read_lock();
-        if (percpu_ref_tryget_live(&q->q_usage_counter)) {
-            /*
-             * The code that sets the PREEMPT_ONLY flag is
-             * responsible for ensuring that that flag is globally
-             * visible before the queue is unfrozen.
-             */
-            if (preempt || !blk_queue_preempt_only(q)) {
-                success = true;
-            } else {
-                percpu_ref_put(&q->q_usage_counter);
-            }
-        }
-        rcu_read_unlock();
 
-        if (success)
+        if (percpu_ref_tryget_live(&q->q_usage_counter))
             return 0;
 
         if (flags & BLK_MQ_REQ_NOWAIT)
@@ -954,7 +946,44 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 
         wait_event(q->mq_freeze_wq,
                (atomic_read(&q->mq_freeze_depth) == 0 &&
-                (preempt || !blk_queue_preempt_only(q))) ||
+               !blk_queue_preempt_only(q)) ||
+               blk_queue_dying(q));
+        if (blk_queue_dying(q))
+            return -ENODEV;
+    }
+}
+
+/*
+ * When set PREEMPT_ONLY mode, q->q_usage_counter is killed.
+ * If only PREEMPT_ONLY mode, go on.
+ * If queue frozen, wait.
+ */
+int blk_queue_preempt_enter(struct request_queue *q,
+    blk_mq_req_flags_t flags)
+{
+    while (true) {
+
+        if (percpu_ref_tryget_live(&q->q_usage_counter))
+            return 0;
+
+        smp_rmb();
+
+        /*
+         * If queue is only in PREEMPT_ONLY mode, get the ref
+         * directly. The one who set PREEMPT_ONLY mode is responsible
+         * to wake up the waiters on mq_freeze_wq.
+         */
+        if (!atomic_read(&q->mq_freeze_depth) &&
+                blk_queue_preempt_only(q)) {
+            percpu_ref_get_many(&q->q_usage_counter, 1);
+            return 0;
+        }
+
+        if (flags & BLK_MQ_REQ_NOWAIT)
+            return -EBUSY;
+
+        wait_event(q->mq_freeze_wq,
+               (atomic_read(&q->mq_freeze_depth) == 0 ||
                blk_queue_dying(q));
         if (blk_queue_dying(q))
             return -ENODEV;
@@ -1587,7 +1616,7 @@ static struct request *blk_old_get_request(struct request_queue *q,
     /* create ioc upfront */
     create_io_context(gfp_mask, q->node);
 
-    ret = blk_queue_enter(q, flags);
+    ret = blk_queue_preempt_enter(q, flags);
     if (ret)
         return ERR_PTR(ret);
     spin_lock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 85a1c1a..3aea78f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -403,7 +403,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
     struct request *rq;
     int ret;
 
-    ret = blk_queue_enter(q, flags);
+    ret = blk_queue_preempt_enter(q, flags);
     if (ret)
         return ERR_PTR(ret);
 
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index eb97d2d..c338c3a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -3046,17 +3046,7 @@ scsi_device_quiesce(struct scsi_device *sdev)
      */
     WARN_ON_ONCE(sdev->quiesced_by && sdev->quiesced_by != current);
 
-    blk_set_preempt_only(q);
-
-    blk_mq_freeze_queue(q);
-    /*
-     * Ensure that the effect of blk_set_preempt_only() will be visible
-     * for percpu_ref_tryget() callers that occur after the queue
-     * unfreeze even if the queue was already frozen before this function
-     * was called. See also https://lwn.net/Articles/573497/.
-     */
-    synchronize_rcu();
-    blk_mq_unfreeze_queue(q);
+    blk_set_preempt_only(q, true);
 
     mutex_lock(&sdev->state_mutex);
     err = scsi_device_set_state(sdev, SDEV_QUIESCE);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 17/17] block: enable runtime PM for blk-mq
  2018-09-13 12:15 ` [PATCH V3 17/17] block: enable runtime PM for blk-mq Ming Lei
@ 2018-09-14 10:33   ` kbuild test robot
  2018-09-15  1:13   ` kbuild test robot
  1 sibling, 0 replies; 41+ messages in thread
From: kbuild test robot @ 2018-09-14 10:33 UTC (permalink / raw)
  To: Ming Lei
  Cc: kbuild-all, Jens Axboe, linux-block, Ming Lei, Alan Stern,
	Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Hannes Reinecke, Johannes Thumshirn, Adrian Hunter,
	James E.J. Bottomley, Martin K. Petersen, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 1846 bytes --]

Hi Ming,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mkp-scsi/for-next]
[also build test ERROR on v4.19-rc3 next-20180913]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ming-Lei/blk-mq-allow-to-pass-default-queue-flags-for-creating-initializing-queue/20180914-162946
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next
config: x86_64-randconfig-x003-201836 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   block/blk-mq.c: In function '__blk_mq_free_request':
>> block/blk-mq.c:508:7: error: 'struct request_queue' has no member named 'dev'
     if (q->dev)
          ^~
   block/blk-mq.c:509:30: error: 'struct request_queue' has no member named 'dev'
      pm_runtime_mark_last_busy(q->dev);
                                 ^~

vim +508 block/blk-mq.c

   493	
   494	static void __blk_mq_free_request(struct request *rq)
   495	{
   496		struct request_queue *q = rq->q;
   497		struct blk_mq_ctx *ctx = rq->mq_ctx;
   498		struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
   499		const int sched_tag = rq->internal_tag;
   500	
   501		if (rq->tag != -1)
   502			blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
   503		if (sched_tag != -1)
   504			blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
   505		blk_mq_sched_restart(hctx);
   506		blk_queue_exit(q);
   507	
 > 508		if (q->dev)
   509			pm_runtime_mark_last_busy(q->dev);
   510	}
   511	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31955 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-14  3:33   ` jianchao.wang
@ 2018-09-14 11:33     ` Ming Lei
  2018-09-17  2:46       ` jianchao.wang
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2018-09-14 11:33 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List

On Fri, Sep 14, 2018 at 11:40 AM jianchao.wang
<jianchao.w.wang@oracle.com> wrote:
>
> Hi Ming
>
> On 09/13/2018 08:15 PM, Ming Lei wrote:
> >  EXPORT_SYMBOL(__scsi_execute);
> > @@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
> >               else
> >                       scsi_wait_for_queuecommand(sdev);
> >       }
> > +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
> >       mutex_unlock(&sdev->state_mutex);
> >
> >       return err;
> ...
> > diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> > index 3aee9464a7bf..8bcb7ecc0c06 100644
> > --- a/drivers/scsi/scsi_sysfs.c
> > +++ b/drivers/scsi/scsi_sysfs.c
> > @@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
> >
> >       blk_cleanup_queue(sdev->request_queue);
> >       cancel_work_sync(&sdev->requeue_work);
> > +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>
> This nr_admin_pending could drain the ongoing scsi_request_fn/scsi_queue_rq,
> but I'm afraid it cannot stop new ones coming in, such as the ones that have passed
> the sdev state checking and have not crossed the atomic_inc(&sdev->nr_admin_pending).
>

The counter of .nr_admin_pending is introduced for draining queued
admin requests to this scsi device.

Actually new requests have been prevented from entering scsi_queue_rq(),
please see the two callers of wait_event(sdev->admin_wq,
!atomic_read(&sdev->nr_admin_pending)).

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 17/17] block: enable runtime PM for blk-mq
  2018-09-13 12:15 ` [PATCH V3 17/17] block: enable runtime PM for blk-mq Ming Lei
  2018-09-14 10:33   ` kbuild test robot
@ 2018-09-15  1:13   ` kbuild test robot
  1 sibling, 0 replies; 41+ messages in thread
From: kbuild test robot @ 2018-09-15  1:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: kbuild-all, Jens Axboe, linux-block, Ming Lei, Alan Stern,
	Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Hannes Reinecke, Johannes Thumshirn, Adrian Hunter,
	James E.J. Bottomley, Martin K. Petersen, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 3014 bytes --]

Hi Ming,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on mkp-scsi/for-next]
[also build test WARNING on v4.19-rc3 next-20180913]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ming-Lei/blk-mq-allow-to-pass-default-queue-flags-for-creating-initializing-queue/20180914-162946
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next
config: x86_64-randconfig-u0-09150658 (attached as .config)
compiler: gcc-5 (Debian 5.5.0-3) 5.4.1 20171010
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/linux/export.h:45:0,
                    from include/linux/linkage.h:7,
                    from include/linux/kernel.h:7,
                    from block/blk-mq.c:7:
   block/blk-mq.c: In function '__blk_mq_free_request':
   block/blk-mq.c:508:7: error: 'struct request_queue' has no member named 'dev'
     if (q->dev)
          ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^
>> block/blk-mq.c:508:2: note: in expansion of macro 'if'
     if (q->dev)
     ^
   block/blk-mq.c:508:7: error: 'struct request_queue' has no member named 'dev'
     if (q->dev)
          ^
   include/linux/compiler.h:58:42: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                             ^
>> block/blk-mq.c:508:2: note: in expansion of macro 'if'
     if (q->dev)
     ^
   block/blk-mq.c:508:7: error: 'struct request_queue' has no member named 'dev'
     if (q->dev)
          ^
   include/linux/compiler.h:69:16: note: in definition of macro '__trace_if'
      ______r = !!(cond);     \
                   ^
>> block/blk-mq.c:508:2: note: in expansion of macro 'if'
     if (q->dev)
     ^
   block/blk-mq.c:509:30: error: 'struct request_queue' has no member named 'dev'
      pm_runtime_mark_last_busy(q->dev);
                                 ^

vim +/if +508 block/blk-mq.c

   493	
   494	static void __blk_mq_free_request(struct request *rq)
   495	{
   496		struct request_queue *q = rq->q;
   497		struct blk_mq_ctx *ctx = rq->mq_ctx;
   498		struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
   499		const int sched_tag = rq->internal_tag;
   500	
   501		if (rq->tag != -1)
   502			blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
   503		if (sched_tag != -1)
   504			blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
   505		blk_mq_sched_restart(hctx);
   506		blk_queue_exit(q);
   507	
 > 508		if (q->dev)
   509			pm_runtime_mark_last_busy(q->dev);
   510	}
   511	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31649 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
  2018-09-14  7:27 ` [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM jianchao.wang
@ 2018-09-16 13:09   ` Ming Lei
  2018-09-17  2:25     ` jianchao.wang
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2018-09-16 13:09 UTC (permalink / raw)
  To: jianchao.wang; +Cc: Jens Axboe, linux-block

On Fri, Sep 14, 2018 at 03:27:44PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 09/13/2018 08:15 PM, Ming Lei wrote:
> > This patchset introduces per-host admin request queue for submitting
> > admin request only, and uses this approach to implement both SCSI
> > quiesce and runtime PM in one very simply way. Also runtime PM deadlock
> > can be avoided in case that request pool is used up, such as when too
> > many IO requests are allocated before resuming device
> 
> To be honest, after look in to the SCSI part of this patchset, I really
> suspect whether it is worth to introduce this per scsi-host adminq.
> Too much complexity is introduced into SCSI. Compared with this, the current

Can you comment on individual patch about which part is complicated?
I will explain them to you only by one.

> preempt-only feature look much simpler.
> 
> If you just don't like the preempt-only checking in the hot path, we may introduce
> a new interface similar with blk_queue_enter for the non hot path.
> 
> With your patch percpu-refcount: relax limit on percpu_ref_reinit()

Seems this way may be one improvement on Bart's V6.

> 
> (totally no test)
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4dbc93f..dd7f007 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -427,14 +427,24 @@ EXPORT_SYMBOL(blk_sync_queue);
>   * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
>   * set and 1 if the flag was already set.
>   */
> -int blk_set_preempt_only(struct request_queue *q)
> +int blk_set_preempt_only(struct request_queue *q, bool drain)
>  {
> -    return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
> +    if (blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q))
> +        return 1;
> +    percpu_ref_kill(&q->q_usage_counter);
> +    synchronize_sched();
> +
> +    if (drain)
> +        wait_event(q->mq_freeze_wq,
> +                percpu_ref_is_zero(&q->q_usage_counter));
> +
> +    return 0;
>  }
>  EXPORT_SYMBOL_GPL(blk_set_preempt_only);

It is easy to cause race between the normal freeze interface and the
above one. That may be one new complexity of the preempt only approach,
because there can be more than one path to kill/reinit .q_usage_counter.

>  
>  void blk_clear_preempt_only(struct request_queue *q)
>  {
> +    percpu_ref_reinit(&q->q_usage_counter);
>      blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
>      wake_up_all(&q->mq_freeze_wq);
>  }
> @@ -913,31 +923,13 @@ EXPORT_SYMBOL(blk_alloc_queue);
>  /**
>   * blk_queue_enter() - try to increase q->q_usage_counter
>   * @q: request queue pointer
> - * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
> + * @flags: BLK_MQ_REQ_NOWAIT
>   */
>  int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>  {
> -    const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
> -
>      while (true) {
> -        bool success = false;
> -
> -        rcu_read_lock();
> -        if (percpu_ref_tryget_live(&q->q_usage_counter)) {
> -            /*
> -             * The code that sets the PREEMPT_ONLY flag is
> -             * responsible for ensuring that that flag is globally
> -             * visible before the queue is unfrozen.
> -             */
> -            if (preempt || !blk_queue_preempt_only(q)) {
> -                success = true;
> -            } else {
> -                percpu_ref_put(&q->q_usage_counter);
> -            }
> -        }
> -        rcu_read_unlock();
>  
> -        if (success)
> +        if (percpu_ref_tryget_live(&q->q_usage_counter))
>              return 0;
>  
>          if (flags & BLK_MQ_REQ_NOWAIT)
> @@ -954,7 +946,44 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>  
>          wait_event(q->mq_freeze_wq,
>                 (atomic_read(&q->mq_freeze_depth) == 0 &&
> -                (preempt || !blk_queue_preempt_only(q))) ||
> +               !blk_queue_preempt_only(q)) ||
> +               blk_queue_dying(q));
> +        if (blk_queue_dying(q))
> +            return -ENODEV;
> +    }
> +}

The big question is how you will implement runtime suspend with this
approach? New IO has to terminate the runtime suspend.

> +
> +/*
> + * When set PREEMPT_ONLY mode, q->q_usage_counter is killed.
> + * If only PREEMPT_ONLY mode, go on.
> + * If queue frozen, wait.
> + */
> +int blk_queue_preempt_enter(struct request_queue *q,
> +    blk_mq_req_flags_t flags)
> +{
> +    while (true) {
> +
> +        if (percpu_ref_tryget_live(&q->q_usage_counter))
> +            return 0;
> +
> +        smp_rmb();
> +
> +        /*
> +         * If queue is only in PREEMPT_ONLY mode, get the ref
> +         * directly. The one who set PREEMPT_ONLY mode is responsible
> +         * to wake up the waiters on mq_freeze_wq.
> +         */
> +        if (!atomic_read(&q->mq_freeze_depth) &&
> +                blk_queue_preempt_only(q)) {
> +            percpu_ref_get_many(&q->q_usage_counter, 1);
> +            return 0;
> +        }

This way will delay runtime pm or system suspend until the queue is unfrozen,
but it isn't reasonable.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
  2018-09-16 13:09   ` Ming Lei
@ 2018-09-17  2:25     ` jianchao.wang
  2018-09-17 12:07       ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: jianchao.wang @ 2018-09-17  2:25 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block

Hi Ming

Thanks for your kindly response.

On 09/16/2018 09:09 PM, Ming Lei wrote:
> On Fri, Sep 14, 2018 at 03:27:44PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 09/13/2018 08:15 PM, Ming Lei wrote:
>>> This patchset introduces per-host admin request queue for submitting
>>> admin request only, and uses this approach to implement both SCSI
>>> quiesce and runtime PM in one very simply way. Also runtime PM deadlock
>>> can be avoided in case that request pool is used up, such as when too
>>> many IO requests are allocated before resuming device
>>
>> To be honest, after look in to the SCSI part of this patchset, I really
>> suspect whether it is worth to introduce this per scsi-host adminq.
>> Too much complexity is introduced into SCSI. Compared with this, the current
> 
> Can you comment on individual patch about which part is complicated?
> I will explain them to you only by one.

I have read through the scsi part of your patch many times. :)

After this patchset, we could control the IO request_queues separately. This is
more convenient.

But what we have to pay is the relative complexity _spread_ over scsi-layer code.
Especially, we have to handle that a request to a scsi device could be from its
own IO request_queue or the per-host adminq at both submit and complete side.

We have handled those cases in the patchset, but that looks really boring.
And also it is risky in current stable scsi mid-layer code.

I really think we should control the complexity in a very small range.

> 
>> preempt-only feature look much simpler.
>>
>> If you just don't like the preempt-only checking in the hot path, we may introduce
>> a new interface similar with blk_queue_enter for the non hot path.
>>
>> With your patch percpu-refcount: relax limit on percpu_ref_reinit()
> 
> Seems this way may be one improvement on Bart's V6.
> 
>>
>> (totally no test)
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 4dbc93f..dd7f007 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -427,14 +427,24 @@ EXPORT_SYMBOL(blk_sync_queue);
>>   * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
>>   * set and 1 if the flag was already set.
>>   */
>> -int blk_set_preempt_only(struct request_queue *q)
>> +int blk_set_preempt_only(struct request_queue *q, bool drain)
>>  {
>> -    return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
>> +    if (blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q))
>> +        return 1;
>> +    percpu_ref_kill(&q->q_usage_counter);
>> +    synchronize_sched();
>> +
>> +    if (drain)
>> +        wait_event(q->mq_freeze_wq,
>> +                percpu_ref_is_zero(&q->q_usage_counter));
>> +
>> +    return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(blk_set_preempt_only);
> 
> It is easy to cause race between the normal freeze interface and the
> above one. That may be one new complexity of the preempt only approach,
> because there can be more than one path to kill/reinit .q_usage_counter.

Yes, indeed.

> 
>>  
>>  void blk_clear_preempt_only(struct request_queue *q)
>>  {
>> +    percpu_ref_reinit(&q->q_usage_counter);
>>      blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
>>      wake_up_all(&q->mq_freeze_wq);
>>  }
>> @@ -913,31 +923,13 @@ EXPORT_SYMBOL(blk_alloc_queue);
>>  /**
>>   * blk_queue_enter() - try to increase q->q_usage_counter
>>   * @q: request queue pointer
>> - * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
>> + * @flags: BLK_MQ_REQ_NOWAIT
>>   */
>>  int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>>  {
>> -    const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
>> -
>>      while (true) {
>> -        bool success = false;
>> -
>> -        rcu_read_lock();
>> -        if (percpu_ref_tryget_live(&q->q_usage_counter)) {
>> -            /*
>> -             * The code that sets the PREEMPT_ONLY flag is
>> -             * responsible for ensuring that that flag is globally
>> -             * visible before the queue is unfrozen.
>> -             */
>> -            if (preempt || !blk_queue_preempt_only(q)) {
>> -                success = true;
>> -            } else {
>> -                percpu_ref_put(&q->q_usage_counter);
>> -            }
>> -        }
>> -        rcu_read_unlock();
>>  
>> -        if (success)
>> +        if (percpu_ref_tryget_live(&q->q_usage_counter))
>>              return 0;
>>  
>>          if (flags & BLK_MQ_REQ_NOWAIT)
>> @@ -954,7 +946,44 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>>  
>>          wait_event(q->mq_freeze_wq,
>>                 (atomic_read(&q->mq_freeze_depth) == 0 &&
>> -                (preempt || !blk_queue_preempt_only(q))) ||
>> +               !blk_queue_preempt_only(q)) ||
>> +               blk_queue_dying(q));
>> +        if (blk_queue_dying(q))
>> +            return -ENODEV;
>> +    }
>> +}
> 
> The big question is how you will implement runtime suspend with this
> approach? New IO has to terminate the runtime suspend.

Some checking could be done when percpu_ref_tryget_live fails.
If SUSPENDED, try to trigger resume.

> 
>> +
>> +/*
>> + * When set PREEMPT_ONLY mode, q->q_usage_counter is killed.
>> + * If only PREEMPT_ONLY mode, go on.
>> + * If queue frozen, wait.
>> + */
>> +int blk_queue_preempt_enter(struct request_queue *q,
>> +    blk_mq_req_flags_t flags)
>> +{
>> +    while (true) {
>> +
>> +        if (percpu_ref_tryget_live(&q->q_usage_counter))
>> +            return 0;
>> +
>> +        smp_rmb();
>> +
>> +        /*
>> +         * If queue is only in PREEMPT_ONLY mode, get the ref
>> +         * directly. The one who set PREEMPT_ONLY mode is responsible
>> +         * to wake up the waiters on mq_freeze_wq.
>> +         */
>> +        if (!atomic_read(&q->mq_freeze_depth) &&
>> +                blk_queue_preempt_only(q)) {
>> +            percpu_ref_get_many(&q->q_usage_counter, 1);
>> +            return 0;
>> +        }
> 
> This way will delay runtime pm or system suspend until the queue is unfrozen,
> but it isn't reasonable.

This interface is for the __scsi_execute only, before we call into function, we should have
get the device resumed synchronously.


Thanks
Jianchao
> 
> Thanks,
> Ming
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-14 11:33     ` Ming Lei
@ 2018-09-17  2:46       ` jianchao.wang
  2018-09-17 11:35         ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: jianchao.wang @ 2018-09-17  2:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List

Hi Ming

On 09/14/2018 07:33 PM, Ming Lei wrote:
> On Fri, Sep 14, 2018 at 11:40 AM jianchao.wang
> <jianchao.w.wang@oracle.com> wrote:
>>
>> Hi Ming
>>
>> On 09/13/2018 08:15 PM, Ming Lei wrote:
>>>  EXPORT_SYMBOL(__scsi_execute);
>>> @@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
>>>               else
>>>                       scsi_wait_for_queuecommand(sdev);
>>>       }
>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
>>>       mutex_unlock(&sdev->state_mutex);
>>>
>>>       return err;
>> ...
>>> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
>>> index 3aee9464a7bf..8bcb7ecc0c06 100644
>>> --- a/drivers/scsi/scsi_sysfs.c
>>> +++ b/drivers/scsi/scsi_sysfs.c
>>> @@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
>>>
>>>       blk_cleanup_queue(sdev->request_queue);
>>>       cancel_work_sync(&sdev->requeue_work);
>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>>
>> This nr_admin_pending could drain the ongoing scsi_request_fn/scsi_queue_rq,
>> but I'm afraid it cannot stop new ones coming in, such as the ones that have passed
>> the sdev state checking and have not crossed the atomic_inc(&sdev->nr_admin_pending).
>>
> 
> The counter of .nr_admin_pending is introduced for draining queued
> admin requests to this scsi device.
> 
> Actually new requests have been prevented from entering scsi_queue_rq(),
> please see the two callers of wait_event(sdev->admin_wq,
> !atomic_read(&sdev->nr_admin_pending)).
> 
For example

_scsi_execute
...
                                          scsi_internal_device_block
                                            scsi_internal_device_block_nowait
                                            blk_mq_quiesce_queue
                                            wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
  atomic_inc(&sdev->nr_admin_pending);

  blk_execute_rq(...)

  atomic_dec(&sdev->nr_admin_pending);
  wake_up_all(&sdev->admin_wq);

Or do you mean the scsi_queue_rq -> scsi_prep_state_check could gate out of ?

Thanks
Jianchao
> Thanks,
> Ming Lei
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
  2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
                   ` (17 preceding siblings ...)
  2018-09-14  7:27 ` [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM jianchao.wang
@ 2018-09-17  6:34 ` Hannes Reinecke
  2018-09-17 11:55   ` Ming Lei
  18 siblings, 1 reply; 41+ messages in thread
From: Hannes Reinecke @ 2018-09-17  6:34 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe; +Cc: linux-block

On 09/13/2018 02:15 PM, Ming Lei wrote:
> Hi,
> 
> This patchset introduces per-host admin request queue for submitting
> admin request only, and uses this approach to implement both SCSI
> quiesce and runtime PM in one very simply way. Also runtime PM deadlock
> can be avoided in case that request pool is used up, such as when too
> many IO requests are allocated before resuming device.
> 
> The idea is borrowed from NVMe.
> 
> In this patchset, admin request(all requests submitted via __scsi_execute) will
> be submitted via one per-host admin queue, and the request is still
> associated with the same scsi_device as before, and respects this
> scsi_device's all kinds of limits too. Admin queue shares host tags with
> other IO queues.
> 
> One core idea is that for any admin request submitted from this admin queue,
> this request won't be called back to block layer via the associated IO
> queue(scsi_device). And this is done in the 3rd patch. So once IO queue
> is frozen, it can be observed as really frozen from block layer view.
> 
> SCSI quiesce is implemented by admin queue in very simple way, see patch
> 15.
> 
> Also runtime PM for legacy path is simplified too, see patch 16, and device
> resume is moved to blk_queue_enter().
> 
> blk-mq simply follows legacy's approach for supporting runtime PM.
> 
> Also the fast IO path is simplified much, see blk_queue_enter().
> 
[ .. ]
> 
Please don't do this.
Having an admin queue makes sense for NVMe (where it's even part of the
spec). But for SCSI it's just an additional logical construct with
doesn't correlate to anything we have in the lower layers.
And all of this just to handle PM requests properly.

At ALPSS we've discussed this issue and came up with a different
proposal: Allocate a PM request before _suspending_. Then we trivially
have that request available when resuming, and are sure that nothing can
block the request.
Far simpler, and doesn't require an entirely new infrastructure.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-17  2:46       ` jianchao.wang
@ 2018-09-17 11:35         ` Ming Lei
  2018-09-18  1:22           ` jianchao.wang
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2018-09-17 11:35 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List

On Mon, Sep 17, 2018 at 10:46:34AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 09/14/2018 07:33 PM, Ming Lei wrote:
> > On Fri, Sep 14, 2018 at 11:40 AM jianchao.wang
> > <jianchao.w.wang@oracle.com> wrote:
> >>
> >> Hi Ming
> >>
> >> On 09/13/2018 08:15 PM, Ming Lei wrote:
> >>>  EXPORT_SYMBOL(__scsi_execute);
> >>> @@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
> >>>               else
> >>>                       scsi_wait_for_queuecommand(sdev);
> >>>       }
> >>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
> >>>       mutex_unlock(&sdev->state_mutex);
> >>>
> >>>       return err;
> >> ...
> >>> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> >>> index 3aee9464a7bf..8bcb7ecc0c06 100644
> >>> --- a/drivers/scsi/scsi_sysfs.c
> >>> +++ b/drivers/scsi/scsi_sysfs.c
> >>> @@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
> >>>
> >>>       blk_cleanup_queue(sdev->request_queue);
> >>>       cancel_work_sync(&sdev->requeue_work);
> >>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
> >>
> >> This nr_admin_pending could drain the ongoing scsi_request_fn/scsi_queue_rq,
> >> but I'm afraid it cannot stop new ones coming in, such as the ones that have passed
> >> the sdev state checking and have not crossed the atomic_inc(&sdev->nr_admin_pending).
> >>
> > 
> > The counter of .nr_admin_pending is introduced for draining queued
> > admin requests to this scsi device.
> > 
> > Actually new requests have been prevented from entering scsi_queue_rq(),
> > please see the two callers of wait_event(sdev->admin_wq,
> > !atomic_read(&sdev->nr_admin_pending)).
> > 
> For example
> 
> _scsi_execute
> ...
>                                           scsi_internal_device_block
>                                             scsi_internal_device_block_nowait
>                                             blk_mq_quiesce_queue
>                                             wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>   atomic_inc(&sdev->nr_admin_pending);
> 
>   blk_execute_rq(...)
> 
>   atomic_dec(&sdev->nr_admin_pending);
>   wake_up_all(&sdev->admin_wq);
> 
> Or do you mean the scsi_queue_rq -> scsi_prep_state_check could gate out of ?

I got it, then this issue can be fixed simply by moving atomic_inc/dec(&sdev->nr_admin_pending)
and related wake_up_all(&sdev->admin_wq) into scsi_admin_queue_rq().

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
  2018-09-17  6:34 ` Hannes Reinecke
@ 2018-09-17 11:55   ` Ming Lei
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-17 11:55 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Jens Axboe, linux-block

On Mon, Sep 17, 2018 at 08:34:09AM +0200, Hannes Reinecke wrote:
> On 09/13/2018 02:15 PM, Ming Lei wrote:
> > Hi,
> > 
> > This patchset introduces per-host admin request queue for submitting
> > admin request only, and uses this approach to implement both SCSI
> > quiesce and runtime PM in one very simply way. Also runtime PM deadlock
> > can be avoided in case that request pool is used up, such as when too
> > many IO requests are allocated before resuming device.
> > 
> > The idea is borrowed from NVMe.
> > 
> > In this patchset, admin request(all requests submitted via __scsi_execute) will
> > be submitted via one per-host admin queue, and the request is still
> > associated with the same scsi_device as before, and respects this
> > scsi_device's all kinds of limits too. Admin queue shares host tags with
> > other IO queues.
> > 
> > One core idea is that for any admin request submitted from this admin queue,
> > this request won't be called back to block layer via the associated IO
> > queue(scsi_device). And this is done in the 3rd patch. So once IO queue
> > is frozen, it can be observed as really frozen from block layer view.
> > 
> > SCSI quiesce is implemented by admin queue in very simple way, see patch
> > 15.
> > 
> > Also runtime PM for legacy path is simplified too, see patch 16, and device
> > resume is moved to blk_queue_enter().
> > 
> > blk-mq simply follows legacy's approach for supporting runtime PM.
> > 
> > Also the fast IO path is simplified much, see blk_queue_enter().
> > 
> [ .. ]
> > 
> Please don't do this.
> Having an admin queue makes sense for NVMe (where it's even part of the
> spec). But for SCSI it's just an additional logical construct with
> doesn't correlate to anything we have in the lower layers.

It is an abstract in concept or software, and there can be the real hw
admin queue or not. What matters is that the PM or admin request is handled
differently with normal IO in reality.

>
> And all of this just to handle PM requests properly.

The PM request can be handled easily, one big improvement is that this
way can simplify IO path if admin request is separated from normal IO handling.

> 
> At ALPSS we've discussed this issue and came up with a different
> proposal: Allocate a PM request before _suspending_. Then we trivially
> have that request available when resuming, and are sure that nothing can
> block the request.

Seems this way is only for avoiding deadlock during resume, which is
just a small part of the problem. The big part is to support runtime PM
in an easy/simple way.

Still better to talk with code.

> Far simpler, and doesn't require an entirely new infrastructure.

It is just a new admin queue, not sure it can be called as new infrastructure, :-)


Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
  2018-09-17  2:25     ` jianchao.wang
@ 2018-09-17 12:07       ` Ming Lei
  2018-09-18  1:17         ` jianchao.wang
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2018-09-17 12:07 UTC (permalink / raw)
  To: jianchao.wang; +Cc: Jens Axboe, linux-block

Hi,

On Mon, Sep 17, 2018 at 10:25:54AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> Thanks for your kindly response.
> 
> On 09/16/2018 09:09 PM, Ming Lei wrote:
> > On Fri, Sep 14, 2018 at 03:27:44PM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 09/13/2018 08:15 PM, Ming Lei wrote:
> >>> This patchset introduces per-host admin request queue for submitting
> >>> admin request only, and uses this approach to implement both SCSI
> >>> quiesce and runtime PM in one very simply way. Also runtime PM deadlock
> >>> can be avoided in case that request pool is used up, such as when too
> >>> many IO requests are allocated before resuming device
> >>
> >> To be honest, after look in to the SCSI part of this patchset, I really
> >> suspect whether it is worth to introduce this per scsi-host adminq.
> >> Too much complexity is introduced into SCSI. Compared with this, the current
> > 
> > Can you comment on individual patch about which part is complicated?
> > I will explain them to you only by one.
> 
> I have read through the scsi part of your patch many times. :)
> 
> After this patchset, we could control the IO request_queues separately. This is
> more convenient.
> 
> But what we have to pay is the relative complexity _spread_ over scsi-layer code.
> Especially, we have to handle that a request to a scsi device could be from its
> own IO request_queue or the per-host adminq at both submit and complete side.
> 
> We have handled those cases in the patchset, but that looks really boring.
> And also it is risky in current stable scsi mid-layer code.

Again, please comment on individual patch about the complexity. Then we
will see if it is really complicated. It is quite difficult to follow by
just commenting in cover-letter.

> 
> I really think we should control the complexity in a very small range.
> 
> > 
> >> preempt-only feature look much simpler.
> >>
> >> If you just don't like the preempt-only checking in the hot path, we may introduce
> >> a new interface similar with blk_queue_enter for the non hot path.
> >>
> >> With your patch percpu-refcount: relax limit on percpu_ref_reinit()
> > 
> > Seems this way may be one improvement on Bart's V6.
> > 
> >>
> >> (totally no test)
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 4dbc93f..dd7f007 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -427,14 +427,24 @@ EXPORT_SYMBOL(blk_sync_queue);
> >>   * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
> >>   * set and 1 if the flag was already set.
> >>   */
> >> -int blk_set_preempt_only(struct request_queue *q)
> >> +int blk_set_preempt_only(struct request_queue *q, bool drain)
> >>  {
> >> -    return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
> >> +    if (blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q))
> >> +        return 1;
> >> +    percpu_ref_kill(&q->q_usage_counter);
> >> +    synchronize_sched();
> >> +
> >> +    if (drain)
> >> +        wait_event(q->mq_freeze_wq,
> >> +                percpu_ref_is_zero(&q->q_usage_counter));
> >> +
> >> +    return 0;
> >>  }
> >>  EXPORT_SYMBOL_GPL(blk_set_preempt_only);
> > 
> > It is easy to cause race between the normal freeze interface and the
> > above one. That may be one new complexity of the preempt only approach,
> > because there can be more than one path to kill/reinit .q_usage_counter.
> 
> Yes, indeed.
> 
> > 
> >>  
> >>  void blk_clear_preempt_only(struct request_queue *q)
> >>  {
> >> +    percpu_ref_reinit(&q->q_usage_counter);
> >>      blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
> >>      wake_up_all(&q->mq_freeze_wq);
> >>  }
> >> @@ -913,31 +923,13 @@ EXPORT_SYMBOL(blk_alloc_queue);
> >>  /**
> >>   * blk_queue_enter() - try to increase q->q_usage_counter
> >>   * @q: request queue pointer
> >> - * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
> >> + * @flags: BLK_MQ_REQ_NOWAIT
> >>   */
> >>  int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> >>  {
> >> -    const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
> >> -
> >>      while (true) {
> >> -        bool success = false;
> >> -
> >> -        rcu_read_lock();
> >> -        if (percpu_ref_tryget_live(&q->q_usage_counter)) {
> >> -            /*
> >> -             * The code that sets the PREEMPT_ONLY flag is
> >> -             * responsible for ensuring that that flag is globally
> >> -             * visible before the queue is unfrozen.
> >> -             */
> >> -            if (preempt || !blk_queue_preempt_only(q)) {
> >> -                success = true;
> >> -            } else {
> >> -                percpu_ref_put(&q->q_usage_counter);
> >> -            }
> >> -        }
> >> -        rcu_read_unlock();
> >>  
> >> -        if (success)
> >> +        if (percpu_ref_tryget_live(&q->q_usage_counter))
> >>              return 0;
> >>  
> >>          if (flags & BLK_MQ_REQ_NOWAIT)
> >> @@ -954,7 +946,44 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
> >>  
> >>          wait_event(q->mq_freeze_wq,
> >>                 (atomic_read(&q->mq_freeze_depth) == 0 &&
> >> -                (preempt || !blk_queue_preempt_only(q))) ||
> >> +               !blk_queue_preempt_only(q)) ||
> >> +               blk_queue_dying(q));
> >> +        if (blk_queue_dying(q))
> >> +            return -ENODEV;
> >> +    }
> >> +}
> > 
> > The big question is how you will implement runtime suspend with this
> > approach? New IO has to terminate the runtime suspend.
> 
> Some checking could be done when percpu_ref_tryget_live fails.
> If SUSPENDED, try to trigger resume.
> 
> > 
> >> +
> >> +/*
> >> + * When set PREEMPT_ONLY mode, q->q_usage_counter is killed.
> >> + * If only PREEMPT_ONLY mode, go on.
> >> + * If queue frozen, wait.
> >> + */
> >> +int blk_queue_preempt_enter(struct request_queue *q,
> >> +    blk_mq_req_flags_t flags)
> >> +{
> >> +    while (true) {
> >> +
> >> +        if (percpu_ref_tryget_live(&q->q_usage_counter))
> >> +            return 0;
> >> +
> >> +        smp_rmb();
> >> +
> >> +        /*
> >> +         * If queue is only in PREEMPT_ONLY mode, get the ref
> >> +         * directly. The one who set PREEMPT_ONLY mode is responsible
> >> +         * to wake up the waiters on mq_freeze_wq.
> >> +         */
> >> +        if (!atomic_read(&q->mq_freeze_depth) &&
> >> +                blk_queue_preempt_only(q)) {
> >> +            percpu_ref_get_many(&q->q_usage_counter, 1);
> >> +            return 0;
> >> +        }
> > 
> > This way will delay runtime pm or system suspend until the queue is unfrozen,
> > but it isn't reasonable.
> 
> This interface is for the __scsi_execute only, before we call into function, we should have
> get the device resumed synchronously.

I mean when the queue is frozen, it is reasonable to runtime suspend the queue. However,
blk_queue_preempt_enter() is still waiting for queue becoming unfreezing first.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
  2018-09-17 12:07       ` Ming Lei
@ 2018-09-18  1:17         ` jianchao.wang
  2018-09-18  7:42           ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: jianchao.wang @ 2018-09-18  1:17 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block

Hi Ming

On 09/17/2018 08:07 PM, Ming Lei wrote:
>>> This way will delay runtime pm or system suspend until the queue is unfrozen,
>>> but it isn't reasonable.
>> This interface is for the __scsi_execute only, before we call into function, we should have
>> get the device resumed synchronously.
> I mean when the queue is frozen, it is reasonable to runtime suspend the queue. However,
> blk_queue_preempt_enter() is still waiting for queue becoming unfreezing first.

We don't freeze the queue, but set preempt-only mode when suspend the queue. :)

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-17 11:35         ` Ming Lei
@ 2018-09-18  1:22           ` jianchao.wang
  2018-09-18  7:39             ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: jianchao.wang @ 2018-09-18  1:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List

Hi Ming

On 09/17/2018 07:35 PM, Ming Lei wrote:
> On Mon, Sep 17, 2018 at 10:46:34AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 09/14/2018 07:33 PM, Ming Lei wrote:
>>> On Fri, Sep 14, 2018 at 11:40 AM jianchao.wang
>>> <jianchao.w.wang@oracle.com> wrote:
>>>>
>>>> Hi Ming
>>>>
>>>> On 09/13/2018 08:15 PM, Ming Lei wrote:
>>>>>  EXPORT_SYMBOL(__scsi_execute);
>>>>> @@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
>>>>>               else
>>>>>                       scsi_wait_for_queuecommand(sdev);
>>>>>       }
>>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
>>>>>       mutex_unlock(&sdev->state_mutex);
>>>>>
>>>>>       return err;
>>>> ...
>>>>> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
>>>>> index 3aee9464a7bf..8bcb7ecc0c06 100644
>>>>> --- a/drivers/scsi/scsi_sysfs.c
>>>>> +++ b/drivers/scsi/scsi_sysfs.c
>>>>> @@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
>>>>>
>>>>>       blk_cleanup_queue(sdev->request_queue);
>>>>>       cancel_work_sync(&sdev->requeue_work);
>>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>>>>
>>>> This nr_admin_pending could drain the ongoing scsi_request_fn/scsi_queue_rq,
>>>> but I'm afraid it cannot stop new ones coming in, such as the ones that have passed
>>>> the sdev state checking and have not crossed the atomic_inc(&sdev->nr_admin_pending).
>>>>
>>>
>>> The counter of .nr_admin_pending is introduced for draining queued
>>> admin requests to this scsi device.
>>>
>>> Actually new requests have been prevented from entering scsi_queue_rq(),
>>> please see the two callers of wait_event(sdev->admin_wq,
>>> !atomic_read(&sdev->nr_admin_pending)).
>>>
>> For example
>>
>> _scsi_execute
>> ...
>>                                           scsi_internal_device_block
>>                                             scsi_internal_device_block_nowait
>>                                             blk_mq_quiesce_queue
>>                                             wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>>   &sdev->nr_admin_pending;
>>
>>   blk_execute_rq(...)
>>
>>   atomic_dec(&sdev->nr_admin_pending);
>>   wake_up_all(&sdev->admin_wq);
>>
>> Or do you mean the scsi_queue_rq -> scsi_prep_state_check could gate out of ?
> 
> I got it, then this issue can be fixed simply by moving atomic_inc/dec(&sdev->nr_admin_pending)
> and related wake_up_all(&sdev->admin_wq) into scsi_admin_queue_rq().
> 

I don't think so. It is a similar scenario.

I guess a state checking is needed after atomic_inc(&sdev->nr_admin_pending), like:

 _scsi_execute
 ...
                                           scsi_internal_device_block
                                             scsi_internal_device_block_nowait
                                             blk_mq_quiesce_queue
                                             wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
   atomic_inc(&sdev->nr_admin_pending);
   if state checking fails
     goto done

   blk_execute_rq(...)

   atomic_dec(&sdev->nr_admin_pending);
   wake_up_all(&sdev->admin_wq);

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-18  1:22           ` jianchao.wang
@ 2018-09-18  7:39             ` Ming Lei
  2018-09-18  7:51               ` jianchao.wang
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2018-09-18  7:39 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List

On Tue, Sep 18, 2018 at 09:22:32AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 09/17/2018 07:35 PM, Ming Lei wrote:
> > On Mon, Sep 17, 2018 at 10:46:34AM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 09/14/2018 07:33 PM, Ming Lei wrote:
> >>> On Fri, Sep 14, 2018 at 11:40 AM jianchao.wang
> >>> <jianchao.w.wang@oracle.com> wrote:
> >>>>
> >>>> Hi Ming
> >>>>
> >>>> On 09/13/2018 08:15 PM, Ming Lei wrote:
> >>>>>  EXPORT_SYMBOL(__scsi_execute);
> >>>>> @@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
> >>>>>               else
> >>>>>                       scsi_wait_for_queuecommand(sdev);
> >>>>>       }
> >>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
> >>>>>       mutex_unlock(&sdev->state_mutex);
> >>>>>
> >>>>>       return err;
> >>>> ...
> >>>>> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> >>>>> index 3aee9464a7bf..8bcb7ecc0c06 100644
> >>>>> --- a/drivers/scsi/scsi_sysfs.c
> >>>>> +++ b/drivers/scsi/scsi_sysfs.c
> >>>>> @@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
> >>>>>
> >>>>>       blk_cleanup_queue(sdev->request_queue);
> >>>>>       cancel_work_sync(&sdev->requeue_work);
> >>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
> >>>>
> >>>> This nr_admin_pending could drain the ongoing scsi_request_fn/scsi_queue_rq,
> >>>> but I'm afraid it cannot stop new ones coming in, such as the ones that have passed
> >>>> the sdev state checking and have not crossed the atomic_inc(&sdev->nr_admin_pending).
> >>>>
> >>>
> >>> The counter of .nr_admin_pending is introduced for draining queued
> >>> admin requests to this scsi device.
> >>>
> >>> Actually new requests have been prevented from entering scsi_queue_rq(),
> >>> please see the two callers of wait_event(sdev->admin_wq,
> >>> !atomic_read(&sdev->nr_admin_pending)).
> >>>
> >> For example
> >>
> >> _scsi_execute
> >> ...
> >>                                           scsi_internal_device_block
> >>                                             scsi_internal_device_block_nowait
> >>                                             blk_mq_quiesce_queue
> >>                                             wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
> >>   &sdev->nr_admin_pending;
> >>
> >>   blk_execute_rq(...)
> >>
> >>   atomic_dec(&sdev->nr_admin_pending);
> >>   wake_up_all(&sdev->admin_wq);
> >>
> >> Or do you mean the scsi_queue_rq -> scsi_prep_state_check could gate out of ?
> > 
> > I got it, then this issue can be fixed simply by moving atomic_inc/dec(&sdev->nr_admin_pending)
> > and related wake_up_all(&sdev->admin_wq) into scsi_admin_queue_rq().
> > 
> 
> I don't think so. It is a similar scenario.
> 
> I guess a state checking is needed after atomic_inc(&sdev->nr_admin_pending), like:
> 
>  _scsi_execute
>  ...
>                                            scsi_internal_device_block
>                                              scsi_internal_device_block_nowait
>                                              blk_mq_quiesce_queue
>                                              wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>    atomic_inc(&sdev->nr_admin_pending);
>    if state checking fails
>      goto done

The check will be done in scsi_admin_queue_rq().

> 
>    blk_execute_rq(...)
> 
>    atomic_dec(&sdev->nr_admin_pending);
>    wake_up_all(&sdev->admin_wq);

I guess you may misunderstand the purpose of .nr_admin_pending, which is
for draining requests to .queue_rq(). So it is enough to just move the
inc/dec of .nr_admin_pending into scsi_admin_queue_rq(), right?

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
  2018-09-18  1:17         ` jianchao.wang
@ 2018-09-18  7:42           ` Ming Lei
  2018-09-18  7:53             ` jianchao.wang
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2018-09-18  7:42 UTC (permalink / raw)
  To: jianchao.wang; +Cc: Jens Axboe, linux-block

On Tue, Sep 18, 2018 at 09:17:12AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 09/17/2018 08:07 PM, Ming Lei wrote:
> >>> This way will delay runtime pm or system suspend until the queue is unfrozen,
> >>> but it isn't reasonable.
> >> This interface is for the __scsi_execute only, before we call into function, we should have
> >> get the device resumed synchronously.
> > I mean when the queue is frozen, it is reasonable to runtime suspend the queue. However,
> > blk_queue_preempt_enter() is still waiting for queue becoming unfreezing first.
> 
> We don't freeze the queue, but set preempt-only mode when suspend the queue. :)

But the queue can be frozen by other paths. Even though it is frozen, the queue
should be allowed to suspend too.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-18  7:39             ` Ming Lei
@ 2018-09-18  7:51               ` jianchao.wang
  2018-09-18  7:55                 ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: jianchao.wang @ 2018-09-18  7:51 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List



On 09/18/2018 03:39 PM, Ming Lei wrote:
> On Tue, Sep 18, 2018 at 09:22:32AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 09/17/2018 07:35 PM, Ming Lei wrote:
>>> On Mon, Sep 17, 2018 at 10:46:34AM +0800, jianchao.wang wrote:
>>>> Hi Ming
>>>>
>>>> On 09/14/2018 07:33 PM, Ming Lei wrote:
>>>>> On Fri, Sep 14, 2018 at 11:40 AM jianchao.wang
>>>>> <jianchao.w.wang@oracle.com> wrote:
>>>>>>
>>>>>> Hi Ming
>>>>>>
>>>>>> On 09/13/2018 08:15 PM, Ming Lei wrote:
>>>>>>>  EXPORT_SYMBOL(__scsi_execute);
>>>>>>> @@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
>>>>>>>               else
>>>>>>>                       scsi_wait_for_queuecommand(sdev);
>>>>>>>       }
>>>>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
>>>>>>>       mutex_unlock(&sdev->state_mutex);
>>>>>>>
>>>>>>>       return err;
>>>>>> ...
>>>>>>> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
>>>>>>> index 3aee9464a7bf..8bcb7ecc0c06 100644
>>>>>>> --- a/drivers/scsi/scsi_sysfs.c
>>>>>>> +++ b/drivers/scsi/scsi_sysfs.c
>>>>>>> @@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
>>>>>>>
>>>>>>>       blk_cleanup_queue(sdev->request_queue);
>>>>>>>       cancel_work_sync(&sdev->requeue_work);
>>>>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>>>>>>
>>>>>> This nr_admin_pending could drain the ongoing scsi_request_fn/scsi_queue_rq,
>>>>>> but I'm afraid it cannot stop new ones coming in, such as the ones that have passed
>>>>>> the sdev state checking and have not crossed the atomic_inc(&sdev->nr_admin_pending).
>>>>>>
>>>>>
>>>>> The counter of .nr_admin_pending is introduced for draining queued
>>>>> admin requests to this scsi device.
>>>>>
>>>>> Actually new requests have been prevented from entering scsi_queue_rq(),
>>>>> please see the two callers of wait_event(sdev->admin_wq,
>>>>> !atomic_read(&sdev->nr_admin_pending)).
>>>>>
>>>> For example
>>>>
>>>> _scsi_execute
>>>> ...
>>>>                                           scsi_internal_device_block
>>>>                                             scsi_internal_device_block_nowait
>>>>                                             blk_mq_quiesce_queue
>>>>                                             wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>>>>   &sdev->nr_admin_pending;
>>>>
>>>>   blk_execute_rq(...)
>>>>
>>>>   atomic_dec(&sdev->nr_admin_pending);
>>>>   wake_up_all(&sdev->admin_wq);
>>>>
>>>> Or do you mean the scsi_queue_rq -> scsi_prep_state_check could gate out of ?
>>>
>>> I got it, then this issue can be fixed simply by moving atomic_inc/dec(&sdev->nr_admin_pending)
>>> and related wake_up_all(&sdev->admin_wq) into scsi_admin_queue_rq().
>>>
>>
>> I don't think so. It is a similar scenario.
>>
>> I guess a state checking is needed after atomic_inc(&sdev->nr_admin_pending), like:
>>
>>  _scsi_execute
>>  ...
>>                                            scsi_internal_device_block
>>                                              scsi_internal_device_block_nowait
>>                                              blk_mq_quiesce_queue
>>                                              wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>>    atomic_inc(&sdev->nr_admin_pending);
>>    if state checking fails
>>      goto done
> 
> The check will be done in scsi_admin_queue_rq().
> 
>>
>>    blk_execute_rq(...)
>>
>>    atomic_dec(&sdev->nr_admin_pending);
>>    wake_up_all(&sdev->admin_wq);
> 
> I guess you may misunderstand the purpose of .nr_admin_pending, which is
> for draining requests to .queue_rq(). So it is enough to just move the
> inc/dec of .nr_admin_pending into scsi_admin_queue_rq(), right?
Yes.

But I just think of how to assign with queue quiesce.
The existence of per-host adminq seems to break it.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM
  2018-09-18  7:42           ` Ming Lei
@ 2018-09-18  7:53             ` jianchao.wang
  0 siblings, 0 replies; 41+ messages in thread
From: jianchao.wang @ 2018-09-18  7:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block

Hi Ming

On 09/18/2018 03:42 PM, Ming Lei wrote:
> On Tue, Sep 18, 2018 at 09:17:12AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 09/17/2018 08:07 PM, Ming Lei wrote:
>>>>> This way will delay runtime pm or system suspend until the queue is unfrozen,
>>>>> but it isn't reasonable.
>>>> This interface is for the __scsi_execute only, before we call into function, we should have
>>>> get the device resumed synchronously.
>>> I mean when the queue is frozen, it is reasonable to runtime suspend the queue. However,
>>> blk_queue_preempt_enter() is still waiting for queue becoming unfreezing first.
>>
>> We don't freeze the queue, but set preempt-only mode when suspend the queue. :)
> 
> But the queue can be frozen by other paths. Even though it is frozen, the queue
> should be allowed to suspend too.
> 

Yes, the race between freeze and preempt-only is a tricky issue.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-18  7:51               ` jianchao.wang
@ 2018-09-18  7:55                 ` Ming Lei
  2018-09-18  8:25                   ` jianchao.wang
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2018-09-18  7:55 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List

On Tue, Sep 18, 2018 at 03:51:10PM +0800, jianchao.wang wrote:
> 
> 
> On 09/18/2018 03:39 PM, Ming Lei wrote:
> > On Tue, Sep 18, 2018 at 09:22:32AM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 09/17/2018 07:35 PM, Ming Lei wrote:
> >>> On Mon, Sep 17, 2018 at 10:46:34AM +0800, jianchao.wang wrote:
> >>>> Hi Ming
> >>>>
> >>>> On 09/14/2018 07:33 PM, Ming Lei wrote:
> >>>>> On Fri, Sep 14, 2018 at 11:40 AM jianchao.wang
> >>>>> <jianchao.w.wang@oracle.com> wrote:
> >>>>>>
> >>>>>> Hi Ming
> >>>>>>
> >>>>>> On 09/13/2018 08:15 PM, Ming Lei wrote:
> >>>>>>>  EXPORT_SYMBOL(__scsi_execute);
> >>>>>>> @@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
> >>>>>>>               else
> >>>>>>>                       scsi_wait_for_queuecommand(sdev);
> >>>>>>>       }
> >>>>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
> >>>>>>>       mutex_unlock(&sdev->state_mutex);
> >>>>>>>
> >>>>>>>       return err;
> >>>>>> ...
> >>>>>>> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> >>>>>>> index 3aee9464a7bf..8bcb7ecc0c06 100644
> >>>>>>> --- a/drivers/scsi/scsi_sysfs.c
> >>>>>>> +++ b/drivers/scsi/scsi_sysfs.c
> >>>>>>> @@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
> >>>>>>>
> >>>>>>>       blk_cleanup_queue(sdev->request_queue);
> >>>>>>>       cancel_work_sync(&sdev->requeue_work);
> >>>>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
> >>>>>>
> >>>>>> This nr_admin_pending could drain the ongoing scsi_request_fn/scsi_queue_rq,
> >>>>>> but I'm afraid it cannot stop new ones coming in, such as the ones that have passed
> >>>>>> the sdev state checking and have not crossed the atomic_inc(&sdev->nr_admin_pending).
> >>>>>>
> >>>>>
> >>>>> The counter of .nr_admin_pending is introduced for draining queued
> >>>>> admin requests to this scsi device.
> >>>>>
> >>>>> Actually new requests have been prevented from entering scsi_queue_rq(),
> >>>>> please see the two callers of wait_event(sdev->admin_wq,
> >>>>> !atomic_read(&sdev->nr_admin_pending)).
> >>>>>
> >>>> For example
> >>>>
> >>>> _scsi_execute
> >>>> ...
> >>>>                                           scsi_internal_device_block
> >>>>                                             scsi_internal_device_block_nowait
> >>>>                                             blk_mq_quiesce_queue
> >>>>                                             wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
> >>>>   &sdev->nr_admin_pending;
> >>>>
> >>>>   blk_execute_rq(...)
> >>>>
> >>>>   atomic_dec(&sdev->nr_admin_pending);
> >>>>   wake_up_all(&sdev->admin_wq);
> >>>>
> >>>> Or do you mean the scsi_queue_rq -> scsi_prep_state_check could gate out of ?
> >>>
> >>> I got it, then this issue can be fixed simply by moving atomic_inc/dec(&sdev->nr_admin_pending)
> >>> and related wake_up_all(&sdev->admin_wq) into scsi_admin_queue_rq().
> >>>
> >>
> >> I don't think so. It is a similar scenario.
> >>
> >> I guess a state checking is needed after atomic_inc(&sdev->nr_admin_pending), like:
> >>
> >>  _scsi_execute
> >>  ...
> >>                                            scsi_internal_device_block
> >>                                              scsi_internal_device_block_nowait
> >>                                              blk_mq_quiesce_queue
> >>                                              wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
> >>    atomic_inc(&sdev->nr_admin_pending);
> >>    if state checking fails
> >>      goto done
> > 
> > The check will be done in scsi_admin_queue_rq().
> > 
> >>
> >>    blk_execute_rq(...)
> >>
> >>    atomic_dec(&sdev->nr_admin_pending);
> >>    wake_up_all(&sdev->admin_wq);
> > 
> > I guess you may misunderstand the purpose of .nr_admin_pending, which is
> > for draining requests to .queue_rq(). So it is enough to just move the
> > inc/dec of .nr_admin_pending into scsi_admin_queue_rq(), right?
> Yes.

Thanks for your confirmation.

> 
> But I just think of how to assign with queue quiesce.
> The existence of per-host adminq seems to break it.

The per-host adminq won't be frozen or quiesced at all, could you explain
your concern in a bit detail about 'assign with queue quiesce'? 

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-18  7:55                 ` Ming Lei
@ 2018-09-18  8:25                   ` jianchao.wang
  2018-09-18 12:15                     ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: jianchao.wang @ 2018-09-18  8:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List



On 09/18/2018 03:55 PM, Ming Lei wrote:
> On Tue, Sep 18, 2018 at 03:51:10PM +0800, jianchao.wang wrote:
>>
>>
>> On 09/18/2018 03:39 PM, Ming Lei wrote:
>>> On Tue, Sep 18, 2018 at 09:22:32AM +0800, jianchao.wang wrote:
>>>> Hi Ming
>>>>
>>>> On 09/17/2018 07:35 PM, Ming Lei wrote:
>>>>> On Mon, Sep 17, 2018 at 10:46:34AM +0800, jianchao.wang wrote:
>>>>>> Hi Ming
>>>>>>
>>>>>> On 09/14/2018 07:33 PM, Ming Lei wrote:
>>>>>>> On Fri, Sep 14, 2018 at 11:40 AM jianchao.wang
>>>>>>> <jianchao.w.wang@oracle.com> wrote:
>>>>>>>>
>>>>>>>> Hi Ming
>>>>>>>>
>>>>>>>> On 09/13/2018 08:15 PM, Ming Lei wrote:
>>>>>>>>>  EXPORT_SYMBOL(__scsi_execute);
>>>>>>>>> @@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
>>>>>>>>>               else
>>>>>>>>>                       scsi_wait_for_queuecommand(sdev);
>>>>>>>>>       }
>>>>>>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
>>>>>>>>>       mutex_unlock(&sdev->state_mutex);
>>>>>>>>>
>>>>>>>>>       return err;
>>>>>>>> ...
>>>>>>>>> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
>>>>>>>>> index 3aee9464a7bf..8bcb7ecc0c06 100644
>>>>>>>>> --- a/drivers/scsi/scsi_sysfs.c
>>>>>>>>> +++ b/drivers/scsi/scsi_sysfs.c
>>>>>>>>> @@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
>>>>>>>>>
>>>>>>>>>       blk_cleanup_queue(sdev->request_queue);
>>>>>>>>>       cancel_work_sync(&sdev->requeue_work);
>>>>>>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>>>>>>>>
>>>>>>>> This nr_admin_pending could drain the ongoing scsi_request_fn/scsi_queue_rq,
>>>>>>>> but I'm afraid it cannot stop new ones coming in, such as the ones that have passed
>>>>>>>> the sdev state checking and have not crossed the atomic_inc(&sdev->nr_admin_pending).
>>>>>>>>
>>>>>>>
>>>>>>> The counter of .nr_admin_pending is introduced for draining queued
>>>>>>> admin requests to this scsi device.
>>>>>>>
>>>>>>> Actually new requests have been prevented from entering scsi_queue_rq(),
>>>>>>> please see the two callers of wait_event(sdev->admin_wq,
>>>>>>> !atomic_read(&sdev->nr_admin_pending)).
>>>>>>>
>>>>>> For example
>>>>>>
>>>>>> _scsi_execute
>>>>>> ...
>>>>>>                                           scsi_internal_device_block
>>>>>>                                             scsi_internal_device_block_nowait
>>>>>>                                             blk_mq_quiesce_queue
>>>>>>                                             wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>>>>>>   &sdev->nr_admin_pending;
>>>>>>
>>>>>>   blk_execute_rq(...)
>>>>>>
>>>>>>   atomic_dec(&sdev->nr_admin_pending);
>>>>>>   wake_up_all(&sdev->admin_wq);
>>>>>>
>>>>>> Or do you mean the scsi_queue_rq -> scsi_prep_state_check could gate out of ?
>>>>>
>>>>> I got it, then this issue can be fixed simply by moving atomic_inc/dec(&sdev->nr_admin_pending)
>>>>> and related wake_up_all(&sdev->admin_wq) into scsi_admin_queue_rq().
>>>>>
>>>>
>>>> I don't think so. It is a similar scenario.
>>>>
>>>> I guess a state checking is needed after atomic_inc(&sdev->nr_admin_pending), like:
>>>>
>>>>  _scsi_execute
>>>>  ...
>>>>                                            scsi_internal_device_block
>>>>                                              scsi_internal_device_block_nowait
>>>>                                              blk_mq_quiesce_queue
>>>>                                              wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
>>>>    atomic_inc(&sdev->nr_admin_pending);
>>>>    if state checking fails
>>>>      goto done
>>>
>>> The check will be done in scsi_admin_queue_rq().
>>>
>>>>
>>>>    blk_execute_rq(...)
>>>>
>>>>    atomic_dec(&sdev->nr_admin_pending);
>>>>    wake_up_all(&sdev->admin_wq);
>>>
>>> I guess you may misunderstand the purpose of .nr_admin_pending, which is
>>> for draining requests to .queue_rq(). So it is enough to just move the
>>> inc/dec of .nr_admin_pending into scsi_admin_queue_rq(), right?
>> Yes.
> 
> Thanks for your confirmation.
> 
>>
>> But I just think of how to assign with queue quiesce.
>> The existence of per-host adminq seems to break it.
> 
> The per-host adminq won't be frozen or quiesced at all,

This is the place which makes me confused.

Before this patchset, when scsi_internal_device_block quiesces the request_queue,
Both normal IO or admin requests cannot be processed any more.

Given the per-adminq, we could say every scsi device has two request_queues, or
there could be two request_queues sending requests to the a scsi_device:
 - normal IO request_queue
 - shared per-host adminq

We could quiesce the normal IO request_queue, but what's about the per-host adminq ?

> could you explain
> your concern in a bit detail about 'assign with queue quiesce'? 
>

The scsi_queue_rq->scsi_prep_state_check could stop requests entering further, but
the scsi_queue_rq is still invoked.


Thanks
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-18  8:25                   ` jianchao.wang
@ 2018-09-18 12:15                     ` Ming Lei
  2018-09-19  3:52                       ` jianchao.wang
  0 siblings, 1 reply; 41+ messages in thread
From: Ming Lei @ 2018-09-18 12:15 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List

On Tue, Sep 18, 2018 at 04:25:33PM +0800, jianchao.wang wrote:
> 
> 
> On 09/18/2018 03:55 PM, Ming Lei wrote:
> > On Tue, Sep 18, 2018 at 03:51:10PM +0800, jianchao.wang wrote:
> >>
> >>
> >> On 09/18/2018 03:39 PM, Ming Lei wrote:
> >>> On Tue, Sep 18, 2018 at 09:22:32AM +0800, jianchao.wang wrote:
> >>>> Hi Ming
> >>>>
> >>>> On 09/17/2018 07:35 PM, Ming Lei wrote:
> >>>>> On Mon, Sep 17, 2018 at 10:46:34AM +0800, jianchao.wang wrote:
> >>>>>> Hi Ming
> >>>>>>
> >>>>>> On 09/14/2018 07:33 PM, Ming Lei wrote:
> >>>>>>> On Fri, Sep 14, 2018 at 11:40 AM jianchao.wang
> >>>>>>> <jianchao.w.wang@oracle.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi Ming
> >>>>>>>>
> >>>>>>>> On 09/13/2018 08:15 PM, Ming Lei wrote:
> >>>>>>>>>  EXPORT_SYMBOL(__scsi_execute);
> >>>>>>>>> @@ -3246,6 +3251,7 @@ static int scsi_internal_device_block(struct scsi_device *sdev)
> >>>>>>>>>               else
> >>>>>>>>>                       scsi_wait_for_queuecommand(sdev);
> >>>>>>>>>       }
> >>>>>>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending));
> >>>>>>>>>       mutex_unlock(&sdev->state_mutex);
> >>>>>>>>>
> >>>>>>>>>       return err;
> >>>>>>>> ...
> >>>>>>>>> diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
> >>>>>>>>> index 3aee9464a7bf..8bcb7ecc0c06 100644
> >>>>>>>>> --- a/drivers/scsi/scsi_sysfs.c
> >>>>>>>>> +++ b/drivers/scsi/scsi_sysfs.c
> >>>>>>>>> @@ -1393,6 +1393,7 @@ void __scsi_remove_device(struct scsi_device *sdev)
> >>>>>>>>>
> >>>>>>>>>       blk_cleanup_queue(sdev->request_queue);
> >>>>>>>>>       cancel_work_sync(&sdev->requeue_work);
> >>>>>>>>> +     wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
> >>>>>>>>
> >>>>>>>> This nr_admin_pending could drain the ongoing scsi_request_fn/scsi_queue_rq,
> >>>>>>>> but I'm afraid it cannot stop new ones coming in, such as the ones that have passed
> >>>>>>>> the sdev state checking and have not crossed the atomic_inc(&sdev->nr_admin_pending).
> >>>>>>>>
> >>>>>>>
> >>>>>>> The counter of .nr_admin_pending is introduced for draining queued
> >>>>>>> admin requests to this scsi device.
> >>>>>>>
> >>>>>>> Actually new requests have been prevented from entering scsi_queue_rq(),
> >>>>>>> please see the two callers of wait_event(sdev->admin_wq,
> >>>>>>> !atomic_read(&sdev->nr_admin_pending)).
> >>>>>>>
> >>>>>> For example
> >>>>>>
> >>>>>> _scsi_execute
> >>>>>> ...
> >>>>>>                                           scsi_internal_device_block
> >>>>>>                                             scsi_internal_device_block_nowait
> >>>>>>                                             blk_mq_quiesce_queue
> >>>>>>                                             wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
> >>>>>>   &sdev->nr_admin_pending;
> >>>>>>
> >>>>>>   blk_execute_rq(...)
> >>>>>>
> >>>>>>   atomic_dec(&sdev->nr_admin_pending);
> >>>>>>   wake_up_all(&sdev->admin_wq);
> >>>>>>
> >>>>>> Or do you mean the scsi_queue_rq -> scsi_prep_state_check could gate out of ?
> >>>>>
> >>>>> I got it, then this issue can be fixed simply by moving atomic_inc/dec(&sdev->nr_admin_pending)
> >>>>> and related wake_up_all(&sdev->admin_wq) into scsi_admin_queue_rq().
> >>>>>
> >>>>
> >>>> I don't think so. It is a similar scenario.
> >>>>
> >>>> I guess a state checking is needed after atomic_inc(&sdev->nr_admin_pending), like:
> >>>>
> >>>>  _scsi_execute
> >>>>  ...
> >>>>                                            scsi_internal_device_block
> >>>>                                              scsi_internal_device_block_nowait
> >>>>                                              blk_mq_quiesce_queue
> >>>>                                              wait_event(sdev->admin_wq, !atomic_read(&sdev->nr_admin_pending))
> >>>>    atomic_inc(&sdev->nr_admin_pending);
> >>>>    if state checking fails
> >>>>      goto done
> >>>
> >>> The check will be done in scsi_admin_queue_rq().
> >>>
> >>>>
> >>>>    blk_execute_rq(...)
> >>>>
> >>>>    atomic_dec(&sdev->nr_admin_pending);
> >>>>    wake_up_all(&sdev->admin_wq);
> >>>
> >>> I guess you may misunderstand the purpose of .nr_admin_pending, which is
> >>> for draining requests to .queue_rq(). So it is enough to just move the
> >>> inc/dec of .nr_admin_pending into scsi_admin_queue_rq(), right?
> >> Yes.
> > 
> > Thanks for your confirmation.
> > 
> >>
> >> But I just think of how to assign with queue quiesce.
> >> The existence of per-host adminq seems to break it.
> > 
> > The per-host adminq won't be frozen or quiesced at all,
> 
> This is the place which makes me confused.
> 
> Before this patchset, when scsi_internal_device_block quiesces the request_queue,
> Both normal IO or admin requests cannot be processed any more.
> 
> Given the per-adminq, we could say every scsi device has two request_queues, or
> there could be two request_queues sending requests to the a scsi_device:
>  - normal IO request_queue
>  - shared per-host adminq
> 
> We could quiesce the normal IO request_queue, but what's about the per-host adminq ?

The adminq request can be quisced in the scsi_device level, just as normal IO
request, because the scsi_device instance is same with normal IO.

But in blk-mq level, this admin queue won't be quiesced or frozen at all.

> 
> > could you explain
> > your concern in a bit detail about 'assign with queue quiesce'? 
> >
> 
> The scsi_queue_rq->scsi_prep_state_check could stop requests entering further, but
> the scsi_queue_rq is still invoked.

As I mentioned, there isn't any change in scsi level about handling the
admin request because the scsi_device is shared, and scsi_host too.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-18 12:15                     ` Ming Lei
@ 2018-09-19  3:52                       ` jianchao.wang
  2018-09-19  8:07                         ` Ming Lei
  0 siblings, 1 reply; 41+ messages in thread
From: jianchao.wang @ 2018-09-19  3:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List

Hi Ming

On 09/18/2018 08:15 PM, Ming Lei wrote:
> The adminq request can be quisced in the scsi_device level, just as normal IO
> request, because the scsi_device instance is same with normal IO.

Yes, the scsi_queue_rq->scsi_prep_state_check could gate things out of.

> 
> But in blk-mq level, this admin queue won't be quiesced or frozen at all.

But the scsi_queue_rq/scsi_request_fn will be invoked, right ?

This is different from the original implementation.

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH V3 11/17] SCSI: track pending admin commands
  2018-09-19  3:52                       ` jianchao.wang
@ 2018-09-19  8:07                         ` Ming Lei
  0 siblings, 0 replies; 41+ messages in thread
From: Ming Lei @ 2018-09-19  8:07 UTC (permalink / raw)
  To: jianchao.wang
  Cc: Ming Lei, Jens Axboe, linux-block, Alan Stern, Christoph Hellwig,
	Bart Van Assche, Hannes Reinecke, Johannes Thumshirn,
	Adrian Hunter, James Bottomley, Martin K. Petersen,
	Linux SCSI List

On Wed, Sep 19, 2018 at 11:52:21AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 09/18/2018 08:15 PM, Ming Lei wrote:
> > The adminq request can be quisced in the scsi_device level, just as normal IO
> > request, because the scsi_device instance is same with normal IO.
> 
> Yes, the scsi_queue_rq->scsi_prep_state_check could gate things out of.
> 
> > 
> > But in blk-mq level, this admin queue won't be quiesced or frozen at all.
> 
> But the scsi_queue_rq/scsi_request_fn will be invoked, right ?
> 
> This is different from the original implementation.

For legacy path, as you mentioned, scsi_prep_fn() will work, then
scsi_request_fn() won't be called.

For blk-mq, scsi_queue_rq() is always called, and that is a bit
different, but the result is fine given admin request is submitted
much less frequently.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2018-09-19  8:07 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-13 12:15 [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM Ming Lei
2018-09-13 12:15 ` [PATCH V3 01/17] blk-mq: allow to pass default queue flags for creating & initializing queue Ming Lei
2018-09-13 12:15 ` [PATCH V3 02/17] blk-mq: convert BLK_MQ_F_NO_SCHED into per-queue flag Ming Lei
2018-09-13 12:15 ` [PATCH V3 03/17] block: rename QUEUE_FLAG_NO_SCHED as QUEUE_FLAG_ADMIN Ming Lei
2018-09-13 12:15 ` [PATCH V3 04/17] blk-mq: don't reserve tags for admin queue Ming Lei
2018-09-13 12:15 ` [PATCH V3 05/17] SCSI: try to retrieve request_queue via 'scsi_cmnd' if possible Ming Lei
2018-09-13 12:15 ` [PATCH V3 06/17] SCSI: pass 'scsi_device' instance from 'scsi_request' Ming Lei
2018-09-13 12:15 ` [PATCH V3 07/17] SCSI: prepare for introducing admin queue for legacy path Ming Lei
2018-09-13 12:15 ` [PATCH V3 08/17] SCSI: pass scsi_device to scsi_mq_prep_fn Ming Lei
2018-09-13 12:15 ` [PATCH V3 09/17] SCSI: don't set .queuedata in scsi_mq_alloc_queue() Ming Lei
2018-09-13 12:15 ` [PATCH V3 10/17] SCSI: deal with admin queue busy Ming Lei
2018-09-13 12:15 ` [PATCH V3 11/17] SCSI: track pending admin commands Ming Lei
2018-09-14  3:33   ` jianchao.wang
2018-09-14 11:33     ` Ming Lei
2018-09-17  2:46       ` jianchao.wang
2018-09-17 11:35         ` Ming Lei
2018-09-18  1:22           ` jianchao.wang
2018-09-18  7:39             ` Ming Lei
2018-09-18  7:51               ` jianchao.wang
2018-09-18  7:55                 ` Ming Lei
2018-09-18  8:25                   ` jianchao.wang
2018-09-18 12:15                     ` Ming Lei
2018-09-19  3:52                       ` jianchao.wang
2018-09-19  8:07                         ` Ming Lei
2018-09-13 12:15 ` [PATCH V3 12/17] SCSI: create admin queue for each host Ming Lei
2018-09-13 12:15 ` [PATCH V3 13/17] SCSI: use the dedicated admin queue to send admin commands Ming Lei
2018-09-13 12:15 ` [PATCH V3 14/17] SCSI: transport_spi: resume a quiesced device Ming Lei
2018-09-13 12:15 ` [PATCH V3 15/17] SCSI: use admin queue to implement queue QUIESCE Ming Lei
2018-09-13 12:15 ` [PATCH V3 16/17] block: simplify runtime PM support Ming Lei
2018-09-13 12:15 ` [PATCH V3 17/17] block: enable runtime PM for blk-mq Ming Lei
2018-09-14 10:33   ` kbuild test robot
2018-09-15  1:13   ` kbuild test robot
2018-09-14  7:27 ` [PATCH V3 00/17] SCSI: introduce per-host admin queue & enable runtime PM jianchao.wang
2018-09-16 13:09   ` Ming Lei
2018-09-17  2:25     ` jianchao.wang
2018-09-17 12:07       ` Ming Lei
2018-09-18  1:17         ` jianchao.wang
2018-09-18  7:42           ` Ming Lei
2018-09-18  7:53             ` jianchao.wang
2018-09-17  6:34 ` Hannes Reinecke
2018-09-17 11:55   ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.