All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/10] blk-mq: Enable runtime power management
@ 2018-08-04  0:03 Bart Van Assche
  2018-08-04  0:03 ` [PATCH v4 01/10] block: Change the preempt-only flag into a counter Bart Van Assche
                   ` (9 more replies)
  0 siblings, 10 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche

Hello Jens,

This patch series not only enables runtime power management for blk-mq but
also fixes a starvation issue in the power management code for the legacy
block layer. Please consider this patch series for kernel v4.19.

Thanks,

Bart.

Changes compared to v3:
- Avoid adverse interactions between system-wide suspend/resume and runtime
  power management by changing the PREEMPT_ONLY flag into a counter.
- Give RQF_PREEMPT back its original meaning, namely that it is only set for
  ide_preempt requests.
- Remove the flag BLK_MQ_REQ_PREEMPT.
- Removed the pm_request_resume() call.

Changes compared to v2:
- Fixed the build for CONFIG_BLOCK=n.
- Added a patch that introduces percpu_ref_read() in the percpu-counter code.
- Added a patch that makes it easier to detect missing pm_runtime_get*() calls.
- Addressed Jianchao's feedback including the comment about runtime overhead
  of switching a per-cpu counter to atomic mode.

Changes compared to v1:
- Moved the runtime power management code into a separate file.
- Addressed Ming's feedback.

Bart Van Assche (10):
  block: Change the preempt-only flag into a counter
  block, scsi: Give RQF_PREEMPT back its original meaning
  block, ide: Remove flag BLK_MQ_REQ_PREEMPT
  block: Move power management code into a new source file
  block: Serialize queue freezing and blk_pre_runtime_suspend()
  percpu-refcount: Introduce percpu_ref_read()
  block, scsi: Rework runtime power management
  block: Remove blk_pm_requeue_request()
  blk-mq: Insert blk_pm_{add,put}_request() calls
  blk-mq: Enable support for runtime power management

 block/Kconfig                   |   5 +
 block/Makefile                  |   1 +
 block/blk-core.c                | 281 ++++----------------------------
 block/blk-mq-debugfs.c          |  11 +-
 block/blk-mq-sched.c            |  13 +-
 block/blk-mq.c                  |  13 +-
 block/blk-pm.c                  | 245 ++++++++++++++++++++++++++++
 block/blk-pm.h                  |  37 +++++
 block/elevator.c                |  24 +--
 drivers/ide/ide-pm.c            |   3 +-
 drivers/scsi/scsi_lib.c         |  36 ++--
 drivers/scsi/scsi_pm.c          |   1 +
 drivers/scsi/sd.c               |   1 +
 drivers/scsi/sr.c               |   1 +
 include/linux/blk-mq.h          |   4 +-
 include/linux/blk-pm.h          |  30 ++++
 include/linux/blkdev.h          |  45 ++---
 include/linux/percpu-refcount.h |   2 +
 lib/percpu-refcount.c           |  29 ++++
 19 files changed, 453 insertions(+), 329 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

-- 
2.18.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4 01/10] block: Change the preempt-only flag into a counter
  2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
@ 2018-08-04  0:03 ` Bart Van Assche
  2018-08-06  6:25   ` Hannes Reinecke
  2018-09-14 12:53   ` Christoph Hellwig
  2018-08-04  0:03 ` [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning Bart Van Assche
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Johannes Thumshirn, Alan Stern

Rename "preempt-only" into "pm-only" because the primary purpose of
this mode is power management. Since the power management core may
but does not have to resume a runtime suspended device before
performing system-wide suspend and since a later patch will set
"pm-only" mode as long as a block device is runtime suspended, make
it possible to set "pm-only" mode from more than one context.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
---
 block/blk-core.c        | 35 ++++++++++++++++++-----------------
 block/blk-mq-debugfs.c  | 10 +++++++++-
 drivers/scsi/scsi_lib.c |  8 ++++----
 include/linux/blkdev.h  | 14 +++++++++-----
 4 files changed, 40 insertions(+), 27 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f9ad73d8573c..b83798c03ca8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -421,24 +421,25 @@ void blk_sync_queue(struct request_queue *q)
 EXPORT_SYMBOL(blk_sync_queue);
 
 /**
- * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
+ * blk_set_pm_only - increment pm_only counter
  * @q: request queue pointer
- *
- * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
- * set and 1 if the flag was already set.
  */
-int blk_set_preempt_only(struct request_queue *q)
+void blk_set_pm_only(struct request_queue *q)
 {
-	return blk_queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+	atomic_inc(&q->pm_only);
 }
-EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+EXPORT_SYMBOL_GPL(blk_set_pm_only);
 
-void blk_clear_preempt_only(struct request_queue *q)
+void blk_clear_pm_only(struct request_queue *q)
 {
-	blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
-	wake_up_all(&q->mq_freeze_wq);
+	int pm_only;
+
+	pm_only = atomic_dec_return(&q->pm_only);
+	WARN_ON_ONCE(pm_only < 0);
+	if (pm_only == 0)
+		wake_up_all(&q->mq_freeze_wq);
 }
-EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
@@ -911,7 +912,7 @@ EXPORT_SYMBOL(blk_alloc_queue);
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-	const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+	const bool pm = flags & BLK_MQ_REQ_PREEMPT;
 
 	while (true) {
 		bool success = false;
@@ -919,11 +920,11 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		rcu_read_lock();
 		if (percpu_ref_tryget_live(&q->q_usage_counter)) {
 			/*
-			 * The code that sets the PREEMPT_ONLY flag is
-			 * responsible for ensuring that that flag is globally
-			 * visible before the queue is unfrozen.
+			 * The code that increments the pm_only counter is
+			 * responsible for ensuring that that counter is
+			 * globally visible before the queue is unfrozen.
 			 */
-			if (preempt || !blk_queue_preempt_only(q)) {
+			if (pm || !blk_queue_pm_only(q)) {
 				success = true;
 			} else {
 				percpu_ref_put(&q->q_usage_counter);
@@ -948,7 +949,7 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 
 		wait_event(q->mq_freeze_wq,
 			   (atomic_read(&q->mq_freeze_depth) == 0 &&
-			    (preempt || !blk_queue_preempt_only(q))) ||
+			    (pm || !blk_queue_pm_only(q))) ||
 			   blk_queue_dying(q));
 		if (blk_queue_dying(q))
 			return -ENODEV;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..a5ea86835fcb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -102,6 +102,14 @@ static int blk_flags_show(struct seq_file *m, const unsigned long flags,
 	return 0;
 }
 
+static int queue_pm_only_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+
+	seq_printf(m, "%d\n", atomic_read(&q->pm_only));
+	return 0;
+}
+
 #define QUEUE_FLAG_NAME(name) [QUEUE_FLAG_##name] = #name
 static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(QUEUED),
@@ -132,7 +140,6 @@ static const char *const blk_queue_flag_name[] = {
 	QUEUE_FLAG_NAME(REGISTERED),
 	QUEUE_FLAG_NAME(SCSI_PASSTHROUGH),
 	QUEUE_FLAG_NAME(QUIESCED),
-	QUEUE_FLAG_NAME(PREEMPT_ONLY),
 };
 #undef QUEUE_FLAG_NAME
 
@@ -209,6 +216,7 @@ static ssize_t queue_write_hint_store(void *data, const char __user *buf,
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
 	{ "poll_stat", 0400, queue_poll_stat_show },
 	{ "requeue_list", 0400, .seq_ops = &queue_requeue_list_seq_ops },
+	{ "pm_only", 0600, queue_pm_only_show, NULL },
 	{ "state", 0600, queue_state_show, queue_state_write },
 	{ "write_hints", 0600, queue_write_hint_show, queue_write_hint_store },
 	{ "zone_wlock", 0400, queue_zone_wlock_show, NULL },
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cb9a166fa0c..007bb2bc817d 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2996,11 +2996,11 @@ scsi_device_quiesce(struct scsi_device *sdev)
 	 */
 	WARN_ON_ONCE(sdev->quiesced_by && sdev->quiesced_by != current);
 
-	blk_set_preempt_only(q);
+	blk_set_pm_only(q);
 
 	blk_mq_freeze_queue(q);
 	/*
-	 * Ensure that the effect of blk_set_preempt_only() will be visible
+	 * Ensure that the effect of blk_set_pm_only() will be visible
 	 * for percpu_ref_tryget() callers that occur after the queue
 	 * unfreeze even if the queue was already frozen before this function
 	 * was called. See also https://lwn.net/Articles/573497/.
@@ -3013,7 +3013,7 @@ scsi_device_quiesce(struct scsi_device *sdev)
 	if (err == 0)
 		sdev->quiesced_by = current;
 	else
-		blk_clear_preempt_only(q);
+		blk_clear_pm_only(q);
 	mutex_unlock(&sdev->state_mutex);
 
 	return err;
@@ -3038,7 +3038,7 @@ void scsi_device_resume(struct scsi_device *sdev)
 	mutex_lock(&sdev->state_mutex);
 	WARN_ON_ONCE(!sdev->quiesced_by);
 	sdev->quiesced_by = NULL;
-	blk_clear_preempt_only(sdev->request_queue);
+	blk_clear_pm_only(sdev->request_queue);
 	if (sdev->sdev_state == SDEV_QUIESCE)
 		scsi_device_set_state(sdev, SDEV_RUNNING);
 	mutex_unlock(&sdev->state_mutex);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 050d599f5ea9..1a08edbddf9f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -506,6 +506,12 @@ struct request_queue {
 	 * various queue flags, see QUEUE_* below
 	 */
 	unsigned long		queue_flags;
+	/*
+	 * Number of contexts that have called blk_set_pm_only(). If this
+	 * counter is above zero then only RQF_PM and RQF_PREEMPT requests are
+	 * processed.
+	 */
+	atomic_t		pm_only;
 
 	/*
 	 * ida allocated id for this queue.  Used to index queues from
@@ -700,7 +706,6 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26	/* queue has been registered to a disk */
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
-#define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
@@ -738,12 +743,11 @@ bool blk_queue_flag_test_and_clear(unsigned int flag, struct request_queue *q);
 	((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 			     REQ_FAILFAST_DRIVER))
 #define blk_queue_quiesced(q)	test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
-#define blk_queue_preempt_only(q)				\
-	test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+#define blk_queue_pm_only(q)	atomic_read(&(q)->pm_only)
 #define blk_queue_fua(q)	test_bit(QUEUE_FLAG_FUA, &(q)->queue_flags)
 
-extern int blk_set_preempt_only(struct request_queue *q);
-extern void blk_clear_preempt_only(struct request_queue *q);
+extern void blk_set_pm_only(struct request_queue *q);
+extern void blk_clear_pm_only(struct request_queue *q);
 
 static inline int queue_in_flight(struct request_queue *q)
 {
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning
  2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
  2018-08-04  0:03 ` [PATCH v4 01/10] block: Change the preempt-only flag into a counter Bart Van Assche
@ 2018-08-04  0:03 ` Bart Van Assche
  2018-08-04  1:16   ` Ming Lei
  2018-08-06  6:27   ` Hannes Reinecke
  2018-08-04  0:03 ` [PATCH v4 03/10] block, ide: Remove flag BLK_MQ_REQ_PREEMPT Bart Van Assche
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Jianchao Wang, Ming Lei, Alan Stern,
	Johannes Thumshirn

In kernel v2.6.18 RQF_PREEMPT was only set for IDE preempt requests.
Later on the SCSI core was modified such that RQF_PREEMPT requests
was set for all requests submitted by __scsi_execute(), including
power management requests. RQF_PREEMPT requests are the only requests
processed in the SDEV_QUIESCE state. Instead of setting RQF_PREEMPT
for all requests submitted by __scsi_execute(), only set RQF_PM for
power management requests. Modify blk_get_request() such that it
blocks in pm-only mode on non-RQF_PM requests. Leave the code out
from scsi_prep_state_check() that is no longer reachable.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-core.c        |  9 +++++----
 drivers/scsi/scsi_lib.c | 28 ++++++++++++----------------
 include/linux/blk-mq.h  |  2 ++
 include/linux/blkdev.h  |  6 ++----
 4 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b83798c03ca8..3378fe478e67 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -908,11 +908,11 @@ EXPORT_SYMBOL(blk_alloc_queue);
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
- * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
+ * @flags: BLK_MQ_REQ_NOWAIT, BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_PM
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
-	const bool pm = flags & BLK_MQ_REQ_PREEMPT;
+	const bool pm = flags & BLK_MQ_REQ_PM;
 
 	while (true) {
 		bool success = false;
@@ -1570,7 +1570,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 	goto retry;
 }
 
-/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
+/* flags: BLK_MQ_REQ_PREEMPT, BLK_MQ_REQ_PM and/or BLK_MQ_REQ_NOWAIT. */
 static struct request *blk_old_get_request(struct request_queue *q,
 				unsigned int op, blk_mq_req_flags_t flags)
 {
@@ -1613,7 +1613,8 @@ struct request *blk_get_request(struct request_queue *q, unsigned int op,
 	struct request *req;
 
 	WARN_ON_ONCE(op & REQ_NOWAIT);
-	WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
+	WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT |
+			       BLK_MQ_REQ_PM));
 
 	if (q->mq_ops) {
 		req = blk_mq_alloc_request(q, op, flags);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 007bb2bc817d..481158ab5081 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -248,8 +248,8 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
  * @sshdr:	optional decoded sense header
  * @timeout:	request timeout in seconds
  * @retries:	number of times to retry request
- * @flags:	flags for ->cmd_flags
- * @rq_flags:	flags for ->rq_flags
+ * @flags:	REQ_* flags for ->cmd_flags
+ * @rq_flags:	RQF_* flags for ->rq_flags
  * @resid:	optional residual length
  *
  * Returns the scsi_cmnd result field if a command was executed, or a negative
@@ -267,7 +267,8 @@ int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 
 	req = blk_get_request(sdev->request_queue,
 			data_direction == DMA_TO_DEVICE ?
-			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
+			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN,
+			rq_flags & RQF_PM ? BLK_MQ_REQ_PM : 0);
 	if (IS_ERR(req))
 		return ret;
 	rq = scsi_req(req);
@@ -1353,20 +1354,15 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req)
 			ret = BLKPREP_DEFER;
 			break;
 		case SDEV_QUIESCE:
-			/*
-			 * If the devices is blocked we defer normal commands.
-			 */
-			if (req && !(req->rq_flags & RQF_PREEMPT))
-				ret = BLKPREP_DEFER;
+			WARN_ON_ONCE(!(req->rq_flags & RQF_PM));
 			break;
-		default:
-			/*
-			 * For any other not fully online state we only allow
-			 * special commands.  In particular any user initiated
-			 * command is not allowed.
-			 */
-			if (req && !(req->rq_flags & RQF_PREEMPT))
-				ret = BLKPREP_KILL;
+		case SDEV_CANCEL:
+			ret = BLKPREP_KILL;
+			break;
+		case SDEV_CREATED:
+		case SDEV_RUNNING:
+			/* This code is never reached */
+			WARN_ONCE(true, "sdev state = %d\n", sdev->sdev_state);
 			break;
 		}
 	}
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 1da59c16f637..f15c1de51f5e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -223,6 +223,8 @@ enum {
 	BLK_MQ_REQ_INTERNAL	= (__force blk_mq_req_flags_t)(1 << 2),
 	/* set RQF_PREEMPT */
 	BLK_MQ_REQ_PREEMPT	= (__force blk_mq_req_flags_t)(1 << 3),
+	/* for power management requests */
+	BLK_MQ_REQ_PM		= (__force blk_mq_req_flags_t)(1 << 4),
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1a08edbddf9f..1b3fe78872dc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -99,8 +99,7 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_MQ_INFLIGHT		((__force req_flags_t)(1 << 6))
 /* don't call prep for this one */
 #define RQF_DONTPREP		((__force req_flags_t)(1 << 7))
-/* set for "ide_preempt" requests and also for requests for which the SCSI
-   "quiesce" state must be ignored. */
+/* set for "ide_preempt" requests */
 #define RQF_PREEMPT		((__force req_flags_t)(1 << 8))
 /* contains copies of user pages */
 #define RQF_COPY_USER		((__force req_flags_t)(1 << 9))
@@ -508,8 +507,7 @@ struct request_queue {
 	unsigned long		queue_flags;
 	/*
 	 * Number of contexts that have called blk_set_pm_only(). If this
-	 * counter is above zero then only RQF_PM and RQF_PREEMPT requests are
-	 * processed.
+	 * counter is above zero then only RQF_PM requests are processed.
 	 */
 	atomic_t		pm_only;
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 03/10] block, ide: Remove flag BLK_MQ_REQ_PREEMPT
  2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
  2018-08-04  0:03 ` [PATCH v4 01/10] block: Change the preempt-only flag into a counter Bart Van Assche
  2018-08-04  0:03 ` [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning Bart Van Assche
@ 2018-08-04  0:03 ` Bart Van Assche
  2018-08-04  0:03 ` [PATCH v4 04/10] block: Move power management code into a new source file Bart Van Assche
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	David S . Miller, Jianchao Wang, Ming Lei, Alan Stern,
	Johannes Thumshirn

Since it is no longer necessary that blk_get_request() knowns whether
or not RQF_PREEMPT will be set, remove flag BLK_MQ_REQ_PREEMPT. This
patch does not change any functionality. See also 039c635f4e66 ("ide,
scsi: Tell the block layer at request allocation time about preempt
requests").

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-core.c       | 9 +++------
 block/blk-mq.c         | 2 --
 drivers/ide/ide-pm.c   | 3 ++-
 include/linux/blk-mq.h | 4 +---
 4 files changed, 6 insertions(+), 12 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3378fe478e67..3a5aa56eab9d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -908,7 +908,7 @@ EXPORT_SYMBOL(blk_alloc_queue);
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
- * @flags: BLK_MQ_REQ_NOWAIT, BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_PM
+ * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PM
  */
 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
@@ -1431,8 +1431,6 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
 	blk_rq_set_rl(rq, rl);
 	rq->cmd_flags = op;
 	rq->rq_flags = rq_flags;
-	if (flags & BLK_MQ_REQ_PREEMPT)
-		rq->rq_flags |= RQF_PREEMPT;
 
 	/* init elvpriv */
 	if (rq_flags & RQF_ELVPRIV) {
@@ -1570,7 +1568,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 	goto retry;
 }
 
-/* flags: BLK_MQ_REQ_PREEMPT, BLK_MQ_REQ_PM and/or BLK_MQ_REQ_NOWAIT. */
+/* flags: BLK_MQ_REQ_PM and/or BLK_MQ_REQ_NOWAIT. */
 static struct request *blk_old_get_request(struct request_queue *q,
 				unsigned int op, blk_mq_req_flags_t flags)
 {
@@ -1613,8 +1611,7 @@ struct request *blk_get_request(struct request_queue *q, unsigned int op,
 	struct request *req;
 
 	WARN_ON_ONCE(op & REQ_NOWAIT);
-	WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT |
-			       BLK_MQ_REQ_PM));
+	WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PM));
 
 	if (q->mq_ops) {
 		req = blk_mq_alloc_request(q, op, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c92ce06fd565..8b23ae34d949 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -300,8 +300,6 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->rq_flags = rq_flags;
 	rq->cpu = -1;
 	rq->cmd_flags = op;
-	if (data->flags & BLK_MQ_REQ_PREEMPT)
-		rq->rq_flags |= RQF_PREEMPT;
 	if (blk_queue_io_stat(data->q))
 		rq->rq_flags |= RQF_IO_STAT;
 	INIT_LIST_HEAD(&rq->queuelist);
diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 59217aa1d1fb..10d10e01c0f4 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -90,8 +90,9 @@ int generic_ide_resume(struct device *dev)
 	}
 
 	memset(&rqpm, 0, sizeof(rqpm));
-	rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, BLK_MQ_REQ_PREEMPT);
+	rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, 0);
 	ide_req(rq)->type = ATA_PRIV_PM_RESUME;
+	rq->rq_flags |= RQF_PREEMPT;
 	rq->special = &rqpm;
 	rqpm.pm_step = IDE_PM_START_RESUME;
 	rqpm.pm_state = PM_EVENT_ON;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f15c1de51f5e..59f99b0019bb 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -221,10 +221,8 @@ enum {
 	BLK_MQ_REQ_RESERVED	= (__force blk_mq_req_flags_t)(1 << 1),
 	/* allocate internal/sched tag */
 	BLK_MQ_REQ_INTERNAL	= (__force blk_mq_req_flags_t)(1 << 2),
-	/* set RQF_PREEMPT */
-	BLK_MQ_REQ_PREEMPT	= (__force blk_mq_req_flags_t)(1 << 3),
 	/* for power management requests */
-	BLK_MQ_REQ_PM		= (__force blk_mq_req_flags_t)(1 << 4),
+	BLK_MQ_REQ_PM		= (__force blk_mq_req_flags_t)(1 << 3),
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 04/10] block: Move power management code into a new source file
  2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
                   ` (2 preceding siblings ...)
  2018-08-04  0:03 ` [PATCH v4 03/10] block, ide: Remove flag BLK_MQ_REQ_PREEMPT Bart Van Assche
@ 2018-08-04  0:03 ` Bart Van Assche
  2018-09-14 13:01   ` Christoph Hellwig
  2018-08-04  0:03 ` [PATCH v4 05/10] block: Serialize queue freezing and blk_pre_runtime_suspend() Bart Van Assche
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Johannes Thumshirn, Alan Stern

Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the corresponding declarations from
<linux/blkdev.h> into <linux/blk-pm.h>. For CONFIG_PM=n, leave out
the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file <linux/blkdev.h>
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
---
 block/Kconfig          |   5 ++
 block/Makefile         |   1 +
 block/blk-core.c       | 196 +----------------------------------------
 block/blk-pm.c         | 188 +++++++++++++++++++++++++++++++++++++++
 block/blk-pm.h         |  43 +++++++++
 block/elevator.c       |  22 +----
 drivers/scsi/scsi_pm.c |   1 +
 drivers/scsi/sd.c      |   1 +
 drivers/scsi/sr.c      |   1 +
 include/linux/blk-pm.h |  24 +++++
 include/linux/blkdev.h |  23 -----
 11 files changed, 266 insertions(+), 239 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

diff --git a/block/Kconfig b/block/Kconfig
index 1f2469a0123c..e213d90a5e64 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -228,4 +228,9 @@ config BLK_MQ_RDMA
 	depends on BLOCK && INFINIBAND
 	default y
 
+config BLK_PM
+	bool
+	depends on BLOCK && PM
+	default y
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 572b33f32c07..27eac600474f 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
 obj-$(CONFIG_BLK_SED_OPAL)	+= sed-opal.o
+obj-$(CONFIG_BLK_PM)		+= blk-pm.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 3a5aa56eab9d..03cff7445dee 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -42,6 +42,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-rq-qos.h"
 
 #ifdef CONFIG_DEBUG_FS
@@ -1720,16 +1721,6 @@ void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part)
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
-#ifdef CONFIG_PM
-static void blk_pm_put_request(struct request *rq)
-{
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-		pm_runtime_mark_last_busy(rq->q->dev);
-}
-#else
-static inline void blk_pm_put_request(struct request *rq) {}
-#endif
-
 void __blk_put_request(struct request_queue *q, struct request *req)
 {
 	req_flags_t rq_flags = req->rq_flags;
@@ -3745,191 +3736,6 @@ void blk_finish_plug(struct blk_plug *plug)
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
-#ifdef CONFIG_PM
-/**
- * blk_pm_runtime_init - Block layer runtime PM initialization routine
- * @q: the queue of the device
- * @dev: the device the queue belongs to
- *
- * Description:
- *    Initialize runtime-PM-related fields for @q and start auto suspend for
- *    @dev. Drivers that want to take advantage of request-based runtime PM
- *    should call this function after @dev has been initialized, and its
- *    request queue @q has been allocated, and runtime PM for it can not happen
- *    yet(either due to disabled/forbidden or its usage_count > 0). In most
- *    cases, driver should call this function before any I/O has taken place.
- *
- *    This function takes care of setting up using auto suspend for the device,
- *    the autosuspend delay is set to -1 to make runtime suspend impossible
- *    until an updated value is either set by user or by driver. Drivers do
- *    not need to touch other autosuspend settings.
- *
- *    The block layer runtime PM is request based, so only works for drivers
- *    that use request as their IO unit instead of those directly use bio's.
- */
-void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
-{
-	/* Don't enable runtime PM for blk-mq until it is ready */
-	if (q->mq_ops) {
-		pm_runtime_disable(dev);
-		return;
-	}
-
-	q->dev = dev;
-	q->rpm_status = RPM_ACTIVE;
-	pm_runtime_set_autosuspend_delay(q->dev, -1);
-	pm_runtime_use_autosuspend(q->dev);
-}
-EXPORT_SYMBOL(blk_pm_runtime_init);
-
-/**
- * blk_pre_runtime_suspend - Pre runtime suspend check
- * @q: the queue of the device
- *
- * Description:
- *    This function will check if runtime suspend is allowed for the device
- *    by examining if there are any requests pending in the queue. If there
- *    are requests pending, the device can not be runtime suspended; otherwise,
- *    the queue's status will be updated to SUSPENDING and the driver can
- *    proceed to suspend the device.
- *
- *    For the not allowed case, we mark last busy for the device so that
- *    runtime PM core will try to autosuspend it some time later.
- *
- *    This function should be called near the start of the device's
- *    runtime_suspend callback.
- *
- * Return:
- *    0		- OK to runtime suspend the device
- *    -EBUSY	- Device should not be runtime suspended
- */
-int blk_pre_runtime_suspend(struct request_queue *q)
-{
-	int ret = 0;
-
-	if (!q->dev)
-		return ret;
-
-	spin_lock_irq(q->queue_lock);
-	if (q->nr_pending) {
-		ret = -EBUSY;
-		pm_runtime_mark_last_busy(q->dev);
-	} else {
-		q->rpm_status = RPM_SUSPENDING;
-	}
-	spin_unlock_irq(q->queue_lock);
-	return ret;
-}
-EXPORT_SYMBOL(blk_pre_runtime_suspend);
-
-/**
- * blk_post_runtime_suspend - Post runtime suspend processing
- * @q: the queue of the device
- * @err: return value of the device's runtime_suspend function
- *
- * Description:
- *    Update the queue's runtime status according to the return value of the
- *    device's runtime suspend function and mark last busy for the device so
- *    that PM core will try to auto suspend the device at a later time.
- *
- *    This function should be called near the end of the device's
- *    runtime_suspend callback.
- */
-void blk_post_runtime_suspend(struct request_queue *q, int err)
-{
-	if (!q->dev)
-		return;
-
-	spin_lock_irq(q->queue_lock);
-	if (!err) {
-		q->rpm_status = RPM_SUSPENDED;
-	} else {
-		q->rpm_status = RPM_ACTIVE;
-		pm_runtime_mark_last_busy(q->dev);
-	}
-	spin_unlock_irq(q->queue_lock);
-}
-EXPORT_SYMBOL(blk_post_runtime_suspend);
-
-/**
- * blk_pre_runtime_resume - Pre runtime resume processing
- * @q: the queue of the device
- *
- * Description:
- *    Update the queue's runtime status to RESUMING in preparation for the
- *    runtime resume of the device.
- *
- *    This function should be called near the start of the device's
- *    runtime_resume callback.
- */
-void blk_pre_runtime_resume(struct request_queue *q)
-{
-	if (!q->dev)
-		return;
-
-	spin_lock_irq(q->queue_lock);
-	q->rpm_status = RPM_RESUMING;
-	spin_unlock_irq(q->queue_lock);
-}
-EXPORT_SYMBOL(blk_pre_runtime_resume);
-
-/**
- * blk_post_runtime_resume - Post runtime resume processing
- * @q: the queue of the device
- * @err: return value of the device's runtime_resume function
- *
- * Description:
- *    Update the queue's runtime status according to the return value of the
- *    device's runtime_resume function. If it is successfully resumed, process
- *    the requests that are queued into the device's queue when it is resuming
- *    and then mark last busy and initiate autosuspend for it.
- *
- *    This function should be called near the end of the device's
- *    runtime_resume callback.
- */
-void blk_post_runtime_resume(struct request_queue *q, int err)
-{
-	if (!q->dev)
-		return;
-
-	spin_lock_irq(q->queue_lock);
-	if (!err) {
-		q->rpm_status = RPM_ACTIVE;
-		__blk_run_queue(q);
-		pm_runtime_mark_last_busy(q->dev);
-		pm_request_autosuspend(q->dev);
-	} else {
-		q->rpm_status = RPM_SUSPENDED;
-	}
-	spin_unlock_irq(q->queue_lock);
-}
-EXPORT_SYMBOL(blk_post_runtime_resume);
-
-/**
- * blk_set_runtime_active - Force runtime status of the queue to be active
- * @q: the queue of the device
- *
- * If the device is left runtime suspended during system suspend the resume
- * hook typically resumes the device and corrects runtime status
- * accordingly. However, that does not affect the queue runtime PM status
- * which is still "suspended". This prevents processing requests from the
- * queue.
- *
- * This function can be used in driver's resume hook to correct queue
- * runtime PM status and re-enable peeking requests from the queue. It
- * should be called before first request is added to the queue.
- */
-void blk_set_runtime_active(struct request_queue *q)
-{
-	spin_lock_irq(q->queue_lock);
-	q->rpm_status = RPM_ACTIVE;
-	pm_runtime_mark_last_busy(q->dev);
-	pm_request_autosuspend(q->dev);
-	spin_unlock_irq(q->queue_lock);
-}
-EXPORT_SYMBOL(blk_set_runtime_active);
-#endif
-
 int __init blk_dev_init(void)
 {
 	BUILD_BUG_ON(REQ_OP_LAST >= (1 << REQ_OP_BITS));
diff --git a/block/blk-pm.c b/block/blk-pm.c
new file mode 100644
index 000000000000..9b636960d285
--- /dev/null
+++ b/block/blk-pm.c
@@ -0,0 +1,188 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/blk-pm.h>
+#include <linux/blkdev.h>
+#include <linux/pm_runtime.h>
+
+/**
+ * blk_pm_runtime_init - Block layer runtime PM initialization routine
+ * @q: the queue of the device
+ * @dev: the device the queue belongs to
+ *
+ * Description:
+ *    Initialize runtime-PM-related fields for @q and start auto suspend for
+ *    @dev. Drivers that want to take advantage of request-based runtime PM
+ *    should call this function after @dev has been initialized, and its
+ *    request queue @q has been allocated, and runtime PM for it can not happen
+ *    yet(either due to disabled/forbidden or its usage_count > 0). In most
+ *    cases, driver should call this function before any I/O has taken place.
+ *
+ *    This function takes care of setting up using auto suspend for the device,
+ *    the autosuspend delay is set to -1 to make runtime suspend impossible
+ *    until an updated value is either set by user or by driver. Drivers do
+ *    not need to touch other autosuspend settings.
+ *
+ *    The block layer runtime PM is request based, so only works for drivers
+ *    that use request as their IO unit instead of those directly use bio's.
+ */
+void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
+{
+	/* Don't enable runtime PM for blk-mq until it is ready */
+	if (q->mq_ops) {
+		pm_runtime_disable(dev);
+		return;
+	}
+
+	q->dev = dev;
+	q->rpm_status = RPM_ACTIVE;
+	pm_runtime_set_autosuspend_delay(q->dev, -1);
+	pm_runtime_use_autosuspend(q->dev);
+}
+EXPORT_SYMBOL(blk_pm_runtime_init);
+
+/**
+ * blk_pre_runtime_suspend - Pre runtime suspend check
+ * @q: the queue of the device
+ *
+ * Description:
+ *    This function will check if runtime suspend is allowed for the device
+ *    by examining if there are any requests pending in the queue. If there
+ *    are requests pending, the device can not be runtime suspended; otherwise,
+ *    the queue's status will be updated to SUSPENDING and the driver can
+ *    proceed to suspend the device.
+ *
+ *    For the not allowed case, we mark last busy for the device so that
+ *    runtime PM core will try to autosuspend it some time later.
+ *
+ *    This function should be called near the start of the device's
+ *    runtime_suspend callback.
+ *
+ * Return:
+ *    0		- OK to runtime suspend the device
+ *    -EBUSY	- Device should not be runtime suspended
+ */
+int blk_pre_runtime_suspend(struct request_queue *q)
+{
+	int ret = 0;
+
+	if (!q->dev)
+		return ret;
+
+	spin_lock_irq(q->queue_lock);
+	if (q->nr_pending) {
+		ret = -EBUSY;
+		pm_runtime_mark_last_busy(q->dev);
+	} else {
+		q->rpm_status = RPM_SUSPENDING;
+	}
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+EXPORT_SYMBOL(blk_pre_runtime_suspend);
+
+/**
+ * blk_post_runtime_suspend - Post runtime suspend processing
+ * @q: the queue of the device
+ * @err: return value of the device's runtime_suspend function
+ *
+ * Description:
+ *    Update the queue's runtime status according to the return value of the
+ *    device's runtime suspend function and mark last busy for the device so
+ *    that PM core will try to auto suspend the device at a later time.
+ *
+ *    This function should be called near the end of the device's
+ *    runtime_suspend callback.
+ */
+void blk_post_runtime_suspend(struct request_queue *q, int err)
+{
+	if (!q->dev)
+		return;
+
+	spin_lock_irq(q->queue_lock);
+	if (!err) {
+		q->rpm_status = RPM_SUSPENDED;
+	} else {
+		q->rpm_status = RPM_ACTIVE;
+		pm_runtime_mark_last_busy(q->dev);
+	}
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL(blk_post_runtime_suspend);
+
+/**
+ * blk_pre_runtime_resume - Pre runtime resume processing
+ * @q: the queue of the device
+ *
+ * Description:
+ *    Update the queue's runtime status to RESUMING in preparation for the
+ *    runtime resume of the device.
+ *
+ *    This function should be called near the start of the device's
+ *    runtime_resume callback.
+ */
+void blk_pre_runtime_resume(struct request_queue *q)
+{
+	if (!q->dev)
+		return;
+
+	spin_lock_irq(q->queue_lock);
+	q->rpm_status = RPM_RESUMING;
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL(blk_pre_runtime_resume);
+
+/**
+ * blk_post_runtime_resume - Post runtime resume processing
+ * @q: the queue of the device
+ * @err: return value of the device's runtime_resume function
+ *
+ * Description:
+ *    Update the queue's runtime status according to the return value of the
+ *    device's runtime_resume function. If it is successfully resumed, process
+ *    the requests that are queued into the device's queue when it is resuming
+ *    and then mark last busy and initiate autosuspend for it.
+ *
+ *    This function should be called near the end of the device's
+ *    runtime_resume callback.
+ */
+void blk_post_runtime_resume(struct request_queue *q, int err)
+{
+	if (!q->dev)
+		return;
+
+	spin_lock_irq(q->queue_lock);
+	if (!err) {
+		q->rpm_status = RPM_ACTIVE;
+		__blk_run_queue(q);
+		pm_runtime_mark_last_busy(q->dev);
+		pm_request_autosuspend(q->dev);
+	} else {
+		q->rpm_status = RPM_SUSPENDED;
+	}
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL(blk_post_runtime_resume);
+
+/**
+ * blk_set_runtime_active - Force runtime status of the queue to be active
+ * @q: the queue of the device
+ *
+ * If the device is left runtime suspended during system suspend the resume
+ * hook typically resumes the device and corrects runtime status
+ * accordingly. However, that does not affect the queue runtime PM status
+ * which is still "suspended". This prevents processing requests from the
+ * queue.
+ *
+ * This function can be used in driver's resume hook to correct queue
+ * runtime PM status and re-enable peeking requests from the queue. It
+ * should be called before first request is added to the queue.
+ */
+void blk_set_runtime_active(struct request_queue *q)
+{
+	spin_lock_irq(q->queue_lock);
+	q->rpm_status = RPM_ACTIVE;
+	pm_runtime_mark_last_busy(q->dev);
+	pm_request_autosuspend(q->dev);
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL(blk_set_runtime_active);
diff --git a/block/blk-pm.h b/block/blk-pm.h
new file mode 100644
index 000000000000..1ffc8ef203ec
--- /dev/null
+++ b/block/blk-pm.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _BLOCK_BLK_PM_H_
+#define _BLOCK_BLK_PM_H_
+
+#include <linux/pm_runtime.h>
+
+#ifdef CONFIG_PM
+static inline void blk_pm_requeue_request(struct request *rq)
+{
+	if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+		rq->q->nr_pending--;
+}
+
+static inline void blk_pm_add_request(struct request_queue *q,
+				      struct request *rq)
+{
+	if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
+	    (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
+		pm_request_resume(q->dev);
+}
+
+static inline void blk_pm_put_request(struct request *rq)
+{
+	if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
+		pm_runtime_mark_last_busy(rq->q->dev);
+}
+#else
+static inline void blk_pm_requeue_request(struct request *rq)
+{
+}
+
+static inline void blk_pm_add_request(struct request_queue *q,
+				      struct request *rq)
+{
+}
+
+static inline void blk_pm_put_request(struct request *rq)
+{
+}
+#endif
+
+#endif /* _BLOCK_BLK_PM_H_ */
diff --git a/block/elevator.c b/block/elevator.c
index fa828b5bfd4b..4c15f0240c99 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -41,6 +41,7 @@
 
 #include "blk.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-wbt.h"
 
 static DEFINE_SPINLOCK(elv_list_lock);
@@ -557,27 +558,6 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
 		e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
 }
 
-#ifdef CONFIG_PM
-static void blk_pm_requeue_request(struct request *rq)
-{
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM))
-		rq->q->nr_pending--;
-}
-
-static void blk_pm_add_request(struct request_queue *q, struct request *rq)
-{
-	if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
-	    (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
-		pm_request_resume(q->dev);
-}
-#else
-static inline void blk_pm_requeue_request(struct request *rq) {}
-static inline void blk_pm_add_request(struct request_queue *q,
-				      struct request *rq)
-{
-}
-#endif
-
 void elv_requeue_request(struct request_queue *q, struct request *rq)
 {
 	/*
diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index b44c1bb687a2..a2b4179bfdf7 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -8,6 +8,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/export.h>
 #include <linux/async.h>
+#include <linux/blk-pm.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_device.h>
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bbebdc3769b0..69ab459abb98 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -45,6 +45,7 @@
 #include <linux/init.h>
 #include <linux/blkdev.h>
 #include <linux/blkpg.h>
+#include <linux/blk-pm.h>
 #include <linux/delay.h>
 #include <linux/mutex.h>
 #include <linux/string_helpers.h>
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 3f3cb72e0c0c..de4413e66eca 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -43,6 +43,7 @@
 #include <linux/interrupt.h>
 #include <linux/init.h>
 #include <linux/blkdev.h>
+#include <linux/blk-pm.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/pm_runtime.h>
diff --git a/include/linux/blk-pm.h b/include/linux/blk-pm.h
new file mode 100644
index 000000000000..b80c65aba249
--- /dev/null
+++ b/include/linux/blk-pm.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _BLK_PM_H_
+#define _BLK_PM_H_
+
+struct device;
+struct request_queue;
+
+/*
+ * block layer runtime pm functions
+ */
+#ifdef CONFIG_PM
+extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev);
+extern int blk_pre_runtime_suspend(struct request_queue *q);
+extern void blk_post_runtime_suspend(struct request_queue *q, int err);
+extern void blk_pre_runtime_resume(struct request_queue *q);
+extern void blk_post_runtime_resume(struct request_queue *q, int err);
+extern void blk_set_runtime_active(struct request_queue *q);
+#else
+static inline void blk_pm_runtime_init(struct request_queue *q,
+				       struct device *dev) {}
+#endif
+
+#endif /* _BLK_PM_H_ */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1b3fe78872dc..2ef38739d645 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1284,29 +1284,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
 extern void blk_put_queue(struct request_queue *);
 extern void blk_set_queue_dying(struct request_queue *);
 
-/*
- * block layer runtime pm functions
- */
-#ifdef CONFIG_PM
-extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev);
-extern int blk_pre_runtime_suspend(struct request_queue *q);
-extern void blk_post_runtime_suspend(struct request_queue *q, int err);
-extern void blk_pre_runtime_resume(struct request_queue *q);
-extern void blk_post_runtime_resume(struct request_queue *q, int err);
-extern void blk_set_runtime_active(struct request_queue *q);
-#else
-static inline void blk_pm_runtime_init(struct request_queue *q,
-	struct device *dev) {}
-static inline int blk_pre_runtime_suspend(struct request_queue *q)
-{
-	return -ENOSYS;
-}
-static inline void blk_post_runtime_suspend(struct request_queue *q, int err) {}
-static inline void blk_pre_runtime_resume(struct request_queue *q) {}
-static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
-static inline void blk_set_runtime_active(struct request_queue *q) {}
-#endif
-
 /*
  * blk_plug permits building a queue of related requests by holding the I/O
  * fragments for a short period. This allows merging of sequential requests
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 05/10] block: Serialize queue freezing and blk_pre_runtime_suspend()
  2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
                   ` (3 preceding siblings ...)
  2018-08-04  0:03 ` [PATCH v4 04/10] block: Move power management code into a new source file Bart Van Assche
@ 2018-08-04  0:03 ` Bart Van Assche
  2018-08-04 10:23   ` Ming Lei
  2018-08-04  0:03 ` [PATCH v4 06/10] percpu-refcount: Introduce percpu_ref_read() Bart Van Assche
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Johannes Thumshirn, Alan Stern

Serialize these operations because a later patch will add code into
blk_pre_runtime_suspend() that should not run concurrently with queue
freezing nor unfreezing.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
---
 block/blk-core.c       |  5 +++++
 block/blk-mq.c         |  3 +++
 block/blk-pm.c         | 44 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-pm.h |  6 ++++++
 include/linux/blkdev.h |  5 +++++
 5 files changed, 63 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 03cff7445dee..59382c758155 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/blk-mq.h>
+#include <linux/blk-pm.h>
 #include <linux/highmem.h>
 #include <linux/mm.h>
 #include <linux/kernel_stat.h>
@@ -696,6 +697,7 @@ void blk_set_queue_dying(struct request_queue *q)
 	 * prevent I/O from crossing blk_queue_enter().
 	 */
 	blk_freeze_queue_start(q);
+	blk_pm_runtime_unlock(q);
 
 	if (q->mq_ops)
 		blk_mq_wake_waiters(q);
@@ -756,6 +758,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	 * prevent that q->request_fn() gets invoked after draining finished.
 	 */
 	blk_freeze_queue(q);
+	blk_pm_runtime_unlock(q);
 	spin_lock_irq(lock);
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
@@ -1045,6 +1048,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	mutex_init(&q->blk_trace_mutex);
 #endif
+	blk_pm_init(q);
+
 	mutex_init(&q->sysfs_lock);
 	spin_lock_init(&q->__queue_lock);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8b23ae34d949..b1882a3a5216 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -9,6 +9,7 @@
 #include <linux/backing-dev.h>
 #include <linux/bio.h>
 #include <linux/blkdev.h>
+#include <linux/blk-pm.h>
 #include <linux/kmemleak.h>
 #include <linux/mm.h>
 #include <linux/init.h>
@@ -138,6 +139,7 @@ void blk_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
+	blk_pm_runtime_lock(q);
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
 		percpu_ref_kill(&q->q_usage_counter);
@@ -201,6 +203,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 		percpu_ref_reinit(&q->q_usage_counter);
 		wake_up_all(&q->mq_freeze_wq);
 	}
+	blk_pm_runtime_unlock(q);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 9b636960d285..2a4632d0be4b 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -3,6 +3,45 @@
 #include <linux/blk-pm.h>
 #include <linux/blkdev.h>
 #include <linux/pm_runtime.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+
+/*
+ * Initialize the request queue members used by blk_pm_runtime_lock() and
+ * blk_pm_runtime_unlock().
+ */
+void blk_pm_init(struct request_queue *q)
+{
+	spin_lock_init(&q->rpm_lock);
+	init_waitqueue_head(&q->rpm_wq);
+	q->rpm_owner = NULL;
+	q->rpm_nesting_level = 0;
+}
+
+void blk_pm_runtime_lock(struct request_queue *q)
+{
+	might_sleep();
+
+	spin_lock(&q->rpm_lock);
+	wait_event_exclusive_cmd(q->rpm_wq,
+			q->rpm_owner == NULL || q->rpm_owner == current,
+			spin_unlock(&q->rpm_lock), spin_lock(&q->rpm_lock));
+	if (q->rpm_owner == NULL)
+		q->rpm_owner = current;
+	q->rpm_nesting_level++;
+	spin_unlock(&q->rpm_lock);
+}
+
+void blk_pm_runtime_unlock(struct request_queue *q)
+{
+	spin_lock(&q->rpm_lock);
+	WARN_ON_ONCE(q->rpm_nesting_level <= 0);
+	if (--q->rpm_nesting_level == 0) {
+		q->rpm_owner = NULL;
+		wake_up(&q->rpm_wq);
+	}
+	spin_unlock(&q->rpm_lock);
+}
 
 /**
  * blk_pm_runtime_init - Block layer runtime PM initialization routine
@@ -68,6 +107,8 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 	if (!q->dev)
 		return ret;
 
+	blk_pm_runtime_lock(q);
+
 	spin_lock_irq(q->queue_lock);
 	if (q->nr_pending) {
 		ret = -EBUSY;
@@ -76,6 +117,9 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 		q->rpm_status = RPM_SUSPENDING;
 	}
 	spin_unlock_irq(q->queue_lock);
+
+	blk_pm_runtime_unlock(q);
+
 	return ret;
 }
 EXPORT_SYMBOL(blk_pre_runtime_suspend);
diff --git a/include/linux/blk-pm.h b/include/linux/blk-pm.h
index b80c65aba249..aafcc7877e53 100644
--- a/include/linux/blk-pm.h
+++ b/include/linux/blk-pm.h
@@ -10,6 +10,9 @@ struct request_queue;
  * block layer runtime pm functions
  */
 #ifdef CONFIG_PM
+extern void blk_pm_init(struct request_queue *q);
+extern void blk_pm_runtime_lock(struct request_queue *q);
+extern void blk_pm_runtime_unlock(struct request_queue *q);
 extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev);
 extern int blk_pre_runtime_suspend(struct request_queue *q);
 extern void blk_post_runtime_suspend(struct request_queue *q, int err);
@@ -17,6 +20,9 @@ extern void blk_pre_runtime_resume(struct request_queue *q);
 extern void blk_post_runtime_resume(struct request_queue *q, int err);
 extern void blk_set_runtime_active(struct request_queue *q);
 #else
+static inline void blk_pm_init(struct request_queue *q) {}
+static inline void blk_pm_runtime_lock(struct request_queue *q) {}
+static inline void blk_pm_runtime_unlock(struct request_queue *q) {}
 static inline void blk_pm_runtime_init(struct request_queue *q,
 				       struct device *dev) {}
 #endif
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2ef38739d645..72d569218231 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -548,6 +548,11 @@ struct request_queue {
 	struct device		*dev;
 	int			rpm_status;
 	unsigned int		nr_pending;
+	wait_queue_head_t	rpm_wq;
+	/* rpm_lock protects rpm_owner and rpm_nesting_level */
+	spinlock_t		rpm_lock;
+	struct task_struct	*rpm_owner;
+	int			rpm_nesting_level;
 #endif
 
 	/*
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 06/10] percpu-refcount: Introduce percpu_ref_read()
  2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
                   ` (4 preceding siblings ...)
  2018-08-04  0:03 ` [PATCH v4 05/10] block: Serialize queue freezing and blk_pre_runtime_suspend() Bart Van Assche
@ 2018-08-04  0:03 ` Bart Van Assche
  2018-08-04  0:03 ` [PATCH v4 07/10] block, scsi: Rework runtime power management Bart Van Assche
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Tejun Heo,
	Jianchao Wang, Ming Lei, Alan Stern, Johannes Thumshirn

Introduce a function that allows to read the value of a per-cpu counter.
This function will be used in the next patch to check whether a per-cpu
counter has the value one.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 include/linux/percpu-refcount.h |  2 ++
 lib/percpu-refcount.c           | 29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 009cdf3d65b6..5707289ba828 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -331,4 +331,6 @@ static inline bool percpu_ref_is_zero(struct percpu_ref *ref)
 	return !atomic_long_read(&ref->count);
 }
 
+unsigned long percpu_ref_read(struct percpu_ref *ref);
+
 #endif
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9f96fa7bc000..c0b9fc8efa6b 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -369,3 +369,32 @@ void percpu_ref_reinit(struct percpu_ref *ref)
 	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_reinit);
+
+/**
+ * percpu_ref_read - read a percpu refcount
+ * @ref: percpu_ref to test
+ *
+ * This function is safe to call as long as @ref is between init and exit. It
+ * is the responsibility of the caller to handle changes of @ref concurrently
+ * with this function. If this function is called while @ref is in per-cpu
+ * mode the returned value may be incorrect if e.g. percpu_ref_get() is called
+ * from one CPU and percpu_ref_put() is called from another CPU.
+ */
+unsigned long percpu_ref_read(struct percpu_ref *ref)
+{
+	unsigned long __percpu *percpu_count;
+	unsigned long sum = 0;
+	int cpu;
+
+	rcu_read_lock_sched();
+	if (__ref_is_percpu(ref, &percpu_count)) {
+		for_each_possible_cpu(cpu)
+			sum += *per_cpu_ptr(percpu_count, cpu);
+	}
+	rcu_read_unlock_sched();
+	sum += atomic_long_read(&ref->count);
+	sum &= ~PERCPU_COUNT_BIAS;
+
+	return sum;
+}
+EXPORT_SYMBOL_GPL(percpu_ref_read);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 07/10] block, scsi: Rework runtime power management
  2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
                   ` (5 preceding siblings ...)
  2018-08-04  0:03 ` [PATCH v4 06/10] percpu-refcount: Introduce percpu_ref_read() Bart Van Assche
@ 2018-08-04  0:03 ` Bart Van Assche
  2018-08-04 10:01   ` Ming Lei
  2018-08-06  2:46   ` jianchao.wang
  2018-08-04  0:03 ` [PATCH v4 08/10] block: Remove blk_pm_requeue_request() Bart Van Assche
                   ` (2 subsequent siblings)
  9 siblings, 2 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Jianchao Wang, Ming Lei, Alan Stern,
	Johannes Thumshirn

Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
Instead of maintaining the q->nr_pending counter, rely on
q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-core.c       | 37 ++++++++-----------------------------
 block/blk-mq-debugfs.c |  1 -
 block/blk-pm.c         | 29 ++++++++++++++++++++++++-----
 block/blk-pm.h         | 14 ++++++++------
 include/linux/blkdev.h |  1 -
 5 files changed, 40 insertions(+), 42 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 59382c758155..1e208e134ca9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2736,30 +2736,6 @@ void blk_account_io_done(struct request *req, u64 now)
 	}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static bool blk_pm_allow_request(struct request *rq)
-{
-	switch (rq->q->rpm_status) {
-	case RPM_RESUMING:
-	case RPM_SUSPENDING:
-		return rq->rq_flags & RQF_PM;
-	case RPM_SUSPENDED:
-		return false;
-	default:
-		return true;
-	}
-}
-#else
-static bool blk_pm_allow_request(struct request *rq)
-{
-	return true;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
 	struct hd_struct *part;
@@ -2805,11 +2781,14 @@ static struct request *elv_next_request(struct request_queue *q)
 
 	while (1) {
 		list_for_each_entry(rq, &q->queue_head, queuelist) {
-			if (blk_pm_allow_request(rq))
-				return rq;
-
-			if (rq->rq_flags & RQF_SOFTBARRIER)
-				break;
+#ifdef CONFIG_PM
+			/*
+			 * If a request gets queued in state RPM_SUSPENDED
+			 * then that's a kernel bug.
+			 */
+			WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
+#endif
+			return rq;
 		}
 
 		/*
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index a5ea86835fcb..7d74d53dc098 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -332,7 +332,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(ELVPRIV),
 	RQF_NAME(IO_STAT),
 	RQF_NAME(ALLOCED),
-	RQF_NAME(PM),
 	RQF_NAME(HASHED),
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 2a4632d0be4b..0708185ead61 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -107,17 +107,31 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 	if (!q->dev)
 		return ret;
 
+	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
+
 	blk_pm_runtime_lock(q);
+	/*
+	 * Switch to preempt-only mode before calling synchronize_rcu() such
+	 * that later blk_queue_enter() calls see the preempt-only state. See
+	 * also http://lwn.net/Articles/573497/.
+	 */
+	blk_set_pm_only(q);
+	ret = -EBUSY;
+	if (percpu_ref_read(&q->q_usage_counter) == 1) {
+		synchronize_rcu();
+		if (percpu_ref_read(&q->q_usage_counter) == 1)
+			ret = 0;
+	}
 
 	spin_lock_irq(q->queue_lock);
-	if (q->nr_pending) {
-		ret = -EBUSY;
+	if (ret < 0)
 		pm_runtime_mark_last_busy(q->dev);
-	} else {
+	else
 		q->rpm_status = RPM_SUSPENDING;
-	}
 	spin_unlock_irq(q->queue_lock);
 
+	if (ret)
+		blk_clear_pm_only(q);
 	blk_pm_runtime_unlock(q);
 
 	return ret;
@@ -150,6 +164,9 @@ void blk_post_runtime_suspend(struct request_queue *q, int err)
 		pm_runtime_mark_last_busy(q->dev);
 	}
 	spin_unlock_irq(q->queue_lock);
+
+	if (err)
+		blk_clear_pm_only(q);
 }
 EXPORT_SYMBOL(blk_post_runtime_suspend);
 
@@ -197,13 +214,15 @@ void blk_post_runtime_resume(struct request_queue *q, int err)
 	spin_lock_irq(q->queue_lock);
 	if (!err) {
 		q->rpm_status = RPM_ACTIVE;
-		__blk_run_queue(q);
 		pm_runtime_mark_last_busy(q->dev);
 		pm_request_autosuspend(q->dev);
 	} else {
 		q->rpm_status = RPM_SUSPENDED;
 	}
 	spin_unlock_irq(q->queue_lock);
+
+	if (!err)
+		blk_clear_pm_only(q);
 }
 EXPORT_SYMBOL(blk_post_runtime_resume);
 
diff --git a/block/blk-pm.h b/block/blk-pm.h
index 1ffc8ef203ec..8d250e545e42 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -8,21 +8,23 @@
 #ifdef CONFIG_PM
 static inline void blk_pm_requeue_request(struct request *rq)
 {
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM))
-		rq->q->nr_pending--;
 }
 
 static inline void blk_pm_add_request(struct request_queue *q,
 				      struct request *rq)
 {
-	if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
-	    (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
-		pm_request_resume(q->dev);
+	if (!q->dev || (rq->rq_flags & RQF_PREEMPT))
+		return;
+	/*
+	 * If the below warning is triggered then that means that the caller
+	 * did not use pm_runtime_get*().
+	 */
+	WARN_ON_ONCE(atomic_read(&q->dev->power.usage_count) == 0);
 }
 
 static inline void blk_pm_put_request(struct request *rq)
 {
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
+	if (rq->q->dev && !(rq->rq_flags & RQF_PREEMPT))
 		pm_runtime_mark_last_busy(rq->q->dev);
 }
 #else
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 72d569218231..9f859f32757c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -547,7 +547,6 @@ struct request_queue {
 #ifdef CONFIG_PM
 	struct device		*dev;
 	int			rpm_status;
-	unsigned int		nr_pending;
 	wait_queue_head_t	rpm_wq;
 	/* rpm_lock protects rpm_owner and rpm_nesting_level */
 	spinlock_t		rpm_lock;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 08/10] block: Remove blk_pm_requeue_request()
  2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
                   ` (6 preceding siblings ...)
  2018-08-04  0:03 ` [PATCH v4 07/10] block, scsi: Rework runtime power management Bart Van Assche
@ 2018-08-04  0:03 ` Bart Van Assche
  2018-08-04  0:03 ` [PATCH v4 09/10] blk-mq: Insert blk_pm_{add,put}_request() calls Bart Van Assche
  2018-08-04  0:03 ` [PATCH v4 10/10] blk-mq: Enable support for runtime power management Bart Van Assche
  9 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Alan Stern, Johannes Thumshirn

Since this function is now empty, remove it.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-pm.h   | 8 --------
 block/elevator.c | 2 --
 2 files changed, 10 deletions(-)

diff --git a/block/blk-pm.h b/block/blk-pm.h
index 8d250e545e42..0bfd0d3b5356 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -6,10 +6,6 @@
 #include <linux/pm_runtime.h>
 
 #ifdef CONFIG_PM
-static inline void blk_pm_requeue_request(struct request *rq)
-{
-}
-
 static inline void blk_pm_add_request(struct request_queue *q,
 				      struct request *rq)
 {
@@ -28,10 +24,6 @@ static inline void blk_pm_put_request(struct request *rq)
 		pm_runtime_mark_last_busy(rq->q->dev);
 }
 #else
-static inline void blk_pm_requeue_request(struct request *rq)
-{
-}
-
 static inline void blk_pm_add_request(struct request_queue *q,
 				      struct request *rq)
 {
diff --git a/block/elevator.c b/block/elevator.c
index 4c15f0240c99..fd7b12bb32a3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -572,8 +572,6 @@ void elv_requeue_request(struct request_queue *q, struct request *rq)
 
 	rq->rq_flags &= ~RQF_STARTED;
 
-	blk_pm_requeue_request(rq);
-
 	__elv_add_request(q, rq, ELEVATOR_INSERT_REQUEUE);
 }
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 09/10] blk-mq: Insert blk_pm_{add,put}_request() calls
  2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
                   ` (7 preceding siblings ...)
  2018-08-04  0:03 ` [PATCH v4 08/10] block: Remove blk_pm_requeue_request() Bart Van Assche
@ 2018-08-04  0:03 ` Bart Van Assche
  2018-08-04  0:03 ` [PATCH v4 10/10] blk-mq: Enable support for runtime power management Bart Van Assche
  9 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Alan Stern, Johannes Thumshirn

Make sure that blk_pm_add_request() is called exactly once before
a request is added to a software queue, to the scheduler and also
before .queue_rq() is called directly. Call blk_pm_put_request()
after a request has finished.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-mq-sched.c | 13 +++++++++++--
 block/blk-mq.c       |  8 ++++++++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index cf9c66c6d35a..d87839b31d56 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -14,6 +14,7 @@
 #include "blk-mq-debugfs.h"
 #include "blk-mq-sched.h"
 #include "blk-mq-tag.h"
+#include "blk-pm.h"
 #include "blk-wbt.h"
 
 void blk_mq_sched_free_hctx_data(struct request_queue *q,
@@ -349,6 +350,8 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 {
 	/* dispatch flush rq directly */
 	if (rq->rq_flags & RQF_FLUSH_SEQ) {
+		blk_pm_add_request(rq->q, rq);
+
 		spin_lock(&hctx->lock);
 		list_add(&rq->queuelist, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
@@ -380,6 +383,8 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 	if (blk_mq_sched_bypass_insert(hctx, !!e, rq))
 		goto run;
 
+	blk_pm_add_request(q, rq);
+
 	if (e && e->type->ops.mq.insert_requests) {
 		LIST_HEAD(list);
 
@@ -402,10 +407,14 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
 {
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 	struct elevator_queue *e = hctx->queue->elevator;
+	struct request *rq;
+
+	if (e && e->type->ops.mq.insert_requests) {
+		list_for_each_entry(rq, list, queuelist)
+			blk_pm_add_request(q, rq);
 
-	if (e && e->type->ops.mq.insert_requests)
 		e->type->ops.mq.insert_requests(hctx, list, false);
-	else {
+	} else {
 		/*
 		 * try to issue requests directly if the hw queue isn't
 		 * busy in case of 'none' scheduler, and this way may save
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b1882a3a5216..74a575a32dda 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -36,6 +36,7 @@
 #include "blk-mq-tag.h"
 #include "blk-stat.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-rq-qos.h"
 
 static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
@@ -474,6 +475,8 @@ static void __blk_mq_free_request(struct request *rq)
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 	const int sched_tag = rq->internal_tag;
 
+	blk_pm_put_request(rq);
+
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -1563,6 +1566,8 @@ void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
 
+	blk_pm_add_request(rq->q, rq);
+
 	spin_lock(&hctx->lock);
 	list_add_tail(&rq->queuelist, &hctx->dispatch);
 	spin_unlock(&hctx->lock);
@@ -1584,6 +1589,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	list_for_each_entry(rq, list, queuelist) {
 		BUG_ON(rq->mq_ctx != ctx);
 		trace_block_rq_insert(hctx->queue, rq);
+		blk_pm_add_request(rq->q, rq);
 	}
 
 	spin_lock(&ctx->lock);
@@ -1680,6 +1686,8 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 	blk_qc_t new_cookie;
 	blk_status_t ret;
 
+	blk_pm_add_request(q, rq);
+
 	new_cookie = request_to_qc_t(hctx, rq);
 
 	/*
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v4 10/10] blk-mq: Enable support for runtime power management
  2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
                   ` (8 preceding siblings ...)
  2018-08-04  0:03 ` [PATCH v4 09/10] blk-mq: Insert blk_pm_{add,put}_request() calls Bart Van Assche
@ 2018-08-04  0:03 ` Bart Van Assche
  9 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-04  0:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Alan Stern, Johannes Thumshirn

Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-pm.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/block/blk-pm.c b/block/blk-pm.c
index 0708185ead61..583cfc04e869 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -66,12 +66,6 @@ void blk_pm_runtime_unlock(struct request_queue *q)
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-	/* Don't enable runtime PM for blk-mq until it is ready */
-	if (q->mq_ops) {
-		pm_runtime_disable(dev);
-		return;
-	}
-
 	q->dev = dev;
 	q->rpm_status = RPM_ACTIVE;
 	pm_runtime_set_autosuspend_delay(q->dev, -1);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning
  2018-08-04  0:03 ` [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning Bart Van Assche
@ 2018-08-04  1:16   ` Ming Lei
  2018-08-06 14:15     ` Bart Van Assche
  2018-08-06  6:27   ` Hannes Reinecke
  1 sibling, 1 reply; 27+ messages in thread
From: Ming Lei @ 2018-08-04  1:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Martin K . Petersen,
	Jianchao Wang, Ming Lei, Alan Stern, Johannes Thumshirn

On Sat, Aug 4, 2018 at 8:03 AM, Bart Van Assche <bart.vanassche@wdc.com> wrote:
> In kernel v2.6.18 RQF_PREEMPT was only set for IDE preempt requests.
> Later on the SCSI core was modified such that RQF_PREEMPT requests
> was set for all requests submitted by __scsi_execute(), including
> power management requests. RQF_PREEMPT requests are the only requests
> processed in the SDEV_QUIESCE state. Instead of setting RQF_PREEMPT
> for all requests submitted by __scsi_execute(), only set RQF_PM for
> power management requests. Modify blk_get_request() such that it

Looks a big change, since this patch only allows PM command  instead of
all other admin commands to be handled when device is in quiesce state,
we should be careful about this change.

> blocks in pm-only mode on non-RQF_PM requests. Leave the code out
> from scsi_prep_state_check() that is no longer reachable.
>
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  block/blk-core.c        |  9 +++++----
>  drivers/scsi/scsi_lib.c | 28 ++++++++++++----------------
>  include/linux/blk-mq.h  |  2 ++
>  include/linux/blkdev.h  |  6 ++----
>  4 files changed, 21 insertions(+), 24 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b83798c03ca8..3378fe478e67 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -908,11 +908,11 @@ EXPORT_SYMBOL(blk_alloc_queue);
>  /**
>   * blk_queue_enter() - try to increase q->q_usage_counter
>   * @q: request queue pointer
> - * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
> + * @flags: BLK_MQ_REQ_NOWAIT, BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_PM
>   */
>  int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
>  {
> -       const bool pm = flags & BLK_MQ_REQ_PREEMPT;
> +       const bool pm = flags & BLK_MQ_REQ_PM;
>
>         while (true) {
>                 bool success = false;
> @@ -1570,7 +1570,7 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
>         goto retry;
>  }
>
> -/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
> +/* flags: BLK_MQ_REQ_PREEMPT, BLK_MQ_REQ_PM and/or BLK_MQ_REQ_NOWAIT. */
>  static struct request *blk_old_get_request(struct request_queue *q,
>                                 unsigned int op, blk_mq_req_flags_t flags)
>  {
> @@ -1613,7 +1613,8 @@ struct request *blk_get_request(struct request_queue *q, unsigned int op,
>         struct request *req;
>
>         WARN_ON_ONCE(op & REQ_NOWAIT);
> -       WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
> +       WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT |
> +                              BLK_MQ_REQ_PM));
>
>         if (q->mq_ops) {
>                 req = blk_mq_alloc_request(q, op, flags);
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 007bb2bc817d..481158ab5081 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -248,8 +248,8 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
>   * @sshdr:     optional decoded sense header
>   * @timeout:   request timeout in seconds
>   * @retries:   number of times to retry request
> - * @flags:     flags for ->cmd_flags
> - * @rq_flags:  flags for ->rq_flags
> + * @flags:     REQ_* flags for ->cmd_flags
> + * @rq_flags:  RQF_* flags for ->rq_flags
>   * @resid:     optional residual length
>   *
>   * Returns the scsi_cmnd result field if a command was executed, or a negative
> @@ -267,7 +267,8 @@ int __scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
>
>         req = blk_get_request(sdev->request_queue,
>                         data_direction == DMA_TO_DEVICE ?
> -                       REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
> +                       REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN,
> +                       rq_flags & RQF_PM ? BLK_MQ_REQ_PM : 0);

The above change will cause regression on SPI transport, please see
spi_dv_device(), where non-PM command is sent to device via
scsi_execute() after the device is put to QUIESCE.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 07/10] block, scsi: Rework runtime power management
  2018-08-04  0:03 ` [PATCH v4 07/10] block, scsi: Rework runtime power management Bart Van Assche
@ 2018-08-04 10:01   ` Ming Lei
  2018-08-06 15:20     ` Bart Van Assche
  2018-08-06  2:46   ` jianchao.wang
  1 sibling, 1 reply; 27+ messages in thread
From: Ming Lei @ 2018-08-04 10:01 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Martin K . Petersen,
	Jianchao Wang, Ming Lei, Alan Stern, Johannes Thumshirn,
	Paul E. McKenney

On Sat, Aug 4, 2018 at 8:03 AM, Bart Van Assche <bart.vanassche@wdc.com> wrote:
> Instead of allowing requests that are not power management requests
> to enter the queue in runtime suspended status (RPM_SUSPENDED), make
> the blk_get_request() caller block. This change fixes a starvation
> issue: it is now guaranteed that power management requests will be
> executed no matter how many blk_get_request() callers are waiting.
> Instead of maintaining the q->nr_pending counter, rely on
> q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
> request finishes instead of only if the queue depth drops to zero.
>
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  block/blk-core.c       | 37 ++++++++-----------------------------
>  block/blk-mq-debugfs.c |  1 -
>  block/blk-pm.c         | 29 ++++++++++++++++++++++++-----
>  block/blk-pm.h         | 14 ++++++++------
>  include/linux/blkdev.h |  1 -
>  5 files changed, 40 insertions(+), 42 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 59382c758155..1e208e134ca9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2736,30 +2736,6 @@ void blk_account_io_done(struct request *req, u64 now)
>         }
>  }
>
> -#ifdef CONFIG_PM
> -/*
> - * Don't process normal requests when queue is suspended
> - * or in the process of suspending/resuming
> - */
> -static bool blk_pm_allow_request(struct request *rq)
> -{
> -       switch (rq->q->rpm_status) {
> -       case RPM_RESUMING:
> -       case RPM_SUSPENDING:
> -               return rq->rq_flags & RQF_PM;
> -       case RPM_SUSPENDED:
> -               return false;
> -       default:
> -               return true;
> -       }
> -}
> -#else
> -static bool blk_pm_allow_request(struct request *rq)
> -{
> -       return true;
> -}
> -#endif
> -
>  void blk_account_io_start(struct request *rq, bool new_io)
>  {
>         struct hd_struct *part;
> @@ -2805,11 +2781,14 @@ static struct request *elv_next_request(struct request_queue *q)
>
>         while (1) {
>                 list_for_each_entry(rq, &q->queue_head, queuelist) {
> -                       if (blk_pm_allow_request(rq))
> -                               return rq;
> -
> -                       if (rq->rq_flags & RQF_SOFTBARRIER)
> -                               break;
> +#ifdef CONFIG_PM
> +                       /*
> +                        * If a request gets queued in state RPM_SUSPENDED
> +                        * then that's a kernel bug.
> +                        */
> +                       WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
> +#endif
> +                       return rq;
>                 }
>
>                 /*
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index a5ea86835fcb..7d74d53dc098 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -332,7 +332,6 @@ static const char *const rqf_name[] = {
>         RQF_NAME(ELVPRIV),
>         RQF_NAME(IO_STAT),
>         RQF_NAME(ALLOCED),
> -       RQF_NAME(PM),
>         RQF_NAME(HASHED),
>         RQF_NAME(STATS),
>         RQF_NAME(SPECIAL_PAYLOAD),
> diff --git a/block/blk-pm.c b/block/blk-pm.c
> index 2a4632d0be4b..0708185ead61 100644
> --- a/block/blk-pm.c
> +++ b/block/blk-pm.c
> @@ -107,17 +107,31 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>         if (!q->dev)
>                 return ret;
>
> +       WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
> +
>         blk_pm_runtime_lock(q);
> +       /*
> +        * Switch to preempt-only mode before calling synchronize_rcu() such
> +        * that later blk_queue_enter() calls see the preempt-only state. See
> +        * also http://lwn.net/Articles/573497/.
> +        */
> +       blk_set_pm_only(q);
> +       ret = -EBUSY;
> +       if (percpu_ref_read(&q->q_usage_counter) == 1) {
> +               synchronize_rcu();
> +               if (percpu_ref_read(&q->q_usage_counter) == 1)
> +                       ret = 0;
> +       }

The above code looks too tricky, and the following issue might exist in case
of potential WRITEs re-order from blk_queue_enter().

1) suppose there are one in-flight request not completed yet

2) observer is the percpu_ref_read() following synchronize_rcu().

3) inside blk_queue_enter(), WRITE in percpu_ref_tryget_live()/percpu_ref_put()
can be re-ordered from view of the observer in 2)

4) then percpu_ref_read() may return 1 even though the only in-flight
request isn't completed

5) suspend still moves on, and data loss may be triggered

Cc Paul, and Alan have been CCed already, so we have several memory
model experts here, :-), hope they may help to clarify if this reorder case
may exist.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 05/10] block: Serialize queue freezing and blk_pre_runtime_suspend()
  2018-08-04  0:03 ` [PATCH v4 05/10] block: Serialize queue freezing and blk_pre_runtime_suspend() Bart Van Assche
@ 2018-08-04 10:23   ` Ming Lei
  2018-08-06 14:19     ` Bart Van Assche
  0 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2018-08-04 10:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jianchao Wang,
	Ming Lei, Johannes Thumshirn, Alan Stern

On Sat, Aug 4, 2018 at 8:03 AM, Bart Van Assche <bart.vanassche@wdc.com> wrote:
> Serialize these operations because a later patch will add code into
> blk_pre_runtime_suspend() that should not run concurrently with queue
> freezing nor unfreezing.
>
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> ---
>  block/blk-core.c       |  5 +++++
>  block/blk-mq.c         |  3 +++
>  block/blk-pm.c         | 44 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/blk-pm.h |  6 ++++++
>  include/linux/blkdev.h |  5 +++++
>  5 files changed, 63 insertions(+)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 03cff7445dee..59382c758155 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -17,6 +17,7 @@
>  #include <linux/bio.h>
>  #include <linux/blkdev.h>
>  #include <linux/blk-mq.h>
> +#include <linux/blk-pm.h>
>  #include <linux/highmem.h>
>  #include <linux/mm.h>
>  #include <linux/kernel_stat.h>
> @@ -696,6 +697,7 @@ void blk_set_queue_dying(struct request_queue *q)
>          * prevent I/O from crossing blk_queue_enter().
>          */
>         blk_freeze_queue_start(q);
> +       blk_pm_runtime_unlock(q);
>
>         if (q->mq_ops)
>                 blk_mq_wake_waiters(q);
> @@ -756,6 +758,7 @@ void blk_cleanup_queue(struct request_queue *q)
>          * prevent that q->request_fn() gets invoked after draining finished.
>          */
>         blk_freeze_queue(q);
> +       blk_pm_runtime_unlock(q);
>         spin_lock_irq(lock);
>         queue_flag_set(QUEUE_FLAG_DEAD, q);
>         spin_unlock_irq(lock);
> @@ -1045,6 +1048,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>         mutex_init(&q->blk_trace_mutex);
>  #endif
> +       blk_pm_init(q);
> +
>         mutex_init(&q->sysfs_lock);
>         spin_lock_init(&q->__queue_lock);
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8b23ae34d949..b1882a3a5216 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -9,6 +9,7 @@
>  #include <linux/backing-dev.h>
>  #include <linux/bio.h>
>  #include <linux/blkdev.h>
> +#include <linux/blk-pm.h>
>  #include <linux/kmemleak.h>
>  #include <linux/mm.h>
>  #include <linux/init.h>
> @@ -138,6 +139,7 @@ void blk_freeze_queue_start(struct request_queue *q)
>  {
>         int freeze_depth;
>
> +       blk_pm_runtime_lock(q);
>         freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
>         if (freeze_depth == 1) {
>                 percpu_ref_kill(&q->q_usage_counter);
> @@ -201,6 +203,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
>                 percpu_ref_reinit(&q->q_usage_counter);
>                 wake_up_all(&q->mq_freeze_wq);
>         }
> +       blk_pm_runtime_unlock(q);
>  }

>From user view, it isn't reasonable to prevent runtime suspend from happening
during queue freeze. The period can be a bit long, and it should be one perfect
opportunity to suspend device during the period since no any IO is possible.


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 07/10] block, scsi: Rework runtime power management
  2018-08-04  0:03 ` [PATCH v4 07/10] block, scsi: Rework runtime power management Bart Van Assche
  2018-08-04 10:01   ` Ming Lei
@ 2018-08-06  2:46   ` jianchao.wang
  2018-08-06 15:07     ` Bart Van Assche
  1 sibling, 1 reply; 27+ messages in thread
From: jianchao.wang @ 2018-08-06  2:46 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen, Ming Lei,
	Alan Stern, Johannes Thumshirn

Hi Bart

On 08/04/2018 08:03 AM, Bart Van Assche wrote:
> Instead of allowing requests that are not power management requests
> to enter the queue in runtime suspended status (RPM_SUSPENDED), make
> the blk_get_request() caller block. 

Who will resume the runtime suspended device ?

In original blk-legacy, when a request is added, blk_pm_add_request will call
pm_request_resume could do that. The request will be issued after the q is resumed
successfully.

After this patch, non-pm request will be blocked and pm_request_resume is removed from
blk_pm_add_request, who does resume the runtime suspended q and device ?

Thanks
Jianchao

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 01/10] block: Change the preempt-only flag into a counter
  2018-08-04  0:03 ` [PATCH v4 01/10] block: Change the preempt-only flag into a counter Bart Van Assche
@ 2018-08-06  6:25   ` Hannes Reinecke
  2018-09-14 12:53   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Hannes Reinecke @ 2018-08-06  6:25 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jianchao Wang, Ming Lei,
	Johannes Thumshirn, Alan Stern

On 08/04/2018 02:03 AM, Bart Van Assche wrote:
> Rename "preempt-only" into "pm-only" because the primary purpose of
> this mode is power management. Since the power management core may
> but does not have to resume a runtime suspended device before
> performing system-wide suspend and since a later patch will set
> "pm-only" mode as long as a block device is runtime suspended, make
> it possible to set "pm-only" mode from more than one context.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> ---
>  block/blk-core.c        | 35 ++++++++++++++++++-----------------
>  block/blk-mq-debugfs.c  | 10 +++++++++-
>  drivers/scsi/scsi_lib.c |  8 ++++----
>  include/linux/blkdev.h  | 14 +++++++++-----
>  4 files changed, 40 insertions(+), 27 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.com>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning
  2018-08-04  0:03 ` [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning Bart Van Assche
  2018-08-04  1:16   ` Ming Lei
@ 2018-08-06  6:27   ` Hannes Reinecke
  2018-08-06 13:54     ` Bart Van Assche
  1 sibling, 1 reply; 27+ messages in thread
From: Hannes Reinecke @ 2018-08-06  6:27 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Martin K . Petersen,
	Jianchao Wang, Ming Lei, Alan Stern, Johannes Thumshirn

On 08/04/2018 02:03 AM, Bart Van Assche wrote:
> In kernel v2.6.18 RQF_PREEMPT was only set for IDE preempt requests.
> Later on the SCSI core was modified such that RQF_PREEMPT requests
> was set for all requests submitted by __scsi_execute(), including
> power management requests. RQF_PREEMPT requests are the only requests
> processed in the SDEV_QUIESCE state. Instead of setting RQF_PREEMPT
> for all requests submitted by __scsi_execute(), only set RQF_PM for
> power management requests. Modify blk_get_request() such that it
> blocks in pm-only mode on non-RQF_PM requests. Leave the code out
> from scsi_prep_state_check() that is no longer reachable.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  block/blk-core.c        |  9 +++++----
>  drivers/scsi/scsi_lib.c | 28 ++++++++++++----------------
>  include/linux/blk-mq.h  |  2 ++
>  include/linux/blkdev.h  |  6 ++----
>  4 files changed, 21 insertions(+), 24 deletions(-)
> 
I am not sure this is going to work for SCSI parallel; we're using the
QUIESCE state there to do domain validation, and all commands there are
most definitely not PM requests.
Can you please validate your patches with eg aic7xxx and SCSI parallel
disks?

Thanks.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning
  2018-08-06  6:27   ` Hannes Reinecke
@ 2018-08-06 13:54     ` Bart Van Assche
  2018-09-14 12:58       ` hch
  0 siblings, 1 reply; 27+ messages in thread
From: Bart Van Assche @ 2018-08-06 13:54 UTC (permalink / raw)
  To: hare, axboe
  Cc: hch, jthumshirn, linux-block, stern, martin.petersen,
	jianchao.w.wang, ming.lei

On Mon, 2018-08-06 at 08:27 +0200, Hannes Reinecke wrote:
> I am not sure this is going to work for SCSI parallel; we're usin=
g the
> QUIESCE state there to do domain validation, and all commands there a=
re
> most definitely not PM requests.
> Can you please validate your patches with eg aic7xxx and SCSI paralle=
l
> disks?

Hello Hannes,

How about using the RQF_PM flag for SCSI parallel requests instead of
RQF_PREEMPT? That change will also avoid that RQF_PREEMPT has a dou=
ble meaning.
Anyway, I will see whether I can drop that change from this series and subm=
it
that change seperately.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning
  2018-08-04  1:16   ` Ming Lei
@ 2018-08-06 14:15     ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-06 14:15 UTC (permalink / raw)
  To: tom.leiming
  Cc: jthumshirn, linux-block, hch, martin.petersen, ming.lei, axboe,
	stern, jianchao.w.wang

On Sat, 2018-08-04 at 09:16 +0800, Ming Lei wrote:
> The above change will cause regression on SPI transport, please see
> spi_dv_device(), where non-PM command is sent to device via
> scsi_execute() after the device is put to QUIESCE.

Hello Ming,

I will drop this patch for now and revisit this change later.

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 05/10] block: Serialize queue freezing and blk_pre_runtime_suspend()
  2018-08-04 10:23   ` Ming Lei
@ 2018-08-06 14:19     ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-06 14:19 UTC (permalink / raw)
  To: tom.leiming
  Cc: hch, ming.lei, linux-block, jthumshirn, stern, axboe, jianchao.w.wang

On Sat, 2018-08-04 at 18:23 +0800, Ming Lei wrote:
> From user view, it isn't reasonable to prevent runtime suspend from h=
appening
> during queue freeze. The period can be a bit long, and it should be o=
ne perfect
> opportunity to suspend device during the period since no any IO is po=
ssible.

Hello Ming,

I will look into reducing the scope of the code that is protected by
blk_pm_runtime_lock() / blk_pm_runtime_unlock() to =
the code in blk_freeze_queue_start()
under "if (freeze_depth == 1)".

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 07/10] block, scsi: Rework runtime power management
  2018-08-06  2:46   ` jianchao.wang
@ 2018-08-06 15:07     ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2018-08-06 15:07 UTC (permalink / raw)
  To: jianchao.w.wang, axboe
  Cc: hch, linux-block, jthumshirn, martin.petersen, ming.lei, stern

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-7", Size: 1131 bytes --]

On Mon, 2018-08-06 at 10:46 +-0800, jianchao.wang wrote:
+AD4- On 08/04/2018 08:03 AM, Bart Van Assche wrote:
+AD4- +AD4- Instead of allowing requests that are not power management requ=
ests
+AD4- +AD4- to enter the queue in runtime suspended status (RPM+AF8-SUSPEND=
ED), make
+AD4- +AD4- the blk+AF8-get+AF8-request() caller block.=20
+AD4-=20
+AD4- Who will resume the runtime suspended device ?
+AD4-=20
+AD4- In original blk-legacy, when a request is added, blk+AF8-pm+AF8-add+A=
F8-request will call
+AD4- pm+AF8-request+AF8-resume could do that. The request will be issued a=
fter the q is resumed
+AD4- successfully.
+AD4-=20
+AD4- After this patch, non-pm request will be blocked and pm+AF8-request+A=
F8-resume is removed from
+AD4- blk+AF8-pm+AF8-add+AF8-request, who does resume the runtime suspended=
 q and device ?

Hello Jianchao,

I will double check that all contexts from which requests are submitted
protect submission of non-RQF+AF8-PM requests with scsi+AF8-autopm+AF8-get+=
AF8-device() /
scsi+AF8-autopm+AF8-put+AF8-device(). scsi+AF8-autopm+AF8-get+AF8-device() =
namely resumes suspended
devices.

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 07/10] block, scsi: Rework runtime power management
  2018-08-04 10:01   ` Ming Lei
@ 2018-08-06 15:20     ` Bart Van Assche
  2018-08-06 22:30       ` Ming Lei
  0 siblings, 1 reply; 27+ messages in thread
From: Bart Van Assche @ 2018-08-06 15:20 UTC (permalink / raw)
  To: tom.leiming
  Cc: jthumshirn, linux-block, hch, martin.petersen, ming.lei, axboe,
	paulmck, stern, jianchao.w.wang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-7", Size: 3531 bytes --]

On Sat, 2018-08-04 at 18:01 +-0800, Ming Lei wrote:
+AD4- On Sat, Aug 4, 2018 at 8:03 AM, Bart Van Assche +ADw-bart.vanassche+A=
EA-wdc.com+AD4- wrote:
+AD4- +AD4-=20
+AD4- +AD4- diff --git a/block/blk-pm.c b/block/blk-pm.c
+AD4- +AD4- index 2a4632d0be4b..0708185ead61 100644
+AD4- +AD4- --- a/block/blk-pm.c
+AD4- +AD4- +-+-+- b/block/blk-pm.c
+AD4- +AD4- +AEAAQA- -107,17 +-107,31 +AEAAQA- int blk+AF8-pre+AF8-runtime+=
AF8-suspend(struct request+AF8-queue +ACo-q)
+AD4- +AD4-         if (+ACE-q-+AD4-dev)
+AD4- +AD4-                 return ret+ADs-
+AD4- +AD4-=20
+AD4- +AD4- +-       WARN+AF8-ON+AF8-ONCE(q-+AD4-rpm+AF8-status +ACEAPQ- RP=
M+AF8-ACTIVE)+ADs-
+AD4- +AD4- +-
+AD4- +AD4-         blk+AF8-pm+AF8-runtime+AF8-lock(q)+ADs-
+AD4- +AD4- +-       /+ACo-
+AD4- +AD4- +-        +ACo- Switch to preempt-only mode before calling sync=
hronize+AF8-rcu() such
+AD4- +AD4- +-        +ACo- that later blk+AF8-queue+AF8-enter() calls see =
the preempt-only state. See
+AD4- +AD4- +-        +ACo- also http://lwn.net/Articles/573497/.
+AD4- +AD4- +-        +ACo-/
+AD4- +AD4- +-       blk+AF8-set+AF8-pm+AF8-only(q)+ADs-
+AD4- +AD4- +-       ret +AD0- -EBUSY+ADs-
+AD4- +AD4- +-       if (percpu+AF8-ref+AF8-read(+ACY-q-+AD4-q+AF8-usage+AF=
8-counter) +AD0APQ- 1) +AHs-
+AD4- +AD4- +-               synchronize+AF8-rcu()+ADs-
+AD4- +AD4- +-               if (percpu+AF8-ref+AF8-read(+ACY-q-+AD4-q+AF8-=
usage+AF8-counter) +AD0APQ- 1)
+AD4- +AD4- +-                       ret +AD0- 0+ADs-
+AD4- +AD4- +-       +AH0-
+AD4-=20
+AD4- The above code looks too tricky, and the following issue might exist =
in case
+AD4- of potential WRITEs re-order from blk+AF8-queue+AF8-enter().
+AD4-=20
+AD4- 1) suppose there are one in-flight request not completed yet
+AD4-=20
+AD4- 2) observer is the percpu+AF8-ref+AF8-read() following synchronize+AF=
8-rcu().
+AD4-=20
+AD4- 3) inside blk+AF8-queue+AF8-enter(), WRITE in percpu+AF8-ref+AF8-tryg=
et+AF8-live()/percpu+AF8-ref+AF8-put()
+AD4- can be re-ordered from view of the observer in 2)
+AD4-=20
+AD4- 4) then percpu+AF8-ref+AF8-read() may return 1 even though the only i=
n-flight
+AD4- request isn't completed
+AD4-=20
+AD4- 5) suspend still moves on, and data loss may be triggered
+AD4-=20
+AD4- Cc Paul, and Alan have been CCed already, so we have several memory
+AD4- model experts here, :-), hope they may help to clarify if this reorde=
r case
+AD4- may exist.

Hello Ming,

There are two cases:
(a) The percpu+AF8-ref+AF8-tryget+AF8-live() call in (3) is triggered by a =
runtime power
    state transition (RPM+AF8AKg-).
(b) The percpu+AF8-ref+AF8-tryget+AF8-live() call in (3) is not triggered b=
y a runtime
    power state transition.

In case (b) the driver that submits the request should protect the request
with scsi+AF8-autopm+AF8-get+AF8-device() / scsi+AF8-autopm+AF8-put+AF8-dev=
ice() or by any other
mechanism that prevents runtime power state transitions. Since
scsi+AF8-autopm+AF8-get+AF8-device() is called before blk+AF8-queue+AF8-ent=
er(), since
scsi+AF8-autopm+AF8-get+AF8-device() only returns after it has resumed a de=
vice and since
scsi+AF8-autopm+AF8-get+AF8-device() prevents runtime suspend, the function
blk+AF8-pre+AF8-runtime+AF8-suspend() won't be called concurrently with blk=
+AF8-queue+AF8-enter()
in case (b).

Since the power management serializes power state transitions also for case
(a) the race described above won't happen. blk+AF8-pre+AF8-runtime+AF8-susp=
end() will have
finished before any RQF+AF8-PM requests are submitted.

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 07/10] block, scsi: Rework runtime power management
  2018-08-06 15:20     ` Bart Van Assche
@ 2018-08-06 22:30       ` Ming Lei
  0 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2018-08-06 22:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jthumshirn, linux-block, hch, martin.petersen, ming.lei, axboe,
	paulmck, stern, jianchao.w.wang

On Mon, Aug 6, 2018 at 11:20 PM, Bart Van Assche <Bart.VanAssche@wdc.com> wrote:
> On Sat, 2018-08-04 at 18:01 +0800, Ming Lei wrote:
>> On Sat, Aug 4, 2018 at 8:03 AM, Bart Van Assche <bart.vanassche@wdc.com> wrote:
>> >
>> > diff --git a/block/blk-pm.c b/block/blk-pm.c
>> > index 2a4632d0be4b..0708185ead61 100644
>> > --- a/block/blk-pm.c
>> > +++ b/block/blk-pm.c
>> > @@ -107,17 +107,31 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>> >         if (!q->dev)
>> >                 return ret;
>> >
>> > +       WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
>> > +
>> >         blk_pm_runtime_lock(q);
>> > +       /*
>> > +        * Switch to preempt-only mode before calling synchronize_rcu() such
>> > +        * that later blk_queue_enter() calls see the preempt-only state. See
>> > +        * also http://lwn.net/Articles/573497/.
>> > +        */
>> > +       blk_set_pm_only(q);
>> > +       ret = -EBUSY;
>> > +       if (percpu_ref_read(&q->q_usage_counter) == 1) {
>> > +               synchronize_rcu();
>> > +               if (percpu_ref_read(&q->q_usage_counter) == 1)
>> > +                       ret = 0;
>> > +       }
>>
>> The above code looks too tricky, and the following issue might exist in case
>> of potential WRITEs re-order from blk_queue_enter().
>>
>> 1) suppose there are one in-flight request not completed yet
>>
>> 2) observer is the percpu_ref_read() following synchronize_rcu().
>>
>> 3) inside blk_queue_enter(), WRITE in percpu_ref_tryget_live()/percpu_ref_put()
>> can be re-ordered from view of the observer in 2)
>>
>> 4) then percpu_ref_read() may return 1 even though the only in-flight
>> request isn't completed
>>
>> 5) suspend still moves on, and data loss may be triggered
>>
>> Cc Paul, and Alan have been CCed already, so we have several memory
>> model experts here, :-), hope they may help to clarify if this reorder case
>> may exist.
>
> Hello Ming,
>
> There are two cases:
> (a) The percpu_ref_tryget_live() call in (3) is triggered by a runtime power
>     state transition (RPM_*).
> (b) The percpu_ref_tryget_live() call in (3) is not triggered by a runtime
>     power state transition.
>
> In case (b) the driver that submits the request should protect the request
> with scsi_autopm_get_device() / scsi_autopm_put_device() or by any other
> mechanism that prevents runtime power state transitions. Since
> scsi_autopm_get_device() is called before blk_queue_enter(), since
> scsi_autopm_get_device() only returns after it has resumed a device and since
> scsi_autopm_get_device() prevents runtime suspend, the function
> blk_pre_runtime_suspend() won't be called concurrently with blk_queue_enter()
> in case (b).

For normal IO path, blk_queue_enter() is called inside generic_make_request()
which is usually run from fs, do you mean fs code will call
scsi_autopm_get_device()? :-)


Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 01/10] block: Change the preempt-only flag into a counter
  2018-08-04  0:03 ` [PATCH v4 01/10] block: Change the preempt-only flag into a counter Bart Van Assche
  2018-08-06  6:25   ` Hannes Reinecke
@ 2018-09-14 12:53   ` Christoph Hellwig
  1 sibling, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2018-09-14 12:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jianchao Wang,
	Ming Lei, Johannes Thumshirn, Alan Stern

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning
  2018-08-06 13:54     ` Bart Van Assche
@ 2018-09-14 12:58       ` hch
  0 siblings, 0 replies; 27+ messages in thread
From: hch @ 2018-09-14 12:58 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hare, axboe, hch, jthumshirn, linux-block, stern,
	martin.petersen, jianchao.w.wang, ming.lei

On Mon, Aug 06, 2018 at 01:54:40PM +0000, Bart Van Assche wrote:
> On Mon, 2018-08-06 at 08:27 +0200, Hannes Reinecke wrote:
> > I am not sure this is going to work for SCSI parallel; we're using the
> > QUIESCE state there to do domain validation, and all commands there are
> > most definitely not PM requests.
> > Can you please validate your patches with eg aic7xxx and SCSI parallel
> > disks?
> 
> Hello Hannes,
> 
> How about using the RQF_PM flag for SCSI parallel requests instead of
> RQF_PREEMPT? That change will also avoid that RQF_PREEMPT has a double meaning.
> Anyway, I will see whether I can drop that change from this series and submit
> that change seperately.

One thing I'd like to do is split BLK_MQ_REQ_PREEMPT and RQF_PREEMPT,
that is don't set RQF_PREEMPT automatically when BLK_MQ_REQ_PREEMPT is
passed to blk_get_request, but let the caller set it manually.  After
that RQF_PREEMPT isn't used by the block layer itself at all and can
be moved into ide/scsi specific flags that we can use as we want.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 04/10] block: Move power management code into a new source file
  2018-08-04  0:03 ` [PATCH v4 04/10] block: Move power management code into a new source file Bart Van Assche
@ 2018-09-14 13:01   ` Christoph Hellwig
  2018-09-14 14:10     ` Alan Stern
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2018-09-14 13:01 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jianchao Wang,
	Ming Lei, Johannes Thumshirn, Alan Stern

>  11 files changed, 266 insertions(+), 239 deletions(-)
>  create mode 100644 block/blk-pm.c
>  create mode 100644 block/blk-pm.h
>  create mode 100644 include/linux/blk-pm.h
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index 1f2469a0123c..e213d90a5e64 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -228,4 +228,9 @@ config BLK_MQ_RDMA
>  	depends on BLOCK && INFINIBAND
>  	default y
>  
> +config BLK_PM
> +	bool
> +	depends on BLOCK && PM
> +	default y

This could be shortened to

config BLK_PM
	def_bool BLOCK && PM

Also a lot of this code is only used by scsi.  I really wonder if
we should move it into scsi instead, and if not at least make it
a selectable option that only scsi selects.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v4 04/10] block: Move power management code into a new source file
  2018-09-14 13:01   ` Christoph Hellwig
@ 2018-09-14 14:10     ` Alan Stern
  0 siblings, 0 replies; 27+ messages in thread
From: Alan Stern @ 2018-09-14 14:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Jens Axboe, linux-block, Jianchao Wang,
	Ming Lei, Johannes Thumshirn

On Fri, 14 Sep 2018, Christoph Hellwig wrote:

> >  11 files changed, 266 insertions(+), 239 deletions(-)
> >  create mode 100644 block/blk-pm.c
> >  create mode 100644 block/blk-pm.h
> >  create mode 100644 include/linux/blk-pm.h
> > 
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 1f2469a0123c..e213d90a5e64 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -228,4 +228,9 @@ config BLK_MQ_RDMA
> >  	depends on BLOCK && INFINIBAND
> >  	default y
> >  
> > +config BLK_PM
> > +	bool
> > +	depends on BLOCK && PM
> > +	default y
> 
> This could be shortened to
> 
> config BLK_PM
> 	def_bool BLOCK && PM
> 
> Also a lot of this code is only used by scsi.  I really wonder if
> we should move it into scsi instead, and if not at least make it
> a selectable option that only scsi selects.

Although SCSI is currently the only user of block-layer runtime PM, 
there's no reason other subsystems or drivers couldn't decide to use it 
too.  In theory, anything whose activity is controlled by block 
requests could benefit from it.

Making it selectable sounds like a reasonable approach.

Alan Stern

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-09-14 14:10 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-04  0:03 [PATCH v4 00/10] blk-mq: Enable runtime power management Bart Van Assche
2018-08-04  0:03 ` [PATCH v4 01/10] block: Change the preempt-only flag into a counter Bart Van Assche
2018-08-06  6:25   ` Hannes Reinecke
2018-09-14 12:53   ` Christoph Hellwig
2018-08-04  0:03 ` [PATCH v4 02/10] block, scsi: Give RQF_PREEMPT back its original meaning Bart Van Assche
2018-08-04  1:16   ` Ming Lei
2018-08-06 14:15     ` Bart Van Assche
2018-08-06  6:27   ` Hannes Reinecke
2018-08-06 13:54     ` Bart Van Assche
2018-09-14 12:58       ` hch
2018-08-04  0:03 ` [PATCH v4 03/10] block, ide: Remove flag BLK_MQ_REQ_PREEMPT Bart Van Assche
2018-08-04  0:03 ` [PATCH v4 04/10] block: Move power management code into a new source file Bart Van Assche
2018-09-14 13:01   ` Christoph Hellwig
2018-09-14 14:10     ` Alan Stern
2018-08-04  0:03 ` [PATCH v4 05/10] block: Serialize queue freezing and blk_pre_runtime_suspend() Bart Van Assche
2018-08-04 10:23   ` Ming Lei
2018-08-06 14:19     ` Bart Van Assche
2018-08-04  0:03 ` [PATCH v4 06/10] percpu-refcount: Introduce percpu_ref_read() Bart Van Assche
2018-08-04  0:03 ` [PATCH v4 07/10] block, scsi: Rework runtime power management Bart Van Assche
2018-08-04 10:01   ` Ming Lei
2018-08-06 15:20     ` Bart Van Assche
2018-08-06 22:30       ` Ming Lei
2018-08-06  2:46   ` jianchao.wang
2018-08-06 15:07     ` Bart Van Assche
2018-08-04  0:03 ` [PATCH v4 08/10] block: Remove blk_pm_requeue_request() Bart Van Assche
2018-08-04  0:03 ` [PATCH v4 09/10] blk-mq: Insert blk_pm_{add,put}_request() calls Bart Van Assche
2018-08-04  0:03 ` [PATCH v4 10/10] blk-mq: Enable support for runtime power management Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.