All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/9] blk-mq: Enable runtime power management
@ 2018-08-02 18:29 Bart Van Assche
  2018-08-02 18:29 ` [PATCH v3 1/9] block: Fix a comment in a header file Bart Van Assche
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 18:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche

Hello Jens,

This patch series not only enables runtime power management for blk-mq but
also fixes a starvation issue in the power management code for the legacy
block layer. Please consider this patch series for kernel v4.19.

Thanks,

Bart.

Changes compared to v2:
- Fixed the build for CONFIG_BLOCK=n.
- Added a patch that introduces percpu_ref_read() in the percpu-counter code.
- Added a patch that makes it easier to detect missing pm_runtime_get*() calls.
- Addressed Jianchao's feedback including the comment about runtime overhead
  of switching a per-cpu counter to atomic mode.

Changes compared to v1:
- Moved the runtime power management code into a separate file.
- Addressed Ming's feedback.

Bart Van Assche (9):
  block: Fix a comment in a header file
  block: Move power management functions into new source files
  block: Serialize queue freezing and blk_pre_runtime_suspend()
  percpu-refcount: Introduce percpu_ref_read()
  block, scsi: Rework runtime power management
  block: Warn if pm_runtime_get*() has not been called
  block: Remove blk_pm_requeue_request()
  blk-mq: Insert blk_pm_{add,put}_request() calls
  blk-mq: Enable support for runtime power management

 block/Kconfig                   |   5 +
 block/Makefile                  |   1 +
 block/blk-core.c                | 246 +++----------------------------
 block/blk-mq-debugfs.c          |   1 -
 block/blk-mq-sched.c            |  13 +-
 block/blk-mq.c                  |  11 ++
 block/blk-pm.c                  | 253 ++++++++++++++++++++++++++++++++
 block/blk-pm.h                  |  35 +++++
 block/elevator.c                |  24 +--
 drivers/scsi/scsi_pm.c          |   1 +
 drivers/scsi/sd.c               |   5 +-
 drivers/scsi/sr.c               |   1 +
 drivers/scsi/ufs/ufshcd.c       |  10 +-
 include/linux/blk-pm.h          |  30 ++++
 include/linux/blkdev.h          |  37 +----
 include/linux/percpu-refcount.h |   2 +
 lib/percpu-refcount.c           |  29 ++++
 17 files changed, 420 insertions(+), 284 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

-- 
2.18.0

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 1/9] block: Fix a comment in a header file
  2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
@ 2018-08-02 18:29 ` Bart Van Assche
  2018-08-02 18:29 ` [PATCH v3 2/9] block: Move power management functions into new source files Bart Van Assche
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Alan Stern

Since the REQ_PREEMPT flag has been renamed into RQF_PREEMPT, update the
comment that refers to that flag.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
---
 include/linux/blkdev.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 050d599f5ea9..9467c21ec040 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -700,7 +700,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26	/* queue has been registered to a disk */
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27	/* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED    28	/* queue has been quiesced */
-#define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process REQ_PREEMPT requests */
+#define QUEUE_FLAG_PREEMPT_ONLY	29	/* only process RQF_PREEMPT requests */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 2/9] block: Move power management functions into new source files
  2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
  2018-08-02 18:29 ` [PATCH v3 1/9] block: Fix a comment in a header file Bart Van Assche
@ 2018-08-02 18:29 ` Bart Van Assche
  2018-08-02 18:29 ` [PATCH v3 3/9] block: Serialize queue freezing and blk_pre_runtime_suspend() Bart Van Assche
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Johannes Thumshirn, Alan Stern

Move the code for runtime power management from blk-core.c into the
new source file blk-pm.c. Move the declarations of the moved functions
from <linux/blkdev.h> into <linux/blk-pm.h>. For CONFIG_PM=n, leave
out the declarations of the functions that are not used in that mode.
This patch not only reduces the number of #ifdefs in the block layer
core code but also reduces the size of header file <linux/blkdev.h>
and hence should help to reduce the build time of the Linux kernel
if CONFIG_PM is not defined.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
---
 block/Kconfig          |   5 ++
 block/Makefile         |   1 +
 block/blk-core.c       | 194 +----------------------------------------
 block/blk-pm.c         | 186 +++++++++++++++++++++++++++++++++++++++
 block/blk-pm.h         |  43 +++++++++
 block/elevator.c       |  22 +----
 drivers/scsi/scsi_pm.c |   1 +
 drivers/scsi/sd.c      |   1 +
 drivers/scsi/sr.c      |   1 +
 include/linux/blk-pm.h |  24 +++++
 include/linux/blkdev.h |  23 -----
 11 files changed, 264 insertions(+), 237 deletions(-)
 create mode 100644 block/blk-pm.c
 create mode 100644 block/blk-pm.h
 create mode 100644 include/linux/blk-pm.h

diff --git a/block/Kconfig b/block/Kconfig
index 1f2469a0123c..e213d90a5e64 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -228,4 +228,9 @@ config BLK_MQ_RDMA
 	depends on BLOCK && INFINIBAND
 	default y
 
+config BLK_PM
+	bool
+	depends on BLOCK && PM
+	default y
+
 source block/Kconfig.iosched
diff --git a/block/Makefile b/block/Makefile
index 572b33f32c07..27eac600474f 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -37,3 +37,4 @@ obj-$(CONFIG_BLK_WBT)		+= blk-wbt.o
 obj-$(CONFIG_BLK_DEBUG_FS)	+= blk-mq-debugfs.o
 obj-$(CONFIG_BLK_DEBUG_FS_ZONED)+= blk-mq-debugfs-zoned.o
 obj-$(CONFIG_BLK_SED_OPAL)	+= sed-opal.o
+obj-$(CONFIG_BLK_PM)		+= blk-pm.o
diff --git a/block/blk-core.c b/block/blk-core.c
index 23cd1b7770e7..236071a487a2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -42,6 +42,7 @@
 #include "blk.h"
 #include "blk-mq.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-rq-qos.h"
 
 #ifdef CONFIG_DEBUG_FS
@@ -1721,16 +1722,6 @@ void part_round_stats(struct request_queue *q, int cpu, struct hd_struct *part)
 }
 EXPORT_SYMBOL_GPL(part_round_stats);
 
-#ifdef CONFIG_PM
-static void blk_pm_put_request(struct request *rq)
-{
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
-		pm_runtime_mark_last_busy(rq->q->dev);
-}
-#else
-static inline void blk_pm_put_request(struct request *rq) {}
-#endif
-
 void __blk_put_request(struct request_queue *q, struct request *req)
 {
 	req_flags_t rq_flags = req->rq_flags;
@@ -3746,189 +3737,6 @@ void blk_finish_plug(struct blk_plug *plug)
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
-#ifdef CONFIG_PM
-/**
- * blk_pm_runtime_init - Block layer runtime PM initialization routine
- * @q: the queue of the device
- * @dev: the device the queue belongs to
- *
- * Description:
- *    Initialize runtime-PM-related fields for @q and start auto suspend for
- *    @dev. Drivers that want to take advantage of request-based runtime PM
- *    should call this function after @dev has been initialized, and its
- *    request queue @q has been allocated, and runtime PM for it can not happen
- *    yet(either due to disabled/forbidden or its usage_count > 0). In most
- *    cases, driver should call this function before any I/O has taken place.
- *
- *    This function takes care of setting up using auto suspend for the device,
- *    the autosuspend delay is set to -1 to make runtime suspend impossible
- *    until an updated value is either set by user or by driver. Drivers do
- *    not need to touch other autosuspend settings.
- *
- *    The block layer runtime PM is request based, so only works for drivers
- *    that use request as their IO unit instead of those directly use bio's.
- */
-void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
-{
-	/* not support for RQF_PM and ->rpm_status in blk-mq yet */
-	if (q->mq_ops)
-		return;
-
-	q->dev = dev;
-	q->rpm_status = RPM_ACTIVE;
-	pm_runtime_set_autosuspend_delay(q->dev, -1);
-	pm_runtime_use_autosuspend(q->dev);
-}
-EXPORT_SYMBOL(blk_pm_runtime_init);
-
-/**
- * blk_pre_runtime_suspend - Pre runtime suspend check
- * @q: the queue of the device
- *
- * Description:
- *    This function will check if runtime suspend is allowed for the device
- *    by examining if there are any requests pending in the queue. If there
- *    are requests pending, the device can not be runtime suspended; otherwise,
- *    the queue's status will be updated to SUSPENDING and the driver can
- *    proceed to suspend the device.
- *
- *    For the not allowed case, we mark last busy for the device so that
- *    runtime PM core will try to autosuspend it some time later.
- *
- *    This function should be called near the start of the device's
- *    runtime_suspend callback.
- *
- * Return:
- *    0		- OK to runtime suspend the device
- *    -EBUSY	- Device should not be runtime suspended
- */
-int blk_pre_runtime_suspend(struct request_queue *q)
-{
-	int ret = 0;
-
-	if (!q->dev)
-		return ret;
-
-	spin_lock_irq(q->queue_lock);
-	if (q->nr_pending) {
-		ret = -EBUSY;
-		pm_runtime_mark_last_busy(q->dev);
-	} else {
-		q->rpm_status = RPM_SUSPENDING;
-	}
-	spin_unlock_irq(q->queue_lock);
-	return ret;
-}
-EXPORT_SYMBOL(blk_pre_runtime_suspend);
-
-/**
- * blk_post_runtime_suspend - Post runtime suspend processing
- * @q: the queue of the device
- * @err: return value of the device's runtime_suspend function
- *
- * Description:
- *    Update the queue's runtime status according to the return value of the
- *    device's runtime suspend function and mark last busy for the device so
- *    that PM core will try to auto suspend the device at a later time.
- *
- *    This function should be called near the end of the device's
- *    runtime_suspend callback.
- */
-void blk_post_runtime_suspend(struct request_queue *q, int err)
-{
-	if (!q->dev)
-		return;
-
-	spin_lock_irq(q->queue_lock);
-	if (!err) {
-		q->rpm_status = RPM_SUSPENDED;
-	} else {
-		q->rpm_status = RPM_ACTIVE;
-		pm_runtime_mark_last_busy(q->dev);
-	}
-	spin_unlock_irq(q->queue_lock);
-}
-EXPORT_SYMBOL(blk_post_runtime_suspend);
-
-/**
- * blk_pre_runtime_resume - Pre runtime resume processing
- * @q: the queue of the device
- *
- * Description:
- *    Update the queue's runtime status to RESUMING in preparation for the
- *    runtime resume of the device.
- *
- *    This function should be called near the start of the device's
- *    runtime_resume callback.
- */
-void blk_pre_runtime_resume(struct request_queue *q)
-{
-	if (!q->dev)
-		return;
-
-	spin_lock_irq(q->queue_lock);
-	q->rpm_status = RPM_RESUMING;
-	spin_unlock_irq(q->queue_lock);
-}
-EXPORT_SYMBOL(blk_pre_runtime_resume);
-
-/**
- * blk_post_runtime_resume - Post runtime resume processing
- * @q: the queue of the device
- * @err: return value of the device's runtime_resume function
- *
- * Description:
- *    Update the queue's runtime status according to the return value of the
- *    device's runtime_resume function. If it is successfully resumed, process
- *    the requests that are queued into the device's queue when it is resuming
- *    and then mark last busy and initiate autosuspend for it.
- *
- *    This function should be called near the end of the device's
- *    runtime_resume callback.
- */
-void blk_post_runtime_resume(struct request_queue *q, int err)
-{
-	if (!q->dev)
-		return;
-
-	spin_lock_irq(q->queue_lock);
-	if (!err) {
-		q->rpm_status = RPM_ACTIVE;
-		__blk_run_queue(q);
-		pm_runtime_mark_last_busy(q->dev);
-		pm_request_autosuspend(q->dev);
-	} else {
-		q->rpm_status = RPM_SUSPENDED;
-	}
-	spin_unlock_irq(q->queue_lock);
-}
-EXPORT_SYMBOL(blk_post_runtime_resume);
-
-/**
- * blk_set_runtime_active - Force runtime status of the queue to be active
- * @q: the queue of the device
- *
- * If the device is left runtime suspended during system suspend the resume
- * hook typically resumes the device and corrects runtime status
- * accordingly. However, that does not affect the queue runtime PM status
- * which is still "suspended". This prevents processing requests from the
- * queue.
- *
- * This function can be used in driver's resume hook to correct queue
- * runtime PM status and re-enable peeking requests from the queue. It
- * should be called before first request is added to the queue.
- */
-void blk_set_runtime_active(struct request_queue *q)
-{
-	spin_lock_irq(q->queue_lock);
-	q->rpm_status = RPM_ACTIVE;
-	pm_runtime_mark_last_busy(q->dev);
-	pm_request_autosuspend(q->dev);
-	spin_unlock_irq(q->queue_lock);
-}
-EXPORT_SYMBOL(blk_set_runtime_active);
-#endif
-
 int __init blk_dev_init(void)
 {
 	BUILD_BUG_ON(REQ_OP_LAST >= (1 << REQ_OP_BITS));
diff --git a/block/blk-pm.c b/block/blk-pm.c
new file mode 100644
index 000000000000..08d7222d4757
--- /dev/null
+++ b/block/blk-pm.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/blk-pm.h>
+#include <linux/blkdev.h>
+#include <linux/pm_runtime.h>
+
+/**
+ * blk_pm_runtime_init - Block layer runtime PM initialization routine
+ * @q: the queue of the device
+ * @dev: the device the queue belongs to
+ *
+ * Description:
+ *    Initialize runtime-PM-related fields for @q and start auto suspend for
+ *    @dev. Drivers that want to take advantage of request-based runtime PM
+ *    should call this function after @dev has been initialized, and its
+ *    request queue @q has been allocated, and runtime PM for it can not happen
+ *    yet(either due to disabled/forbidden or its usage_count > 0). In most
+ *    cases, driver should call this function before any I/O has taken place.
+ *
+ *    This function takes care of setting up using auto suspend for the device,
+ *    the autosuspend delay is set to -1 to make runtime suspend impossible
+ *    until an updated value is either set by user or by driver. Drivers do
+ *    not need to touch other autosuspend settings.
+ *
+ *    The block layer runtime PM is request based, so only works for drivers
+ *    that use request as their IO unit instead of those directly use bio's.
+ */
+void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
+{
+	/* not support for RQF_PM and ->rpm_status in blk-mq yet */
+	if (q->mq_ops)
+		return;
+
+	q->dev = dev;
+	q->rpm_status = RPM_ACTIVE;
+	pm_runtime_set_autosuspend_delay(q->dev, -1);
+	pm_runtime_use_autosuspend(q->dev);
+}
+EXPORT_SYMBOL(blk_pm_runtime_init);
+
+/**
+ * blk_pre_runtime_suspend - Pre runtime suspend check
+ * @q: the queue of the device
+ *
+ * Description:
+ *    This function will check if runtime suspend is allowed for the device
+ *    by examining if there are any requests pending in the queue. If there
+ *    are requests pending, the device can not be runtime suspended; otherwise,
+ *    the queue's status will be updated to SUSPENDING and the driver can
+ *    proceed to suspend the device.
+ *
+ *    For the not allowed case, we mark last busy for the device so that
+ *    runtime PM core will try to autosuspend it some time later.
+ *
+ *    This function should be called near the start of the device's
+ *    runtime_suspend callback.
+ *
+ * Return:
+ *    0		- OK to runtime suspend the device
+ *    -EBUSY	- Device should not be runtime suspended
+ */
+int blk_pre_runtime_suspend(struct request_queue *q)
+{
+	int ret = 0;
+
+	if (!q->dev)
+		return ret;
+
+	spin_lock_irq(q->queue_lock);
+	if (q->nr_pending) {
+		ret = -EBUSY;
+		pm_runtime_mark_last_busy(q->dev);
+	} else {
+		q->rpm_status = RPM_SUSPENDING;
+	}
+	spin_unlock_irq(q->queue_lock);
+	return ret;
+}
+EXPORT_SYMBOL(blk_pre_runtime_suspend);
+
+/**
+ * blk_post_runtime_suspend - Post runtime suspend processing
+ * @q: the queue of the device
+ * @err: return value of the device's runtime_suspend function
+ *
+ * Description:
+ *    Update the queue's runtime status according to the return value of the
+ *    device's runtime suspend function and mark last busy for the device so
+ *    that PM core will try to auto suspend the device at a later time.
+ *
+ *    This function should be called near the end of the device's
+ *    runtime_suspend callback.
+ */
+void blk_post_runtime_suspend(struct request_queue *q, int err)
+{
+	if (!q->dev)
+		return;
+
+	spin_lock_irq(q->queue_lock);
+	if (!err) {
+		q->rpm_status = RPM_SUSPENDED;
+	} else {
+		q->rpm_status = RPM_ACTIVE;
+		pm_runtime_mark_last_busy(q->dev);
+	}
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL(blk_post_runtime_suspend);
+
+/**
+ * blk_pre_runtime_resume - Pre runtime resume processing
+ * @q: the queue of the device
+ *
+ * Description:
+ *    Update the queue's runtime status to RESUMING in preparation for the
+ *    runtime resume of the device.
+ *
+ *    This function should be called near the start of the device's
+ *    runtime_resume callback.
+ */
+void blk_pre_runtime_resume(struct request_queue *q)
+{
+	if (!q->dev)
+		return;
+
+	spin_lock_irq(q->queue_lock);
+	q->rpm_status = RPM_RESUMING;
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL(blk_pre_runtime_resume);
+
+/**
+ * blk_post_runtime_resume - Post runtime resume processing
+ * @q: the queue of the device
+ * @err: return value of the device's runtime_resume function
+ *
+ * Description:
+ *    Update the queue's runtime status according to the return value of the
+ *    device's runtime_resume function. If it is successfully resumed, process
+ *    the requests that are queued into the device's queue when it is resuming
+ *    and then mark last busy and initiate autosuspend for it.
+ *
+ *    This function should be called near the end of the device's
+ *    runtime_resume callback.
+ */
+void blk_post_runtime_resume(struct request_queue *q, int err)
+{
+	if (!q->dev)
+		return;
+
+	spin_lock_irq(q->queue_lock);
+	if (!err) {
+		q->rpm_status = RPM_ACTIVE;
+		__blk_run_queue(q);
+		pm_runtime_mark_last_busy(q->dev);
+		pm_request_autosuspend(q->dev);
+	} else {
+		q->rpm_status = RPM_SUSPENDED;
+	}
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL(blk_post_runtime_resume);
+
+/**
+ * blk_set_runtime_active - Force runtime status of the queue to be active
+ * @q: the queue of the device
+ *
+ * If the device is left runtime suspended during system suspend the resume
+ * hook typically resumes the device and corrects runtime status
+ * accordingly. However, that does not affect the queue runtime PM status
+ * which is still "suspended". This prevents processing requests from the
+ * queue.
+ *
+ * This function can be used in driver's resume hook to correct queue
+ * runtime PM status and re-enable peeking requests from the queue. It
+ * should be called before first request is added to the queue.
+ */
+void blk_set_runtime_active(struct request_queue *q)
+{
+	spin_lock_irq(q->queue_lock);
+	q->rpm_status = RPM_ACTIVE;
+	pm_runtime_mark_last_busy(q->dev);
+	pm_request_autosuspend(q->dev);
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL(blk_set_runtime_active);
diff --git a/block/blk-pm.h b/block/blk-pm.h
new file mode 100644
index 000000000000..1ffc8ef203ec
--- /dev/null
+++ b/block/blk-pm.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _BLOCK_BLK_PM_H_
+#define _BLOCK_BLK_PM_H_
+
+#include <linux/pm_runtime.h>
+
+#ifdef CONFIG_PM
+static inline void blk_pm_requeue_request(struct request *rq)
+{
+	if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+		rq->q->nr_pending--;
+}
+
+static inline void blk_pm_add_request(struct request_queue *q,
+				      struct request *rq)
+{
+	if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
+	    (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
+		pm_request_resume(q->dev);
+}
+
+static inline void blk_pm_put_request(struct request *rq)
+{
+	if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
+		pm_runtime_mark_last_busy(rq->q->dev);
+}
+#else
+static inline void blk_pm_requeue_request(struct request *rq)
+{
+}
+
+static inline void blk_pm_add_request(struct request_queue *q,
+				      struct request *rq)
+{
+}
+
+static inline void blk_pm_put_request(struct request *rq)
+{
+}
+#endif
+
+#endif /* _BLOCK_BLK_PM_H_ */
diff --git a/block/elevator.c b/block/elevator.c
index fa828b5bfd4b..4c15f0240c99 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -41,6 +41,7 @@
 
 #include "blk.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-wbt.h"
 
 static DEFINE_SPINLOCK(elv_list_lock);
@@ -557,27 +558,6 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
 		e->type->ops.sq.elevator_bio_merged_fn(q, rq, bio);
 }
 
-#ifdef CONFIG_PM
-static void blk_pm_requeue_request(struct request *rq)
-{
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM))
-		rq->q->nr_pending--;
-}
-
-static void blk_pm_add_request(struct request_queue *q, struct request *rq)
-{
-	if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
-	    (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
-		pm_request_resume(q->dev);
-}
-#else
-static inline void blk_pm_requeue_request(struct request *rq) {}
-static inline void blk_pm_add_request(struct request_queue *q,
-				      struct request *rq)
-{
-}
-#endif
-
 void elv_requeue_request(struct request_queue *q, struct request *rq)
 {
 	/*
diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index b44c1bb687a2..a2b4179bfdf7 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -8,6 +8,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/export.h>
 #include <linux/async.h>
+#include <linux/blk-pm.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_device.h>
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bbebdc3769b0..69ab459abb98 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -45,6 +45,7 @@
 #include <linux/init.h>
 #include <linux/blkdev.h>
 #include <linux/blkpg.h>
+#include <linux/blk-pm.h>
 #include <linux/delay.h>
 #include <linux/mutex.h>
 #include <linux/string_helpers.h>
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 3f3cb72e0c0c..de4413e66eca 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -43,6 +43,7 @@
 #include <linux/interrupt.h>
 #include <linux/init.h>
 #include <linux/blkdev.h>
+#include <linux/blk-pm.h>
 #include <linux/mutex.h>
 #include <linux/slab.h>
 #include <linux/pm_runtime.h>
diff --git a/include/linux/blk-pm.h b/include/linux/blk-pm.h
new file mode 100644
index 000000000000..b80c65aba249
--- /dev/null
+++ b/include/linux/blk-pm.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _BLK_PM_H_
+#define _BLK_PM_H_
+
+struct device;
+struct request_queue;
+
+/*
+ * block layer runtime pm functions
+ */
+#ifdef CONFIG_PM
+extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev);
+extern int blk_pre_runtime_suspend(struct request_queue *q);
+extern void blk_post_runtime_suspend(struct request_queue *q, int err);
+extern void blk_pre_runtime_resume(struct request_queue *q);
+extern void blk_post_runtime_resume(struct request_queue *q, int err);
+extern void blk_set_runtime_active(struct request_queue *q);
+#else
+static inline void blk_pm_runtime_init(struct request_queue *q,
+				       struct device *dev) {}
+#endif
+
+#endif /* _BLK_PM_H_ */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9467c21ec040..dc24e9e2bea1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1282,29 +1282,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
 extern void blk_put_queue(struct request_queue *);
 extern void blk_set_queue_dying(struct request_queue *);
 
-/*
- * block layer runtime pm functions
- */
-#ifdef CONFIG_PM
-extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev);
-extern int blk_pre_runtime_suspend(struct request_queue *q);
-extern void blk_post_runtime_suspend(struct request_queue *q, int err);
-extern void blk_pre_runtime_resume(struct request_queue *q);
-extern void blk_post_runtime_resume(struct request_queue *q, int err);
-extern void blk_set_runtime_active(struct request_queue *q);
-#else
-static inline void blk_pm_runtime_init(struct request_queue *q,
-	struct device *dev) {}
-static inline int blk_pre_runtime_suspend(struct request_queue *q)
-{
-	return -ENOSYS;
-}
-static inline void blk_post_runtime_suspend(struct request_queue *q, int err) {}
-static inline void blk_pre_runtime_resume(struct request_queue *q) {}
-static inline void blk_post_runtime_resume(struct request_queue *q, int err) {}
-static inline void blk_set_runtime_active(struct request_queue *q) {}
-#endif
-
 /*
  * blk_plug permits building a queue of related requests by holding the I/O
  * fragments for a short period. This allows merging of sequential requests
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 3/9] block: Serialize queue freezing and blk_pre_runtime_suspend()
  2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
  2018-08-02 18:29 ` [PATCH v3 1/9] block: Fix a comment in a header file Bart Van Assche
  2018-08-02 18:29 ` [PATCH v3 2/9] block: Move power management functions into new source files Bart Van Assche
@ 2018-08-02 18:29 ` Bart Van Assche
  2018-08-02 18:29 ` [PATCH v3 4/9] percpu-refcount: Introduce percpu_ref_read() Bart Van Assche
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Johannes Thumshirn, Alan Stern

Serialize these operations because a later patch will add code into
blk_pre_runtime_suspend() that should not run concurrently with queue
freezing nor unfreezing.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Alan Stern <stern@rowland.harvard.edu>
---
 block/blk-core.c       |  5 +++++
 block/blk-mq.c         |  3 +++
 block/blk-pm.c         | 44 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/blk-pm.h |  6 ++++++
 include/linux/blkdev.h |  5 +++++
 5 files changed, 63 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 236071a487a2..7ca76cf26870 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/blk-mq.h>
+#include <linux/blk-pm.h>
 #include <linux/highmem.h>
 #include <linux/mm.h>
 #include <linux/kernel_stat.h>
@@ -695,6 +696,7 @@ void blk_set_queue_dying(struct request_queue *q)
 	 * prevent I/O from crossing blk_queue_enter().
 	 */
 	blk_freeze_queue_start(q);
+	blk_pm_runtime_unlock(q);
 
 	if (q->mq_ops)
 		blk_mq_wake_waiters(q);
@@ -755,6 +757,7 @@ void blk_cleanup_queue(struct request_queue *q)
 	 * prevent that q->request_fn() gets invoked after draining finished.
 	 */
 	blk_freeze_queue(q);
+	blk_pm_runtime_unlock(q);
 	spin_lock_irq(lock);
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
@@ -1044,6 +1047,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id,
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	mutex_init(&q->blk_trace_mutex);
 #endif
+	blk_pm_init(q);
+
 	mutex_init(&q->sysfs_lock);
 	spin_lock_init(&q->__queue_lock);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c92ce06fd565..8d845872ea02 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -9,6 +9,7 @@
 #include <linux/backing-dev.h>
 #include <linux/bio.h>
 #include <linux/blkdev.h>
+#include <linux/blk-pm.h>
 #include <linux/kmemleak.h>
 #include <linux/mm.h>
 #include <linux/init.h>
@@ -138,6 +139,7 @@ void blk_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
+	blk_pm_runtime_lock(q);
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
 		percpu_ref_kill(&q->q_usage_counter);
@@ -201,6 +203,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 		percpu_ref_reinit(&q->q_usage_counter);
 		wake_up_all(&q->mq_freeze_wq);
 	}
+	blk_pm_runtime_unlock(q);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 08d7222d4757..524e19b67380 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -3,6 +3,45 @@
 #include <linux/blk-pm.h>
 #include <linux/blkdev.h>
 #include <linux/pm_runtime.h>
+#include <linux/spinlock.h>
+#include <linux/wait.h>
+
+/*
+ * Initialize the request queue members used by blk_pm_runtime_lock() and
+ * blk_pm_runtime_unlock().
+ */
+void blk_pm_init(struct request_queue *q)
+{
+	spin_lock_init(&q->rpm_lock);
+	init_waitqueue_head(&q->rpm_wq);
+	q->rpm_owner = NULL;
+	q->rpm_nesting_level = 0;
+}
+
+void blk_pm_runtime_lock(struct request_queue *q)
+{
+	might_sleep();
+
+	spin_lock(&q->rpm_lock);
+	wait_event_exclusive_cmd(q->rpm_wq,
+			q->rpm_owner == NULL || q->rpm_owner == current,
+			spin_unlock(&q->rpm_lock), spin_lock(&q->rpm_lock));
+	if (q->rpm_owner == NULL)
+		q->rpm_owner = current;
+	q->rpm_nesting_level++;
+	spin_unlock(&q->rpm_lock);
+}
+
+void blk_pm_runtime_unlock(struct request_queue *q)
+{
+	spin_lock(&q->rpm_lock);
+	WARN_ON_ONCE(q->rpm_nesting_level <= 0);
+	if (--q->rpm_nesting_level == 0) {
+		q->rpm_owner = NULL;
+		wake_up(&q->rpm_wq);
+	}
+	spin_unlock(&q->rpm_lock);
+}
 
 /**
  * blk_pm_runtime_init - Block layer runtime PM initialization routine
@@ -66,6 +105,8 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 	if (!q->dev)
 		return ret;
 
+	blk_pm_runtime_lock(q);
+
 	spin_lock_irq(q->queue_lock);
 	if (q->nr_pending) {
 		ret = -EBUSY;
@@ -74,6 +115,9 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 		q->rpm_status = RPM_SUSPENDING;
 	}
 	spin_unlock_irq(q->queue_lock);
+
+	blk_pm_runtime_unlock(q);
+
 	return ret;
 }
 EXPORT_SYMBOL(blk_pre_runtime_suspend);
diff --git a/include/linux/blk-pm.h b/include/linux/blk-pm.h
index b80c65aba249..aafcc7877e53 100644
--- a/include/linux/blk-pm.h
+++ b/include/linux/blk-pm.h
@@ -10,6 +10,9 @@ struct request_queue;
  * block layer runtime pm functions
  */
 #ifdef CONFIG_PM
+extern void blk_pm_init(struct request_queue *q);
+extern void blk_pm_runtime_lock(struct request_queue *q);
+extern void blk_pm_runtime_unlock(struct request_queue *q);
 extern void blk_pm_runtime_init(struct request_queue *q, struct device *dev);
 extern int blk_pre_runtime_suspend(struct request_queue *q);
 extern void blk_post_runtime_suspend(struct request_queue *q, int err);
@@ -17,6 +20,9 @@ extern void blk_pre_runtime_resume(struct request_queue *q);
 extern void blk_post_runtime_resume(struct request_queue *q, int err);
 extern void blk_set_runtime_active(struct request_queue *q);
 #else
+static inline void blk_pm_init(struct request_queue *q) {}
+static inline void blk_pm_runtime_lock(struct request_queue *q) {}
+static inline void blk_pm_runtime_unlock(struct request_queue *q) {}
 static inline void blk_pm_runtime_init(struct request_queue *q,
 				       struct device *dev) {}
 #endif
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index dc24e9e2bea1..74a2f9f5718b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -544,6 +544,11 @@ struct request_queue {
 	struct device		*dev;
 	int			rpm_status;
 	unsigned int		nr_pending;
+	wait_queue_head_t	rpm_wq;
+	/* rpm_lock protects rpm_owner and rpm_nesting_level */
+	spinlock_t		rpm_lock;
+	struct task_struct	*rpm_owner;
+	int			rpm_nesting_level;
 #endif
 
 	/*
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 4/9] percpu-refcount: Introduce percpu_ref_read()
  2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
                   ` (2 preceding siblings ...)
  2018-08-02 18:29 ` [PATCH v3 3/9] block: Serialize queue freezing and blk_pre_runtime_suspend() Bart Van Assche
@ 2018-08-02 18:29 ` Bart Van Assche
  2018-08-02 19:06   ` Tejun Heo
  2018-08-02 18:29 ` [PATCH v3 5/9] block, scsi: Rework runtime power management Bart Van Assche
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Tejun Heo,
	Jianchao Wang, Ming Lei, Alan Stern, Johannes Thumshirn

Introduce a function that allows to read the value of a per-cpu counter.
This function will be used in the next patch to check whether a per-cpu
counter in atomic mode has the value one.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 include/linux/percpu-refcount.h |  2 ++
 lib/percpu-refcount.c           | 29 +++++++++++++++++++++++++++++
 2 files changed, 31 insertions(+)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 009cdf3d65b6..5707289ba828 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -331,4 +331,6 @@ static inline bool percpu_ref_is_zero(struct percpu_ref *ref)
 	return !atomic_long_read(&ref->count);
 }
 
+unsigned long percpu_ref_read(struct percpu_ref *ref);
+
 #endif
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 9f96fa7bc000..c0b9fc8efa6b 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -369,3 +369,32 @@ void percpu_ref_reinit(struct percpu_ref *ref)
 	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_reinit);
+
+/**
+ * percpu_ref_read - read a percpu refcount
+ * @ref: percpu_ref to test
+ *
+ * This function is safe to call as long as @ref is between init and exit. It
+ * is the responsibility of the caller to handle changes of @ref concurrently
+ * with this function. If this function is called while @ref is in per-cpu
+ * mode the returned value may be incorrect if e.g. percpu_ref_get() is called
+ * from one CPU and percpu_ref_put() is called from another CPU.
+ */
+unsigned long percpu_ref_read(struct percpu_ref *ref)
+{
+	unsigned long __percpu *percpu_count;
+	unsigned long sum = 0;
+	int cpu;
+
+	rcu_read_lock_sched();
+	if (__ref_is_percpu(ref, &percpu_count)) {
+		for_each_possible_cpu(cpu)
+			sum += *per_cpu_ptr(percpu_count, cpu);
+	}
+	rcu_read_unlock_sched();
+	sum += atomic_long_read(&ref->count);
+	sum &= ~PERCPU_COUNT_BIAS;
+
+	return sum;
+}
+EXPORT_SYMBOL_GPL(percpu_ref_read);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 5/9] block, scsi: Rework runtime power management
  2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
                   ` (3 preceding siblings ...)
  2018-08-02 18:29 ` [PATCH v3 4/9] percpu-refcount: Introduce percpu_ref_read() Bart Van Assche
@ 2018-08-02 18:29 ` Bart Van Assche
  2018-08-02 23:50   ` Ming Lei
  2018-08-02 18:29 ` [PATCH v3 6/9] block: Warn if pm_runtime_get*() has not been called Bart Van Assche
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Jianchao Wang, Ming Lei, Alan Stern,
	Johannes Thumshirn

Instead of allowing requests that are not power management requests
to enter the queue in runtime suspended status (RPM_SUSPENDED), make
the blk_get_request() caller block. This change fixes a starvation
issue: it is now guaranteed that power management requests will be
executed no matter how many blk_get_request() callers are waiting.
Instead of maintaining the q->nr_pending counter, rely on
q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
request finishes instead of only if the queue depth drops to zero.
Use RQF_PREEMPT to mark power management requests instead of RQF_PM.
This is safe because the power management core serializes system-wide
suspend/resume and runtime power management state changes.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-core.c          | 47 +++++++++++++++------------------------
 block/blk-mq-debugfs.c    |  1 -
 block/blk-pm.c            | 37 +++++++++++++++++++++++++-----
 block/blk-pm.h            |  6 ++---
 drivers/scsi/sd.c         |  4 ++--
 drivers/scsi/ufs/ufshcd.c | 10 ++++-----
 include/linux/blkdev.h    |  7 ++----
 7 files changed, 61 insertions(+), 51 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7ca76cf26870..9ba6e42a4b18 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -690,6 +690,16 @@ void blk_set_queue_dying(struct request_queue *q)
 {
 	blk_queue_flag_set(QUEUE_FLAG_DYING, q);
 
+#ifdef CONFIG_PM
+	/*
+	 * Avoid that runtime power management tries to modify the state of
+	 * q->q_usage_counter after that counter has been transitioned to the
+	 * "dead" state.
+	 */
+	if (q->dev)
+		pm_runtime_disable(q->dev);
+#endif
+
 	/*
 	 * When queue DYING flag is set, we need to block new req
 	 * entering queue, so we call blk_freeze_queue_start() to
@@ -2737,30 +2747,6 @@ void blk_account_io_done(struct request *req, u64 now)
 	}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static bool blk_pm_allow_request(struct request *rq)
-{
-	switch (rq->q->rpm_status) {
-	case RPM_RESUMING:
-	case RPM_SUSPENDING:
-		return rq->rq_flags & RQF_PM;
-	case RPM_SUSPENDED:
-		return false;
-	default:
-		return true;
-	}
-}
-#else
-static bool blk_pm_allow_request(struct request *rq)
-{
-	return true;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
 	struct hd_struct *part;
@@ -2806,11 +2792,14 @@ static struct request *elv_next_request(struct request_queue *q)
 
 	while (1) {
 		list_for_each_entry(rq, &q->queue_head, queuelist) {
-			if (blk_pm_allow_request(rq))
-				return rq;
-
-			if (rq->rq_flags & RQF_SOFTBARRIER)
-				break;
+#ifdef CONFIG_PM
+			/*
+			 * If a request gets queued in state RPM_SUSPENDED
+			 * then that's a kernel bug.
+			 */
+			WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
+#endif
+			return rq;
 		}
 
 		/*
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index cb1e6cf7ac48..994bdd41feb2 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -324,7 +324,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(ELVPRIV),
 	RQF_NAME(IO_STAT),
 	RQF_NAME(ALLOCED),
-	RQF_NAME(PM),
 	RQF_NAME(HASHED),
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
diff --git a/block/blk-pm.c b/block/blk-pm.c
index 524e19b67380..c33c881f64df 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -77,6 +77,13 @@ void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 }
 EXPORT_SYMBOL(blk_pm_runtime_init);
 
+static void blk_unprepare_runtime_suspend(struct request_queue *q)
+{
+	blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+	/* Because QUEUE_FLAG_PREEMPT_ONLY has been cleared. */
+	wake_up_all(&q->mq_freeze_wq);
+}
+
 /**
  * blk_pre_runtime_suspend - Pre runtime suspend check
  * @q: the queue of the device
@@ -105,17 +112,32 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 	if (!q->dev)
 		return ret;
 
+	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
+
 	blk_pm_runtime_lock(q);
+	/*
+	 * Set the PREEMPT_ONLY flag before switching the queue usage counter
+	 * to atomic mode such that blk_queue_enter() sees the two writes in
+	 * the right order. See also http://lwn.net/Articles/573497/.
+	 */
+	blk_set_preempt_only(q);
+	ret = -EBUSY;
+	if (percpu_ref_read(&q->q_usage_counter) == 1) {
+		percpu_ref_switch_to_atomic_sync(&q->q_usage_counter);
+		if (percpu_ref_read(&q->q_usage_counter) == 1)
+			ret = 0;
+	}
 
 	spin_lock_irq(q->queue_lock);
-	if (q->nr_pending) {
-		ret = -EBUSY;
+	if (ret < 0)
 		pm_runtime_mark_last_busy(q->dev);
-	} else {
+	else
 		q->rpm_status = RPM_SUSPENDING;
-	}
 	spin_unlock_irq(q->queue_lock);
 
+	percpu_ref_switch_to_percpu(&q->q_usage_counter);
+	if (ret)
+		blk_unprepare_runtime_suspend(q);
 	blk_pm_runtime_unlock(q);
 
 	return ret;
@@ -148,6 +170,9 @@ void blk_post_runtime_suspend(struct request_queue *q, int err)
 		pm_runtime_mark_last_busy(q->dev);
 	}
 	spin_unlock_irq(q->queue_lock);
+
+	if (err)
+		blk_unprepare_runtime_suspend(q);
 }
 EXPORT_SYMBOL(blk_post_runtime_suspend);
 
@@ -195,13 +220,15 @@ void blk_post_runtime_resume(struct request_queue *q, int err)
 	spin_lock_irq(q->queue_lock);
 	if (!err) {
 		q->rpm_status = RPM_ACTIVE;
-		__blk_run_queue(q);
 		pm_runtime_mark_last_busy(q->dev);
 		pm_request_autosuspend(q->dev);
 	} else {
 		q->rpm_status = RPM_SUSPENDED;
 	}
 	spin_unlock_irq(q->queue_lock);
+
+	if (!err)
+		blk_unprepare_runtime_suspend(q);
 }
 EXPORT_SYMBOL(blk_post_runtime_resume);
 
diff --git a/block/blk-pm.h b/block/blk-pm.h
index 1ffc8ef203ec..226fe34c0df9 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -8,21 +8,19 @@
 #ifdef CONFIG_PM
 static inline void blk_pm_requeue_request(struct request *rq)
 {
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM))
-		rq->q->nr_pending--;
 }
 
 static inline void blk_pm_add_request(struct request_queue *q,
 				      struct request *rq)
 {
-	if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
+	if (q->dev && !(rq->rq_flags & RQF_PREEMPT) &&
 	    (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
 		pm_request_resume(q->dev);
 }
 
 static inline void blk_pm_put_request(struct request *rq)
 {
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
+	if (rq->q->dev && !(rq->rq_flags & RQF_PREEMPT))
 		pm_runtime_mark_last_busy(rq->q->dev);
 }
 #else
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 69ab459abb98..95d83af37c3e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1628,7 +1628,7 @@ static int sd_sync_cache(struct scsi_disk *sdkp, struct scsi_sense_hdr *sshdr)
 		 * flush everything.
 		 */
 		res = scsi_execute(sdp, cmd, DMA_NONE, NULL, 0, NULL, sshdr,
-				timeout, SD_MAX_RETRIES, 0, RQF_PM, NULL);
+				timeout, SD_MAX_RETRIES, 0, RQF_PREEMPT, NULL);
 		if (res == 0)
 			break;
 	}
@@ -3490,7 +3490,7 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start)
 		return -ENODEV;
 
 	res = scsi_execute(sdp, cmd, DMA_NONE, NULL, 0, NULL, &sshdr,
-			SD_TIMEOUT, SD_MAX_RETRIES, 0, RQF_PM, NULL);
+			   SD_TIMEOUT, SD_MAX_RETRIES, 0, RQF_PREEMPT, NULL);
 	if (res) {
 		sd_print_result(sdkp, "Start/Stop Unit failed", res);
 		if (driver_byte(res) & DRIVER_SENSE)
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 397081d320b1..4a16d6e90e65 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -7219,7 +7219,7 @@ ufshcd_send_request_sense(struct ufs_hba *hba, struct scsi_device *sdp)
 
 	ret = scsi_execute(sdp, cmd, DMA_FROM_DEVICE, buffer,
 			UFSHCD_REQ_SENSE_SIZE, NULL, NULL,
-			msecs_to_jiffies(1000), 3, 0, RQF_PM, NULL);
+			msecs_to_jiffies(1000), 3, 0, RQF_PREEMPT, NULL);
 	if (ret)
 		pr_err("%s: failed with err %d\n", __func__, ret);
 
@@ -7280,12 +7280,12 @@ static int ufshcd_set_dev_pwr_mode(struct ufs_hba *hba,
 	cmd[4] = pwr_mode << 4;
 
 	/*
-	 * Current function would be generally called from the power management
-	 * callbacks hence set the RQF_PM flag so that it doesn't resume the
-	 * already suspended childs.
+	 * Current function would be generally called from the power
+	 * management callbacks hence set the RQF_PREEMPT flag so that it
+	 * doesn't resume the already suspended childs.
 	 */
 	ret = scsi_execute(sdp, cmd, DMA_NONE, NULL, 0, NULL, &sshdr,
-			START_STOP_TIMEOUT, 0, 0, RQF_PM, NULL);
+			   START_STOP_TIMEOUT, 0, 0, RQF_PREEMPT, NULL);
 	if (ret) {
 		sdev_printk(KERN_WARNING, sdp,
 			    "START_STOP failed for power mode: %d, result %x\n",
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 74a2f9f5718b..ae1b3804e5e4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -99,8 +99,8 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_MQ_INFLIGHT		((__force req_flags_t)(1 << 6))
 /* don't call prep for this one */
 #define RQF_DONTPREP		((__force req_flags_t)(1 << 7))
-/* set for "ide_preempt" requests and also for requests for which the SCSI
-   "quiesce" state must be ignored. */
+/* set for requests that must be processed even if QUEUE_FLAG_PREEMPT_ONLY has
+   been set, e.g. power management requests and "ide_preempt" requests. */
 #define RQF_PREEMPT		((__force req_flags_t)(1 << 8))
 /* contains copies of user pages */
 #define RQF_COPY_USER		((__force req_flags_t)(1 << 9))
@@ -114,8 +114,6 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_IO_STAT		((__force req_flags_t)(1 << 13))
 /* request came from our alloc pool */
 #define RQF_ALLOCED		((__force req_flags_t)(1 << 14))
-/* runtime pm request */
-#define RQF_PM			((__force req_flags_t)(1 << 15))
 /* on IO scheduler merge hash */
 #define RQF_HASHED		((__force req_flags_t)(1 << 16))
 /* IO stats tracking on */
@@ -543,7 +541,6 @@ struct request_queue {
 #ifdef CONFIG_PM
 	struct device		*dev;
 	int			rpm_status;
-	unsigned int		nr_pending;
 	wait_queue_head_t	rpm_wq;
 	/* rpm_lock protects rpm_owner and rpm_nesting_level */
 	spinlock_t		rpm_lock;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 6/9] block: Warn if pm_runtime_get*() has not been called
  2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
                   ` (4 preceding siblings ...)
  2018-08-02 18:29 ` [PATCH v3 5/9] block, scsi: Rework runtime power management Bart Van Assche
@ 2018-08-02 18:29 ` Bart Van Assche
  2018-08-02 18:29 ` [PATCH v3 7/9] block: Remove blk_pm_requeue_request() Bart Van Assche
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Alan Stern, Johannes Thumshirn

Make it easier to determine from which code path a pm_runtime_get*()
call is missing by issuing a warning if such a call is missing.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-pm.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/blk-pm.h b/block/blk-pm.h
index 226fe34c0df9..3978a3021958 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -13,8 +13,10 @@ static inline void blk_pm_requeue_request(struct request *rq)
 static inline void blk_pm_add_request(struct request_queue *q,
 				      struct request *rq)
 {
-	if (q->dev && !(rq->rq_flags & RQF_PREEMPT) &&
-	    (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
+	if (!q->dev || (rq->rq_flags & RQF_PREEMPT))
+		return;
+	WARN_ON_ONCE(atomic_read(&q->dev->power.usage_count) == 0);
+	if (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING)
 		pm_request_resume(q->dev);
 }
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 7/9] block: Remove blk_pm_requeue_request()
  2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
                   ` (5 preceding siblings ...)
  2018-08-02 18:29 ` [PATCH v3 6/9] block: Warn if pm_runtime_get*() has not been called Bart Van Assche
@ 2018-08-02 18:29 ` Bart Van Assche
  2018-08-02 18:29 ` [PATCH v3 8/9] blk-mq: Insert blk_pm_{add,put}_request() calls Bart Van Assche
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Alan Stern, Johannes Thumshirn

Since this function is now empty, remove it.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-pm.h   | 8 --------
 block/elevator.c | 2 --
 2 files changed, 10 deletions(-)

diff --git a/block/blk-pm.h b/block/blk-pm.h
index 3978a3021958..d9077bd7ba6d 100644
--- a/block/blk-pm.h
+++ b/block/blk-pm.h
@@ -6,10 +6,6 @@
 #include <linux/pm_runtime.h>
 
 #ifdef CONFIG_PM
-static inline void blk_pm_requeue_request(struct request *rq)
-{
-}
-
 static inline void blk_pm_add_request(struct request_queue *q,
 				      struct request *rq)
 {
@@ -26,10 +22,6 @@ static inline void blk_pm_put_request(struct request *rq)
 		pm_runtime_mark_last_busy(rq->q->dev);
 }
 #else
-static inline void blk_pm_requeue_request(struct request *rq)
-{
-}
-
 static inline void blk_pm_add_request(struct request_queue *q,
 				      struct request *rq)
 {
diff --git a/block/elevator.c b/block/elevator.c
index 4c15f0240c99..fd7b12bb32a3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -572,8 +572,6 @@ void elv_requeue_request(struct request_queue *q, struct request *rq)
 
 	rq->rq_flags &= ~RQF_STARTED;
 
-	blk_pm_requeue_request(rq);
-
 	__elv_add_request(q, rq, ELEVATOR_INSERT_REQUEUE);
 }
 
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 8/9] blk-mq: Insert blk_pm_{add,put}_request() calls
  2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
                   ` (6 preceding siblings ...)
  2018-08-02 18:29 ` [PATCH v3 7/9] block: Remove blk_pm_requeue_request() Bart Van Assche
@ 2018-08-02 18:29 ` Bart Van Assche
  2018-08-02 23:53   ` Ming Lei
  2018-08-02 18:29 ` [PATCH v3 9/9] blk-mq: Enable support for runtime power management Bart Van Assche
  2018-08-02 21:25 ` [PATCH v3 0/9] blk-mq: Enable " Jens Axboe
  9 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Alan Stern, Johannes Thumshirn

Make sure that blk_pm_add_request() is called exactly once before
a request is added to a software queue, to the scheduler and also
before .queue_rq() is called directly. Call blk_pm_put_request()
after a request has finished.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-mq-sched.c | 13 +++++++++++--
 block/blk-mq.c       |  8 ++++++++
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index cf9c66c6d35a..d87839b31d56 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -14,6 +14,7 @@
 #include "blk-mq-debugfs.h"
 #include "blk-mq-sched.h"
 #include "blk-mq-tag.h"
+#include "blk-pm.h"
 #include "blk-wbt.h"
 
 void blk_mq_sched_free_hctx_data(struct request_queue *q,
@@ -349,6 +350,8 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
 {
 	/* dispatch flush rq directly */
 	if (rq->rq_flags & RQF_FLUSH_SEQ) {
+		blk_pm_add_request(rq->q, rq);
+
 		spin_lock(&hctx->lock);
 		list_add(&rq->queuelist, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
@@ -380,6 +383,8 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
 	if (blk_mq_sched_bypass_insert(hctx, !!e, rq))
 		goto run;
 
+	blk_pm_add_request(q, rq);
+
 	if (e && e->type->ops.mq.insert_requests) {
 		LIST_HEAD(list);
 
@@ -402,10 +407,14 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
 {
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 	struct elevator_queue *e = hctx->queue->elevator;
+	struct request *rq;
+
+	if (e && e->type->ops.mq.insert_requests) {
+		list_for_each_entry(rq, list, queuelist)
+			blk_pm_add_request(q, rq);
 
-	if (e && e->type->ops.mq.insert_requests)
 		e->type->ops.mq.insert_requests(hctx, list, false);
-	else {
+	} else {
 		/*
 		 * try to issue requests directly if the hw queue isn't
 		 * busy in case of 'none' scheduler, and this way may save
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8d845872ea02..592d81c37b07 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -36,6 +36,7 @@
 #include "blk-mq-tag.h"
 #include "blk-stat.h"
 #include "blk-mq-sched.h"
+#include "blk-pm.h"
 #include "blk-rq-qos.h"
 
 static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
@@ -476,6 +477,8 @@ static void __blk_mq_free_request(struct request *rq)
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 	const int sched_tag = rq->internal_tag;
 
+	blk_pm_put_request(rq);
+
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -1565,6 +1568,8 @@ void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
 
+	blk_pm_add_request(rq->q, rq);
+
 	spin_lock(&hctx->lock);
 	list_add_tail(&rq->queuelist, &hctx->dispatch);
 	spin_unlock(&hctx->lock);
@@ -1586,6 +1591,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
 	list_for_each_entry(rq, list, queuelist) {
 		BUG_ON(rq->mq_ctx != ctx);
 		trace_block_rq_insert(hctx->queue, rq);
+		blk_pm_add_request(rq->q, rq);
 	}
 
 	spin_lock(&ctx->lock);
@@ -1682,6 +1688,8 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
 	blk_qc_t new_cookie;
 	blk_status_t ret;
 
+	blk_pm_add_request(q, rq);
+
 	new_cookie = request_to_qc_t(hctx, rq);
 
 	/*
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 9/9] blk-mq: Enable support for runtime power management
  2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
                   ` (7 preceding siblings ...)
  2018-08-02 18:29 ` [PATCH v3 8/9] blk-mq: Insert blk_pm_{add,put}_request() calls Bart Van Assche
@ 2018-08-02 18:29 ` Bart Van Assche
  2018-08-02 21:25 ` [PATCH v3 0/9] blk-mq: Enable " Jens Axboe
  9 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Jianchao Wang,
	Ming Lei, Alan Stern, Johannes Thumshirn

Now that the blk-mq core processes power management requests
(marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
runtime power management for blk-mq.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
---
 block/blk-pm.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/block/blk-pm.c b/block/blk-pm.c
index c33c881f64df..d719cae766cf 100644
--- a/block/blk-pm.c
+++ b/block/blk-pm.c
@@ -66,10 +66,6 @@ void blk_pm_runtime_unlock(struct request_queue *q)
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-	/* not support for RQF_PM and ->rpm_status in blk-mq yet */
-	if (q->mq_ops)
-		return;
-
 	q->dev = dev;
 	q->rpm_status = RPM_ACTIVE;
 	pm_runtime_set_autosuspend_delay(q->dev, -1);
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/9] percpu-refcount: Introduce percpu_ref_read()
  2018-08-02 18:29 ` [PATCH v3 4/9] percpu-refcount: Introduce percpu_ref_read() Bart Van Assche
@ 2018-08-02 19:06   ` Tejun Heo
  2018-08-02 20:04     ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Tejun Heo @ 2018-08-02 19:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jianchao Wang,
	Ming Lei, Alan Stern, Johannes Thumshirn

Hello,

On Thu, Aug 02, 2018 at 11:29:39AM -0700, Bart Van Assche wrote:
> Introduce a function that allows to read the value of a per-cpu counter.
> This function will be used in the next patch to check whether a per-cpu
> counter in atomic mode has the value one.

I'm not a big fan of exposing this.  If you need to test for atomic &&
1, I'd much prefer a helper specifically testing for that.  But can
you please explain a bit why you need this?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/9] percpu-refcount: Introduce percpu_ref_read()
  2018-08-02 19:06   ` Tejun Heo
@ 2018-08-02 20:04     ` Bart Van Assche
  2018-08-06 17:18       ` Tejun Heo
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-08-02 20:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jianchao Wang,
	Ming Lei, Alan Stern, Johannes Thumshirn

On 08/02/18 12:06, Tejun Heo wrote:
> On Thu, Aug 02, 2018 at 11:29:39AM -0700, Bart Van Assche wrote:
>> Introduce a function that allows to read the value of a per-cpu counter.
>> This function will be used in the next patch to check whether a per-cpu
>> counter in atomic mode has the value one.
> 
> I'm not a big fan of exposing this.  If you need to test for atomic &&
> 1, I'd much prefer a helper specifically testing for that.  But can
> you please explain a bit why you need this?
Hello Tejun,

As you probably know one of the long term goals for the block layer is 
to switch to blk-mq and to drop the legacy block layer. Hence this patch 
series that adds support for runtime power management to blk-mq because 
today that feature is missing from blk-mq. The approach is the same as 
for the legacy block layer: if the autosuspend timer expires and no
requests are in flight, suspend the block device. So we need a mechanism 
to track whether or not any requests are in flight. One possible 
approach is to check the value of q_usage_counter. percpu_ref_is_zero() 
could only be used to check q_usage_counter if that counter would be 
switched to atomic mode first and if the initial reference would be 
dropped too. I want to avoid the overhead of that switch to atomic mode 
whenever possible. Hence the proposal to introduce the percpu_ref_read() 
function. If the value returned by that function is larger than one then 
we know that requests are in flight and hence that the switch to atomic 
mode can be skipped.

Proposals for alternative approaches are welcome.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/9] blk-mq: Enable runtime power management
  2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
                   ` (8 preceding siblings ...)
  2018-08-02 18:29 ` [PATCH v3 9/9] blk-mq: Enable support for runtime power management Bart Van Assche
@ 2018-08-02 21:25 ` Jens Axboe
  9 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2018-08-02 21:25 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, Christoph Hellwig

On 8/2/18 12:29 PM, Bart Van Assche wrote:
> Hello Jens,
> 
> This patch series not only enables runtime power management for blk-mq but
> also fixes a starvation issue in the power management code for the legacy
> block layer. Please consider this patch series for kernel v4.19.

Looks good to me, but would be nice to get Tejun onboard with the percpu
change. We'll need his ack to get this applied.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/9] block, scsi: Rework runtime power management
  2018-08-02 18:29 ` [PATCH v3 5/9] block, scsi: Rework runtime power management Bart Van Assche
@ 2018-08-02 23:50   ` Ming Lei
  2018-08-03 16:16     ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2018-08-02 23:50 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Martin K . Petersen,
	Jianchao Wang, Alan Stern, Johannes Thumshirn

On Thu, Aug 02, 2018 at 11:29:40AM -0700, Bart Van Assche wrote:
> Instead of allowing requests that are not power management requests
> to enter the queue in runtime suspended status (RPM_SUSPENDED), make
> the blk_get_request() caller block. This change fixes a starvation
> issue: it is now guaranteed that power management requests will be
> executed no matter how many blk_get_request() callers are waiting.
> Instead of maintaining the q->nr_pending counter, rely on
> q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
> request finishes instead of only if the queue depth drops to zero.
> Use RQF_PREEMPT to mark power management requests instead of RQF_PM.
> This is safe because the power management core serializes system-wide
> suspend/resume and runtime power management state changes.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  block/blk-core.c          | 47 +++++++++++++++------------------------
>  block/blk-mq-debugfs.c    |  1 -
>  block/blk-pm.c            | 37 +++++++++++++++++++++++++-----
>  block/blk-pm.h            |  6 ++---
>  drivers/scsi/sd.c         |  4 ++--
>  drivers/scsi/ufs/ufshcd.c | 10 ++++-----
>  include/linux/blkdev.h    |  7 ++----
>  7 files changed, 61 insertions(+), 51 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7ca76cf26870..9ba6e42a4b18 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -690,6 +690,16 @@ void blk_set_queue_dying(struct request_queue *q)
>  {
>  	blk_queue_flag_set(QUEUE_FLAG_DYING, q);
>  
> +#ifdef CONFIG_PM
> +	/*
> +	 * Avoid that runtime power management tries to modify the state of
> +	 * q->q_usage_counter after that counter has been transitioned to the
> +	 * "dead" state.
> +	 */
> +	if (q->dev)
> +		pm_runtime_disable(q->dev);
> +#endif
> +
>  	/*
>  	 * When queue DYING flag is set, we need to block new req
>  	 * entering queue, so we call blk_freeze_queue_start() to
> @@ -2737,30 +2747,6 @@ void blk_account_io_done(struct request *req, u64 now)
>  	}
>  }
>  
> -#ifdef CONFIG_PM
> -/*
> - * Don't process normal requests when queue is suspended
> - * or in the process of suspending/resuming
> - */
> -static bool blk_pm_allow_request(struct request *rq)
> -{
> -	switch (rq->q->rpm_status) {
> -	case RPM_RESUMING:
> -	case RPM_SUSPENDING:
> -		return rq->rq_flags & RQF_PM;
> -	case RPM_SUSPENDED:
> -		return false;
> -	default:
> -		return true;
> -	}
> -}
> -#else
> -static bool blk_pm_allow_request(struct request *rq)
> -{
> -	return true;
> -}
> -#endif
> -
>  void blk_account_io_start(struct request *rq, bool new_io)
>  {
>  	struct hd_struct *part;
> @@ -2806,11 +2792,14 @@ static struct request *elv_next_request(struct request_queue *q)
>  
>  	while (1) {
>  		list_for_each_entry(rq, &q->queue_head, queuelist) {
> -			if (blk_pm_allow_request(rq))
> -				return rq;
> -
> -			if (rq->rq_flags & RQF_SOFTBARRIER)
> -				break;
> +#ifdef CONFIG_PM
> +			/*
> +			 * If a request gets queued in state RPM_SUSPENDED
> +			 * then that's a kernel bug.
> +			 */
> +			WARN_ON_ONCE(q->rpm_status == RPM_SUSPENDED);
> +#endif
> +			return rq;
>  		}
>  
>  		/*
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index cb1e6cf7ac48..994bdd41feb2 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -324,7 +324,6 @@ static const char *const rqf_name[] = {
>  	RQF_NAME(ELVPRIV),
>  	RQF_NAME(IO_STAT),
>  	RQF_NAME(ALLOCED),
> -	RQF_NAME(PM),
>  	RQF_NAME(HASHED),
>  	RQF_NAME(STATS),
>  	RQF_NAME(SPECIAL_PAYLOAD),
> diff --git a/block/blk-pm.c b/block/blk-pm.c
> index 524e19b67380..c33c881f64df 100644
> --- a/block/blk-pm.c
> +++ b/block/blk-pm.c
> @@ -77,6 +77,13 @@ void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
>  }
>  EXPORT_SYMBOL(blk_pm_runtime_init);
>  
> +static void blk_unprepare_runtime_suspend(struct request_queue *q)
> +{
> +	blk_queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
> +	/* Because QUEUE_FLAG_PREEMPT_ONLY has been cleared. */
> +	wake_up_all(&q->mq_freeze_wq);
> +}
> +
>  /**
>   * blk_pre_runtime_suspend - Pre runtime suspend check
>   * @q: the queue of the device
> @@ -105,17 +112,32 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>  	if (!q->dev)
>  		return ret;
>  
> +	WARN_ON_ONCE(q->rpm_status != RPM_ACTIVE);
> +
>  	blk_pm_runtime_lock(q);
> +	/*
> +	 * Set the PREEMPT_ONLY flag before switching the queue usage counter
> +	 * to atomic mode such that blk_queue_enter() sees the two writes in
> +	 * the right order. See also http://lwn.net/Articles/573497/.
> +	 */
> +	blk_set_preempt_only(q);
> +	ret = -EBUSY;
> +	if (percpu_ref_read(&q->q_usage_counter) == 1) {
> +		percpu_ref_switch_to_atomic_sync(&q->q_usage_counter);
> +		if (percpu_ref_read(&q->q_usage_counter) == 1)
> +			ret = 0;
> +	}

I am not sure if using PREEMPT_ONLY can work well for runtime PM, there
are lots of users which depends on blk_freeze_queue() for making sure
there isn't any in-flight requests.

But now PREEMPT_ONLY opens one door for preempt-req during the big
window, so this way may break current uses of blk_freeze_queue().


Thanks,
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 8/9] blk-mq: Insert blk_pm_{add,put}_request() calls
  2018-08-02 18:29 ` [PATCH v3 8/9] blk-mq: Insert blk_pm_{add,put}_request() calls Bart Van Assche
@ 2018-08-02 23:53   ` Ming Lei
  2018-08-03  0:08     ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2018-08-02 23:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jianchao Wang,
	Alan Stern, Johannes Thumshirn

On Thu, Aug 02, 2018 at 11:29:43AM -0700, Bart Van Assche wrote:
> Make sure that blk_pm_add_request() is called exactly once before
> a request is added to a software queue, to the scheduler and also
> before .queue_rq() is called directly. Call blk_pm_put_request()
> after a request has finished.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> ---
>  block/blk-mq-sched.c | 13 +++++++++++--
>  block/blk-mq.c       |  8 ++++++++
>  2 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index cf9c66c6d35a..d87839b31d56 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -14,6 +14,7 @@
>  #include "blk-mq-debugfs.h"
>  #include "blk-mq-sched.h"
>  #include "blk-mq-tag.h"
> +#include "blk-pm.h"
>  #include "blk-wbt.h"
>  
>  void blk_mq_sched_free_hctx_data(struct request_queue *q,
> @@ -349,6 +350,8 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx,
>  {
>  	/* dispatch flush rq directly */
>  	if (rq->rq_flags & RQF_FLUSH_SEQ) {
> +		blk_pm_add_request(rq->q, rq);
> +
>  		spin_lock(&hctx->lock);
>  		list_add(&rq->queuelist, &hctx->dispatch);
>  		spin_unlock(&hctx->lock);
> @@ -380,6 +383,8 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
>  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq))
>  		goto run;
>  
> +	blk_pm_add_request(q, rq);
> +
>  	if (e && e->type->ops.mq.insert_requests) {
>  		LIST_HEAD(list);
>  
> @@ -402,10 +407,14 @@ void blk_mq_sched_insert_requests(struct request_queue *q,
>  {
>  	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
>  	struct elevator_queue *e = hctx->queue->elevator;
> +	struct request *rq;
> +
> +	if (e && e->type->ops.mq.insert_requests) {
> +		list_for_each_entry(rq, list, queuelist)
> +			blk_pm_add_request(q, rq);
>  
> -	if (e && e->type->ops.mq.insert_requests)
>  		e->type->ops.mq.insert_requests(hctx, list, false);
> -	else {
> +	} else {
>  		/*
>  		 * try to issue requests directly if the hw queue isn't
>  		 * busy in case of 'none' scheduler, and this way may save
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8d845872ea02..592d81c37b07 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -36,6 +36,7 @@
>  #include "blk-mq-tag.h"
>  #include "blk-stat.h"
>  #include "blk-mq-sched.h"
> +#include "blk-pm.h"
>  #include "blk-rq-qos.h"
>  
>  static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
> @@ -476,6 +477,8 @@ static void __blk_mq_free_request(struct request *rq)
>  	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
>  	const int sched_tag = rq->internal_tag;
>  
> +	blk_pm_put_request(rq);
> +
>  	if (rq->tag != -1)
>  		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
>  	if (sched_tag != -1)
> @@ -1565,6 +1568,8 @@ void blk_mq_request_bypass_insert(struct request *rq, bool run_queue)
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;
>  	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
>  
> +	blk_pm_add_request(rq->q, rq);
> +
>  	spin_lock(&hctx->lock);
>  	list_add_tail(&rq->queuelist, &hctx->dispatch);
>  	spin_unlock(&hctx->lock);
> @@ -1586,6 +1591,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
>  	list_for_each_entry(rq, list, queuelist) {
>  		BUG_ON(rq->mq_ctx != ctx);
>  		trace_block_rq_insert(hctx->queue, rq);
> +		blk_pm_add_request(rq->q, rq);
>  	}
>  
>  	spin_lock(&ctx->lock);
> @@ -1682,6 +1688,8 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>  	blk_qc_t new_cookie;
>  	blk_status_t ret;
>  
> +	blk_pm_add_request(q, rq);
> +
>  	new_cookie = request_to_qc_t(hctx, rq);
>  

blk_pm_add_request() calls pm_request_resume() for waking up device, but
it is wrong because it is async request, which can't guarantee device
will be ready before calling .queue_rq().

Thanks,
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 8/9] blk-mq: Insert blk_pm_{add,put}_request() calls
  2018-08-02 23:53   ` Ming Lei
@ 2018-08-03  0:08     ` Bart Van Assche
  2018-08-03  0:11       ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-08-03  0:08 UTC (permalink / raw)
  To: ming.lei; +Cc: hch, linux-block, jthumshirn, stern, axboe, jianchao.w.wang

On Fri, 2018-08-03 at 07:53 +0800, Ming Lei wrote:
> blk_pm_add_request() calls pm_request_resume() fo=
r waking up device, but
> it is wrong because it is async request, which can't guarantee device
> will be ready before calling .queue_rq().

That's a longstanding issue and not something that has been introduced by t=
his
patch series. Hence patch 6/9 that issues a warning if pm_request_r=
esume() would
be called.

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 8/9] blk-mq: Insert blk_pm_{add,put}_request() calls
  2018-08-03  0:08     ` Bart Van Assche
@ 2018-08-03  0:11       ` Ming Lei
  2018-08-03  1:03         ` Bart Van Assche
  0 siblings, 1 reply; 23+ messages in thread
From: Ming Lei @ 2018-08-03  0:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, linux-block, jthumshirn, stern, axboe, jianchao.w.wang

On Fri, Aug 03, 2018 at 12:08:54AM +0000, Bart Van Assche wrote:
> On Fri, 2018-08-03 at 07:53 +0800, Ming Lei wrote:
> > blk_pm_add_request() calls pm_request_resume() for waking up device, but
> > it is wrong because it is async request, which can't guarantee device
> > will be ready before calling .queue_rq().
> 
> That's a longstanding issue and not something that has been introduced by this
> patch series. Hence patch 6/9 that issues a warning if pm_request_resume() would
> be called.

It is introduced by your patch, please see blk_pm_allow_request().

Thanks,
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 8/9] blk-mq: Insert blk_pm_{add,put}_request() calls
  2018-08-03  0:11       ` Ming Lei
@ 2018-08-03  1:03         ` Bart Van Assche
  2018-08-03  1:11           ` Ming Lei
  0 siblings, 1 reply; 23+ messages in thread
From: Bart Van Assche @ 2018-08-03  1:03 UTC (permalink / raw)
  To: ming.lei; +Cc: hch, linux-block, jthumshirn, stern, axboe, jianchao.w.wang

On Fri, 2018-08-03 at 08:11 +0800, Ming Lei wrote:
> On Fri, Aug 03, 2018 at 12:08:54AM +0000, Bart Van Assche wrote:
> > On Fri, 2018-08-03 at 07:53 +0800, Ming Lei wrote:
> > > blk_pm_add_request() calls pm_request_=
-resume() for waking up device, but
> > > it is wrong because it is async request, which can't guar=
antee device
> > > will be ready before calling .queue_rq().
> >=20
> > That's a longstanding issue and not something that has been int=
roduced by this
> > patch series. Hence patch 6/9 that issues a warning if pm_r=
equest_resume() would
> > be called.
>=20
> It is introduced by your patch, please see blk_pm_allow_r=
equest().

Did you perhaps misread my patch? Please have a look at commit c8158819d506
("block: implement runtime pm strategy"). I think that commit from =
2013
introduced the pm_request_resume() call in blk_pm_add_r=
equest().

By the way, if you would like to see that call removed, I'm fine with
removing that call.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 8/9] blk-mq: Insert blk_pm_{add,put}_request() calls
  2018-08-03  1:03         ` Bart Van Assche
@ 2018-08-03  1:11           ` Ming Lei
  0 siblings, 0 replies; 23+ messages in thread
From: Ming Lei @ 2018-08-03  1:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, linux-block, jthumshirn, stern, axboe, jianchao.w.wang

On Fri, Aug 03, 2018 at 01:03:36AM +0000, Bart Van Assche wrote:
> On Fri, 2018-08-03 at 08:11 +0800, Ming Lei wrote:
> > On Fri, Aug 03, 2018 at 12:08:54AM +0000, Bart Van Assche wrote:
> > > On Fri, 2018-08-03 at 07:53 +0800, Ming Lei wrote:
> > > > blk_pm_add_request() calls pm_request_resume() for waking up device, but
> > > > it is wrong because it is async request, which can't guarantee device
> > > > will be ready before calling .queue_rq().
> > > 
> > > That's a longstanding issue and not something that has been introduced by this
> > > patch series. Hence patch 6/9 that issues a warning if pm_request_resume() would
> > > be called.
> > 
> > It is introduced by your patch, please see blk_pm_allow_request().
> 
> Did you perhaps misread my patch? Please have a look at commit c8158819d506
> ("block: implement runtime pm strategy"). I think that commit from 2013
> introduced the pm_request_resume() call in blk_pm_add_request().
> 
> By the way, if you would like to see that call removed, I'm fine with
> removing that call.

pm_request_resume() isn't used wrong now, and the issue is introduced by
removing blk_pm_allow_request(), which is called in elv_next_request() for
avoiding this issue, so that requests can be kept in scheduler queue if
device isn't active.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 5/9] block, scsi: Rework runtime power management
  2018-08-02 23:50   ` Ming Lei
@ 2018-08-03 16:16     ` Bart Van Assche
  0 siblings, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-08-03 16:16 UTC (permalink / raw)
  To: ming.lei
  Cc: hch, jthumshirn, linux-block, stern, martin.petersen, axboe,
	jianchao.w.wang

On Fri, 2018-08-03 at 07:50 +0800, Ming Lei wrote:
> I am not sure if using PREEMPT_ONLY can work well for runtime PM,=
 there
> are lots of users which depends on blk_freeze_queue() for mak=
ing sure
> there isn't any in-flight requests.

Please elaborate "lots of other users". The only use I have found i=
n the
kernel tree of that flag of which I think that it is not related to power
management is in ide_prep_sense(). All other uses of the RQF_PR=
EEMPT /
BLK_MQ_REQ_PREEMPT / QUEUE_FLAG_PREEMPT_ONLY flags =
I found are IMHO related
to power management:
* generic_ide_resume()
* __scsi_execute()
* scsi_device_quiesce() / scsi_device_resume()

However, we may have to modify __scsi_execute() such that only re=
quests
that are related to power management are submitted with the
BLK_MQ_REQ_PREEMPT flag set. I don't think that all __scs=
i_execute() callers
need that flag.

> But now PREEMPT_ONLY opens one door for preempt-req during the bi=
g
> window, so this way may break current uses of blk_freeze_queu=
e().

The power management core serializes runtime power management against suspe=
nd,
resume and hibernation so I think it is safe to use the PREEMPT_ONLY fl=
ag for
runtime power management although it is already used for suspend, resume an=
d
hibernation.

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/9] percpu-refcount: Introduce percpu_ref_read()
  2018-08-02 20:04     ` Bart Van Assche
@ 2018-08-06 17:18       ` Tejun Heo
  2018-08-06 17:28         ` Bart Van Assche
  2018-08-07 22:53         ` Bart Van Assche
  0 siblings, 2 replies; 23+ messages in thread
From: Tejun Heo @ 2018-08-06 17:18 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jianchao Wang,
	Ming Lei, Alan Stern, Johannes Thumshirn

Hello, Bart.

On Thu, Aug 02, 2018 at 01:04:43PM -0700, Bart Van Assche wrote:
> As you probably know one of the long term goals for the block layer
> is to switch to blk-mq and to drop the legacy block layer. Hence
> this patch series that adds support for runtime power management to
> blk-mq because today that feature is missing from blk-mq. The
> approach is the same as for the legacy block layer: if the
> autosuspend timer expires and no
> requests are in flight, suspend the block device. So we need a
> mechanism to track whether or not any requests are in flight. One
> possible approach is to check the value of q_usage_counter.
> percpu_ref_is_zero() could only be used to check q_usage_counter if
> that counter would be switched to atomic mode first and if the
> initial reference would be dropped too. I want to avoid the overhead
> of that switch to atomic mode whenever possible. Hence the proposal
> to introduce the percpu_ref_read() function. If the value returned
> by that function is larger than one then we know that requests are
> in flight and hence that the switch to atomic mode can be skipped.
> 
> Proposals for alternative approaches are welcome.

I'm worried that this is too inviting for misuses and subtle problems.
For example, your patch which uses this function uses
synchronize_rcu() to synchronize against per-cpu counts after the
first snoop; however, percpu_ref uses sched rcu, not the regular one,
so depending on the config, this will lead to *really* subtle
failures.  Even if that gets fixed, it's still leaking percpu-ref
internal details to its users - details which may change in the future
and will cause super subtle bugs.

I'd go for something a lot more specific, like percpu_ref_is_one(), so
that all the magics can be contained in percpu-ref implementation
proper.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/9] percpu-refcount: Introduce percpu_ref_read()
  2018-08-06 17:18       ` Tejun Heo
@ 2018-08-06 17:28         ` Bart Van Assche
  2018-08-07 22:53         ` Bart Van Assche
  1 sibling, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-08-06 17:28 UTC (permalink / raw)
  To: tj; +Cc: hch, ming.lei, linux-block, jthumshirn, stern, axboe, jianchao.w.wang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-7", Size: 934 bytes --]

On Mon, 2018-08-06 at 10:18 -0700, Tejun Heo wrote:
+AD4- I'm worried that this is too inviting for misuses and subtle problems=
.
+AD4- For example, your patch which uses this function uses
+AD4- synchronize+AF8-rcu() to synchronize against per-cpu counts after the
+AD4- first snoop+ADs- however, percpu+AF8-ref uses sched rcu, not the regu=
lar one,
+AD4- so depending on the config, this will lead to +ACo-really+ACo- subtle
+AD4- failures.  Even if that gets fixed, it's still leaking percpu-ref
+AD4- internal details to its users - details which may change in the futur=
e
+AD4- and will cause super subtle bugs.
+AD4-=20
+AD4- I'd go for something a lot more specific, like percpu+AF8-ref+AF8-is+=
AF8-one(), so
+AD4- that all the magics can be contained in percpu-ref implementation
+AD4- proper.

Hello Tejun,

I will try to come up with an approach that does not require to modify the
per-cpu counter code.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 4/9] percpu-refcount: Introduce percpu_ref_read()
  2018-08-06 17:18       ` Tejun Heo
  2018-08-06 17:28         ` Bart Van Assche
@ 2018-08-07 22:53         ` Bart Van Assche
  1 sibling, 0 replies; 23+ messages in thread
From: Bart Van Assche @ 2018-08-07 22:53 UTC (permalink / raw)
  To: tj; +Cc: hch, ming.lei, linux-block, jthumshirn, stern, axboe, jianchao.w.wang

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-7", Size: 1107 bytes --]

On Mon, 2018-08-06 at 10:18 -0700, Tejun Heo wrote:
+AD4- I'm worried that this is too inviting for misuses and subtle problems=
.
+AD4- For example, your patch which uses this function uses
+AD4- synchronize+AF8-rcu() to synchronize against per-cpu counts after the
+AD4- first snoop+ADs- however, percpu+AF8-ref uses sched rcu, not the regu=
lar one,
+AD4- so depending on the config, this will lead to +ACo-really+ACo- subtle
+AD4- failures.  Even if that gets fixed, it's still leaking percpu-ref
+AD4- internal details to its users - details which may change in the futur=
e
+AD4- and will cause super subtle bugs.
+AD4-=20
+AD4- I'd go for something a lot more specific, like percpu+AF8-ref+AF8-is+=
AF8-one(), so
+AD4- that all the magics can be contained in percpu-ref implementation
+AD4- proper.

Hi Tejun,

Can you have a look at the new percpu+AF8-ref+AF8-is+AF8-in+AF8-use() funct=
ion? Please also
note that the synchronize+AF8-rcu() call between the two percpu+AF8-ref+AF8=
-is+AF8-in+AF8-use()
calls is not related to the use of RCU in the per-cpu refcount implementati=
on.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-08-07 22:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-02 18:29 [PATCH v3 0/9] blk-mq: Enable runtime power management Bart Van Assche
2018-08-02 18:29 ` [PATCH v3 1/9] block: Fix a comment in a header file Bart Van Assche
2018-08-02 18:29 ` [PATCH v3 2/9] block: Move power management functions into new source files Bart Van Assche
2018-08-02 18:29 ` [PATCH v3 3/9] block: Serialize queue freezing and blk_pre_runtime_suspend() Bart Van Assche
2018-08-02 18:29 ` [PATCH v3 4/9] percpu-refcount: Introduce percpu_ref_read() Bart Van Assche
2018-08-02 19:06   ` Tejun Heo
2018-08-02 20:04     ` Bart Van Assche
2018-08-06 17:18       ` Tejun Heo
2018-08-06 17:28         ` Bart Van Assche
2018-08-07 22:53         ` Bart Van Assche
2018-08-02 18:29 ` [PATCH v3 5/9] block, scsi: Rework runtime power management Bart Van Assche
2018-08-02 23:50   ` Ming Lei
2018-08-03 16:16     ` Bart Van Assche
2018-08-02 18:29 ` [PATCH v3 6/9] block: Warn if pm_runtime_get*() has not been called Bart Van Assche
2018-08-02 18:29 ` [PATCH v3 7/9] block: Remove blk_pm_requeue_request() Bart Van Assche
2018-08-02 18:29 ` [PATCH v3 8/9] blk-mq: Insert blk_pm_{add,put}_request() calls Bart Van Assche
2018-08-02 23:53   ` Ming Lei
2018-08-03  0:08     ` Bart Van Assche
2018-08-03  0:11       ` Ming Lei
2018-08-03  1:03         ` Bart Van Assche
2018-08-03  1:11           ` Ming Lei
2018-08-02 18:29 ` [PATCH v3 9/9] blk-mq: Enable support for runtime power management Bart Van Assche
2018-08-02 21:25 ` [PATCH v3 0/9] blk-mq: Enable " Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.