All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Make SCSI device suspend work reliably
@ 2017-09-08 23:52 Bart Van Assche
  2017-09-08 23:52 ` [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait() Bart Van Assche
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Bart Van Assche @ 2017-09-08 23:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche

Hello Jens,

Recently it was reported on the block layer mailing list that suspend
does not work reliably neither for the legacy block layer nor for blk-mq.
The purpose of this patch series is to make device suspend work reliably
without affecting the hot path significantly and without introducing any
race conditions between request queue cleanup and blk_get_request().

Please consider this patch series for kernel v4.15.

Thanks,

Bart.

Bart Van Assche (5):
  percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()
  scsi: Change the type of the second last argument of scsi_execute()
  block: Introduce REQ_PM and remove RQF_PM
  block: Make SCSI device suspend work reliably
  blk-mq: Implement power management support

 block/blk-core.c                | 89 ++++++++++++++++++++++++++---------------
 block/blk-mq-debugfs.c          |  2 +-
 block/blk-mq.c                  | 34 ++++++++++++++++
 block/blk.h                     | 12 ++++++
 block/elevator.c                |  4 +-
 drivers/scsi/scsi_lib.c         | 15 +++----
 drivers/scsi/sd.c               |  4 +-
 drivers/scsi/ufs/ufshcd.c       | 11 ++---
 include/linux/blk_types.h       |  2 +
 include/linux/blkdev.h          |  3 +-
 include/linux/percpu-refcount.h |  1 +
 include/scsi/scsi_device.h      |  2 +-
 lib/percpu-refcount.c           | 21 ++++++++++
 13 files changed, 147 insertions(+), 53 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()
  2017-09-08 23:52 [PATCH 0/5] Make SCSI device suspend work reliably Bart Van Assche
@ 2017-09-08 23:52 ` Bart Van Assche
  2017-09-11  8:42   ` Johannes Thumshirn
  2017-09-11 13:13   ` Tejun Heo
  2017-09-08 23:52 ` [PATCH 2/5] scsi: Change the type of the second last argument of scsi_execute() Bart Van Assche
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Bart Van Assche @ 2017-09-08 23:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Tejun Heo,
	Christoph Lameter, NeilBrown

The blk-mq core keeps track of the number of request queue users
through q->q_usage_count. Make it possible to switch this counter
to atomic mode from the context of the block layer power management
code by introducing percpu_ref_switch_to_atomic_nowait().

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: NeilBrown <neilb@suse.com>
---
 include/linux/percpu-refcount.h |  1 +
 lib/percpu-refcount.c           | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index c13dceb87b60..0d4bfbb392d7 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -100,6 +100,7 @@ void percpu_ref_exit(struct percpu_ref *ref);
 void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_switch);
 void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref);
+void percpu_ref_switch_to_atomic_nowait(struct percpu_ref *ref);
 void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_kill);
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index fe03c6d52761..cf9152ff0892 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -277,6 +277,27 @@ void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref)
 }
 EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync);
 
+/**
+ * percpu_ref_switch_to_nowait - switch a percpu_ref to atomic mode
+ * @ref: percpu_ref to switch to atomic mode
+ *
+ * Schedule switching of @ref to atomic mode. All its percpu counts will be
+ * collected to the main atomic counter. @ref will stay in atomic mode across
+ * kill/reinit cycles until percpu_ref_switch_to_percpu() is called.
+ *
+ * This function does not block and can be called from any context.
+ */
+void percpu_ref_switch_to_atomic_nowait(struct percpu_ref *ref)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+	if (!ref->confirm_switch)
+		__percpu_ref_switch_to_atomic(ref, NULL);
+	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+}
+EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_nowait);
+
 /**
  * percpu_ref_switch_to_percpu - switch a percpu_ref to percpu mode
  * @ref: percpu_ref to switch to percpu mode
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/5] scsi: Change the type of the second last argument of scsi_execute()
  2017-09-08 23:52 [PATCH 0/5] Make SCSI device suspend work reliably Bart Van Assche
  2017-09-08 23:52 ` [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait() Bart Van Assche
@ 2017-09-08 23:52 ` Bart Van Assche
  2017-09-08 23:52 ` [PATCH 3/5] block: Introduce REQ_PM and remove RQF_PM Bart Van Assche
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Bart Van Assche @ 2017-09-08 23:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche,
	Martin K . Petersen, Hannes Reinecke, Johannes Thumshirn,
	Rafael J . Wysocki, Ming Lei

The second last argument of scsi_execute() is either 0 or RQF_PM.
Change the type of that argument from req_flags_t into bool and
update the callers that pass RQF_PM. This patch does not change
any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c    |  8 ++++----
 drivers/scsi/sd.c          |  4 ++--
 drivers/scsi/ufs/ufshcd.c  | 11 ++++++-----
 include/scsi/scsi_device.h |  2 +-
 4 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..5be515c8c6bd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -227,8 +227,8 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
  * @sshdr:	optional decoded sense header
  * @timeout:	request timeout in seconds
  * @retries:	number of times to retry request
- * @flags:	flags for ->cmd_flags
- * @rq_flags:	flags for ->rq_flags
+ * @flags:	REQ_* flags for ->cmd_flags
+ * @is_pm_req:	whether or not this is a power management request
  * @resid:	optional residual length
  *
  * Returns the scsi_cmnd result field if a command was executed, or a negative
@@ -237,7 +237,7 @@ void scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
 int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 		 int data_direction, void *buffer, unsigned bufflen,
 		 unsigned char *sense, struct scsi_sense_hdr *sshdr,
-		 int timeout, int retries, u64 flags, req_flags_t rq_flags,
+		 int timeout, int retries, u64 flags, bool is_pm_req,
 		 int *resid)
 {
 	struct request *req;
@@ -260,7 +260,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	rq->retries = retries;
 	req->timeout = timeout;
 	req->cmd_flags |= flags;
-	req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
+	req->rq_flags |= (is_pm_req ? RQF_PM : 0) | RQF_QUIET | RQF_PREEMPT;
 
 	/*
 	 * head injection *required* here otherwise quiesce won't work
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index e2647f2d4430..ad22b16748c6 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1620,7 +1620,7 @@ static int sd_sync_cache(struct scsi_disk *sdkp, struct scsi_sense_hdr *sshdr)
 		 * flush everything.
 		 */
 		res = scsi_execute(sdp, cmd, DMA_NONE, NULL, 0, NULL, sshdr,
-				timeout, SD_MAX_RETRIES, 0, RQF_PM, NULL);
+				timeout, SD_MAX_RETRIES, 0, true, NULL);
 		if (res == 0)
 			break;
 	}
@@ -3469,7 +3469,7 @@ static int sd_start_stop_device(struct scsi_disk *sdkp, int start)
 		return -ENODEV;
 
 	res = scsi_execute(sdp, cmd, DMA_NONE, NULL, 0, NULL, &sshdr,
-			SD_TIMEOUT, SD_MAX_RETRIES, 0, RQF_PM, NULL);
+			SD_TIMEOUT, SD_MAX_RETRIES, 0, true, NULL);
 	if (res) {
 		sd_print_result(sdkp, "Start/Stop Unit failed", res);
 		if (driver_byte(res) & DRIVER_SENSE)
diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 5bc9dc14e075..7798acd91331 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -6976,7 +6976,7 @@ ufshcd_send_request_sense(struct ufs_hba *hba, struct scsi_device *sdp)
 
 	ret = scsi_execute(sdp, cmd, DMA_FROM_DEVICE, buffer,
 			UFSHCD_REQ_SENSE_SIZE, NULL, NULL,
-			msecs_to_jiffies(1000), 3, 0, RQF_PM, NULL);
+			msecs_to_jiffies(1000), 3, 0, true, NULL);
 	if (ret)
 		pr_err("%s: failed with err %d\n", __func__, ret);
 
@@ -7037,12 +7037,13 @@ static int ufshcd_set_dev_pwr_mode(struct ufs_hba *hba,
 	cmd[4] = pwr_mode << 4;
 
 	/*
-	 * Current function would be generally called from the power management
-	 * callbacks hence set the RQF_PM flag so that it doesn't resume the
-	 * already suspended childs.
+	 * This function is typically called from a power management
+	 * callback. Hence submit the START request as a power management
+	 * request too so that it doesn't resume the already suspended
+	 * children.
 	 */
 	ret = scsi_execute(sdp, cmd, DMA_NONE, NULL, 0, NULL, &sshdr,
-			START_STOP_TIMEOUT, 0, 0, RQF_PM, NULL);
+			START_STOP_TIMEOUT, 0, 0, true, NULL);
 	if (ret) {
 		sdev_printk(KERN_WARNING, sdp,
 			    "START_STOP failed for power mode: %d, result %x\n",
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 0979a5f3b69a..b23e771cd305 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -414,7 +414,7 @@ extern int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 			int data_direction, void *buffer, unsigned bufflen,
 			unsigned char *sense, struct scsi_sense_hdr *sshdr,
 			int timeout, int retries, u64 flags,
-			req_flags_t rq_flags, int *resid);
+			bool is_pm_req, int *resid);
 static inline int scsi_execute_req(struct scsi_device *sdev,
 	const unsigned char *cmd, int data_direction, void *buffer,
 	unsigned bufflen, struct scsi_sense_hdr *sshdr, int timeout,
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/5] block: Introduce REQ_PM and remove RQF_PM
  2017-09-08 23:52 [PATCH 0/5] Make SCSI device suspend work reliably Bart Van Assche
  2017-09-08 23:52 ` [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait() Bart Van Assche
  2017-09-08 23:52 ` [PATCH 2/5] scsi: Change the type of the second last argument of scsi_execute() Bart Van Assche
@ 2017-09-08 23:52 ` Bart Van Assche
  2017-09-08 23:52 ` [PATCH 4/5] block: Make SCSI device suspend work reliably Bart Van Assche
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Bart Van Assche @ 2017-09-08 23:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Hannes Reinecke,
	Johannes Thumshirn, Rafael J . Wysocki, Ming Lei

Tell the block layer at request allocation time whether or not a
request will be used to change the power management state of a
block device.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c          | 11 +++++++++--
 block/blk-mq-debugfs.c    |  2 +-
 block/elevator.c          |  4 ++--
 drivers/scsi/scsi_lib.c   |  9 +++++----
 include/linux/blk_types.h |  2 ++
 include/linux/blkdev.h    |  2 --
 6 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d709c0e3a2ac..bb53c6b58e8c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1412,6 +1412,13 @@ static struct request *blk_old_get_request(struct request_queue *q,
 	return rq;
 }
 
+/**
+ * blk_get_request() - allocate a request
+ * @q: request queue.
+ * @op: bitwise or of operation (REQ_OP_*; see also enum req_opf) and flags
+ *      (REQ_*; see also enum req_flag_bits).
+ * @gfp_mask: request memory allocation flags.
+ */
 struct request *blk_get_request(struct request_queue *q, unsigned int op,
 				gfp_t gfp_mask)
 {
@@ -1529,7 +1536,7 @@ EXPORT_SYMBOL_GPL(part_round_stats);
 #ifdef CONFIG_PM
 static void blk_pm_put_request(struct request *rq)
 {
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM) && !--rq->q->nr_pending)
+	if (rq->q->dev && !(rq->cmd_flags & REQ_PM) && !--rq->q->nr_pending)
 		pm_runtime_mark_last_busy(rq->q->dev);
 }
 #else
@@ -2460,7 +2467,7 @@ static struct request *blk_pm_peek_request(struct request_queue *q,
 					   struct request *rq)
 {
 	if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
-	    (q->rpm_status != RPM_ACTIVE && !(rq->rq_flags & RQF_PM))))
+	    (q->rpm_status != RPM_ACTIVE && !(rq->cmd_flags & REQ_PM))))
 		return NULL;
 	else
 		return rq;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 980e73095643..ea21124b190b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -266,6 +266,7 @@ static const char *const cmd_flag_name[] = {
 	CMD_FLAG_NAME(BACKGROUND),
 	CMD_FLAG_NAME(NOUNMAP),
 	CMD_FLAG_NAME(NOWAIT),
+	CMD_FLAG_NAME(PM),
 };
 #undef CMD_FLAG_NAME
 
@@ -286,7 +287,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(ELVPRIV),
 	RQF_NAME(IO_STAT),
 	RQF_NAME(ALLOCED),
-	RQF_NAME(PM),
 	RQF_NAME(HASHED),
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
diff --git a/block/elevator.c b/block/elevator.c
index 153926a90901..7898960bed2a 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -574,13 +574,13 @@ void elv_bio_merged(struct request_queue *q, struct request *rq,
 #ifdef CONFIG_PM
 static void blk_pm_requeue_request(struct request *rq)
 {
-	if (rq->q->dev && !(rq->rq_flags & RQF_PM))
+	if (rq->q->dev && !(rq->cmd_flags & REQ_PM))
 		rq->q->nr_pending--;
 }
 
 static void blk_pm_add_request(struct request_queue *q, struct request *rq)
 {
-	if (q->dev && !(rq->rq_flags & RQF_PM) && q->nr_pending++ == 0 &&
+	if (q->dev && !(rq->cmd_flags & REQ_PM) && q->nr_pending++ == 0 &&
 	    (q->rpm_status == RPM_SUSPENDED || q->rpm_status == RPM_SUSPENDING))
 		pm_request_resume(q->dev);
 }
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5be515c8c6bd..64ad70e8447c 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -242,11 +242,12 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 {
 	struct request *req;
 	struct scsi_request *rq;
+	unsigned int op_and_flags;
 	int ret = DRIVER_ERROR << 24;
 
-	req = blk_get_request(sdev->request_queue,
-			data_direction == DMA_TO_DEVICE ?
-			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+	op_and_flags = (data_direction == DMA_TO_DEVICE ? REQ_OP_SCSI_OUT :
+			REQ_OP_SCSI_IN) | (is_pm_req ? REQ_PM : 0);
+	req = blk_get_request(sdev->request_queue, op_and_flags, __GFP_RECLAIM);
 	if (IS_ERR(req))
 		return ret;
 	rq = scsi_req(req);
@@ -260,7 +261,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	rq->retries = retries;
 	req->timeout = timeout;
 	req->cmd_flags |= flags;
-	req->rq_flags |= (is_pm_req ? RQF_PM : 0) | RQF_QUIET | RQF_PREEMPT;
+	req->rq_flags |= RQF_QUIET | RQF_PREEMPT;
 
 	/*
 	 * head injection *required* here otherwise quiesce won't work
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index a2d2aa709cef..fdf4685a5856 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -229,6 +229,7 @@ enum req_flag_bits {
 	__REQ_NOUNMAP,		/* do not free blocks when zeroing */
 
 	__REQ_NOWAIT,           /* Don't wait if request will block */
+	__REQ_PM,		/* Power management request */
 	__REQ_NR_BITS,		/* stops here */
 };
 
@@ -248,6 +249,7 @@ enum req_flag_bits {
 
 #define REQ_NOUNMAP		(1ULL << __REQ_NOUNMAP)
 #define REQ_NOWAIT		(1ULL << __REQ_NOWAIT)
+#define REQ_PM			(1ULL << __REQ_PM)
 
 #define REQ_FAILFAST_MASK \
 	(REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT | REQ_FAILFAST_DRIVER)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 460294bb0fa5..7f9a0743fc09 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -111,8 +111,6 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_IO_STAT		((__force req_flags_t)(1 << 13))
 /* request came from our alloc pool */
 #define RQF_ALLOCED		((__force req_flags_t)(1 << 14))
-/* runtime pm request */
-#define RQF_PM			((__force req_flags_t)(1 << 15))
 /* on IO scheduler merge hash */
 #define RQF_HASHED		((__force req_flags_t)(1 << 16))
 /* IO stats tracking on */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/5] block: Make SCSI device suspend work reliably
  2017-09-08 23:52 [PATCH 0/5] Make SCSI device suspend work reliably Bart Van Assche
                   ` (2 preceding siblings ...)
  2017-09-08 23:52 ` [PATCH 3/5] block: Introduce REQ_PM and remove RQF_PM Bart Van Assche
@ 2017-09-08 23:52 ` Bart Van Assche
  2017-09-12  2:29   ` Ming Lei
  2017-09-12 16:25   ` Ming Lei
  2017-09-08 23:52 ` [PATCH 5/5] blk-mq: Implement power management support Bart Van Assche
  2017-09-09 10:39 ` [PATCH 0/5] Make SCSI device suspend work reliably Ming Lei
  5 siblings, 2 replies; 22+ messages in thread
From: Bart Van Assche @ 2017-09-08 23:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Hannes Reinecke,
	Johannes Thumshirn, Rafael J . Wysocki, Ming Lei

Instead of allowing request allocation to succeed for suspended
request queues and only to process power management requests, make
blk_get_request() wait until the request queue is resumed for
requests that are not power management requests.

This patch avoids that resume does not occur if the maximum queue
depth is reached when a power management request is submitted.

Note: this patch affects the behavior of scsi_device_quiesce() only
if that function is called from inside a power management callback.
This patch does not affect the behavior of scsi_device_quiesce()
when a call of that function is triggered by writing "quiesce" into
/sys/class/scsi_device/*/device/state.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       | 60 +++++++++++++++++++++++++++-----------------------
 block/blk.h            | 12 ++++++++++
 include/linux/blkdev.h |  1 +
 3 files changed, 45 insertions(+), 28 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bb53c6b58e8c..cd2700c763ed 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1325,6 +1325,24 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
 	return ERR_PTR(-ENOMEM);
 }
 
+#ifdef CONFIG_PM
+static bool blk_wait_until_active(struct request_queue *q, bool wait)
+	__releases(q->queue_lock)
+	__acquires(q->queue_lock)
+{
+	if (wait)
+		wait_event_lock_irq(q->rpm_active_wq,
+				    q->rpm_status == RPM_ACTIVE,
+				    *q->queue_lock);
+	return q->rpm_status == RPM_ACTIVE;
+}
+#else
+static bool blk_wait_until_active(struct request_queue *q, bool wait)
+{
+	return true;
+}
+#endif
+
 /**
  * get_request - get a free request
  * @q: request_queue to allocate request from
@@ -1350,6 +1368,16 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 	lockdep_assert_held(q->queue_lock);
 	WARN_ON_ONCE(q->mq_ops);
 
+	WARN_ON_ONCE((op & REQ_PM) && blk_pm_suspended(q));
+
+	/*
+	 * Wait if the request queue is suspended or in the process of
+	 * suspending/resuming and the request being allocated will not be
+	 * used for power management purposes.
+	 */
+	if (!(op & REQ_PM) && !blk_wait_until_active(q, !(op & REQ_NOWAIT)))
+		return ERR_PTR(-EAGAIN);
+
 	rl = blk_get_rl(q, bio);	/* transferred to @rq on success */
 retry:
 	rq = __get_request(rl, op, bio, gfp_mask);
@@ -2458,28 +2486,6 @@ void blk_account_io_done(struct request *req)
 	}
 }
 
-#ifdef CONFIG_PM
-/*
- * Don't process normal requests when queue is suspended
- * or in the process of suspending/resuming
- */
-static struct request *blk_pm_peek_request(struct request_queue *q,
-					   struct request *rq)
-{
-	if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
-	    (q->rpm_status != RPM_ACTIVE && !(rq->cmd_flags & REQ_PM))))
-		return NULL;
-	else
-		return rq;
-}
-#else
-static inline struct request *blk_pm_peek_request(struct request_queue *q,
-						  struct request *rq)
-{
-	return rq;
-}
-#endif
-
 void blk_account_io_start(struct request *rq, bool new_io)
 {
 	struct hd_struct *part;
@@ -2538,11 +2544,6 @@ struct request *blk_peek_request(struct request_queue *q)
 	WARN_ON_ONCE(q->mq_ops);
 
 	while ((rq = __elv_next_request(q)) != NULL) {
-
-		rq = blk_pm_peek_request(q, rq);
-		if (!rq)
-			break;
-
 		if (!(rq->rq_flags & RQF_STARTED)) {
 			/*
 			 * This is the first time the device driver
@@ -3443,6 +3444,7 @@ void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 
 	q->dev = dev;
 	q->rpm_status = RPM_ACTIVE;
+	init_waitqueue_head(&q->rpm_active_wq);
 	pm_runtime_set_autosuspend_delay(q->dev, -1);
 	pm_runtime_use_autosuspend(q->dev);
 }
@@ -3512,6 +3514,7 @@ void blk_post_runtime_suspend(struct request_queue *q, int err)
 	} else {
 		q->rpm_status = RPM_ACTIVE;
 		pm_runtime_mark_last_busy(q->dev);
+		wake_up_all(&q->rpm_active_wq);
 	}
 	spin_unlock_irq(q->queue_lock);
 }
@@ -3561,8 +3564,8 @@ void blk_post_runtime_resume(struct request_queue *q, int err)
 	spin_lock_irq(q->queue_lock);
 	if (!err) {
 		q->rpm_status = RPM_ACTIVE;
-		__blk_run_queue(q);
 		pm_runtime_mark_last_busy(q->dev);
+		wake_up_all(&q->rpm_active_wq);
 		pm_request_autosuspend(q->dev);
 	} else {
 		q->rpm_status = RPM_SUSPENDED;
@@ -3590,6 +3593,7 @@ void blk_set_runtime_active(struct request_queue *q)
 	spin_lock_irq(q->queue_lock);
 	q->rpm_status = RPM_ACTIVE;
 	pm_runtime_mark_last_busy(q->dev);
+	wake_up_all(&q->rpm_active_wq);
 	pm_request_autosuspend(q->dev);
 	spin_unlock_irq(q->queue_lock);
 }
diff --git a/block/blk.h b/block/blk.h
index fcb9775b997d..f535ece723ab 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -361,4 +361,16 @@ static inline void blk_queue_bounce(struct request_queue *q, struct bio **bio)
 }
 #endif /* CONFIG_BOUNCE */
 
+#ifdef CONFIG_PM
+static inline bool blk_pm_suspended(struct request_queue *q)
+{
+	return q->rpm_status == RPM_SUSPENDED;
+}
+#else
+static inline bool blk_pm_suspended(struct request_queue *q)
+{
+	return false;
+}
+#endif
+
 #endif /* BLK_INTERNAL_H */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 7f9a0743fc09..08a709c0971b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -493,6 +493,7 @@ struct request_queue {
 #ifdef CONFIG_PM
 	struct device		*dev;
 	int			rpm_status;
+	struct wait_queue_head	rpm_active_wq;
 	unsigned int		nr_pending;
 #endif
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/5] blk-mq: Implement power management support
  2017-09-08 23:52 [PATCH 0/5] Make SCSI device suspend work reliably Bart Van Assche
                   ` (3 preceding siblings ...)
  2017-09-08 23:52 ` [PATCH 4/5] block: Make SCSI device suspend work reliably Bart Van Assche
@ 2017-09-08 23:52 ` Bart Van Assche
  2017-09-12  9:14   ` Ming Lei
  2017-09-09 10:39 ` [PATCH 0/5] Make SCSI device suspend work reliably Ming Lei
  5 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2017-09-08 23:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Bart Van Assche, Hannes Reinecke,
	Johannes Thumshirn, Rafael J . Wysocki, Ming Lei

Implement the following approach for blk-mq:
- Either make blk_get_request() wait or make it fail when a
  request queue is not in status RPM_ACTIVE.
- While suspending, suspended or resuming, only process power
  management requests (REQ_PM).

Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 20 ++++++++++++++++----
 block/blk-mq.c   | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cd2700c763ed..49a4cd5b255e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3438,10 +3438,6 @@ EXPORT_SYMBOL(blk_finish_plug);
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-	/* not support for RQF_PM and ->rpm_status in blk-mq yet */
-	if (q->mq_ops)
-		return;
-
 	q->dev = dev;
 	q->rpm_status = RPM_ACTIVE;
 	init_waitqueue_head(&q->rpm_active_wq);
@@ -3478,6 +3474,19 @@ int blk_pre_runtime_suspend(struct request_queue *q)
 	if (!q->dev)
 		return ret;
 
+	if (q->mq_ops) {
+		percpu_ref_switch_to_atomic_nowait(&q->q_usage_counter);
+		if (!percpu_ref_is_zero(&q->q_usage_counter)) {
+			ret = -EBUSY;
+			pm_runtime_mark_last_busy(q->dev);
+		} else {
+			spin_lock_irq(q->queue_lock);
+			q->rpm_status = RPM_SUSPENDING;
+			spin_unlock_irq(q->queue_lock);
+		}
+		return ret;
+	}
+
 	spin_lock_irq(q->queue_lock);
 	if (q->nr_pending) {
 		ret = -EBUSY;
@@ -3561,6 +3570,9 @@ void blk_post_runtime_resume(struct request_queue *q, int err)
 	if (!q->dev)
 		return;
 
+	if (q->mq_ops)
+		percpu_ref_switch_to_percpu(&q->q_usage_counter);
+
 	spin_lock_irq(q->queue_lock);
 	if (!err) {
 		q->rpm_status = RPM_ACTIVE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f18cff80050..cbd680dc194a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -383,6 +383,29 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 	return rq;
 }
 
+#ifdef CONFIG_PM
+static bool blk_mq_wait_until_active(struct request_queue *q, bool wait)
+{
+	if (!wait)
+		return false;
+	/*
+	 * Note: the q->rpm_status check below races against the changes of
+	 * that variable by the blk_{pre,post}_runtime_{suspend,resume}()
+	 * functions. The worst possible consequence of these races is that a
+	 * small number of requests gets passed to the block driver associated
+	 * with the request queue after rpm_status has been changed into
+	 * RPM_SUSPENDING and before it is changed into RPM_SUSPENDED.
+	 */
+	wait_event(q->rpm_active_wq, q->rpm_status == RPM_ACTIVE);
+	return true;
+}
+#else
+static bool blk_mq_wait_until_active(struct request_queue *q, bool nowait)
+{
+	return true;
+}
+#endif
+
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 		unsigned int flags)
 {
@@ -390,6 +413,17 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 	struct request *rq;
 	int ret;
 
+	WARN_ON_ONCE((op & REQ_PM) && blk_pm_suspended(q));
+
+	/*
+	 * Wait if the request queue is suspended or in the process of
+	 * suspending/resuming and the request being allocated will not be
+	 * used for power management purposes.
+	 */
+	if (!(op & REQ_PM) &&
+	    !blk_mq_wait_until_active(q, !(op & REQ_NOWAIT)))
+		return ERR_PTR(-EAGAIN);
+
 	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
 	if (ret)
 		return ERR_PTR(ret);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] Make SCSI device suspend work reliably
  2017-09-08 23:52 [PATCH 0/5] Make SCSI device suspend work reliably Bart Van Assche
                   ` (4 preceding siblings ...)
  2017-09-08 23:52 ` [PATCH 5/5] blk-mq: Implement power management support Bart Van Assche
@ 2017-09-09 10:39 ` Ming Lei
  2017-09-11 16:25   ` Bart Van Assche
  5 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2017-09-09 10:39 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Jens Axboe, linux-block, Christoph Hellwig

Bart,

On Fri, Sep 08, 2017 at 04:52:21PM -0700, Bart Van Assche wrote:
> Hello Jens,
> 
> Recently it was reported on the block layer mailing list that suspend
> does not work reliably neither for the legacy block layer nor for blk-mq.

What is the issue? Why is it not reliably? Please describe it clearly.

> The purpose of this patch series is to make device suspend work reliably
> without affecting the hot path significantly and without introducing any
> race conditions between request queue cleanup and blk_get_request().

If you mean the approach in my patchset, please say it clearly.
I replied you already, but I am happy to reply you again.

Looks you do not understand the root cause behind I/O hang
during suspend/resume reported by Oleksandr.

Let me explain it again:

- the issue is not suspend/resume only, it is about SCSI quiesce vs.
RQF_PREEMPT.

- when SCSI device is put into quiesce, only RQF_PREEMPT request is
allowed to dispatch to lld, and other requests can't be dispatched
successfully.

- so if requests pool are used up during SCSI quiesce, no request can
be allocated for RQF_PREEMPT and these requests can't be freed too,
so I/O hang is caused because the pool is often very limited, and
easy to be consumed up

Except for the suspend I/O hang reported by Oleksandr, we also have
other I/O hang related with SCSI quiesce, both belongs to the
same kind of issue.

I don't see the issue above addressed by this patchset, so maybe
you are trying to fix another PM specific issue(not reported by
Oleksandr), I am just confused.

So again, please describe the issue to be addressed clearly first!

-- 
Ming

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()
  2017-09-08 23:52 ` [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait() Bart Van Assche
@ 2017-09-11  8:42   ` Johannes Thumshirn
  2017-09-11 16:10     ` Bart Van Assche
  2017-09-11 13:13   ` Tejun Heo
  1 sibling, 1 reply; 22+ messages in thread
From: Johannes Thumshirn @ 2017-09-11  8:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Tejun Heo,
	Christoph Lameter, NeilBrown

On Fri, Sep 08, 2017 at 04:52:22PM -0700, Bart Van Assche wrote:
> +/**
> + * percpu_ref_switch_to_nowait - switch a percpu_ref to atomic mode

This should probably read percpu_ref_switch_to_atomic_nowait, shouldn't it?

> +void percpu_ref_switch_to_atomic_nowait(struct percpu_ref *ref)

-- 
Johannes Thumshirn                                          Storage
jthumshirn@suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: Felix Imend�rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N�rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()
  2017-09-08 23:52 ` [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait() Bart Van Assche
  2017-09-11  8:42   ` Johannes Thumshirn
@ 2017-09-11 13:13   ` Tejun Heo
  2017-09-11 16:09     ` Bart Van Assche
  1 sibling, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2017-09-11 13:13 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Christoph Lameter, NeilBrown

On Fri, Sep 08, 2017 at 04:52:22PM -0700, Bart Van Assche wrote:
> The blk-mq core keeps track of the number of request queue users
> through q->q_usage_count. Make it possible to switch this counter
> to atomic mode from the context of the block layer power management
> code by introducing percpu_ref_switch_to_atomic_nowait().

Do you ever have to switch back?  If so, how do you know whether the
previous transition finished?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()
  2017-09-11 13:13   ` Tejun Heo
@ 2017-09-11 16:09     ` Bart Van Assche
  2017-09-11 16:37       ` tj
  0 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2017-09-11 16:09 UTC (permalink / raw)
  To: tj; +Cc: hch, neilb, linux-block, cl, axboe

T24gTW9uLCAyMDE3LTA5LTExIGF0IDA2OjEzIC0wNzAwLCBUZWp1biBIZW8gd3JvdGU6DQo+IE9u
IEZyaSwgU2VwIDA4LCAyMDE3IGF0IDA0OjUyOjIyUE0gLTA3MDAsIEJhcnQgVmFuIEFzc2NoZSB3
cm90ZToNCj4gPiBUaGUgYmxrLW1xIGNvcmUga2VlcHMgdHJhY2sgb2YgdGhlIG51bWJlciBvZiBy
ZXF1ZXN0IHF1ZXVlIHVzZXJzDQo+ID4gdGhyb3VnaCBxLT5xX3VzYWdlX2NvdW50LiBNYWtlIGl0
IHBvc3NpYmxlIHRvIHN3aXRjaCB0aGlzIGNvdW50ZXINCj4gPiB0byBhdG9taWMgbW9kZSBmcm9t
IHRoZSBjb250ZXh0IG9mIHRoZSBibG9jayBsYXllciBwb3dlciBtYW5hZ2VtZW50DQo+ID4gY29k
ZSBieSBpbnRyb2R1Y2luZyBwZXJjcHVfcmVmX3N3aXRjaF90b19hdG9taWNfbm93YWl0KCkuDQo+
IA0KPiBEbyB5b3UgZXZlciBoYXZlIHRvIHN3aXRjaCBiYWNrPyAgSWYgc28sIGhvdyBkbyB5b3Ug
a25vdyB3aGV0aGVyIHRoZQ0KPiBwcmV2aW91cyB0cmFuc2l0aW9uIGZpbmlzaGVkPw0KDQpIZWxs
byBUZWp1biwNCg0KVGhlIHdheSBJIHdvdWxkIGxpa2UgdG8gdXNlIHRoaXMgZnVuY3Rpb24gaXMg
dG8gY2hlY2sgZm9yIGNvbXBsZXRpb24gb2YgdGhlDQp0cmFuc2l0aW9uIGJ5IGNhbGxpbmcgcGVy
Y3B1X3JlZl9pc196ZXJvKCkuIFRoYXQgZnVuY3Rpb24gbmFtZWx5IG5vdCBvbmx5DQpjaGVja3Mg
dGhlIHZhbHVlIG9mIHRoZSByZWZjb3VudCBidXQgYWxzbyB3aGV0aGVyIGl0IGlzIGluIGF0b21p
YyBtb2RlLiBTZWUNCmFsc28gIltQQVRDSCA1LzVdIGJsay1tcTogSW1wbGVtZW50IHBvd2VyIG1h
bmFnZW1lbnQgc3VwcG9ydCINCihodHRwczovL3d3dy5zcGluaWNzLm5ldC9saXN0cy9saW51eC1i
bG9jay9tc2cxNzE0My5odG1sKS4gRG8geW91IHRoaW5rIHRoaXMNCndpbGwgd29yaz8NCg0KVGhh
bmtzLA0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()
  2017-09-11  8:42   ` Johannes Thumshirn
@ 2017-09-11 16:10     ` Bart Van Assche
  0 siblings, 0 replies; 22+ messages in thread
From: Bart Van Assche @ 2017-09-11 16:10 UTC (permalink / raw)
  To: jthumshirn; +Cc: hch, tj, linux-block, cl, neilb, axboe

T24gTW9uLCAyMDE3LTA5LTExIGF0IDEwOjQyICswMjAwLCBKb2hhbm5lcyBUaHVtc2hpcm4gd3Jv
dGU6DQo+IE9uIEZyaSwgU2VwIDA4LCAyMDE3IGF0IDA0OjUyOjIyUE0gLTA3MDAsIEJhcnQgVmFu
IEFzc2NoZSB3cm90ZToNCj4gPiArLyoqDQo+ID4gKyAqIHBlcmNwdV9yZWZfc3dpdGNoX3RvX25v
d2FpdCAtIHN3aXRjaCBhIHBlcmNwdV9yZWYgdG8gYXRvbWljIG1vZGUNCj4gDQo+IFRoaXMgc2hv
dWxkIHByb2JhYmx5IHJlYWQgcGVyY3B1X3JlZl9zd2l0Y2hfdG9fYXRvbWljX25vd2FpdCwgc2hv
dWxkbid0IGl0Pw0KDQpJbmRlZWQuIFRoYW5rcyBmb3IgaGF2aW5nIGNhdWdodCBhbmQgcmVwb3J0
ZWQgdGhpcy4gSSB3aWxsIGFkZHJlc3MgdGhpcw0Kd2hlbiBJIHJlc3VibWl0IHRoaXMgcGF0Y2gg
c2VyaWVzLg0KDQpCYXJ0Lg==

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] Make SCSI device suspend work reliably
  2017-09-09 10:39 ` [PATCH 0/5] Make SCSI device suspend work reliably Ming Lei
@ 2017-09-11 16:25   ` Bart Van Assche
  2017-09-12  2:17     ` Ming Lei
  0 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2017-09-11 16:25 UTC (permalink / raw)
  To: ming.lei; +Cc: hch, linux-block, axboe

T24gU2F0LCAyMDE3LTA5LTA5IGF0IDE4OjM5ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gLSB0
aGUgaXNzdWUgaXMgbm90IHN1c3BlbmQvcmVzdW1lIG9ubHksIGl0IGlzIGFib3V0IFNDU0kgcXVp
ZXNjZSB2cy4NCj4gUlFGX1BSRUVNUFQuDQoNClRoZSBvbmx5IGluLWtlcm5lbCB1c2VyIG9mIHRo
ZSBTQ1NJIHF1aWVzY2UvcmVzdW1lIGZ1bmN0aW9uYWxpdHkgbmV4dCB0bw0KcnVudGltZSBwb3dl
ciBtYW5hZ2VtZW50IEkga25vdyBvZiBpcyB0aGUgU1BJIHRyYW5zcG9ydCwgbmFtZWx5DQpzcGlf
ZHZfZGV2aWNlKCkuIE5vYm9keSBoYXMgZXZlciByZXBvcnRlZCBhIGhhbmcgaW4gc3BpX2R2X2Rl
dmljZSgpIGFzIGZhcg0KYXMgSSBrbm93LiBTaW5jZSB0aGUgbnVtYmVyIG9mIHVzZXJzIG9mIHRo
ZSBTQ1NJIHBhcmFsbGVsIHRyYW5zcG9ydCBpcyB2ZXJ5DQpzbWFsbCBJIHByb3Bvc2UgdG8gd2Fp
dCB3aXRoIGZpeGluZyBzcGlfZHZfZGV2aWNlKCkgdW50aWwgc29tZW9uZSBzaG93cyB0aGF0DQpp
dCBpcyBwb3NzaWJsZSB0byB0cmlnZ2VyIHRoYXQgaGFuZy4gSWYgd2UgaGF2ZSB0byBhZGRyZXNz
IGEgaGFuZyBpbg0Kc3BpX2R2X2RldmljZSgpIEkgcHJvcG9zZSB0byBhZGRyZXNzIGl0IGluIHRo
ZSBzYW1lIHdheSBhcyBteSBwYXRjaCBzZXJpZXMNCmFkZHJlc3NlcyBwb3dlciBtYW5hZ2VtZW50
IHJlcXVlc3RzLCBuYW1lbHkgYnkgbGV0dGluZyBwcmV2aW91c2x5IHN1Ym1pdHRlZA0KcmVxdWVz
dHMgY29tcGxldGUgYW5kIGJ5IG1ha2luZyBibGtfZ2V0X3JlcXVlc3QoKSBjYWxsZXJzIHdhaXQg
dW50aWwgdGhlDQpxdWllc2NlZCBTQ1NJIGRldmljZSBoYXMgYmVlbiByZXN1bWVkLg0KDQo+IEkg
ZG9uJ3Qgc2VlIHRoZSBpc3N1ZSBhYm92ZSBhZGRyZXNzZWQgYnkgdGhpcyBwYXRjaHNldCwgc28g
bWF5YmUNCj4geW91IGFyZSB0cnlpbmcgdG8gZml4IGFub3RoZXIgUE0gc3BlY2lmaWMgaXNzdWUo
bm90IHJlcG9ydGVkIGJ5DQo+IE9sZWtzYW5kciksIEkgYW0ganVzdCBjb25mdXNlZC4NCg0KVGhh
dCdzIGNvcnJlY3QgLSBJJ20gb25seSB0cnlpbmcgdG8gYWRkcmVzcyB0aGUgUE0gaXNzdWVzLiBJ
IHdpbGwgbWVudGlvbg0KdGhhdCBleHBsaWNpdGx5IHdoZW4gSSByZXBvc3QgdGhpcyBwYXRjaCBz
ZXJpZXMuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()
  2017-09-11 16:09     ` Bart Van Assche
@ 2017-09-11 16:37       ` tj
  2017-09-11 16:55         ` Bart Van Assche
  0 siblings, 1 reply; 22+ messages in thread
From: tj @ 2017-09-11 16:37 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, neilb, linux-block, cl, axboe

Hello,

On Mon, Sep 11, 2017 at 04:09:12PM +0000, Bart Van Assche wrote:
> On Mon, 2017-09-11 at 06:13 -0700, Tejun Heo wrote:
> > On Fri, Sep 08, 2017 at 04:52:22PM -0700, Bart Van Assche wrote:
> > > The blk-mq core keeps track of the number of request queue users
> > > through q->q_usage_count. Make it possible to switch this counter
> > > to atomic mode from the context of the block layer power management
> > > code by introducing percpu_ref_switch_to_atomic_nowait().
> > 
> > Do you ever have to switch back?  If so, how do you know whether the
> > previous transition finished?
> 
> The way I would like to use this function is to check for completion of the
> transition by calling percpu_ref_is_zero(). That function namely not only
> checks the value of the refcount but also whether it is in atomic mode. See
> also "[PATCH 5/5] blk-mq: Implement power management support"
> (https://www.spinics.net/lists/linux-block/msg17143.html). Do you think this
> will work?

Probably but that sounds really hairy.  I'd much prefer if it could be
done through the usual kill / confirm / release and re-init.  That's
not a possibility?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()
  2017-09-11 16:37       ` tj
@ 2017-09-11 16:55         ` Bart Van Assche
  2017-09-11 17:20           ` Tejun Heo
  0 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2017-09-11 16:55 UTC (permalink / raw)
  To: tj; +Cc: hch, neilb, linux-block, cl, axboe

T24gTW9uLCAyMDE3LTA5LTExIGF0IDA5OjM3IC0wNzAwLCB0akBrZXJuZWwub3JnIHdyb3RlOg0K
PiBIZWxsbywNCj4gDQo+IE9uIE1vbiwgU2VwIDExLCAyMDE3IGF0IDA0OjA5OjEyUE0gKzAwMDAs
IEJhcnQgVmFuIEFzc2NoZSB3cm90ZToNCj4gPiBPbiBNb24sIDIwMTctMDktMTEgYXQgMDY6MTMg
LTA3MDAsIFRlanVuIEhlbyB3cm90ZToNCj4gPiA+IE9uIEZyaSwgU2VwIDA4LCAyMDE3IGF0IDA0
OjUyOjIyUE0gLTA3MDAsIEJhcnQgVmFuIEFzc2NoZSB3cm90ZToNCj4gPiA+ID4gVGhlIGJsay1t
cSBjb3JlIGtlZXBzIHRyYWNrIG9mIHRoZSBudW1iZXIgb2YgcmVxdWVzdCBxdWV1ZSB1c2Vycw0K
PiA+ID4gPiB0aHJvdWdoIHEtPnFfdXNhZ2VfY291bnQuIE1ha2UgaXQgcG9zc2libGUgdG8gc3dp
dGNoIHRoaXMgY291bnRlcg0KPiA+ID4gPiB0byBhdG9taWMgbW9kZSBmcm9tIHRoZSBjb250ZXh0
IG9mIHRoZSBibG9jayBsYXllciBwb3dlciBtYW5hZ2VtZW50DQo+ID4gPiA+IGNvZGUgYnkgaW50
cm9kdWNpbmcgcGVyY3B1X3JlZl9zd2l0Y2hfdG9fYXRvbWljX25vd2FpdCgpLg0KPiA+ID4gDQo+
ID4gPiBEbyB5b3UgZXZlciBoYXZlIHRvIHN3aXRjaCBiYWNrPyAgSWYgc28sIGhvdyBkbyB5b3Ug
a25vdyB3aGV0aGVyIHRoZQ0KPiA+ID4gcHJldmlvdXMgdHJhbnNpdGlvbiBmaW5pc2hlZD8NCj4g
PiANCj4gPiBUaGUgd2F5IEkgd291bGQgbGlrZSB0byB1c2UgdGhpcyBmdW5jdGlvbiBpcyB0byBj
aGVjayBmb3IgY29tcGxldGlvbiBvZiB0aGUNCj4gPiB0cmFuc2l0aW9uIGJ5IGNhbGxpbmcgcGVy
Y3B1X3JlZl9pc196ZXJvKCkuIFRoYXQgZnVuY3Rpb24gbmFtZWx5IG5vdCBvbmx5DQo+ID4gY2hl
Y2tzIHRoZSB2YWx1ZSBvZiB0aGUgcmVmY291bnQgYnV0IGFsc28gd2hldGhlciBpdCBpcyBpbiBh
dG9taWMgbW9kZS4gU2VlDQo+ID4gYWxzbyAiW1BBVENIIDUvNV0gYmxrLW1xOiBJbXBsZW1lbnQg
cG93ZXIgbWFuYWdlbWVudCBzdXBwb3J0Ig0KPiA+IChodHRwczovL3d3dy5zcGluaWNzLm5ldC9s
aXN0cy9saW51eC1ibG9jay9tc2cxNzE0My5odG1sKS4gRG8geW91IHRoaW5rIHRoaXMNCj4gPiB3
aWxsIHdvcms/DQo+IA0KPiBQcm9iYWJseSBidXQgdGhhdCBzb3VuZHMgcmVhbGx5IGhhaXJ5LiAg
SSdkIG11Y2ggcHJlZmVyIGlmIGl0IGNvdWxkIGJlDQo+IGRvbmUgdGhyb3VnaCB0aGUgdXN1YWwg
a2lsbCAvIGNvbmZpcm0gLyByZWxlYXNlIGFuZCByZS1pbml0LiAgVGhhdCdzDQo+IG5vdCBhIHBv
c3NpYmlsaXR5Pw0KDQpIZWxsbyBUZWp1biwNCg0KTW9kaWZ5aW5nIHRoaXMgcGF0Y2ggc2VyaWVz
IHN1Y2ggdGhhdCBpdCB1c2VzIHRoZSBjb25maXJtYXRpb24gbWVjaGFuaXNtDQp3aGVuIHN3aXRj
aGluZyBmcm9tIHBlci1jcHUgdG8gYXRvbWljIG1vZGUgc2hvdWxkIGJlIHBvc3NpYmxlLiBIb3dl
dmVyLCB0aGUNCndhaXRfZXZlbnRfbG9ja19pcnEoKSBjYWxsIGluIF9fcGVyY3B1X3JlZl9zd2l0
Y2hfbW9kZSgpIGlzIGFubm95aW5nIHdoZW4NCmNhbGxpbmcgcGVyY3B1X3JlZl9raWxsX2FuZF9j
b25maXJtKCkgZnJvbSBhdG9taWMgY29udGV4dC4gSG93IGFib3V0IGFkZGluZw0KYSBub3dhaXQg
dmVyc2lvbiBvZiBwZXJjcHVfcmVmX2tpbGxfYW5kX2NvbmZpcm0oKSB0aGF0IHJldHVybnMgYSBi
b29sZWFuDQppbmRpY2F0aW5nIHdoZXRoZXIgb3Igbm90IHRoaXMgZnVuY3Rpb24gc3VjY2VlZGVk
Pw0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()
  2017-09-11 16:55         ` Bart Van Assche
@ 2017-09-11 17:20           ` Tejun Heo
  2017-09-11 17:29             ` Bart Van Assche
  0 siblings, 1 reply; 22+ messages in thread
From: Tejun Heo @ 2017-09-11 17:20 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, neilb, linux-block, cl, axboe

Hey,

On Mon, Sep 11, 2017 at 04:55:59PM +0000, Bart Van Assche wrote:
> > Probably but that sounds really hairy.  I'd much prefer if it could be
> > done through the usual kill / confirm / release and re-init.  That's
> > not a possibility?
> 
> Modifying this patch series such that it uses the confirmation mechanism
> when switching from per-cpu to atomic mode should be possible. However, the
> wait_event_lock_irq() call in __percpu_ref_switch_mode() is annoying when
> calling percpu_ref_kill_and_confirm() from atomic context. How about adding
> a nowait version of percpu_ref_kill_and_confirm() that returns a boolean
> indicating whether or not this function succeeded?

Hmm... so if you know that you won't try to switch again while
switching is in progress, it should be fine.  That can't be
guaranteed?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait()
  2017-09-11 17:20           ` Tejun Heo
@ 2017-09-11 17:29             ` Bart Van Assche
  0 siblings, 0 replies; 22+ messages in thread
From: Bart Van Assche @ 2017-09-11 17:29 UTC (permalink / raw)
  To: tj; +Cc: hch, neilb, linux-block, cl, axboe

T24gTW9uLCAyMDE3LTA5LTExIGF0IDEwOjIwIC0wNzAwLCBUZWp1biBIZW8gd3JvdGU6DQo+IEht
bS4uLiBzbyBpZiB5b3Uga25vdyB0aGF0IHlvdSB3b24ndCB0cnkgdG8gc3dpdGNoIGFnYWluIHdo
aWxlDQo+IHN3aXRjaGluZyBpcyBpbiBwcm9ncmVzcywgaXQgc2hvdWxkIGJlIGZpbmUuICBUaGF0
IGNhbid0IGJlDQo+IGd1YXJhbnRlZWQ/DQoNCkhlbGxvIFRlanVuLA0KDQpBZnRlciBoYXZpbmcg
aGFkIGFub3RoZXIgbG9vayBhdCB0aGUgcG93ZXIgbWFuYWdlbWVudCBkb2N1bWVudGF0aW9uIEkg
c2VlDQpub3cgdGhhdCBwb3dlciBtYW5hZ2VtZW50IGNhbGxiYWNrcyBhcmUgb25seSBjYWxsZWQg
ZnJvbSBhdG9taWMgY29udGV4dCBpZg0KcG1fcnVudGltZV9pcnFfc2FmZShkZXYpIGlzIGNhbGxl
ZCBmaXJzdC4gTmVpdGhlciB0aGUgYmxvY2sgbGF5ZXIgY29yZSBub3INCnRoZSBTQ1NJIGNvcmUg
Y2FsbCB0aGlzIGZ1bmN0aW9uIHNvIEkgd2lsbCB0cnkgdG8gcmV3b3JrIHRoaXMgcGF0Y2ggc2Vy
aWVzDQp3aXRob3V0IG1vZGlmeWluZyB0aGUgcGVyY3B1LXJlZmNvdW50IGltcGxlbWVudGF0aW9u
Lg0KDQpUaGFua3MsDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/5] Make SCSI device suspend work reliably
  2017-09-11 16:25   ` Bart Van Assche
@ 2017-09-12  2:17     ` Ming Lei
  0 siblings, 0 replies; 22+ messages in thread
From: Ming Lei @ 2017-09-12  2:17 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: hch, linux-block, axboe

On Mon, Sep 11, 2017 at 04:25:30PM +0000, Bart Van Assche wrote:
> On Sat, 2017-09-09 at 18:39 +0800, Ming Lei wrote:
> > - the issue is not suspend/resume only, it is about SCSI quiesce vs.
> > RQF_PREEMPT.
> 
> The only in-kernel user of the SCSI quiesce/resume functionality next to
> runtime power management I know of is the SPI transport, namely
> spi_dv_device(). Nobody has ever reported a hang in spi_dv_device() as far
> as I know. Since the number of users of the SCSI parallel transport is very
> small I propose to wait with fixing spi_dv_device() until someone shows that
> it is possible to trigger that hang. If we have to address a hang in
> spi_dv_device() I propose to address it in the same way as my patch series
> addresses power management requests, namely by letting previously submitted
> requests complete and by making blk_get_request() callers wait until the
> quiesced SCSI device has been resumed.

As I mentioned in V4 cover letter, Cathy saw this I/O hang issue, and
she has verified that my revised patchset V3 can fix the issue.

> 
> > I don't see the issue above addressed by this patchset, so maybe
> > you are trying to fix another PM specific issue(not reported by
> > Oleksandr), I am just confused.
> 
> That's correct - I'm only trying to address the PM issues. I will mention
> that explicitly when I repost this patch series.

Unfortunately it isn't enough, and the approach you took is wrong too,
I will comment it in your patch 4.

-- 
Ming

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] block: Make SCSI device suspend work reliably
  2017-09-08 23:52 ` [PATCH 4/5] block: Make SCSI device suspend work reliably Bart Van Assche
@ 2017-09-12  2:29   ` Ming Lei
  2017-09-12 15:45     ` Bart Van Assche
  2017-09-12 16:25   ` Ming Lei
  1 sibling, 1 reply; 22+ messages in thread
From: Ming Lei @ 2017-09-12  2:29 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Johannes Thumshirn, Rafael J . Wysocki

On Fri, Sep 08, 2017 at 04:52:25PM -0700, Bart Van Assche wrote:
> Instead of allowing request allocation to succeed for suspended
> request queues and only to process power management requests, make
> blk_get_request() wait until the request queue is resumed for
> requests that are not power management requests.
> 
> This patch avoids that resume does not occur if the maximum queue
> depth is reached when a power management request is submitted.
> 
> Note: this patch affects the behavior of scsi_device_quiesce() only
> if that function is called from inside a power management callback.
> This patch does not affect the behavior of scsi_device_quiesce()
> when a call of that function is triggered by writing "quiesce" into
> /sys/class/scsi_device/*/device/state.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c       | 60 +++++++++++++++++++++++++++-----------------------
>  block/blk.h            | 12 ++++++++++
>  include/linux/blkdev.h |  1 +
>  3 files changed, 45 insertions(+), 28 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index bb53c6b58e8c..cd2700c763ed 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1325,6 +1325,24 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
>  	return ERR_PTR(-ENOMEM);
>  }
>  
> +#ifdef CONFIG_PM
> +static bool blk_wait_until_active(struct request_queue *q, bool wait)
> +	__releases(q->queue_lock)
> +	__acquires(q->queue_lock)
> +{
> +	if (wait)
> +		wait_event_lock_irq(q->rpm_active_wq,
> +				    q->rpm_status == RPM_ACTIVE,
> +				    *q->queue_lock);
> +	return q->rpm_status == RPM_ACTIVE;
> +}
> +#else
> +static bool blk_wait_until_active(struct request_queue *q, bool wait)
> +{
> +	return true;
> +}
> +#endif
> +
>  /**
>   * get_request - get a free request
>   * @q: request_queue to allocate request from
> @@ -1350,6 +1368,16 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
>  	lockdep_assert_held(q->queue_lock);
>  	WARN_ON_ONCE(q->mq_ops);
>  
> +	WARN_ON_ONCE((op & REQ_PM) && blk_pm_suspended(q));
> +
> +	/*
> +	 * Wait if the request queue is suspended or in the process of
> +	 * suspending/resuming and the request being allocated will not be
> +	 * used for power management purposes.
> +	 */
> +	if (!(op & REQ_PM) && !blk_wait_until_active(q, !(op & REQ_NOWAIT)))
> +		return ERR_PTR(-EAGAIN);
> +

Firstly it is not enough to prevent new allocation only, because
requests pool may have been used up and all the allocated requests
just stay in block queue when SCSI device is put into quiesce.
So you may cause new I/O hang and wait forever here. That is why
I take freeze to do that because freezing queue can prevent new
allocation and drain in-queue requests both.

Secondly, all RQF_PREEMPT(not PM) is blocked here too, that may
cause regression since we usually need/allow RQF_PREEMPT to run
during SCSI quiesce.

-- 
Ming

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/5] blk-mq: Implement power management support
  2017-09-08 23:52 ` [PATCH 5/5] blk-mq: Implement power management support Bart Van Assche
@ 2017-09-12  9:14   ` Ming Lei
  0 siblings, 0 replies; 22+ messages in thread
From: Ming Lei @ 2017-09-12  9:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Johannes Thumshirn, Rafael J . Wysocki

On Fri, Sep 08, 2017 at 04:52:26PM -0700, Bart Van Assche wrote:
> Implement the following approach for blk-mq:
> - Either make blk_get_request() wait or make it fail when a
>   request queue is not in status RPM_ACTIVE.
> - While suspending, suspended or resuming, only process power
>   management requests (REQ_PM).
> 
> Reported-by: Oleksandr Natalenko <oleksandr@natalenko.name>
> References: "I/O hangs after resuming from suspend-to-ram" (https://marc.info/?l=linux-block&m=150340235201348).

This patch is nothing to do with Oleksandr's report, please
remove the above two lines. For example, runttime PM can
be bypassed via sysfs, and suspend/resume still can work
well.

> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c | 20 ++++++++++++++++----
>  block/blk-mq.c   | 34 ++++++++++++++++++++++++++++++++++
>  2 files changed, 50 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index cd2700c763ed..49a4cd5b255e 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -3438,10 +3438,6 @@ EXPORT_SYMBOL(blk_finish_plug);
>   */
>  void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
>  {
> -	/* not support for RQF_PM and ->rpm_status in blk-mq yet */
> -	if (q->mq_ops)
> -		return;
> -
>  	q->dev = dev;
>  	q->rpm_status = RPM_ACTIVE;
>  	init_waitqueue_head(&q->rpm_active_wq);
> @@ -3478,6 +3474,19 @@ int blk_pre_runtime_suspend(struct request_queue *q)
>  	if (!q->dev)
>  		return ret;
>  
> +	if (q->mq_ops) {
> +		percpu_ref_switch_to_atomic_nowait(&q->q_usage_counter);
> +		if (!percpu_ref_is_zero(&q->q_usage_counter)) {
> +			ret = -EBUSY;
> +			pm_runtime_mark_last_busy(q->dev);
> +		} else {
> +			spin_lock_irq(q->queue_lock);
> +			q->rpm_status = RPM_SUSPENDING;
> +			spin_unlock_irq(q->queue_lock);
> +		}
> +		return ret;
> +	}
> +
>  	spin_lock_irq(q->queue_lock);
>  	if (q->nr_pending) {
>  		ret = -EBUSY;
> @@ -3561,6 +3570,9 @@ void blk_post_runtime_resume(struct request_queue *q, int err)
>  	if (!q->dev)
>  		return;
>  
> +	if (q->mq_ops)
> +		percpu_ref_switch_to_percpu(&q->q_usage_counter);
> +
>  	spin_lock_irq(q->queue_lock);
>  	if (!err) {
>  		q->rpm_status = RPM_ACTIVE;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3f18cff80050..cbd680dc194a 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -383,6 +383,29 @@ static struct request *blk_mq_get_request(struct request_queue *q,
>  	return rq;
>  }
>  
> +#ifdef CONFIG_PM
> +static bool blk_mq_wait_until_active(struct request_queue *q, bool wait)
> +{
> +	if (!wait)
> +		return false;
> +	/*
> +	 * Note: the q->rpm_status check below races against the changes of
> +	 * that variable by the blk_{pre,post}_runtime_{suspend,resume}()
> +	 * functions. The worst possible consequence of these races is that a
> +	 * small number of requests gets passed to the block driver associated
> +	 * with the request queue after rpm_status has been changed into
> +	 * RPM_SUSPENDING and before it is changed into RPM_SUSPENDED.
> +	 */
> +	wait_event(q->rpm_active_wq, q->rpm_status == RPM_ACTIVE);
> +	return true;
> +}
> +#else
> +static bool blk_mq_wait_until_active(struct request_queue *q, bool nowait)
> +{
> +	return true;
> +}
> +#endif
> +
>  struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
>  		unsigned int flags)
>  {
> @@ -390,6 +413,17 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
>  	struct request *rq;
>  	int ret;
>  
> +	WARN_ON_ONCE((op & REQ_PM) && blk_pm_suspended(q));
> +
> +	/*
> +	 * Wait if the request queue is suspended or in the process of
> +	 * suspending/resuming and the request being allocated will not be
> +	 * used for power management purposes.
> +	 */
> +	if (!(op & REQ_PM) &&
> +	    !blk_mq_wait_until_active(q, !(op & REQ_NOWAIT)))
> +		return ERR_PTR(-EAGAIN);
> +
>  	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
>  	if (ret)
>  		return ERR_PTR(ret);
> -- 
> 2.14.1
> 

One issue is that pm_runtime_mark_last_busy() isn't set
accurately because it can't check if the freeing req is
the last active one, and set it if yes.


-- 
Ming

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] block: Make SCSI device suspend work reliably
  2017-09-12  2:29   ` Ming Lei
@ 2017-09-12 15:45     ` Bart Van Assche
  2017-09-12 16:10       ` Ming Lei
  0 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2017-09-12 15:45 UTC (permalink / raw)
  To: ming.lei; +Cc: hch, jthumshirn, linux-block, hare, axboe, rafael.j.wysocki

T24gVHVlLCAyMDE3LTA5LTEyIGF0IDEwOjI5ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
RnJpLCBTZXAgMDgsIDIwMTcgYXQgMDQ6NTI6MjVQTSAtMDcwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IEBAIC0xMzUwLDYgKzEzNjgsMTYgQEAgc3RhdGljIHN0cnVjdCByZXF1ZXN0ICpn
ZXRfcmVxdWVzdChzdHJ1Y3QgcmVxdWVzdF9xdWV1ZSAqcSwgdW5zaWduZWQgaW50IG9wLA0KPiA+
ICAJbG9ja2RlcF9hc3NlcnRfaGVsZChxLT5xdWV1ZV9sb2NrKTsNCj4gPiAgCVdBUk5fT05fT05D
RShxLT5tcV9vcHMpOw0KPiA+ICANCj4gPiArCVdBUk5fT05fT05DRSgob3AgJiBSRVFfUE0pICYm
IGJsa19wbV9zdXNwZW5kZWQocSkpOw0KPiA+ICsNCj4gPiArCS8qDQo+ID4gKwkgKiBXYWl0IGlm
IHRoZSByZXF1ZXN0IHF1ZXVlIGlzIHN1c3BlbmRlZCBvciBpbiB0aGUgcHJvY2VzcyBvZg0KPiA+
ICsJICogc3VzcGVuZGluZy9yZXN1bWluZyBhbmQgdGhlIHJlcXVlc3QgYmVpbmcgYWxsb2NhdGVk
IHdpbGwgbm90IGJlDQo+ID4gKwkgKiB1c2VkIGZvciBwb3dlciBtYW5hZ2VtZW50IHB1cnBvc2Vz
Lg0KPiA+ICsJICovDQo+ID4gKwlpZiAoIShvcCAmIFJFUV9QTSkgJiYgIWJsa193YWl0X3VudGls
X2FjdGl2ZShxLCAhKG9wICYgUkVRX05PV0FJVCkpKQ0KPiA+ICsJCXJldHVybiBFUlJfUFRSKC1F
QUdBSU4pOw0KPiA+ICsNCj4gDQo+IEZpcnN0bHkgaXQgaXMgbm90IGVub3VnaCB0byBwcmV2ZW50
IG5ldyBhbGxvY2F0aW9uIG9ubHksIGJlY2F1c2UNCj4gcmVxdWVzdHMgcG9vbCBtYXkgaGF2ZSBi
ZWVuIHVzZWQgdXAgYW5kIGFsbCB0aGUgYWxsb2NhdGVkIHJlcXVlc3RzDQo+IGp1c3Qgc3RheSBp
biBibG9jayBxdWV1ZSB3aGVuIFNDU0kgZGV2aWNlIGlzIHB1dCBpbnRvIHF1aWVzY2UuDQo+IFNv
IHlvdSBtYXkgY2F1c2UgbmV3IEkvTyBoYW5nIGFuZCB3YWl0IGZvcmV2ZXIgaGVyZS4gVGhhdCBp
cyB3aHkNCj4gSSB0YWtlIGZyZWV6ZSB0byBkbyB0aGF0IGJlY2F1c2UgZnJlZXppbmcgcXVldWUg
Y2FuIHByZXZlbnQgbmV3DQo+IGFsbG9jYXRpb24gYW5kIGRyYWluIGluLXF1ZXVlIHJlcXVlc3Rz
IGJvdGguDQoNClNvcnJ5IGJ1dCBJIGRpc2FncmVlLiBJZiBhbGwgdGFncyBhcmUgYWxsb2NhdGVk
IGFuZCBpdCBpcyBhdHRlbXB0ZWQgdG8NCnN1c3BlbmQgYSBxdWV1ZSB0aGVuIHRoaXMgcGF0Y2gg
bm90IG9ubHkgd2lsbCBwcmV2ZW50IGFsbG9jYXRpb24gb2YgbmV3DQpub24tUE0gcmVxdWVzdHMg
YnV0IGl0IHdpbGwgYWxzbyBsZXQgdGhlc2UgcHJldmlvdXNseSBzdWJtaXR0ZWQgbm9uLVBNDQpy
ZXF1ZXN0cyBmaW5pc2guIFNlZSBhbHNvIHRoZSBibGtfcGVla19yZXF1ZXN0KCkgY2hhbmdlcyBp
biBwYXRjaCA0LzUuDQpPbmNlIGEgcHJldmlvdXNseSBzdWJtaXR0ZWQgcmVxdWVzdCBmaW5pc2hl
ZCBhbGxvY2F0aW9uIG9mIHRoZSBQTSByZXF1ZXN0DQp3aWxsIHN1Y2NlZWQuDQoNCj4gU2Vjb25k
bHksIGFsbCBSUUZfUFJFRU1QVChub3QgUE0pIGlzIGJsb2NrZWQgaGVyZSB0b28sIHRoYXQgbWF5
DQo+IGNhdXNlIHJlZ3Jlc3Npb24gc2luY2Ugd2UgdXN1YWxseSBuZWVkL2FsbG93IFJRRl9QUkVF
TVBUIHRvIHJ1bg0KPiBkdXJpbmcgU0NTSSBxdWllc2NlLg0KDQpTb3JyeSBidXQgSSBkaXNhZ3Jl
ZSB3aXRoIHRoaXMgY29tbWVudCB0b28uIFBsZWFzZSBrZWVwIGluIG1pbmQgdGhhdCBteSBwYXRj
aA0Kb25seSBhZmZlY3RzIHRoZSBTQ1NJIHF1aWVzY2Ugc3RhdHVzIGlmIHRoYXQgc3RhdHVzIGhh
cyBiZWVuIHJlcXVlc3RlZCBieSB0aGUNCnBvd2VyIG1hbmFnZW1lbnQgc3Vic3lzdGVtLiBBZnRl
ciB0aGUgcG93ZXIgbWFuYWdlbWVudCBzdWJzeXN0ZW0gaGFzDQp0cmFuc2l0aW9uZWQgYSBTQ1NJ
IHF1ZXVlIGludG8gdGhlIHF1aWVzY2Ugc3RhdGUgdGhhdCBxdWV1ZSBoYXMgcmVhY2hlZCB0aGUN
ClJQTV9TVVNQRU5ERUQgc3RhdHVzLiBObyBuZXcgcmVxdWVzdHMgbXVzdCBiZSBzdWJtaXR0ZWQg
YWdhaW5zdCBhIHN1c3BlbmRlZA0KcXVldWUuIEl0IGlzIGVhc3kgdG8gc2VlIGluIHRoZSBsZWdh
Y3kgYmxvY2sgbGF5ZXIgdGhhdCBvbmx5IFBNIHJlcXVlc3RzIGFuZA0Kbm8gb3RoZXIgUlFGX1BS
RUVNUFQgcmVxdWVzdHMgYXJlIHByb2Nlc3NlZCBvbmNlIHRoZSBxdWV1ZSBoYXMgcmVhY2hlZCB0
aGUNCnN1c3BlbmRlZCBzdGF0dXM6DQoNCi8qDQogKiBEb24ndCBwcm9jZXNzIG5vcm1hbCByZXF1
ZXN0cyB3aGVuIHF1ZXVlIGlzIHN1c3BlbmRlZA0KICogb3IgaW4gdGhlIHByb2Nlc3Mgb2Ygc3Vz
cGVuZGluZy9yZXN1bWluZw0KICovDQpzdGF0aWMgc3RydWN0IHJlcXVlc3QgKmJsa19wbV9wZWVr
X3JlcXVlc3Qoc3RydWN0IHJlcXVlc3RfcXVldWUgKnEsDQoJCQkJCSAgIHN0cnVjdCByZXF1ZXN0
ICpycSkNCnsNCglpZiAocS0+ZGV2ICYmIChxLT5ycG1fc3RhdHVzID09IFJQTV9TVVNQRU5ERUQg
fHwNCgkgICAgKHEtPnJwbV9zdGF0dXMgIT0gUlBNX0FDVElWRSAmJiAhKHJxLT5jbWRfZmxhZ3Mg
JiBSRVFfUE0pKSkpDQoJCXJldHVybiBOVUxMOw0KCWVsc2UNCgkJcmV0dXJuIHJxOw0KfQ0KDQpC
YXJ0Lg==

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] block: Make SCSI device suspend work reliably
  2017-09-12 15:45     ` Bart Van Assche
@ 2017-09-12 16:10       ` Ming Lei
  0 siblings, 0 replies; 22+ messages in thread
From: Ming Lei @ 2017-09-12 16:10 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, jthumshirn, linux-block, hare, axboe, rafael.j.wysocki

On Tue, Sep 12, 2017 at 03:45:50PM +0000, Bart Van Assche wrote:
> On Tue, 2017-09-12 at 10:29 +0800, Ming Lei wrote:
> > On Fri, Sep 08, 2017 at 04:52:25PM -0700, Bart Van Assche wrote:
> > > @@ -1350,6 +1368,16 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
> > >  	lockdep_assert_held(q->queue_lock);
> > >  	WARN_ON_ONCE(q->mq_ops);
> > >  
> > > +	WARN_ON_ONCE((op & REQ_PM) && blk_pm_suspended(q));
> > > +
> > > +	/*
> > > +	 * Wait if the request queue is suspended or in the process of
> > > +	 * suspending/resuming and the request being allocated will not be
> > > +	 * used for power management purposes.
> > > +	 */
> > > +	if (!(op & REQ_PM) && !blk_wait_until_active(q, !(op & REQ_NOWAIT)))
> > > +		return ERR_PTR(-EAGAIN);
> > > +
> > 
> > Firstly it is not enough to prevent new allocation only, because
> > requests pool may have been used up and all the allocated requests
> > just stay in block queue when SCSI device is put into quiesce.
> > So you may cause new I/O hang and wait forever here. That is why
> > I take freeze to do that because freezing queue can prevent new
> > allocation and drain in-queue requests both.
> 
> Sorry but I disagree. If all tags are allocated and it is attempted to
> suspend a queue then this patch not only will prevent allocation of new
> non-PM requests but it will also let these previously submitted non-PM
> requests finish. See also the blk_peek_request() changes in patch 4/5.
> Once a previously submitted request finished allocation of the PM request
> will succeed.

You just simply removed blk_pm_peek_request() from blk_peek_request(),
how does that work on blk-mq?

Also that may not work because SCSI quiesce may just happen between
request allocation and run queue, then nothing can be dispatched
to driver.

> 
> > Secondly, all RQF_PREEMPT(not PM) is blocked here too, that may
> > cause regression since we usually need/allow RQF_PREEMPT to run
> > during SCSI quiesce.
> 
> Sorry but I disagree with this comment too. Please keep in mind that my patch
> only affects the SCSI quiesce status if that status has been requested by the
> power management subsystem. After the power management subsystem has
> transitioned a SCSI queue into the quiesce state that queue has reached the
> RPM_SUSPENDED status. No new requests must be submitted against a suspended

No, RPM_SUSPEND only means runtime suspend, you misunderstand it as
system suspend.

Actually the reported issue is during system suspend/resume, which can
happen without runtime PM, such as, runtime PM is disabled via sysfs.

> queue. It is easy to see in the legacy block layer that only PM requests and
> no other RQF_PREEMPT requests are processed once the queue has reached the
> suspended status:
> 
> /*
>  * Don't process normal requests when queue is suspended
>  * or in the process of suspending/resuming
>  */
> static struct request *blk_pm_peek_request(struct request_queue *q,
> 					   struct request *rq)
> {
> 	if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
> 	    (q->rpm_status != RPM_ACTIVE && !(rq->cmd_flags & REQ_PM))))

As I explained, rpm_status represents runtime PM status, nothing to
do with system suspend/resume.

So RRF_PREEMPT can be dispatched to driver during system suspend.

-- 
Ming

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/5] block: Make SCSI device suspend work reliably
  2017-09-08 23:52 ` [PATCH 4/5] block: Make SCSI device suspend work reliably Bart Van Assche
  2017-09-12  2:29   ` Ming Lei
@ 2017-09-12 16:25   ` Ming Lei
  1 sibling, 0 replies; 22+ messages in thread
From: Ming Lei @ 2017-09-12 16:25 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Hannes Reinecke,
	Johannes Thumshirn, Rafael J . Wysocki

On Fri, Sep 08, 2017 at 04:52:25PM -0700, Bart Van Assche wrote:
> Instead of allowing request allocation to succeed for suspended
> request queues and only to process power management requests, make
> blk_get_request() wait until the request queue is resumed for
> requests that are not power management requests.
> 
> This patch avoids that resume does not occur if the maximum queue
> depth is reached when a power management request is submitted.
> 
> Note: this patch affects the behavior of scsi_device_quiesce() only
> if that function is called from inside a power management callback.
> This patch does not affect the behavior of scsi_device_quiesce()
> when a call of that function is triggered by writing "quiesce" into
> /sys/class/scsi_device/*/device/state.
> 
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c       | 60 +++++++++++++++++++++++++++-----------------------
>  block/blk.h            | 12 ++++++++++
>  include/linux/blkdev.h |  1 +
>  3 files changed, 45 insertions(+), 28 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index bb53c6b58e8c..cd2700c763ed 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1325,6 +1325,24 @@ static struct request *__get_request(struct request_list *rl, unsigned int op,
>  	return ERR_PTR(-ENOMEM);
>  }
>  
> +#ifdef CONFIG_PM
> +static bool blk_wait_until_active(struct request_queue *q, bool wait)
> +	__releases(q->queue_lock)
> +	__acquires(q->queue_lock)
> +{
> +	if (wait)
> +		wait_event_lock_irq(q->rpm_active_wq,
> +				    q->rpm_status == RPM_ACTIVE,
> +				    *q->queue_lock);
> +	return q->rpm_status == RPM_ACTIVE;

If runtime PM is disabled via /sys/.../power/control, q->rpm_status
can be always ACTIVE, even during system suspend, then you can't
prevent any new request allocation at that time.


-- 
Ming

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2017-09-12 16:25 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-08 23:52 [PATCH 0/5] Make SCSI device suspend work reliably Bart Van Assche
2017-09-08 23:52 ` [PATCH 1/5] percpu-refcount: Introduce percpu_ref_switch_to_atomic_nowait() Bart Van Assche
2017-09-11  8:42   ` Johannes Thumshirn
2017-09-11 16:10     ` Bart Van Assche
2017-09-11 13:13   ` Tejun Heo
2017-09-11 16:09     ` Bart Van Assche
2017-09-11 16:37       ` tj
2017-09-11 16:55         ` Bart Van Assche
2017-09-11 17:20           ` Tejun Heo
2017-09-11 17:29             ` Bart Van Assche
2017-09-08 23:52 ` [PATCH 2/5] scsi: Change the type of the second last argument of scsi_execute() Bart Van Assche
2017-09-08 23:52 ` [PATCH 3/5] block: Introduce REQ_PM and remove RQF_PM Bart Van Assche
2017-09-08 23:52 ` [PATCH 4/5] block: Make SCSI device suspend work reliably Bart Van Assche
2017-09-12  2:29   ` Ming Lei
2017-09-12 15:45     ` Bart Van Assche
2017-09-12 16:10       ` Ming Lei
2017-09-12 16:25   ` Ming Lei
2017-09-08 23:52 ` [PATCH 5/5] blk-mq: Implement power management support Bart Van Assche
2017-09-12  9:14   ` Ming Lei
2017-09-09 10:39 ` [PATCH 0/5] Make SCSI device suspend work reliably Ming Lei
2017-09-11 16:25   ` Bart Van Assche
2017-09-12  2:17     ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.