All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/8] block/scsi: safe SCSI quiescing
@ 2017-09-02 13:08 Ming Lei
  2017-09-02 13:08 ` [PATCH V3 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue Ming Lei
                   ` (8 more replies)
  0 siblings, 9 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

Hi,

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for RQF_PREEMPT
can be dispatched to SCSI successfully, and scsi_device_quiesce() just simply
waits for completion of I/Os dispatched to SCSI stack. It isn't enough at all.

Because new request still can be allocated, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated, and system may
hang forever, such as during system suspend or SCSI domain alidation.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch introduces preempt freez, and tries to solve the issue
by preempt freezing block queue during SCSI quiesce, and allows
to allocate request of RQF_PREEMPT when queue is preempt-frozen.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all by introducing blk_freeze_queue_preempt() and
blk_unfreeze_queue_preempt(), also unifying current interfaces for
freezing queue between block legacy and blk-mq.

Oleksandr has verified that this patchset V2 fixes his I/O hang
during suspend/resume cycle.

V3:
	- introduce q->preempt_unfreezing to fix one bug of preempt freeze
	- call blk_queue_enter_live() only when queue is preempt frozen
	- cleanup a bit on the implementation of preempt freeze
	- only patch 6 and 7 are changed

V2:
	- drop the 1st patch in V1 because percpu_ref_is_dying() is
	enough as pointed by Tejun
	- introduce preempt version of blk_[freeze|unfreeze]_queue
	- sync between preempt freeze and normal freeze
	- fix warning from percpu-refcount as reported by Oleksandr


[1] https://marc.info/?t=150340250100013&r=3&w=2



Ming Lei (8):
  blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue
  blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue
  blk-mq: only run hw queues for blk-mq
  blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
  block: tracking request allocation with q_usage_counter
  block: introduce preempt version of blk_[freeze|unfreeze]_queue
  block: allow to allocate req with REQF_PREEMPT when queue is preempt
    frozen
  SCSI: preempt freeze block queue when SCSI device is put into quiesce

 block/bfq-iosched.c      |   2 +-
 block/blk-cgroup.c       |   8 +--
 block/blk-core.c         |  53 ++++++++++++---
 block/blk-mq.c           | 170 +++++++++++++++++++++++++++++++++++++++--------
 block/blk-mq.h           |   1 -
 block/blk.h              |  17 +++++
 block/elevator.c         |   4 +-
 drivers/block/loop.c     |  16 ++---
 drivers/block/rbd.c      |   2 +-
 drivers/nvme/host/core.c |   8 +--
 drivers/scsi/scsi_lib.c  |  22 +++++-
 include/linux/blk-mq.h   |  15 +++--
 include/linux/blkdev.h   |  21 +++++-
 13 files changed, 273 insertions(+), 66 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH V3 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 2/8] blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue Ming Lei
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

We will support to freeze queue on block legacy path too.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-cgroup.c       |  4 ++--
 block/blk-mq.c           | 10 +++++-----
 block/elevator.c         |  2 +-
 drivers/block/loop.c     |  8 ++++----
 drivers/nvme/host/core.c |  4 ++--
 include/linux/blk-mq.h   |  2 +-
 6 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0480892e97e5..02e8a47ac77c 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1337,7 +1337,7 @@ int blkcg_activate_policy(struct request_queue *q,
 	spin_unlock_irq(q->queue_lock);
 out_bypass_end:
 	if (q->mq_ops)
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 	else
 		blk_queue_bypass_end(q);
 	if (pd_prealloc)
@@ -1388,7 +1388,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
 	spin_unlock_irq(q->queue_lock);
 
 	if (q->mq_ops)
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 	else
 		blk_queue_bypass_end(q);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d935f15c54da..82136e83951d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -172,7 +172,7 @@ void blk_mq_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
-void blk_mq_unfreeze_queue(struct request_queue *q)
+void blk_unfreeze_queue(struct request_queue *q)
 {
 	int freeze_depth;
 
@@ -183,7 +183,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 		wake_up_all(&q->mq_freeze_wq);
 	}
 }
-EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
+EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
 
 /*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
@@ -2250,7 +2250,7 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set,
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
 		blk_mq_freeze_queue(q);
 		queue_set_hctx_shared(q, shared);
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 	}
 }
 
@@ -2708,7 +2708,7 @@ static int __blk_mq_update_nr_requests(struct request_queue *q,
 	if (!ret)
 		q->nr_requests = nr;
 
-	blk_mq_unfreeze_queue(q);
+	blk_unfreeze_queue(q);
 
 	return ret;
 }
@@ -2757,7 +2757,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	}
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 }
 
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
diff --git a/block/elevator.c b/block/elevator.c
index 0e465809d3f3..371c8165c9e8 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -994,7 +994,7 @@ static int elevator_switch_mq(struct request_queue *q,
 		blk_add_trace_msg(q, "elv switch: none");
 
 out:
-	blk_mq_unfreeze_queue(q);
+	blk_unfreeze_queue(q);
 	return ret;
 }
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2fbd4089c20e..5c11ea44d470 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -217,7 +217,7 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
 	else
 		lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 }
 
 static int
@@ -605,7 +605,7 @@ static int loop_switch(struct loop_device *lo, struct file *file)
 	do_loop_switch(lo, &w);
 
 	/* unfreeze */
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 
 	return 0;
 }
@@ -1079,7 +1079,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_state = Lo_unbound;
 	/* This is safe: open() is still holding a reference. */
 	module_put(THIS_MODULE);
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
 		loop_reread_partitions(lo, bdev);
@@ -1191,7 +1191,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 	__loop_update_dio(lo, lo->use_dio);
 
  exit:
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 
 	if (!err && (info->lo_flags & LO_FLAGS_PARTSCAN) &&
 	     !(lo->lo_flags & LO_FLAGS_PARTSCAN)) {
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 37046ac2c441..5c76b0a96be2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1226,7 +1226,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 
 	if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
 		nvme_config_discard(ns);
-	blk_mq_unfreeze_queue(disk->queue);
+	blk_unfreeze_queue(disk->queue);
 }
 
 static int nvme_revalidate_disk(struct gendisk *disk)
@@ -2753,7 +2753,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unfreeze_queue(ns->queue);
+		blk_unfreeze_queue(ns->queue);
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 13f6c25fa461..2572e5641568 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -257,7 +257,7 @@ void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
 void blk_mq_freeze_queue(struct request_queue *q);
-void blk_mq_unfreeze_queue(struct request_queue *q);
+void blk_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V3 2/8] blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
  2017-09-02 13:08 ` [PATCH V3 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 3/8] blk-mq: only run hw queues for blk-mq Ming Lei
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

This APIs will be used by legacy path too.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bfq-iosched.c      |  2 +-
 block/blk-cgroup.c       |  4 ++--
 block/blk-mq.c           | 17 ++++-------------
 block/blk-mq.h           |  1 -
 block/elevator.c         |  2 +-
 drivers/block/loop.c     |  8 ++++----
 drivers/block/rbd.c      |  2 +-
 drivers/nvme/host/core.c |  2 +-
 include/linux/blk-mq.h   |  2 +-
 9 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 509f39998011..ce2b00e897e2 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4757,7 +4757,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	 * The invocation of the next bfq_create_group_hierarchy
 	 * function is the head of a chain of function calls
 	 * (bfq_create_group_hierarchy->blkcg_activate_policy->
-	 * blk_mq_freeze_queue) that may lead to the invocation of the
+	 * blk_freeze_queue) that may lead to the invocation of the
 	 * has_work hook function. For this reason,
 	 * bfq_create_group_hierarchy is invoked only after all
 	 * scheduler data has been initialized, apart from the fields
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 02e8a47ac77c..87c15f3947d5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1296,7 +1296,7 @@ int blkcg_activate_policy(struct request_queue *q,
 		return 0;
 
 	if (q->mq_ops)
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 	else
 		blk_queue_bypass_start(q);
 pd_prealloc:
@@ -1363,7 +1363,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
 		return;
 
 	if (q->mq_ops)
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 	else
 		blk_queue_bypass_start(q);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 82136e83951d..8cf1f7cbef2b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -161,16 +161,7 @@ void blk_freeze_queue(struct request_queue *q)
 	blk_freeze_queue_start(q);
 	blk_mq_freeze_queue_wait(q);
 }
-
-void blk_mq_freeze_queue(struct request_queue *q)
-{
-	/*
-	 * ...just an alias to keep freeze and unfreeze actions balanced
-	 * in the blk_mq_* namespace
-	 */
-	blk_freeze_queue(q);
-}
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
+EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
 void blk_unfreeze_queue(struct request_queue *q)
 {
@@ -2248,7 +2239,7 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set,
 	lockdep_assert_held(&set->tag_list_lock);
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 		queue_set_hctx_shared(q, shared);
 		blk_unfreeze_queue(q);
 	}
@@ -2683,7 +2674,7 @@ static int __blk_mq_update_nr_requests(struct request_queue *q,
 	if (!set)
 		return -EINVAL;
 
-	blk_mq_freeze_queue(q);
+	blk_freeze_queue(q);
 
 	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -2747,7 +2738,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 
 	set->nr_hw_queues = nr_hw_queues;
 	blk_mq_update_queue_map(set);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1b9742eb7399..7ce29ef1e6f3 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -30,7 +30,6 @@ struct blk_mq_ctx {
 } ____cacheline_aligned_in_smp;
 
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
-void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 371c8165c9e8..1164c8a3720f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -967,7 +967,7 @@ static int elevator_switch_mq(struct request_queue *q,
 {
 	int ret;
 
-	blk_mq_freeze_queue(q);
+	blk_freeze_queue(q);
 
 	if (q->elevator) {
 		if (q->elevator->registered)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5c11ea44d470..b2e708b7e1e6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -211,7 +211,7 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 	 * LO_FLAGS_READ_ONLY, both are set from kernel, and losetup
 	 * will get updated by ioctl(LOOP_GET_STATUS)
 	 */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 	lo->use_dio = use_dio;
 	if (use_dio)
 		lo->lo_flags |= LO_FLAGS_DIRECT_IO;
@@ -599,7 +599,7 @@ static int loop_switch(struct loop_device *lo, struct file *file)
 	w.file = file;
 
 	/* freeze queue and wait for completion of scheduled requests */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 
 	/* do the switch action */
 	do_loop_switch(lo, &w);
@@ -1046,7 +1046,7 @@ static int loop_clr_fd(struct loop_device *lo)
 		return -EINVAL;
 
 	/* freeze request queue during the transition */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 
 	spin_lock_irq(&lo->lo_lock);
 	lo->lo_state = Lo_rundown;
@@ -1116,7 +1116,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 		return -EINVAL;
 
 	/* I/O need to be drained during transfer transition */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 
 	err = loop_release_xfer(lo);
 	if (err)
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b008b6a98098..3a97ffcb3a81 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -6347,7 +6347,7 @@ static ssize_t do_rbd_remove(struct bus_type *bus,
 		 * Prevent new IO from being queued and wait for existing
 		 * IO to complete/fail.
 		 */
-		blk_mq_freeze_queue(rbd_dev->disk->queue);
+		blk_freeze_queue(rbd_dev->disk->queue);
 		blk_set_queue_dying(rbd_dev->disk->queue);
 	}
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5c76b0a96be2..986f2b4f9760 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1210,7 +1210,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	bs = 1 << ns->lba_shift;
 	ns->noiob = le16_to_cpu(id->noiob);
 
-	blk_mq_freeze_queue(disk->queue);
+	blk_freeze_queue(disk->queue);
 
 	if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
 		nvme_prep_integrity(disk, id, bs);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2572e5641568..8ae77e088c01 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -256,7 +256,7 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
-void blk_mq_freeze_queue(struct request_queue *q);
+void blk_freeze_queue(struct request_queue *q);
 void blk_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V3 3/8] blk-mq: only run hw queues for blk-mq
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
  2017-09-02 13:08 ` [PATCH V3 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue Ming Lei
  2017-09-02 13:08 ` [PATCH V3 2/8] blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 4/8] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait Ming Lei
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

This patch just makes it explicitely.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8cf1f7cbef2b..4c532d8612e1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
 		percpu_ref_kill(&q->q_usage_counter);
-		blk_mq_run_hw_queues(q, false);
+		if (q->mq_ops)
+			blk_mq_run_hw_queues(q, false);
 	}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V3 4/8] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (2 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 3/8] blk-mq: only run hw queues for blk-mq Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 5/8] block: tracking request allocation with q_usage_counter Ming Lei
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

The only change on legacy is that blk_drain_queue() is run
from blk_freeze_queue(), which is called in blk_cleanup_queue().

So this patch removes the explicite __blk_drain_queue() in
blk_cleanup_queue().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c         | 17 +++++++++++++++--
 block/blk-mq.c           |  8 +++++---
 block/blk.h              |  1 +
 drivers/nvme/host/core.c |  2 +-
 include/linux/blk-mq.h   |  2 +-
 5 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d579501f24ba..ce2d3b6f6c62 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -530,6 +530,21 @@ static void __blk_drain_queue(struct request_queue *q, bool drain_all)
 }
 
 /**
+ * blk_drain_queue - drain requests from request_queue
+ * @q: queue to drain
+ *
+ * Drain requests from @q.  All pending requests are drained.
+ * The caller is responsible for ensuring that no new requests
+ * which need to be drained are queued.
+ */
+void blk_drain_queue(struct request_queue *q)
+{
+	spin_lock_irq(q->queue_lock);
+	__blk_drain_queue(q, true);
+	spin_unlock_irq(q->queue_lock);
+}
+
+/**
  * blk_queue_bypass_start - enter queue bypass mode
  * @q: queue of interest
  *
@@ -653,8 +668,6 @@ void blk_cleanup_queue(struct request_queue *q)
 	 */
 	blk_freeze_queue(q);
 	spin_lock_irq(lock);
-	if (!q->mq_ops)
-		__blk_drain_queue(q, true);
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c532d8612e1..24de78afbe9a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -131,11 +131,13 @@ void blk_freeze_queue_start(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
-void blk_mq_freeze_queue_wait(struct request_queue *q)
+void blk_freeze_queue_wait(struct request_queue *q)
 {
+	if (!q->mq_ops)
+		blk_drain_queue(q);
 	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
+EXPORT_SYMBOL_GPL(blk_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout)
@@ -160,7 +162,7 @@ void blk_freeze_queue(struct request_queue *q)
 	 * exported to drivers as the only user for unfreeze is blk_mq.
 	 */
 	blk_freeze_queue_start(q);
-	blk_mq_freeze_queue_wait(q);
+	blk_freeze_queue_wait(q);
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
diff --git a/block/blk.h b/block/blk.h
index 6847c5435cca..242486e26a81 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -64,6 +64,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			struct bio *bio);
 void blk_queue_bypass_start(struct request_queue *q);
 void blk_queue_bypass_end(struct request_queue *q);
+void blk_drain_queue(struct request_queue *q);
 void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 986f2b4f9760..d34a9ffaa940 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2778,7 +2778,7 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_freeze_queue_wait(ns->queue);
+		blk_freeze_queue_wait(ns->queue);
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8ae77e088c01..f90d78eb85df 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -259,7 +259,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 void blk_freeze_queue(struct request_queue *q);
 void blk_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
-void blk_mq_freeze_queue_wait(struct request_queue *q);
+void blk_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V3 5/8] block: tracking request allocation with q_usage_counter
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (3 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 4/8] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

This usage is basically same with blk-mq, so that we can
support to freeze queue easily.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index ce2d3b6f6c62..85b15833a7a5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1405,16 +1405,21 @@ static struct request *blk_old_get_request(struct request_queue *q,
 					   unsigned int op, gfp_t gfp_mask)
 {
 	struct request *rq;
+	int ret = 0;
 
 	WARN_ON_ONCE(q->mq_ops);
 
 	/* create ioc upfront */
 	create_io_context(gfp_mask, q->node);
 
+	ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+	if (ret)
+		return ERR_PTR(ret);
 	spin_lock_irq(q->queue_lock);
 	rq = get_request(q, op, NULL, gfp_mask);
 	if (IS_ERR(rq)) {
 		spin_unlock_irq(q->queue_lock);
+		blk_queue_exit(q);
 		return rq;
 	}
 
@@ -1586,6 +1591,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 		blk_free_request(rl, req);
 		freed_request(rl, sync, rq_flags);
 		blk_put_rl(rl);
+		blk_queue_exit(q);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1867,8 +1873,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */
+	blk_queue_enter_live(q);
 	req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
 	if (IS_ERR(req)) {
+		blk_queue_exit(q);
 		__wbt_done(q->rq_wb, wb_acct);
 		if (PTR_ERR(req) == -ENOMEM)
 			bio->bi_status = BLK_STS_RESOURCE;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (4 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 5/8] block: tracking request allocation with q_usage_counter Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-04 15:21     ` Bart Van Assche
  2017-09-02 13:08 ` [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen Ming Lei
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

The two APIs are required to allow request allocation of
RQF_PREEMPT when queue is preempt frozen.

The following two points have to be guaranteed for one queue:

1) preempt freezing can be started only after all in-progress
normal & preempt freezings are completed

2) normal freezing can be started only if in-progress preempt
freezing is completed

Because for normal freezing, once blk_mq_freeze_queue_wait()
is returned, we have to make sure no request is entering queue
any more.

rwsem should have been perfect for this kind of sync, but we need
to support nested normal freeze, so spin_lock and normal_freezing &
preempt_freezing flag are used for the sync between normal freeze
and preempt freeze.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       |   2 +
 block/blk-mq.c         | 120 +++++++++++++++++++++++++++++++++++++++++++++++--
 block/blk.h            |  16 +++++++
 include/linux/blk-mq.h |   2 +
 include/linux/blkdev.h |   4 ++
 5 files changed, 141 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 85b15833a7a5..2549b0a0535d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -899,6 +899,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (blkcg_init_queue(q))
 		goto fail_ref;
 
+	spin_lock_init(&q->freeze_lock);
+
 	return q;
 
 fail_ref:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 24de78afbe9a..54b8d8b9f40e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -118,16 +118,75 @@ void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
 }
 
-void blk_freeze_queue_start(struct request_queue *q)
+static bool queue_freeze_is_over(struct request_queue *q, bool preempt)
+{
+	/*
+	 * For preempt freeze, we simply call blk_queue_enter_live()
+	 * before allocating one request of RQF_PREEMPT, so we have
+	 * to check if queue is dead, otherwise we may hang on dead
+	 * queue.
+	 *
+	 * For normal freeze, no need to check blk_queue_dying()
+	 * because it is checked in blk_queue_enter().
+	 */
+	if (preempt)
+		return !(q->normal_freezing + q->preempt_freezing) ||
+			blk_queue_dying(q);
+	return !q->preempt_freezing;
+}
+
+static bool __blk_freeze_queue_start(struct request_queue *q, bool preempt)
 {
 	int freeze_depth;
+	bool start_freeze = true;
+
+	/*
+	 * Wait for completion of another kind of freezing.
+	 *
+	 * We have to sync between normal freeze and preempt
+	 * freeze. preempt freeze can only be started iff all
+	 * pending normal & preempt freezing are completed,
+	 * meantime normal freeze can be started only if there
+	 * isn't pending preempt freezing.
+	 *
+	 * rwsem should have been perfect for this kind of sync,
+	 * but we need to support nested normal freeze, so use
+	 * spin_lock with two flag for syncing between normal
+	 * freeze and preempt freeze.
+	 */
+	spin_lock(&q->freeze_lock);
+	wait_event_cmd(q->mq_freeze_wq,
+		       queue_freeze_is_over(q, preempt),
+		       spin_unlock(&q->freeze_lock),
+		       spin_lock(&q->freeze_lock));
+
+	if (preempt && blk_queue_dying(q)) {
+		start_freeze = false;
+		goto unlock;
+	}
 
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
+		if (preempt) {
+			q->preempt_freezing = 1;
+			q->preempt_unfreezing = 0;
+		} else
+			q->normal_freezing = 1;
+		spin_unlock(&q->freeze_lock);
+
 		percpu_ref_kill(&q->q_usage_counter);
 		if (q->mq_ops)
 			blk_mq_run_hw_queues(q, false);
-	}
+	} else
+ unlock:
+		spin_unlock(&q->freeze_lock);
+
+	return start_freeze;
+}
+
+void blk_freeze_queue_start(struct request_queue *q)
+{
+	__blk_freeze_queue_start(q, false);
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
@@ -166,7 +225,7 @@ void blk_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
-void blk_unfreeze_queue(struct request_queue *q)
+static void __blk_unfreeze_queue(struct request_queue *q, bool preempt)
 {
 	int freeze_depth;
 
@@ -174,12 +233,67 @@ void blk_unfreeze_queue(struct request_queue *q)
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
 		percpu_ref_reinit(&q->q_usage_counter);
+
+		/*
+		 * clearing the freeze flag so that any pending
+		 * freeze can move on
+		 */
+		spin_lock(&q->freeze_lock);
+		if (preempt)
+			q->preempt_freezing = 0;
+		else
+			q->normal_freezing = 0;
+		spin_unlock(&q->freeze_lock);
 		wake_up_all(&q->mq_freeze_wq);
 	}
 }
+
+void blk_unfreeze_queue(struct request_queue *q)
+{
+	__blk_unfreeze_queue(q, false);
+}
 EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
 
 /*
+ * Once this function is returned, only allow to get request
+ * of RQF_PREEMPT.
+ */
+void blk_freeze_queue_preempt(struct request_queue *q)
+{
+	/*
+	 * If queue isn't in preempt_frozen, the queue has
+	 * to be dying, so do nothing since no I/O can
+	 * succeed any more.
+	 */
+	if (__blk_freeze_queue_start(q, true))
+		blk_freeze_queue_wait(q);
+}
+EXPORT_SYMBOL_GPL(blk_freeze_queue_preempt);
+
+/*
+ * It is the caller's responsibility to make sure no new
+ * request is allocated before calling this function.
+ */
+void blk_unfreeze_queue_preempt(struct request_queue *q)
+{
+	/*
+	 * If queue isn't in preempt_frozen, the queue should
+	 * be dying , so do nothing since no I/O can succeed.
+	 */
+	if (blk_queue_is_preempt_frozen(q)) {
+
+		/* no new request can be coming after unfreezing */
+		spin_lock(&q->freeze_lock);
+		q->preempt_unfreezing = 1;
+		spin_unlock(&q->freeze_lock);
+
+		blk_freeze_queue_wait(q);
+		__blk_unfreeze_queue(q, true);
+	}
+}
+EXPORT_SYMBOL_GPL(blk_unfreeze_queue_preempt);
+
+/*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
  * mpt3sas driver such that this function can be removed.
  */
diff --git a/block/blk.h b/block/blk.h
index 242486e26a81..28e9be6a14c6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -80,6 +80,22 @@ static inline void blk_queue_enter_live(struct request_queue *q)
 	percpu_ref_get(&q->q_usage_counter);
 }
 
+static inline bool blk_queue_is_preempt_frozen(struct request_queue *q)
+{
+	bool preempt_frozen;
+	bool preempt_unfreezing;
+
+	if (!percpu_ref_is_dying(&q->q_usage_counter))
+		return false;
+
+	spin_lock(&q->freeze_lock);
+	preempt_frozen = q->preempt_freezing;
+	preempt_unfreezing = q->preempt_unfreezing;
+	spin_unlock(&q->freeze_lock);
+
+	return preempt_frozen && !preempt_unfreezing;
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 bool __bio_integrity_endio(struct bio *);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f90d78eb85df..5ae8c82d6273 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -258,6 +258,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
 void blk_freeze_queue(struct request_queue *q);
 void blk_unfreeze_queue(struct request_queue *q);
+void blk_freeze_queue_preempt(struct request_queue *q);
+void blk_unfreeze_queue_preempt(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
 void blk_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f45f157b2910..5618d174100a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -565,6 +565,10 @@ struct request_queue {
 
 	int			bypass_depth;
 	atomic_t		mq_freeze_depth;
+	spinlock_t		freeze_lock;
+	unsigned		normal_freezing:1;
+	unsigned		preempt_freezing:1;
+	unsigned		preempt_unfreezing:1;
 
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (5 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 13:12   ` Ming Lei
  2017-09-02 13:08 ` [PATCH V3 8/8] SCSI: preempt freeze block queue when SCSI device is put into quiesce Ming Lei
  2017-09-02 14:47   ` Oleksandr Natalenko
  8 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

REQF_PREEMPT is a bit special because the request is required
to be dispatched to lld even when SCSI device is quiesced.

So this patch introduces __blk_get_request() to allow block
layer to allocate request when queue is preempt frozen, since we
will preempt freeze queue before quiescing SCSI device in the
following patch for supporting safe SCSI quiescing.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       | 28 ++++++++++++++++++++--------
 block/blk-mq.c         | 14 ++++++++++++--
 include/linux/blk-mq.h |  7 ++++---
 include/linux/blkdev.h | 17 +++++++++++++++--
 4 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2549b0a0535d..f7a6fbb87dea 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1404,7 +1404,8 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 }
 
 static struct request *blk_old_get_request(struct request_queue *q,
-					   unsigned int op, gfp_t gfp_mask)
+					   unsigned int op, gfp_t gfp_mask,
+					   unsigned int flags)
 {
 	struct request *rq;
 	int ret = 0;
@@ -1414,9 +1415,20 @@ static struct request *blk_old_get_request(struct request_queue *q,
 	/* create ioc upfront */
 	create_io_context(gfp_mask, q->node);
 
-	ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+	/*
+	 * We need to allocate req of REQF_PREEMPT in preempt freezing.
+	 * No normal freezing can be started when preempt freezing
+	 * is in-progress, and queue dying is checked before starting
+	 * preempt freezing, so it is safe to use blk_queue_enter_live()
+	 * in case of preempt freezing.
+	 */
+	if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_preempt_frozen(q))
+		blk_queue_enter_live(q);
+	else
+		ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
 	if (ret)
 		return ERR_PTR(ret);
+
 	spin_lock_irq(q->queue_lock);
 	rq = get_request(q, op, NULL, gfp_mask);
 	if (IS_ERR(rq)) {
@@ -1432,26 +1444,26 @@ static struct request *blk_old_get_request(struct request_queue *q,
 	return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-				gfp_t gfp_mask)
+struct request *__blk_get_request(struct request_queue *q, unsigned int op,
+				  gfp_t gfp_mask, unsigned int flags)
 {
 	struct request *req;
 
 	if (q->mq_ops) {
 		req = blk_mq_alloc_request(q, op,
-			(gfp_mask & __GFP_DIRECT_RECLAIM) ?
-				0 : BLK_MQ_REQ_NOWAIT);
+			flags | ((gfp_mask & __GFP_DIRECT_RECLAIM) ?
+				0 : BLK_MQ_REQ_NOWAIT));
 		if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
 			q->mq_ops->initialize_rq_fn(req);
 	} else {
-		req = blk_old_get_request(q, op, gfp_mask);
+		req = blk_old_get_request(q, op, gfp_mask, flags);
 		if (!IS_ERR(req) && q->initialize_rq_fn)
 			q->initialize_rq_fn(req);
 	}
 
 	return req;
 }
-EXPORT_SYMBOL(blk_get_request);
+EXPORT_SYMBOL(__blk_get_request);
 
 /**
  * blk_requeue_request - put a request back on queue
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 54b8d8b9f40e..e81001d1da27 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -496,9 +496,19 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 {
 	struct blk_mq_alloc_data alloc_data = { .flags = flags };
 	struct request *rq;
-	int ret;
+	int ret = 0;
 
-	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
+	/*
+	 * We need to allocate req of REQF_PREEMPT in preempt freezing.
+	 * No normal freezing can be started when preempt freezing
+	 * is in-progress, and queue dying is checked before starting
+	 * preempt freezing, so it is safe to use blk_queue_enter_live()
+	 * in case of preempt freezing.
+	 */
+	if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_preempt_frozen(q))
+		blk_queue_enter_live(q);
+	else
+		ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
 	if (ret)
 		return ERR_PTR(ret);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 5ae8c82d6273..596f433eb54c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -200,9 +200,10 @@ void blk_mq_free_request(struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
 
 enum {
-	BLK_MQ_REQ_NOWAIT	= (1 << 0), /* return when out of requests */
-	BLK_MQ_REQ_RESERVED	= (1 << 1), /* allocate from reserved pool */
-	BLK_MQ_REQ_INTERNAL	= (1 << 2), /* allocate internal/sched tag */
+	BLK_MQ_REQ_PREEMPT	= BLK_REQ_PREEMPT, /* allocate for RQF_PREEMPT */
+	BLK_MQ_REQ_NOWAIT	= (1 << 8), /* return when out of requests */
+	BLK_MQ_REQ_RESERVED	= (1 << 9), /* allocate from reserved pool */
+	BLK_MQ_REQ_INTERNAL	= (1 << 10), /* allocate internal/sched tag */
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 5618d174100a..ff371c42eb3f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -862,6 +862,11 @@ enum {
 	BLKPREP_INVALID,	/* invalid command, kill, return -EREMOTEIO */
 };
 
+/* passed to __blk_get_request */
+enum {
+	BLK_REQ_PREEMPT	= (1 << 0), /* allocate for RQF_PREEMPT */
+};
+
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
 
 /*
@@ -944,8 +949,9 @@ extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
 extern void __blk_put_request(struct request_queue *, struct request *);
-extern struct request *blk_get_request(struct request_queue *, unsigned int op,
-				       gfp_t gfp_mask);
+extern struct request *__blk_get_request(struct request_queue *,
+					 unsigned int op, gfp_t gfp_mask,
+					 unsigned int flags);
 extern void blk_requeue_request(struct request_queue *, struct request *);
 extern int blk_lld_busy(struct request_queue *q);
 extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
@@ -996,6 +1002,13 @@ blk_status_t errno_to_blk_status(int errno);
 
 bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
 
+static inline struct request *blk_get_request(struct request_queue *q,
+					      unsigned int op,
+					      gfp_t gfp_mask)
+{
+	return __blk_get_request(q, op, gfp_mask, 0);
+}
+
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
 	return bdev->bd_disk->queue;	/* this is never NULL */
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V3 8/8] SCSI: preempt freeze block queue when SCSI device is put into quiesce
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
                   ` (6 preceding siblings ...)
  2017-09-02 13:08 ` [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen Ming Lei
@ 2017-09-02 13:08 ` Ming Lei
  2017-09-02 14:47   ` Oleksandr Natalenko
  8 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-02 13:08 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo, Ming Lei

Simply quiesing SCSI device and waiting for completeion of IO
dispatched to SCSI queue isn't safe, it is easy to use up
requests because all these allocated requests can't be dispatched
when device is put in QIUESCE. Then no request can be allocated
for RQF_PREEMPT, and system may hang somewhere, such as
When sending commands of sync_cache or start_stop during
system suspend path.

Before quiesing SCSI, this patch freezes block queue in preempt
mode first, so no new normal request can enter queue any more,
and all pending requests are drained too once blk_freeze_queue_preempt
is returned. And only RQF_PREEMPT can be allocated in preempt freeze.

This patch also uses __blk_get_request() for allocating
request with RQF_PREEMPT, so that the allocation can
succeed even though block queue is preempt frozen.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f6097b89d5d3..e1ad135cb209 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -243,10 +243,12 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	struct request *req;
 	struct scsi_request *rq;
 	int ret = DRIVER_ERROR << 24;
+	unsigned flag = sdev->sdev_state == SDEV_QUIESCE ? BLK_REQ_PREEMPT : 0;
 
-	req = blk_get_request(sdev->request_queue,
+	req = __blk_get_request(sdev->request_queue,
 			data_direction == DMA_TO_DEVICE ?
-			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM,
+			flag);
 	if (IS_ERR(req))
 		return ret;
 	rq = scsi_req(req);
@@ -2890,6 +2892,20 @@ scsi_device_quiesce(struct scsi_device *sdev)
 {
 	int err;
 
+	/*
+	 * Simply quiesing SCSI device isn't safe, it is easy
+	 * to use up requests because all these allocated requests
+	 * can't be dispatched when device is put in QIUESCE.
+	 * Then no request can be allocated and we may hang
+	 * somewhere, such as system suspend/resume.
+	 *
+	 * So we freeze block queue in preempt mode first, no new
+	 * normal request can enter queue any more, and all pending
+	 * requests are drained once blk_freeze_queue is returned.
+	 * Only RQF_PREEMPT is allowed in preempt freeze.
+	 */
+	blk_freeze_queue_preempt(sdev->request_queue);
+
 	mutex_lock(&sdev->state_mutex);
 	err = scsi_device_set_state(sdev, SDEV_QUIESCE);
 	mutex_unlock(&sdev->state_mutex);
@@ -2926,6 +2942,8 @@ void scsi_device_resume(struct scsi_device *sdev)
 	    scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
 		scsi_run_queue(sdev->request_queue);
 	mutex_unlock(&sdev->state_mutex);
+
+	blk_unfreeze_queue_preempt(sdev->request_queue);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-02 13:08 ` [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen Ming Lei
@ 2017-09-02 13:12   ` Ming Lei
  2017-09-04  4:13       ` Bart Van Assche
  0 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2017-09-02 13:12 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley
  Cc: Oleksandr Natalenko, Johannes Thumshirn, Tejun Heo

On Sat, Sep 02, 2017 at 09:08:39PM +0800, Ming Lei wrote:
> REQF_PREEMPT is a bit special because the request is required
> to be dispatched to lld even when SCSI device is quiesced.
> 
> So this patch introduces __blk_get_request() to allow block
> layer to allocate request when queue is preempt frozen, since we
> will preempt freeze queue before quiescing SCSI device in the
> following patch for supporting safe SCSI quiescing.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-core.c       | 28 ++++++++++++++++++++--------
>  block/blk-mq.c         | 14 ++++++++++++--
>  include/linux/blk-mq.h |  7 ++++---
>  include/linux/blkdev.h | 17 +++++++++++++++--
>  4 files changed, 51 insertions(+), 15 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2549b0a0535d..f7a6fbb87dea 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1404,7 +1404,8 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
>  }
>  
>  static struct request *blk_old_get_request(struct request_queue *q,
> -					   unsigned int op, gfp_t gfp_mask)
> +					   unsigned int op, gfp_t gfp_mask,
> +					   unsigned int flags)
>  {
>  	struct request *rq;
>  	int ret = 0;
> @@ -1414,9 +1415,20 @@ static struct request *blk_old_get_request(struct request_queue *q,
>  	/* create ioc upfront */
>  	create_io_context(gfp_mask, q->node);
>  
> -	ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
> +	/*
> +	 * We need to allocate req of REQF_PREEMPT in preempt freezing.
> +	 * No normal freezing can be started when preempt freezing
> +	 * is in-progress, and queue dying is checked before starting
> +	 * preempt freezing, so it is safe to use blk_queue_enter_live()
> +	 * in case of preempt freezing.
> +	 */
> +	if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_preempt_frozen(q))
> +		blk_queue_enter_live(q);
> +	else
> +		ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
>  	if (ret)
>  		return ERR_PTR(ret);
> +
>  	spin_lock_irq(q->queue_lock);
>  	rq = get_request(q, op, NULL, gfp_mask);
>  	if (IS_ERR(rq)) {
> @@ -1432,26 +1444,26 @@ static struct request *blk_old_get_request(struct request_queue *q,
>  	return rq;
>  }
>  
> -struct request *blk_get_request(struct request_queue *q, unsigned int op,
> -				gfp_t gfp_mask)
> +struct request *__blk_get_request(struct request_queue *q, unsigned int op,
> +				  gfp_t gfp_mask, unsigned int flags)
>  {
>  	struct request *req;
>  
>  	if (q->mq_ops) {
>  		req = blk_mq_alloc_request(q, op,
> -			(gfp_mask & __GFP_DIRECT_RECLAIM) ?
> -				0 : BLK_MQ_REQ_NOWAIT);
> +			flags | ((gfp_mask & __GFP_DIRECT_RECLAIM) ?
> +				0 : BLK_MQ_REQ_NOWAIT));
>  		if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
>  			q->mq_ops->initialize_rq_fn(req);
>  	} else {
> -		req = blk_old_get_request(q, op, gfp_mask);
> +		req = blk_old_get_request(q, op, gfp_mask, flags);
>  		if (!IS_ERR(req) && q->initialize_rq_fn)
>  			q->initialize_rq_fn(req);
>  	}
>  
>  	return req;
>  }
> -EXPORT_SYMBOL(blk_get_request);
> +EXPORT_SYMBOL(__blk_get_request);
>  
>  /**
>   * blk_requeue_request - put a request back on queue
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 54b8d8b9f40e..e81001d1da27 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -496,9 +496,19 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
>  {
>  	struct blk_mq_alloc_data alloc_data = { .flags = flags };
>  	struct request *rq;
> -	int ret;
> +	int ret = 0;
>  
> -	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
> +	/*
> +	 * We need to allocate req of REQF_PREEMPT in preempt freezing.
> +	 * No normal freezing can be started when preempt freezing
> +	 * is in-progress, and queue dying is checked before starting
> +	 * preempt freezing, so it is safe to use blk_queue_enter_live()
> +	 * in case of preempt freezing.
> +	 */
> +	if ((flags & BLK_MQ_REQ_PREEMPT) && blk_queue_is_preempt_frozen(q))
> +		blk_queue_enter_live(q);
> +	else
> +		ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
>  	if (ret)
>  		return ERR_PTR(ret);
>  
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 5ae8c82d6273..596f433eb54c 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -200,9 +200,10 @@ void blk_mq_free_request(struct request *rq);
>  bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
>  
>  enum {
> -	BLK_MQ_REQ_NOWAIT	= (1 << 0), /* return when out of requests */
> -	BLK_MQ_REQ_RESERVED	= (1 << 1), /* allocate from reserved pool */
> -	BLK_MQ_REQ_INTERNAL	= (1 << 2), /* allocate internal/sched tag */
> +	BLK_MQ_REQ_PREEMPT	= BLK_REQ_PREEMPT, /* allocate for RQF_PREEMPT */
> +	BLK_MQ_REQ_NOWAIT	= (1 << 8), /* return when out of requests */
> +	BLK_MQ_REQ_RESERVED	= (1 << 9), /* allocate from reserved pool */
> +	BLK_MQ_REQ_INTERNAL	= (1 << 10), /* allocate internal/sched tag */
>  };
>  
>  struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 5618d174100a..ff371c42eb3f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -862,6 +862,11 @@ enum {
>  	BLKPREP_INVALID,	/* invalid command, kill, return -EREMOTEIO */
>  };
>  
> +/* passed to __blk_get_request */
> +enum {
> +	BLK_REQ_PREEMPT	= (1 << 0), /* allocate for RQF_PREEMPT */
> +};
> +
>  extern unsigned long blk_max_low_pfn, blk_max_pfn;
>  
>  /*
> @@ -944,8 +949,9 @@ extern void blk_rq_init(struct request_queue *q, struct request *rq);
>  extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
>  extern void blk_put_request(struct request *);
>  extern void __blk_put_request(struct request_queue *, struct request *);
> -extern struct request *blk_get_request(struct request_queue *, unsigned int op,
> -				       gfp_t gfp_mask);
> +extern struct request *__blk_get_request(struct request_queue *,
> +					 unsigned int op, gfp_t gfp_mask,
> +					 unsigned int flags);
>  extern void blk_requeue_request(struct request_queue *, struct request *);
>  extern int blk_lld_busy(struct request_queue *q);
>  extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
> @@ -996,6 +1002,13 @@ blk_status_t errno_to_blk_status(int errno);
>  
>  bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
>  
> +static inline struct request *blk_get_request(struct request_queue *q,
> +					      unsigned int op,
> +					      gfp_t gfp_mask)
> +{
> +	return __blk_get_request(q, op, gfp_mask, 0);
> +}
> +
>  static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
>  {
>  	return bdev->bd_disk->queue;	/* this is never NULL */
> -- 
> 2.9.5
> 

Hi Bart,

Please let us know if V3 addresses your previous concern about calling
blk_queue_enter_live() during preempt freezing.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 0/8] block/scsi: safe SCSI quiescing
  2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
@ 2017-09-02 14:47   ` Oleksandr Natalenko
  2017-09-02 13:08 ` [PATCH V3 2/8] blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue Ming Lei
                     ` (7 subsequent siblings)
  8 siblings, 0 replies; 31+ messages in thread
From: Oleksandr Natalenko @ 2017-09-02 14:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley,
	Johannes Thumshirn, Tejun Heo

Again,

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>

On sobota 2. z=C3=A1=C5=99=C3=AD 2017 15:08:32 CEST Ming Lei wrote:
> Hi,
>=20
> The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
>=20
> Once SCSI device is put into QUIESCE, no new request except for RQF_PREEM=
PT
> can be dispatched to SCSI successfully, and scsi_device_quiesce() just
> simply waits for completion of I/Os dispatched to SCSI stack. It isn't
> enough at all.
>=20
> Because new request still can be allocated, but all the allocated
> requests can't be dispatched successfully, so request pool can be
> consumed up easily.
>=20
> Then request with RQF_PREEMPT can't be allocated, and system may
> hang forever, such as during system suspend or SCSI domain alidation.
>=20
> Both IO hang inside system suspend[1] or SCSI domain validation
> were reported before.
>=20
> This patch introduces preempt freez, and tries to solve the issue
> by preempt freezing block queue during SCSI quiesce, and allows
> to allocate request of RQF_PREEMPT when queue is preempt-frozen.
>=20
> Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> them all by introducing blk_freeze_queue_preempt() and
> blk_unfreeze_queue_preempt(), also unifying current interfaces for
> freezing queue between block legacy and blk-mq.
>=20
> Oleksandr has verified that this patchset V2 fixes his I/O hang
> during suspend/resume cycle.
>=20
> V3:
> 	- introduce q->preempt_unfreezing to fix one bug of preempt freeze
> 	- call blk_queue_enter_live() only when queue is preempt frozen
> 	- cleanup a bit on the implementation of preempt freeze
> 	- only patch 6 and 7 are changed
>=20
> V2:
> 	- drop the 1st patch in V1 because percpu_ref_is_dying() is
> 	enough as pointed by Tejun
> 	- introduce preempt version of blk_[freeze|unfreeze]_queue
> 	- sync between preempt freeze and normal freeze
> 	- fix warning from percpu-refcount as reported by Oleksandr
>=20
>=20
> [1] https://marc.info/?t=3D150340250100013&r=3D3&w=3D2
>=20
>=20
>=20
> Ming Lei (8):
>   blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue
>   blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue
>   blk-mq: only run hw queues for blk-mq
>   blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
>   block: tracking request allocation with q_usage_counter
>   block: introduce preempt version of blk_[freeze|unfreeze]_queue
>   block: allow to allocate req with REQF_PREEMPT when queue is preempt
>     frozen
>   SCSI: preempt freeze block queue when SCSI device is put into quiesce
>=20
>  block/bfq-iosched.c      |   2 +-
>  block/blk-cgroup.c       |   8 +--
>  block/blk-core.c         |  53 ++++++++++++---
>  block/blk-mq.c           | 170
> +++++++++++++++++++++++++++++++++++++++-------- block/blk-mq.h           =
|=20
>  1 -
>  block/blk.h              |  17 +++++
>  block/elevator.c         |   4 +-
>  drivers/block/loop.c     |  16 ++---
>  drivers/block/rbd.c      |   2 +-
>  drivers/nvme/host/core.c |   8 +--
>  drivers/scsi/scsi_lib.c  |  22 +++++-
>  include/linux/blk-mq.h   |  15 +++--
>  include/linux/blkdev.h   |  21 +++++-
>  13 files changed, 273 insertions(+), 66 deletions(-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 0/8] block/scsi: safe SCSI quiescing
@ 2017-09-02 14:47   ` Oleksandr Natalenko
  0 siblings, 0 replies; 31+ messages in thread
From: Oleksandr Natalenko @ 2017-09-02 14:47 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Bart Van Assche,
	linux-scsi, Martin K . Petersen, James E . J . Bottomley,
	Johannes Thumshirn, Tejun Heo

Again,

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>

On sobota 2. září 2017 15:08:32 CEST Ming Lei wrote:
> Hi,
> 
> The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
> 
> Once SCSI device is put into QUIESCE, no new request except for RQF_PREEMPT
> can be dispatched to SCSI successfully, and scsi_device_quiesce() just
> simply waits for completion of I/Os dispatched to SCSI stack. It isn't
> enough at all.
> 
> Because new request still can be allocated, but all the allocated
> requests can't be dispatched successfully, so request pool can be
> consumed up easily.
> 
> Then request with RQF_PREEMPT can't be allocated, and system may
> hang forever, such as during system suspend or SCSI domain alidation.
> 
> Both IO hang inside system suspend[1] or SCSI domain validation
> were reported before.
> 
> This patch introduces preempt freez, and tries to solve the issue
> by preempt freezing block queue during SCSI quiesce, and allows
> to allocate request of RQF_PREEMPT when queue is preempt-frozen.
> 
> Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> them all by introducing blk_freeze_queue_preempt() and
> blk_unfreeze_queue_preempt(), also unifying current interfaces for
> freezing queue between block legacy and blk-mq.
> 
> Oleksandr has verified that this patchset V2 fixes his I/O hang
> during suspend/resume cycle.
> 
> V3:
> 	- introduce q->preempt_unfreezing to fix one bug of preempt freeze
> 	- call blk_queue_enter_live() only when queue is preempt frozen
> 	- cleanup a bit on the implementation of preempt freeze
> 	- only patch 6 and 7 are changed
> 
> V2:
> 	- drop the 1st patch in V1 because percpu_ref_is_dying() is
> 	enough as pointed by Tejun
> 	- introduce preempt version of blk_[freeze|unfreeze]_queue
> 	- sync between preempt freeze and normal freeze
> 	- fix warning from percpu-refcount as reported by Oleksandr
> 
> 
> [1] https://marc.info/?t=150340250100013&r=3&w=2
> 
> 
> 
> Ming Lei (8):
>   blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue
>   blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue
>   blk-mq: only run hw queues for blk-mq
>   blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
>   block: tracking request allocation with q_usage_counter
>   block: introduce preempt version of blk_[freeze|unfreeze]_queue
>   block: allow to allocate req with REQF_PREEMPT when queue is preempt
>     frozen
>   SCSI: preempt freeze block queue when SCSI device is put into quiesce
> 
>  block/bfq-iosched.c      |   2 +-
>  block/blk-cgroup.c       |   8 +--
>  block/blk-core.c         |  53 ++++++++++++---
>  block/blk-mq.c           | 170
> +++++++++++++++++++++++++++++++++++++++-------- block/blk-mq.h           | 
>  1 -
>  block/blk.h              |  17 +++++
>  block/elevator.c         |   4 +-
>  drivers/block/loop.c     |  16 ++---
>  drivers/block/rbd.c      |   2 +-
>  drivers/nvme/host/core.c |   8 +--
>  drivers/scsi/scsi_lib.c  |  22 +++++-
>  include/linux/blk-mq.h   |  15 +++--
>  include/linux/blkdev.h   |  21 +++++-
>  13 files changed, 273 insertions(+), 66 deletions(-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-02 13:12   ` Ming Lei
@ 2017-09-04  4:13       ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-04  4:13 UTC (permalink / raw)
  To: linux-block, hch, Bart Van Assche, martin.petersen, ming.lei,
	linux-scsi, axboe, jejb
  Cc: tj, jthumshirn, oleksandr

T24gU2F0LCAyMDE3LTA5LTAyIGF0IDIxOjEyICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gUGxl
YXNlIGxldCB1cyBrbm93IGlmIFYzIGFkZHJlc3NlcyB5b3VyIHByZXZpb3VzIGNvbmNlcm4gYWJv
dXQgY2FsbGluZw0KPiBibGtfcXVldWVfZW50ZXJfbGl2ZSgpIGR1cmluZyBwcmVlbXB0IGZyZWV6
aW5nLg0KDQpEbyB5b3UgdW5kZXJzdGFuZCBob3cgcmVxdWVzdCBxdWV1ZSBjbGVhbnVwIHdvcmtz
PyBUaGUgYWxnb3JpdGhtIHVzZWQgZm9yDQpyZXF1ZXN0IHF1ZXVlIGNsZWFudXAgaXMgYXMgZm9s
bG93czoNCiogU2V0IHRoZSBEWUlORyBmbGFnLiBUaGlzIGZsYWcgbWFrZXMgYWxsIGxhdGVyIGJs
a19nZXRfcmVxdWVzdCgpIGNhbGxzDQogIGZhaWwuDQoqIFdhaXQgdW50aWwgYWxsIHBlbmRpbmcg
cmVxdWVzdHMgaGF2ZSBmaW5pc2hlZC4NCiogU2V0IHRoZSBERUFEIGZsYWcuIEZvciB0aGUgdHJh
ZGl0aW9uYWwgYmxvY2sgbGF5ZXIsIHRoaXMgZmxhZyBjYXVzZXMNCiAgYmxrX3J1bl9xdWV1ZSgp
IG5vdCB0byBjYWxsIC5yZXF1ZXN0X2ZuKCkgYW55bW9yZS4gRm9yIGJsay1tcSBpdCBpcw0KICBn
dWFyYW50ZWVkIGluIGFub3RoZXIgd2F5IHRoYXQgLnF1ZXVlX3JxKCkgd29uJ3QgYmUgY2FsbGVk
IGFueW1vcmUgYWZ0ZXINCiAgdGhpcyBmbGFnIGhhcyBiZWVuIHNldC4NCg0KQWxsb3dpbmcgYmxr
X2dldF9yZXF1ZXN0KCkgdG8gc3VjY2VlZCBhZnRlciB0aGUgRFlJTkcgZmxhZyBoYXMgYmVlbiBz
ZXQgaXMNCmNvbXBsZXRlbHkgd3JvbmcgYmVjYXVzZSB0aGF0IGNvdWxkIHJlc3VsdCBpbiBhIHJl
cXVlc3QgYmVpbmcgcXVldWVkIGFmdGVyDQp0aGUgREVBRCBmbGFnIGhhcyBiZWVuIHNldCwgcmVz
dWx0aW5nIGluIGVpdGhlciBhIGhhbmdpbmcgcmVxdWVzdCBvciBhIGtlcm5lbA0KY3Jhc2guIFRo
aXMgaXMgd2h5IGl0J3MgY29tcGxldGVseSB3cm9uZyB0byBhZGQgYSBibGtfcXVldWVfZW50ZXJf
bGl2ZSgpIGNhbGwNCmluIGJsa19vbGRfZ2V0X3JlcXVlc3QoKSBvciBibGtfbXFfYWxsb2NfcmVx
dWVzdCgpLiBIZW5jZSBteSBOQUsgZm9yIGFueQ0KcGF0Y2ggdGhhdCBhZGRzIGEgYmxrX3F1ZXVl
X2VudGVyX2xpdmUoKSBjYWxsIHRvIGFueSBmdW5jdGlvbiBjYWxsZWQgZnJvbQ0KYmxrX2dldF9y
ZXF1ZXN0KCkuIFRoYXQgaW5jbHVkZXMgdGhlIHBhdGNoIGF0IHRoZSBzdGFydCBvZiB0aGlzIGUt
bWFpbCB0aHJlYWQuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
@ 2017-09-04  4:13       ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-04  4:13 UTC (permalink / raw)
  To: linux-block, hch, Bart Van Assche, martin.petersen, ming.lei,
	linux-scsi, axboe, jejb
  Cc: tj, jthumshirn, oleksandr

On Sat, 2017-09-02 at 21:12 +0800, Ming Lei wrote:
> Please let us know if V3 addresses your previous concern about calling
> blk_queue_enter_live() during preempt freezing.

Do you understand how request queue cleanup works? The algorithm used for
request queue cleanup is as follows:
* Set the DYING flag. This flag makes all later blk_get_request() calls
  fail.
* Wait until all pending requests have finished.
* Set the DEAD flag. For the traditional block layer, this flag causes
  blk_run_queue() not to call .request_fn() anymore. For blk-mq it is
  guaranteed in another way that .queue_rq() won't be called anymore after
  this flag has been set.

Allowing blk_get_request() to succeed after the DYING flag has been set is
completely wrong because that could result in a request being queued after
the DEAD flag has been set, resulting in either a hanging request or a kernel
crash. This is why it's completely wrong to add a blk_queue_enter_live() call
in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
patch that adds a blk_queue_enter_live() call to any function called from
blk_get_request(). That includes the patch at the start of this e-mail thread.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04  4:13       ` Bart Van Assche
  (?)
@ 2017-09-04  7:16       ` Ming Lei
  2017-09-04 15:40           ` Bart Van Assche
  -1 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2017-09-04  7:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, martin.petersen, linux-scsi, axboe, jejb, tj,
	jthumshirn, oleksandr

On Mon, Sep 04, 2017 at 04:13:26AM +0000, Bart Van Assche wrote:
> On Sat, 2017-09-02 at 21:12 +0800, Ming Lei wrote:
> > Please let us know if V3 addresses your previous concern about calling
> > blk_queue_enter_live() during preempt freezing.
> 
> Do you understand how request queue cleanup works? The algorithm used for
> request queue cleanup is as follows:
> * Set the DYING flag. This flag makes all later blk_get_request() calls
>   fail.

If you look at the __blk_freeze_queue_start(), you will see that preemp
freeze will fail after queue is dying, and blk_queue_enter_live() won't
be called at all if queue is dying, right?

> * Wait until all pending requests have finished.
> * Set the DEAD flag. For the traditional block layer, this flag causes
>   blk_run_queue() not to call .request_fn() anymore. For blk-mq it is
>   guaranteed in another way that .queue_rq() won't be called anymore after
>   this flag has been set.
> 
> Allowing blk_get_request() to succeed after the DYING flag has been set is
> completely wrong because that could result in a request being queued after

See above, this patch changes nothing about this fact, please look at
the patch carefully next time just before posting your long comment.

> the DEAD flag has been set, resulting in either a hanging request or a kernel
> crash. This is why it's completely wrong to add a blk_queue_enter_live() call
> in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
> patch that adds a blk_queue_enter_live() call to any function called from
> blk_get_request(). That includes the patch at the start of this e-mail thread.
> 
> Bart.

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue
  2017-09-02 13:08 ` [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
@ 2017-09-04 15:21     ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-04 15:21 UTC (permalink / raw)
  To: linux-block, hch, Bart Van Assche, martin.petersen, ming.lei,
	linux-scsi, axboe, jejb
  Cc: tj, jthumshirn, oleksandr

T24gU2F0LCAyMDE3LTA5LTAyIGF0IDIxOjA4ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gLS0t
IGEvaW5jbHVkZS9saW51eC9ibGtkZXYuaA0KPiArKysgYi9pbmNsdWRlL2xpbnV4L2Jsa2Rldi5o
DQo+IEBAIC01NjUsNiArNTY1LDEwIEBAIHN0cnVjdCByZXF1ZXN0X3F1ZXVlIHsNCj4gIA0KPiAg
CWludAkJCWJ5cGFzc19kZXB0aDsNCj4gIAlhdG9taWNfdAkJbXFfZnJlZXplX2RlcHRoOw0KPiAr
CXNwaW5sb2NrX3QJCWZyZWV6ZV9sb2NrOw0KPiArCXVuc2lnbmVkCQlub3JtYWxfZnJlZXppbmc6
MTsNCj4gKwl1bnNpZ25lZAkJcHJlZW1wdF9mcmVlemluZzoxOw0KPiArCXVuc2lnbmVkCQlwcmVl
bXB0X3VuZnJlZXppbmc6MTsNCj4gIA0KPiAgI2lmIGRlZmluZWQoQ09ORklHX0JMS19ERVZfQlNH
KQ0KPiAgCWJzZ19qb2JfZm4JCSpic2dfam9iX2ZuOw0KDQpSZXF1ZXN0cyBxdWV1ZXMgYWxyZWFk
eSBoYXZlIHRvIG1hbnkgc3RhdGVzIGFuZCB5b3Ugd2FudCB0byBtYWtlIHJlcXVlc3QgcXVldWVz
DQpldmVuIG1vcmUgY29tcGxpY2F0ZWQgYnkgaW50cm9kdWNpbmcgc2V2ZXJhbCBuZXcgc3RhdGUg
dmFyaWFibGVzPyBZaWtlcyENCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue
@ 2017-09-04 15:21     ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-04 15:21 UTC (permalink / raw)
  To: linux-block, hch, Bart Van Assche, martin.petersen, ming.lei,
	linux-scsi, axboe, jejb
  Cc: tj, jthumshirn, oleksandr

On Sat, 2017-09-02 at 21:08 +0800, Ming Lei wrote:
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -565,6 +565,10 @@ struct request_queue {
>  
>  	int			bypass_depth;
>  	atomic_t		mq_freeze_depth;
> +	spinlock_t		freeze_lock;
> +	unsigned		normal_freezing:1;
> +	unsigned		preempt_freezing:1;
> +	unsigned		preempt_unfreezing:1;
>  
>  #if defined(CONFIG_BLK_DEV_BSG)
>  	bsg_job_fn		*bsg_job_fn;

Requests queues already have to many states and you want to make request queues
even more complicated by introducing several new state variables? Yikes!

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04  7:16       ` Ming Lei
@ 2017-09-04 15:40           ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-04 15:40 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

T24gTW9uLCAyMDE3LTA5LTA0IGF0IDE1OjE2ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
TW9uLCBTZXAgMDQsIDIwMTcgYXQgMDQ6MTM6MjZBTSArMDAwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IEFsbG93aW5nIGJsa19nZXRfcmVxdWVzdCgpIHRvIHN1Y2NlZWQgYWZ0ZXIgdGhl
IERZSU5HIGZsYWcgaGFzIGJlZW4gc2V0IGlzDQo+ID4gY29tcGxldGVseSB3cm9uZyBiZWNhdXNl
IHRoYXQgY291bGQgcmVzdWx0IGluIGEgcmVxdWVzdCBiZWluZyBxdWV1ZWQgYWZ0ZXINCj4gPiB0
aGUgREVBRCBmbGFnIGhhcyBiZWVuIHNldCwgcmVzdWx0aW5nIGluIGVpdGhlciBhIGhhbmdpbmcg
cmVxdWVzdCBvciBhIGtlcm5lbA0KPiA+IGNyYXNoLiBUaGlzIGlzIHdoeSBpdCdzIGNvbXBsZXRl
bHkgd3JvbmcgdG8gYWRkIGEgYmxrX3F1ZXVlX2VudGVyX2xpdmUoKSBjYWxsDQo+ID4gaW4gYmxr
X29sZF9nZXRfcmVxdWVzdCgpIG9yIGJsa19tcV9hbGxvY19yZXF1ZXN0KCkuIEhlbmNlIG15IE5B
SyBmb3IgYW55DQo+ID4gcGF0Y2ggdGhhdCBhZGRzIGEgYmxrX3F1ZXVlX2VudGVyX2xpdmUoKSBj
YWxsIHRvIGFueSBmdW5jdGlvbiBjYWxsZWQgZnJvbQ0KPiA+IGJsa19nZXRfcmVxdWVzdCgpLiBU
aGF0IGluY2x1ZGVzIHRoZSBwYXRjaCBhdCB0aGUgc3RhcnQgb2YgdGhpcyBlLW1haWwgdGhyZWFk
Lg0KPg0KPiBTZWUgYWJvdmUsIHRoaXMgcGF0Y2ggY2hhbmdlcyBub3RoaW5nIGFib3V0IHRoaXMg
ZmFjdCwgcGxlYXNlIGxvb2sgYXQNCj4gdGhlIHBhdGNoIGNhcmVmdWxseSBuZXh0IHRpbWUganVz
dCBiZWZvcmUgcG9zdGluZyB5b3VyIGxvbmcgY29tbWVudC4NCg0KQXJlIHlvdSByZWFsbHkgc3Vy
ZSB0aGF0IHlvdXIgcGF0Y2ggZG9lcyBub3QgYWxsb3cgYmxrX2dldF9yZXF1ZXN0KCkgdG8NCnN1
Y2NlZWQgYWZ0ZXIgdGhlIERZSU5HIGZsYWcgaGFzIGJlZW4gc2V0PyBibGtfbXFfYWxsb2NfcmVx
dWVzdCgpIGNhbGxzIGJvdGgNCmJsa19xdWV1ZV9pc19wcmVlbXB0X2Zyb3plbigpIGFuZCBibGtf
cXVldWVfZW50ZXJfbGl2ZSgpIHdpdGhvdXQgaG9sZGluZw0KYW55IGxvY2suIEEgdGhyZWFkIHRo
YXQgaXMgcnVubmluZyBjb25jdXJyZW50bHkgd2l0aCBibGtfbXFfZ2V0X3JlcXVlc3QoKQ0KY2Fu
IHVuZnJlZXplIHRoZSBxdWV1ZSBhZnRlciBibGtfcXVldWVfaXNfcHJlZW1wdF9mcm96ZW4oKSBy
ZXR1cm5lZCBhbmQNCmJlZm9yZSBibGtfcXVldWVfZW50ZXJfbGl2ZSgpIGlzIGNhbGxlZC4gVGhp
cyBtZWFucyB0aGF0IHdpdGggeW91ciBwYXRjaA0Kc2VyaWVzIGFwcGxpZWQgYmxrX2dldF9yZXF1
ZXN0KCkgY2FuIHN1Y2NlZWQgYWZ0ZXIgdGhlIERZSU5HIGZsYWcgaGFzIGJlZW4NCnNldCwgd2hp
Y2ggaXMgc29tZXRoaW5nIHdlIGRvbid0IHdhbnQuIEFkZGl0aW9uYWxseSwgSSBkb24ndCB0aGlu
ayB3ZSB3YW50DQp0byBpbnRyb2R1Y2UgYW55IGtpbmQgb2YgbG9ja2luZyBpbiBibGtfbXFfZ2V0
X3JlcXVlc3QoKSBiZWNhdXNlIHRoYXQgd291bGQNCmJlIGEgc2VyaWFsaXphdGlvbiBwb2ludC4N
Cg0KSGF2ZSB5b3UgY29uc2lkZXJlZCB0byB1c2UgdGhlIGJsay1tcSAicmVzZXJ2ZWQgcmVxdWVz
dCIgbWVjaGFuaXNtIHRvIGF2b2lkDQpzdGFydmF0aW9uIG9mIHBvd2VyIG1hbmFnZW1lbnQgcmVx
dWVzdHMgaW5zdGVhZCBvZiBtYWtpbmcgdGhlIGJsb2NrIGxheWVyDQpldmVuIG1vcmUgY29tcGxp
Y2F0ZWQgdGhhbiBpdCBhbHJlYWR5IGlzPw0KDQpOb3RlOiBleHRlbmRpbmcgYmxrX21xX2ZyZWV6
ZS91bmZyZWV6ZV9xdWV1ZSgpIHRvIHRoZSBsZWdhY3kgYmxvY2sgbGF5ZXINCmNvdWxkIGJlIHVz
ZWZ1bCB0byBtYWtlIHNjc2lfd2FpdF9mb3JfcXVldWVjb21tYW5kKCkgbW9yZSBlbGVnYW50LiBI
b3dldmVyLA0KSSBkb24ndCB0aGluayB3ZSBzaG91bGQgc3BlbmQgb3VyIHRpbWUgb24gbGVnYWN5
IGJsb2NrIGxheWVyIC8gU0NTSSBjb3JlDQpjaGFuZ2VzLiBUaGUgY29kZSBJJ20gcmVmZXJyaW5n
IHRvIGlzIHRoZSBmb2xsb3dpbmc6DQoNCi8qKg0KICogc2NzaV93YWl0X2Zvcl9xdWV1ZWNvbW1h
bmQoKSAtIHdhaXQgZm9yIG9uZ29pbmcgcXVldWVjb21tYW5kKCkgY2FsbHMNCiAqIEBzZGV2OiBT
Q1NJIGRldmljZSBwb2ludGVyLg0KICoNCiAqIFdhaXQgdW50aWwgdGhlIG9uZ29pbmcgc2hvc3Qt
Pmhvc3R0LT5xdWV1ZWNvbW1hbmQoKSBjYWxscyB0aGF0IGFyZQ0KICogaW52b2tlZCBmcm9tIHNj
c2lfcmVxdWVzdF9mbigpIGhhdmUgZmluaXNoZWQuDQogKi8NCnN0YXRpYyB2b2lkIHNjc2lfd2Fp
dF9mb3JfcXVldWVjb21tYW5kKHN0cnVjdCBzY3NpX2RldmljZSAqc2RldikNCnsNCglXQVJOX09O
X09OQ0Uoc2Rldi0+aG9zdC0+dXNlX2Jsa19tcSk7DQoNCgl3aGlsZSAoc2NzaV9yZXF1ZXN0X2Zu
X2FjdGl2ZShzZGV2KSkNCgkJbXNsZWVwKDIwKTsNCn0NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
@ 2017-09-04 15:40           ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-04 15:40 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

On Mon, 2017-09-04 at 15:16 +0800, Ming Lei wrote:
> On Mon, Sep 04, 2017 at 04:13:26AM +0000, Bart Van Assche wrote:
> > Allowing blk_get_request() to succeed after the DYING flag has been set is
> > completely wrong because that could result in a request being queued after
> > the DEAD flag has been set, resulting in either a hanging request or a kernel
> > crash. This is why it's completely wrong to add a blk_queue_enter_live() call
> > in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
> > patch that adds a blk_queue_enter_live() call to any function called from
> > blk_get_request(). That includes the patch at the start of this e-mail thread.
>
> See above, this patch changes nothing about this fact, please look at
> the patch carefully next time just before posting your long comment.

Are you really sure that your patch does not allow blk_get_request() to
succeed after the DYING flag has been set? blk_mq_alloc_request() calls both
blk_queue_is_preempt_frozen() and blk_queue_enter_live() without holding
any lock. A thread that is running concurrently with blk_mq_get_request()
can unfreeze the queue after blk_queue_is_preempt_frozen() returned and
before blk_queue_enter_live() is called. This means that with your patch
series applied blk_get_request() can succeed after the DYING flag has been
set, which is something we don't want. Additionally, I don't think we want
to introduce any kind of locking in blk_mq_get_request() because that would
be a serialization point.

Have you considered to use the blk-mq "reserved request" mechanism to avoid
starvation of power management requests instead of making the block layer
even more complicated than it already is?

Note: extending blk_mq_freeze/unfreeze_queue() to the legacy block layer
could be useful to make scsi_wait_for_queuecommand() more elegant. However,
I don't think we should spend our time on legacy block layer / SCSI core
changes. The code I'm referring to is the following:

/**
 * scsi_wait_for_queuecommand() - wait for ongoing queuecommand() calls
 * @sdev: SCSI device pointer.
 *
 * Wait until the ongoing shost->hostt->queuecommand() calls that are
 * invoked from scsi_request_fn() have finished.
 */
static void scsi_wait_for_queuecommand(struct scsi_device *sdev)
{
	WARN_ON_ONCE(sdev->host->use_blk_mq);

	while (scsi_request_fn_active(sdev))
		msleep(20);
}

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04 15:40           ` Bart Van Assche
  (?)
@ 2017-09-04 16:08           ` Ming Lei
  2017-09-04 16:18               ` Bart Van Assche
  2017-09-05  1:40               ` Bart Van Assche
  -1 siblings, 2 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-04 16:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> On Mon, 2017-09-04 at 15:16 +0800, Ming Lei wrote:
> > On Mon, Sep 04, 2017 at 04:13:26AM +0000, Bart Van Assche wrote:
> > > Allowing blk_get_request() to succeed after the DYING flag has been set is
> > > completely wrong because that could result in a request being queued after
> > > the DEAD flag has been set, resulting in either a hanging request or a kernel
> > > crash. This is why it's completely wrong to add a blk_queue_enter_live() call
> > > in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
> > > patch that adds a blk_queue_enter_live() call to any function called from
> > > blk_get_request(). That includes the patch at the start of this e-mail thread.
> >
> > See above, this patch changes nothing about this fact, please look at
> > the patch carefully next time just before posting your long comment.
> 
> Are you really sure that your patch does not allow blk_get_request() to
> succeed after the DYING flag has been set? blk_mq_alloc_request() calls both

Yeah, I am pretty sure.

Firstly blk_queue_freeze_preempt() is exclusive, that means it will wait
for completion of all pending freezing(both normal and preempt), and other
freezing can't be started too if there is in-progress preempt
freezing, actually it is a typical read/write lock use case, but
we need to support nested normal freezing, so we can't use rwsem.

Also DYNING flag is checked first before starting preempt freezing, the
API will return and preempt_freezing flag isn't set if DYNING is set.

Secondly after preempt freezing is started:

	- for block legacy path, dying is always tested in the entry of
	__get_request(), so no new request is allocated after queue is dying.

	- for blk-mq, it is normal for the DYNING flag to be set just
	between blk_queue_enter() and allocating the request, because
	we depend on lld to handle the case. Even we can enhance the point
	by checking dying flag in blk_queue_enter(), but that is just
	a improvement, not mean V3 isn't correct. 

> blk_queue_is_preempt_frozen() and blk_queue_enter_live() without holding
> any lock. A thread that is running concurrently with blk_mq_get_request()
> can unfreeze the queue after blk_queue_is_preempt_frozen() returned and
> before blk_queue_enter_live() is called. This means that with your patch

preempt freezing is exclusive, so no other freezing can be started at all,
then no such issue you worried about.

> series applied blk_get_request() can succeed after the DYING flag has been
> set, which is something we don't want. Additionally, I don't think we want
> to introduce any kind of locking in blk_mq_get_request() because that would
> be a serialization point.

That needn't to be worried about, as you saw, we can check
percpu_ref_is_dying() first, then acquire the lock to check
flag if queue is freezing.

> 
> Have you considered to use the blk-mq "reserved request" mechanism to avoid
> starvation of power management requests instead of making the block layer
> even more complicated than it already is?

reserved request is really a bad idea, that means the reserved request
can't be used for normal I/O, we all know the request/tag space is
precious, and some device has a quite small tag space, such as sata.
This way will affect performance definitely.

Also I don't think the approach is complicated, and actually the idea
is simple, and the implementation isn't complicated too.

> 
> Note: extending blk_mq_freeze/unfreeze_queue() to the legacy block layer
> could be useful to make scsi_wait_for_queuecommand() more elegant. However,
> I don't think we should spend our time on legacy block layer / SCSI core
> changes. The code I'm referring to is the following:

That can be side-product of this approach, but this patchset is just
focus on fixing I/O hang.

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04 16:08           ` Ming Lei
@ 2017-09-04 16:18               ` Bart Van Assche
  2017-09-05  1:40               ` Bart Van Assche
  1 sibling, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-04 16:18 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

T24gVHVlLCAyMDE3LTA5LTA1IGF0IDAwOjA4ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
TW9uLCBTZXAgMDQsIDIwMTcgYXQgMDM6NDA6MzVQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IE9uIE1vbiwgMjAxNy0wOS0wNCBhdCAxNToxNiArMDgwMCwgTWluZyBMZWkgd3Jv
dGU6DQo+ID4gPiBPbiBNb24sIFNlcCAwNCwgMjAxNyBhdCAwNDoxMzoyNkFNICswMDAwLCBCYXJ0
IFZhbiBBc3NjaGUgd3JvdGU6DQo+ID4gPiA+IEFsbG93aW5nIGJsa19nZXRfcmVxdWVzdCgpIHRv
IHN1Y2NlZWQgYWZ0ZXIgdGhlIERZSU5HIGZsYWcgaGFzIGJlZW4gc2V0IGlzDQo+ID4gPiA+IGNv
bXBsZXRlbHkgd3JvbmcgYmVjYXVzZSB0aGF0IGNvdWxkIHJlc3VsdCBpbiBhIHJlcXVlc3QgYmVp
bmcgcXVldWVkIGFmdGVyDQo+ID4gPiA+IHRoZSBERUFEIGZsYWcgaGFzIGJlZW4gc2V0LCByZXN1
bHRpbmcgaW4gZWl0aGVyIGEgaGFuZ2luZyByZXF1ZXN0IG9yIGEga2VybmVsDQo+ID4gPiA+IGNy
YXNoLiBUaGlzIGlzIHdoeSBpdCdzIGNvbXBsZXRlbHkgd3JvbmcgdG8gYWRkIGEgYmxrX3F1ZXVl
X2VudGVyX2xpdmUoKSBjYWxsDQo+ID4gPiA+IGluIGJsa19vbGRfZ2V0X3JlcXVlc3QoKSBvciBi
bGtfbXFfYWxsb2NfcmVxdWVzdCgpLiBIZW5jZSBteSBOQUsgZm9yIGFueQ0KPiA+ID4gPiBwYXRj
aCB0aGF0IGFkZHMgYSBibGtfcXVldWVfZW50ZXJfbGl2ZSgpIGNhbGwgdG8gYW55IGZ1bmN0aW9u
IGNhbGxlZCBmcm9tDQo+ID4gPiA+IGJsa19nZXRfcmVxdWVzdCgpLiBUaGF0IGluY2x1ZGVzIHRo
ZSBwYXRjaCBhdCB0aGUgc3RhcnQgb2YgdGhpcyBlLW1haWwgdGhyZWFkLg0KPiA+ID4gDQo+ID4g
PiBTZWUgYWJvdmUsIHRoaXMgcGF0Y2ggY2hhbmdlcyBub3RoaW5nIGFib3V0IHRoaXMgZmFjdCwg
cGxlYXNlIGxvb2sgYXQNCj4gPiA+IHRoZSBwYXRjaCBjYXJlZnVsbHkgbmV4dCB0aW1lIGp1c3Qg
YmVmb3JlIHBvc3RpbmcgeW91ciBsb25nIGNvbW1lbnQuDQo+ID4gDQo+ID4gQXJlIHlvdSByZWFs
bHkgc3VyZSB0aGF0IHlvdXIgcGF0Y2ggZG9lcyBub3QgYWxsb3cgYmxrX2dldF9yZXF1ZXN0KCkg
dG8NCj4gPiBzdWNjZWVkIGFmdGVyIHRoZSBEWUlORyBmbGFnIGhhcyBiZWVuIHNldD8gYmxrX21x
X2FsbG9jX3JlcXVlc3QoKSBjYWxscyBib3RoDQo+ID4gYmxrX3F1ZXVlX2lzX3ByZWVtcHRfZnJv
emVuKCkgYW5kIGJsa19xdWV1ZV9lbnRlcl9saXZlKCkgd2l0aG91dCBob2xkaW5nDQo+ID4gYW55
IGxvY2suIEEgdGhyZWFkIHRoYXQgaXMgcnVubmluZyBjb25jdXJyZW50bHkgd2l0aCBibGtfbXFf
Z2V0X3JlcXVlc3QoKQ0KPiA+IGNhbiB1bmZyZWV6ZSB0aGUgcXVldWUgYWZ0ZXIgYmxrX3F1ZXVl
X2lzX3ByZWVtcHRfZnJvemVuKCkgcmV0dXJuZWQgYW5kDQo+ID4gYmVmb3JlIGJsa19xdWV1ZV9l
bnRlcl9saXZlKCkgaXMgY2FsbGVkLiBUaGlzIG1lYW5zIHRoYXQgd2l0aCB5b3VyIHBhdGNoDQo+
ID4gc2VyaWVzIGFwcGxpZWQgYmxrX2dldF9yZXF1ZXN0KCkgY2FuIHN1Y2NlZWQgYWZ0ZXIgdGhl
IERZSU5HIGZsYWcgaGFzIGJlZW4NCj4gPiBzZXQsIHdoaWNoIGlzIHNvbWV0aGluZyB3ZSBkb24n
dCB3YW50LiBBZGRpdGlvbmFsbHksIEkgZG9uJ3QgdGhpbmsgd2Ugd2FudA0KPiA+IHRvIGludHJv
ZHVjZSBhbnkga2luZCBvZiBsb2NraW5nIGluIGJsa19tcV9nZXRfcmVxdWVzdCgpIGJlY2F1c2Ug
dGhhdCB3b3VsZA0KPiA+IGJlIGEgc2VyaWFsaXphdGlvbiBwb2ludC4NCj4NCj4gWWVhaCwgSSBh
bSBwcmV0dHkgc3VyZS4NCj4gDQo+IEZpcnN0bHkgYmxrX3F1ZXVlX2ZyZWV6ZV9wcmVlbXB0KCkg
aXMgZXhjbHVzaXZlLCB0aGF0IG1lYW5zIGl0IHdpbGwgd2FpdA0KPiBmb3IgY29tcGxldGlvbiBv
ZiBhbGwgcGVuZGluZyBmcmVlemluZyhib3RoIG5vcm1hbCBhbmQgcHJlZW1wdCksIGFuZCBvdGhl
cg0KPiBmcmVlemluZyBjYW4ndCBiZSBzdGFydGVkIHRvbyBpZiB0aGVyZSBpcyBpbi1wcm9ncmVz
cyBwcmVlbXB0DQo+IGZyZWV6aW5nLCBhY3R1YWxseSBpdCBpcyBhIHR5cGljYWwgcmVhZC93cml0
ZSBsb2NrIHVzZSBjYXNlLCBidXQNCj4gd2UgbmVlZCB0byBzdXBwb3J0IG5lc3RlZCBub3JtYWwg
ZnJlZXppbmcsIHNvIHdlIGNhbid0IHVzZSByd3NlbS4gDQoNCllvdSBzZWVtIHRvIG92ZXJsb29r
IHRoYXQgYmxrX2dldF9yZXF1ZXN0KCkgY2FuIGJlIGNhbGxlZCBmcm9tIGFub3RoZXIgdGhyZWFk
DQp0aGFuIHRoZSB0aHJlYWQgdGhhdCBpcyBwZXJmb3JtaW5nIHRoZSBmcmVlemluZyBhbmQgdW5m
cmVlemluZy4NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
@ 2017-09-04 16:18               ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-04 16:18 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> > On Mon, 2017-09-04 at 15:16 +0800, Ming Lei wrote:
> > > On Mon, Sep 04, 2017 at 04:13:26AM +0000, Bart Van Assche wrote:
> > > > Allowing blk_get_request() to succeed after the DYING flag has been set is
> > > > completely wrong because that could result in a request being queued after
> > > > the DEAD flag has been set, resulting in either a hanging request or a kernel
> > > > crash. This is why it's completely wrong to add a blk_queue_enter_live() call
> > > > in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
> > > > patch that adds a blk_queue_enter_live() call to any function called from
> > > > blk_get_request(). That includes the patch at the start of this e-mail thread.
> > > 
> > > See above, this patch changes nothing about this fact, please look at
> > > the patch carefully next time just before posting your long comment.
> > 
> > Are you really sure that your patch does not allow blk_get_request() to
> > succeed after the DYING flag has been set? blk_mq_alloc_request() calls both
> > blk_queue_is_preempt_frozen() and blk_queue_enter_live() without holding
> > any lock. A thread that is running concurrently with blk_mq_get_request()
> > can unfreeze the queue after blk_queue_is_preempt_frozen() returned and
> > before blk_queue_enter_live() is called. This means that with your patch
> > series applied blk_get_request() can succeed after the DYING flag has been
> > set, which is something we don't want. Additionally, I don't think we want
> > to introduce any kind of locking in blk_mq_get_request() because that would
> > be a serialization point.
>
> Yeah, I am pretty sure.
> 
> Firstly blk_queue_freeze_preempt() is exclusive, that means it will wait
> for completion of all pending freezing(both normal and preempt), and other
> freezing can't be started too if there is in-progress preempt
> freezing, actually it is a typical read/write lock use case, but
> we need to support nested normal freezing, so we can't use rwsem. 

You seem to overlook that blk_get_request() can be called from another thread
than the thread that is performing the freezing and unfreezing.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue
  2017-09-04 15:21     ` Bart Van Assche
  (?)
@ 2017-09-04 16:20     ` Ming Lei
  -1 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-04 16:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, martin.petersen, linux-scsi, axboe, jejb, tj,
	jthumshirn, oleksandr

On Mon, Sep 04, 2017 at 03:21:08PM +0000, Bart Van Assche wrote:
> On Sat, 2017-09-02 at 21:08 +0800, Ming Lei wrote:
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -565,6 +565,10 @@ struct request_queue {
> >  
> >  	int			bypass_depth;
> >  	atomic_t		mq_freeze_depth;
> > +	spinlock_t		freeze_lock;
> > +	unsigned		normal_freezing:1;
> > +	unsigned		preempt_freezing:1;
> > +	unsigned		preempt_unfreezing:1;
> >  
> >  #if defined(CONFIG_BLK_DEV_BSG)
> >  	bsg_job_fn		*bsg_job_fn;
> 
> Requests queues already have to many states and you want to make request queues
> even more complicated by introducing several new state variables? Yikes!

The three flags are used in freeze/unfreeze path only, and I don't think
they are too complicated to maintain. Actually each state are simply
enough:

	- normal_freezing means the queue is in normal freezing, it is set
	before blk_queue_freeze() returns. In this state, no any request
	can be allocated from the queue, just like current blk queue
	freezing.

	- preempt_freezing means the queue is in preempt freezing, the flag
	is set before blk_queue_freeze_preempt() returns successfully. In
	this state, only RQF_PREEMPT is allowed to be allocated.

	- preempt_unfreezing means the queue is in preempt unfreezing, just
	set in the entry of blk_queue_unfreeze_preempt(). In this state,
	no any request can be allocated from the queue.

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04 16:18               ` Bart Van Assche
  (?)
@ 2017-09-04 16:28               ` Ming Lei
  -1 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-04 16:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

On Mon, Sep 04, 2017 at 04:18:32PM +0000, Bart Van Assche wrote:
> On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> > On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> > > On Mon, 2017-09-04 at 15:16 +0800, Ming Lei wrote:
> > > > On Mon, Sep 04, 2017 at 04:13:26AM +0000, Bart Van Assche wrote:
> > > > > Allowing blk_get_request() to succeed after the DYING flag has been set is
> > > > > completely wrong because that could result in a request being queued after
> > > > > the DEAD flag has been set, resulting in either a hanging request or a kernel
> > > > > crash. This is why it's completely wrong to add a blk_queue_enter_live() call
> > > > > in blk_old_get_request() or blk_mq_alloc_request(). Hence my NAK for any
> > > > > patch that adds a blk_queue_enter_live() call to any function called from
> > > > > blk_get_request(). That includes the patch at the start of this e-mail thread.
> > > > 
> > > > See above, this patch changes nothing about this fact, please look at
> > > > the patch carefully next time just before posting your long comment.
> > > 
> > > Are you really sure that your patch does not allow blk_get_request() to
> > > succeed after the DYING flag has been set? blk_mq_alloc_request() calls both
> > > blk_queue_is_preempt_frozen() and blk_queue_enter_live() without holding
> > > any lock. A thread that is running concurrently with blk_mq_get_request()
> > > can unfreeze the queue after blk_queue_is_preempt_frozen() returned and
> > > before blk_queue_enter_live() is called. This means that with your patch
> > > series applied blk_get_request() can succeed after the DYING flag has been
> > > set, which is something we don't want. Additionally, I don't think we want
> > > to introduce any kind of locking in blk_mq_get_request() because that would
> > > be a serialization point.
> >
> > Yeah, I am pretty sure.
> > 
> > Firstly blk_queue_freeze_preempt() is exclusive, that means it will wait
> > for completion of all pending freezing(both normal and preempt), and other
> > freezing can't be started too if there is in-progress preempt
> > freezing, actually it is a typical read/write lock use case, but
> > we need to support nested normal freezing, so we can't use rwsem. 
> 
> You seem to overlook that blk_get_request() can be called from another thread
> than the thread that is performing the freezing and unfreezing.

The freezing state of queue can be observed in all context, so this
issue isn't related with context, please see blk_queue_is_preempt_frozen()
which is called from blk_get_request():

        if (!percpu_ref_is_dying(&q->q_usage_counter))
                return false;

        spin_lock(&q->freeze_lock);
        preempt_frozen = q->preempt_freezing;
        preempt_unfreezing = q->preempt_unfreezing;
        spin_unlock(&q->freeze_lock);

        return preempt_frozen && !preempt_unfreezing;


If the queue is in preempt freezing, blk_queue_enter_live() is called
before allocating request, otherwise blk_queue_enter() is called.

BTW, we can add the check of queue dying in this helper as one
improvement.

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-04 16:08           ` Ming Lei
@ 2017-09-05  1:40               ` Bart Van Assche
  2017-09-05  1:40               ` Bart Van Assche
  1 sibling, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-05  1:40 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

T24gVHVlLCAyMDE3LTA5LTA1IGF0IDAwOjA4ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
TW9uLCBTZXAgMDQsIDIwMTcgYXQgMDM6NDA6MzVQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IEhhdmUgeW91IGNvbnNpZGVyZWQgdG8gdXNlIHRoZSBibGstbXEgInJlc2VydmVk
IHJlcXVlc3QiIG1lY2hhbmlzbSB0byBhdm9pZA0KPiA+IHN0YXJ2YXRpb24gb2YgcG93ZXIgbWFu
YWdlbWVudCByZXF1ZXN0cyBpbnN0ZWFkIG9mIG1ha2luZyB0aGUgYmxvY2sgbGF5ZXINCj4gPiBl
dmVuIG1vcmUgY29tcGxpY2F0ZWQgdGhhbiBpdCBhbHJlYWR5IGlzPw0KPiANCj4gcmVzZXJ2ZWQg
cmVxdWVzdCBpcyByZWFsbHkgYSBiYWQgaWRlYSwgdGhhdCBtZWFucyB0aGUgcmVzZXJ2ZWQgcmVx
dWVzdA0KPiBjYW4ndCBiZSB1c2VkIGZvciBub3JtYWwgSS9PLCB3ZSBhbGwga25vdyB0aGUgcmVx
dWVzdC90YWcgc3BhY2UgaXMNCj4gcHJlY2lvdXMsIGFuZCBzb21lIGRldmljZSBoYXMgYSBxdWl0
ZSBzbWFsbCB0YWcgc3BhY2UsIHN1Y2ggYXMgc2F0YS4NCj4gVGhpcyB3YXkgd2lsbCBhZmZlY3Qg
cGVyZm9ybWFuY2UgZGVmaW5pdGVseS4NCg0KU29ycnkgYnV0IEknbSBuZWl0aGVyIGNvbnZpbmNl
ZCB0aGF0IHJlc2VydmluZyBhIHJlcXVlc3QgZm9yIHBvd2VyIG1hbmFnZW1lbnQgDQp3b3VsZCBi
ZSBhIGJhZCBpZGVhIG5vciB0aGF0IGl0IHdvdWxkIGhhdmUgYSBzaWduaWZpY2FudCBwZXJmb3Jt
YW5jZSBpbXBhY3Qgbm9yDQp0aGF0IGl0IHdvdWxkIGJlIGNvbXBsaWNhdGVkIHRvIGltcGxlbWVu
dC4gSGF2ZSB5b3Ugbm90aWNlZCB0aGF0IHRoZSBMaW51eCBBVEENCmltcGxlbWVudGF0aW9uIGFs
cmVhZHkgcmVzZXJ2ZXMgYSByZXF1ZXN0IGZvciBpbnRlcm5hbCB1c2UgYW5kIHRoZXJlYnkgcmVk
dWNlcw0KdGhlIHF1ZXVlIGRlcHRoIGZyb20gMzIgdG8gMzEgKHNlZSBhbHNvIEFUQV9UQUdfSU5U
RVJOQUwpPyBXaGF0IEkgd291bGQgbGlrZSB0bw0Ka25vdyBpZiBpcyB3aGV0aGVyIHRoZSBwZXJm
b3JtYW5jZSBpbXBhY3Qgb2YgcmVzZXJ2aW5nIGEgcmVxdWVzdCBpcyBtb3JlIG9yIGxlc3MNCnRo
YW4gMSUuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
@ 2017-09-05  1:40               ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-05  1:40 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> > Have you considered to use the blk-mq "reserved request" mechanism to avoid
> > starvation of power management requests instead of making the block layer
> > even more complicated than it already is?
> 
> reserved request is really a bad idea, that means the reserved request
> can't be used for normal I/O, we all know the request/tag space is
> precious, and some device has a quite small tag space, such as sata.
> This way will affect performance definitely.

Sorry but I'm neither convinced that reserving a request for power management 
would be a bad idea nor that it would have a significant performance impact nor
that it would be complicated to implement. Have you noticed that the Linux ATA
implementation already reserves a request for internal use and thereby reduces
the queue depth from 32 to 31 (see also ATA_TAG_INTERNAL)? What I would like to
know if is whether the performance impact of reserving a request is more or less
than 1%.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-05  1:40               ` Bart Van Assche
  (?)
@ 2017-09-05  2:23               ` Ming Lei
  2017-09-08  3:08                 ` Ming Lei
  -1 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2017-09-05  2:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

On Tue, Sep 05, 2017 at 01:40:11AM +0000, Bart Van Assche wrote:
> On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> > On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> > > Have you considered to use the blk-mq "reserved request" mechanism to avoid
> > > starvation of power management requests instead of making the block layer
> > > even more complicated than it already is?
> > 
> > reserved request is really a bad idea, that means the reserved request
> > can't be used for normal I/O, we all know the request/tag space is
> > precious, and some device has a quite small tag space, such as sata.
> > This way will affect performance definitely.
> 
> Sorry but I'm neither convinced that reserving a request for power management 
> would be a bad idea nor that it would have a significant performance impact nor
> that it would be complicated to implement. Have you noticed that the Linux ATA
> implementation already reserves a request for internal use and thereby reduces
> the queue depth from 32 to 31 (see also ATA_TAG_INTERNAL)? What I would like to
> know if is whether the performance impact of reserving a request is more or less
> than 1%.

Firstly we really can avoid the reservation, why do we have to
wast one precious tag just for PM, which may never happen on
one machine from its running. For SATA, the internal tag is for EH,
I believe the reservation is inevitable.

Secondly reserving one tag may decrease the concurrent I/O by 1,
that definitely hurts performance, especially for random I/O. Think
about why NVMe increases its queue depth so many. Not mention
there are some devices which have only one tag(.can_queue is 1),
how can you reserve one tag on this kind of device?

Finally bad result will follow if you reserve one tag for PM only.
Suppose it is doable to reserve one tag, did you consider
how to do that? Tag can be shared in host wide, do you want to
reserve one tag just for one request_queue?
	- If yes, lots of tag can be reserved/wasted for the unusual PM
	or sort of commands, even worse the whole tag space of HBA may
	not be enough for the reservation if there are lots of LUNs in
	this HBA.
	
	- If not, and you just reserve one tag for one HBA, then all
	PM commands share the one reservation. During suspend/resume, all
	these PM commands have to run exclusively(serialized) for diskes
	attached to the HBA, that will slow down the suspend/resume very
	much because there may be lots of LUNs in this HBA.

That is why I said reserving one tag is really bad, isn't it?


Thanks,
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-05  2:23               ` Ming Lei
@ 2017-09-08  3:08                 ` Ming Lei
  2017-09-08 17:28                     ` Bart Van Assche
  0 siblings, 1 reply; 31+ messages in thread
From: Ming Lei @ 2017-09-08  3:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

On Tue, Sep 05, 2017 at 10:23:25AM +0800, Ming Lei wrote:
> On Tue, Sep 05, 2017 at 01:40:11AM +0000, Bart Van Assche wrote:
> > On Tue, 2017-09-05 at 00:08 +0800, Ming Lei wrote:
> > > On Mon, Sep 04, 2017 at 03:40:35PM +0000, Bart Van Assche wrote:
> > > > Have you considered to use the blk-mq "reserved request" mechanism to avoid
> > > > starvation of power management requests instead of making the block layer
> > > > even more complicated than it already is?
> > > 
> > > reserved request is really a bad idea, that means the reserved request
> > > can't be used for normal I/O, we all know the request/tag space is
> > > precious, and some device has a quite small tag space, such as sata.
> > > This way will affect performance definitely.
> > 
> > Sorry but I'm neither convinced that reserving a request for power management 
> > would be a bad idea nor that it would have a significant performance impact nor
> > that it would be complicated to implement. Have you noticed that the Linux ATA
> > implementation already reserves a request for internal use and thereby reduces
> > the queue depth from 32 to 31 (see also ATA_TAG_INTERNAL)? What I would like to
> > know if is whether the performance impact of reserving a request is more or less
> > than 1%.
> 
> Firstly we really can avoid the reservation, why do we have to
> wast one precious tag just for PM, which may never happen on
> one machine from its running. For SATA, the internal tag is for EH,
> I believe the reservation is inevitable.
> 
> Secondly reserving one tag may decrease the concurrent I/O by 1,
> that definitely hurts performance, especially for random I/O. Think
> about why NVMe increases its queue depth so many. Not mention
> there are some devices which have only one tag(.can_queue is 1),
> how can you reserve one tag on this kind of device?
> 
> Finally bad result will follow if you reserve one tag for PM only.
> Suppose it is doable to reserve one tag, did you consider
> how to do that? Tag can be shared in host wide, do you want to
> reserve one tag just for one request_queue?
> 	- If yes, lots of tag can be reserved/wasted for the unusual PM
> 	or sort of commands, even worse the whole tag space of HBA may
> 	not be enough for the reservation if there are lots of LUNs in
> 	this HBA.
> 	
> 	- If not, and you just reserve one tag for one HBA, then all
> 	PM commands share the one reservation. During suspend/resume, all
> 	these PM commands have to run exclusively(serialized) for diskes
> 	attached to the HBA, that will slow down the suspend/resume very
> 	much because there may be lots of LUNs in this HBA.
> 
> That is why I said reserving one tag is really bad, isn't it?

Hi Bart,

Looks I replied or clarified all your questions/comments on this
patchset.

So could you let me know if you still object this approach or
patchset?

If not, I think we need to move on and fix the real issue.

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-08  3:08                 ` Ming Lei
@ 2017-09-08 17:28                     ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-08 17:28 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

T24gRnJpLCAyMDE3LTA5LTA4IGF0IDExOjA4ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gTG9v
a3MgSSByZXBsaWVkIG9yIGNsYXJpZmllZCBhbGwgeW91ciBxdWVzdGlvbnMvY29tbWVudHMgb24g
dGhpcw0KPiBwYXRjaHNldC4NCg0KTm8sIHlvdSBoYXZlIG5vdCBhZGRyZXNzZWQgYWxsIG15IGNv
bW1lbnRzLCB5b3Uga25vdyB0aGF0IHlvdSBoYXZlIG5vdA0KYWRkcmVzc2VkIGFsbCBteSBjb21t
ZW50cyBzbyB5b3Ugc2hvdWxkIG5vdCBoYXZlIHdyaXR0ZW4gdGhhdCB5b3UgaGF2ZQ0KYWRkcmVz
c2VkIGFsbCBteSBjb21tZW50cy4gVGhpcyBwYXRjaCBzZXJpZXMgbm90IG9ubHkgaW50cm9kdWNl
cyB1Z2x5DQpjaGFuZ2VzIGluIHRoZSByZXF1ZXN0IHF1ZXVlIGZyZWV6ZSBtZWNoYW5pc20gYnV0
IGl0IGFsc28gaW50cm9kdWNlcyBhbg0KdW5hY2NlcHRhYmxlIHJhY2UgY29uZGl0aW9uIGJldHdl
ZW4gYmxrX2dldF9yZXF1ZXN0KCkgYW5kIHJlcXVlc3QgcXVldWUNCmNsZWFudXAuDQoNCkJUVywg
eW91IGRvbid0IGhhdmUgdG8gc3BlbmQgbW9yZSB0aW1lIG9uIHRoaXMgcGF0Y2ggc2VyaWVzLiBJ
IGhhdmUNCmltcGxlbWVudGVkIGFuIGFsdGVybmF0aXZlIGFuZCBtdWNoIGNsZWFuZXIgYXBwcm9h
Y2ggdG8gZml4IFNDU0kgZGV2aWNlDQpzdXNwZW5kLiBJJ20gY3VycmVudGx5IHRlc3RpbmcgdGhh
dCBwYXRjaCBzZXJpZXMuDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
@ 2017-09-08 17:28                     ` Bart Van Assche
  0 siblings, 0 replies; 31+ messages in thread
From: Bart Van Assche @ 2017-09-08 17:28 UTC (permalink / raw)
  To: ming.lei
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

On Fri, 2017-09-08 at 11:08 +0800, Ming Lei wrote:
> Looks I replied or clarified all your questions/comments on this
> patchset.

No, you have not addressed all my comments, you know that you have not
addressed all my comments so you should not have written that you have
addressed all my comments. This patch series not only introduces ugly
changes in the request queue freeze mechanism but it also introduces an
unacceptable race condition between blk_get_request() and request queue
cleanup.

BTW, you don't have to spend more time on this patch series. I have
implemented an alternative and much cleaner approach to fix SCSI device
suspend. I'm currently testing that patch series.

Bart.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen
  2017-09-08 17:28                     ` Bart Van Assche
  (?)
@ 2017-09-09  7:21                     ` Ming Lei
  -1 siblings, 0 replies; 31+ messages in thread
From: Ming Lei @ 2017-09-09  7:21 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, hch, jthumshirn, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, tj

On Fri, Sep 08, 2017 at 05:28:16PM +0000, Bart Van Assche wrote:
> On Fri, 2017-09-08 at 11:08 +0800, Ming Lei wrote:
> > Looks I replied or clarified all your questions/comments on this
> > patchset.
> 
> No, you have not addressed all my comments, you know that you have not
> addressed all my comments so you should not have written that you have

I do not know.

> addressed all my comments. This patch series not only introduces ugly
> changes in the request queue freeze mechanism but it also introduces an
> unacceptable race condition between blk_get_request() and request queue
> cleanup.

So what is the race? Could you reply in original thread?

> 
> BTW, you don't have to spend more time on this patch series. I have
> implemented an alternative and much cleaner approach to fix SCSI device
> suspend. I'm currently testing that patch series.

You do not understand the issue at all, it is not only related with
suspend. Please take a look at scsi_execute(), each request run
via scsi_execute() need to be dispatched successfully even when
SCSI device is quiesced.

-- 
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2017-09-09  7:21 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-02 13:08 [PATCH V3 0/8] block/scsi: safe SCSI quiescing Ming Lei
2017-09-02 13:08 ` [PATCH V3 1/8] blk-mq: rename blk_mq_unfreeze_queue as blk_unfreeze_queue Ming Lei
2017-09-02 13:08 ` [PATCH V3 2/8] blk-mq: rename blk_mq_freeze_queue as blk_freeze_queue Ming Lei
2017-09-02 13:08 ` [PATCH V3 3/8] blk-mq: only run hw queues for blk-mq Ming Lei
2017-09-02 13:08 ` [PATCH V3 4/8] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait Ming Lei
2017-09-02 13:08 ` [PATCH V3 5/8] block: tracking request allocation with q_usage_counter Ming Lei
2017-09-02 13:08 ` [PATCH V3 6/8] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
2017-09-04 15:21   ` Bart Van Assche
2017-09-04 15:21     ` Bart Van Assche
2017-09-04 16:20     ` Ming Lei
2017-09-02 13:08 ` [PATCH V3 7/8] block: allow to allocate req with REQF_PREEMPT when queue is preempt frozen Ming Lei
2017-09-02 13:12   ` Ming Lei
2017-09-04  4:13     ` Bart Van Assche
2017-09-04  4:13       ` Bart Van Assche
2017-09-04  7:16       ` Ming Lei
2017-09-04 15:40         ` Bart Van Assche
2017-09-04 15:40           ` Bart Van Assche
2017-09-04 16:08           ` Ming Lei
2017-09-04 16:18             ` Bart Van Assche
2017-09-04 16:18               ` Bart Van Assche
2017-09-04 16:28               ` Ming Lei
2017-09-05  1:40             ` Bart Van Assche
2017-09-05  1:40               ` Bart Van Assche
2017-09-05  2:23               ` Ming Lei
2017-09-08  3:08                 ` Ming Lei
2017-09-08 17:28                   ` Bart Van Assche
2017-09-08 17:28                     ` Bart Van Assche
2017-09-09  7:21                     ` Ming Lei
2017-09-02 13:08 ` [PATCH V3 8/8] SCSI: preempt freeze block queue when SCSI device is put into quiesce Ming Lei
2017-09-02 14:47 ` [PATCH V3 0/8] block/scsi: safe SCSI quiescing Oleksandr Natalenko
2017-09-02 14:47   ` Oleksandr Natalenko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.