All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 0/10] block/scsi: safe SCSI quiescing
@ 2017-09-11 11:10 Ming Lei
  2017-09-11 11:10 ` [PATCH V4 01/10] blk-mq: only run hw queues for blk-mq Ming Lei
                   ` (11 more replies)
  0 siblings, 12 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

Hi,

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for
RQF_PREEMPT can be dispatched to SCSI successfully, and
scsi_device_quiesce() just simply waits for completion of I/Os
dispatched to SCSI stack. It isn't enough at all.

Because new request still can be comming, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated and wait forever,
meantime scsi_device_resume() waits for completion of RQF_PREEMPT,
then system hangs forever, such as during system suspend or
sending SCSI domain alidation.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch introduces preempt freeze, and solves the issue
by preempt freezing block queue during SCSI quiesce, and allows
to allocate request of RQF_PREEMPT when queue is in this state.

Oleksandr verified that V3 does fix the hang during suspend/resume,
and Cathy verified that revised V3 fixes hang in sending
SCSI domain validation.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all by introducing/unifying blk_freeze_queue_preempt() and
blk_unfreeze_queue_preempt(), and cleanup is done together.

The patchset can be found in the following gitweb:

	https://github.com/ming1/linux/tree/blk_safe_scsi_quiesce_V4

V4:
	- reorganize patch order to make it more reasonable
	- support nested preempt freeze, as required by SCSI transport spi
	- check preempt freezing in slow path of of blk_queue_enter()
	- add "SCSI: transport_spi: resume a quiesced device"
	- wake up freeze queue in setting dying for both blk-mq and legacy
	- rename blk_mq_[freeze|unfreeze]_queue() in one patch
	- rename .mq_freeze_wq and .mq_freeze_depth
	- improve comment

V3:
	- introduce q->preempt_unfreezing to fix one bug of preempt freeze
	- call blk_queue_enter_live() only when queue is preempt frozen
	- cleanup a bit on the implementation of preempt freeze
	- only patch 6 and 7 are changed

V2:
	- drop the 1st patch in V1 because percpu_ref_is_dying() is
	enough as pointed by Tejun
	- introduce preempt version of blk_[freeze|unfreeze]_queue
	- sync between preempt freeze and normal freeze
	- fix warning from percpu-refcount as reported by Oleksandr


[1] https://marc.info/?t=150340250100013&r=3&w=2


Thanks,
Ming


Ming Lei (10):
  blk-mq: only run hw queues for blk-mq
  block: tracking request allocation with q_usage_counter
  blk-mq: rename blk_mq_[freeze|unfreeze]_queue
  blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
  block: rename .mq_freeze_wq and .mq_freeze_depth
  block: pass flags to blk_queue_enter()
  block: introduce preempt version of blk_[freeze|unfreeze]_queue
  block: allow to allocate req with RQF_PREEMPT when queue is preempt
    frozen
  SCSI: transport_spi: resume a quiesced device
  SCSI: preempt freeze block queue when SCSI device is put into quiesce

 block/bfq-iosched.c               |   2 +-
 block/blk-cgroup.c                |   8 +-
 block/blk-core.c                  |  95 ++++++++++++++++----
 block/blk-mq.c                    | 180 ++++++++++++++++++++++++++++----------
 block/blk-mq.h                    |   1 -
 block/blk-timeout.c               |   2 +-
 block/blk.h                       |  12 +++
 block/elevator.c                  |   4 +-
 drivers/block/loop.c              |  24 ++---
 drivers/block/rbd.c               |   2 +-
 drivers/nvme/host/core.c          |   8 +-
 drivers/scsi/scsi_lib.c           |  25 +++++-
 drivers/scsi/scsi_transport_spi.c |   3 +
 fs/block_dev.c                    |   4 +-
 include/linux/blk-mq.h            |  15 ++--
 include/linux/blkdev.h            |  32 +++++--
 16 files changed, 313 insertions(+), 104 deletions(-)

-- 
2.9.5

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH V4 01/10] blk-mq: only run hw queues for blk-mq
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
@ 2017-09-11 11:10 ` Ming Lei
  2017-09-11 11:10 ` [PATCH V4 02/10] block: tracking request allocation with q_usage_counter Ming Lei
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

This patch just makes it explicitely.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f18cff80050..9c364497cc44 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
 		percpu_ref_kill(&q->q_usage_counter);
-		blk_mq_run_hw_queues(q, false);
+		if (q->mq_ops)
+			blk_mq_run_hw_queues(q, false);
 	}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V4 02/10] block: tracking request allocation with q_usage_counter
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
  2017-09-11 11:10 ` [PATCH V4 01/10] blk-mq: only run hw queues for blk-mq Ming Lei
@ 2017-09-11 11:10 ` Ming Lei
  2017-09-11 11:10 ` [PATCH V4 03/10] blk-mq: rename blk_mq_[freeze|unfreeze]_queue Ming Lei
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

This usage is basically same with blk-mq, so that we can
support to freeze legacy queue easily.

Also 'wake_up_all(&q->mq_freeze_wq)' has to be moved
into blk_set_queue_dying() since both legacy and blk-mq
may wait on the wait queue of .mq_freeze_wq.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c | 14 ++++++++++++++
 block/blk-mq.c   |  7 -------
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d709c0e3a2ac..7e436ee04e08 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,12 @@ void blk_set_queue_dying(struct request_queue *q)
 		}
 		spin_unlock_irq(q->queue_lock);
 	}
+
+	/*
+	 * We need to ensure that processes currently waiting on
+	 * the queue are notified as well.
+	 */
+	wake_up_all(&q->mq_freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -1392,16 +1398,21 @@ static struct request *blk_old_get_request(struct request_queue *q,
 					   unsigned int op, gfp_t gfp_mask)
 {
 	struct request *rq;
+	int ret = 0;
 
 	WARN_ON_ONCE(q->mq_ops);
 
 	/* create ioc upfront */
 	create_io_context(gfp_mask, q->node);
 
+	ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+	if (ret)
+		return ERR_PTR(ret);
 	spin_lock_irq(q->queue_lock);
 	rq = get_request(q, op, NULL, gfp_mask);
 	if (IS_ERR(rq)) {
 		spin_unlock_irq(q->queue_lock);
+		blk_queue_exit(q);
 		return rq;
 	}
 
@@ -1573,6 +1584,7 @@ void __blk_put_request(struct request_queue *q, struct request *req)
 		blk_free_request(rl, req);
 		freed_request(rl, sync, rq_flags);
 		blk_put_rl(rl);
+		blk_queue_exit(q);
 	}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1854,8 +1866,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)
 	 * Grab a free request. This is might sleep but can not fail.
 	 * Returns with the queue unlocked.
 	 */
+	blk_queue_enter_live(q);
 	req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
 	if (IS_ERR(req)) {
+		blk_queue_exit(q);
 		__wbt_done(q->rq_wb, wb_acct);
 		if (PTR_ERR(req) == -ENOMEM)
 			bio->bi_status = BLK_STS_RESOURCE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9c364497cc44..eee86adc0f53 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -256,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
 	queue_for_each_hw_ctx(q, hctx, i)
 		if (blk_mq_hw_queue_mapped(hctx))
 			blk_mq_tag_wakeup_all(hctx->tags, true);
-
-	/*
-	 * If we are called because the queue has now been marked as
-	 * dying, we need to ensure that processes currently waiting on
-	 * the queue are notified as well.
-	 */
-	wake_up_all(&q->mq_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V4 03/10] blk-mq: rename blk_mq_[freeze|unfreeze]_queue
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
  2017-09-11 11:10 ` [PATCH V4 01/10] blk-mq: only run hw queues for blk-mq Ming Lei
  2017-09-11 11:10 ` [PATCH V4 02/10] block: tracking request allocation with q_usage_counter Ming Lei
@ 2017-09-11 11:10 ` Ming Lei
  2017-09-11 11:10 ` [PATCH V4 04/10] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait Ming Lei
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

We will support to freeze queue on block legacy path too.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/bfq-iosched.c      |  2 +-
 block/blk-cgroup.c       |  8 ++++----
 block/blk-mq.c           | 27 +++++++++------------------
 block/blk-mq.h           |  1 -
 block/elevator.c         |  4 ++--
 drivers/block/loop.c     | 24 ++++++++++++------------
 drivers/block/rbd.c      |  2 +-
 drivers/nvme/host/core.c |  6 +++---
 include/linux/blk-mq.h   |  4 ++--
 9 files changed, 34 insertions(+), 44 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a4783da90ba8..a18f36bfbdf0 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4758,7 +4758,7 @@ static int bfq_init_queue(struct request_queue *q, struct elevator_type *e)
 	 * The invocation of the next bfq_create_group_hierarchy
 	 * function is the head of a chain of function calls
 	 * (bfq_create_group_hierarchy->blkcg_activate_policy->
-	 * blk_mq_freeze_queue) that may lead to the invocation of the
+	 * blk_freeze_queue) that may lead to the invocation of the
 	 * has_work hook function. For this reason,
 	 * bfq_create_group_hierarchy is invoked only after all
 	 * scheduler data has been initialized, apart from the fields
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d3f56baee936..ffc984381e4b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1298,7 +1298,7 @@ int blkcg_activate_policy(struct request_queue *q,
 		return 0;
 
 	if (q->mq_ops)
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 	else
 		blk_queue_bypass_start(q);
 pd_prealloc:
@@ -1339,7 +1339,7 @@ int blkcg_activate_policy(struct request_queue *q,
 	spin_unlock_irq(q->queue_lock);
 out_bypass_end:
 	if (q->mq_ops)
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 	else
 		blk_queue_bypass_end(q);
 	if (pd_prealloc)
@@ -1365,7 +1365,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
 		return;
 
 	if (q->mq_ops)
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 	else
 		blk_queue_bypass_start(q);
 
@@ -1390,7 +1390,7 @@ void blkcg_deactivate_policy(struct request_queue *q,
 	spin_unlock_irq(q->queue_lock);
 
 	if (q->mq_ops)
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 	else
 		blk_queue_bypass_end(q);
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index eee86adc0f53..aacf47f15b9a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -162,18 +162,9 @@ void blk_freeze_queue(struct request_queue *q)
 	blk_freeze_queue_start(q);
 	blk_mq_freeze_queue_wait(q);
 }
+EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
-void blk_mq_freeze_queue(struct request_queue *q)
-{
-	/*
-	 * ...just an alias to keep freeze and unfreeze actions balanced
-	 * in the blk_mq_* namespace
-	 */
-	blk_freeze_queue(q);
-}
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
-
-void blk_mq_unfreeze_queue(struct request_queue *q)
+void blk_unfreeze_queue(struct request_queue *q)
 {
 	int freeze_depth;
 
@@ -184,7 +175,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 		wake_up_all(&q->mq_freeze_wq);
 	}
 }
-EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
+EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
 
 /*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
@@ -2176,9 +2167,9 @@ static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set,
 	lockdep_assert_held(&set->tag_list_lock);
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 		queue_set_hctx_shared(q, shared);
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 	}
 }
 
@@ -2609,7 +2600,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	if (!set)
 		return -EINVAL;
 
-	blk_mq_freeze_queue(q);
+	blk_freeze_queue(q);
 
 	ret = 0;
 	queue_for_each_hw_ctx(q, hctx, i) {
@@ -2634,7 +2625,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	if (!ret)
 		q->nr_requests = nr;
 
-	blk_mq_unfreeze_queue(q);
+	blk_unfreeze_queue(q);
 
 	return ret;
 }
@@ -2652,7 +2643,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 		return;
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_freeze_queue(q);
+		blk_freeze_queue(q);
 
 	set->nr_hw_queues = nr_hw_queues;
 	blk_mq_update_queue_map(set);
@@ -2662,7 +2653,7 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
 	}
 
 	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_unfreeze_queue(q);
+		blk_unfreeze_queue(q);
 }
 
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 98252b79b80b..f605398fc684 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -26,7 +26,6 @@ struct blk_mq_ctx {
 } ____cacheline_aligned_in_smp;
 
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
-void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
 void blk_mq_wake_waiters(struct request_queue *q);
diff --git a/block/elevator.c b/block/elevator.c
index 153926a90901..c3524091138f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -950,7 +950,7 @@ static int elevator_switch_mq(struct request_queue *q,
 {
 	int ret;
 
-	blk_mq_freeze_queue(q);
+	blk_freeze_queue(q);
 
 	if (q->elevator) {
 		if (q->elevator->registered)
@@ -977,7 +977,7 @@ static int elevator_switch_mq(struct request_queue *q,
 		blk_add_trace_msg(q, "elv switch: none");
 
 out:
-	blk_mq_unfreeze_queue(q);
+	blk_unfreeze_queue(q);
 	return ret;
 }
 
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 85de67334695..1791a0ea8ad9 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -211,7 +211,7 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 	 * LO_FLAGS_READ_ONLY, both are set from kernel, and losetup
 	 * will get updated by ioctl(LOOP_GET_STATUS)
 	 */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 	lo->use_dio = use_dio;
 	if (use_dio) {
 		queue_flag_clear_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
@@ -220,7 +220,7 @@ static void __loop_update_dio(struct loop_device *lo, bool dio)
 		queue_flag_set_unlocked(QUEUE_FLAG_NOMERGES, lo->lo_queue);
 		lo->lo_flags &= ~LO_FLAGS_DIRECT_IO;
 	}
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 }
 
 static int
@@ -659,14 +659,14 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
 		goto out_putf;
 
 	/* and ... switch */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 	mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
 	lo->lo_backing_file = file;
 	lo->old_gfp_mask = mapping_gfp_mask(file->f_mapping);
 	mapping_set_gfp_mask(file->f_mapping,
 			     lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
 	loop_update_dio(lo);
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 
 	fput(old_file);
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN)
@@ -1017,7 +1017,7 @@ static int loop_clr_fd(struct loop_device *lo)
 		return -EINVAL;
 
 	/* freeze request queue during the transition */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 
 	spin_lock_irq(&lo->lo_lock);
 	lo->lo_state = Lo_rundown;
@@ -1053,7 +1053,7 @@ static int loop_clr_fd(struct loop_device *lo)
 	lo->lo_state = Lo_unbound;
 	/* This is safe: open() is still holding a reference. */
 	module_put(THIS_MODULE);
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 
 	if (lo->lo_flags & LO_FLAGS_PARTSCAN && bdev)
 		loop_reread_partitions(lo, bdev);
@@ -1089,7 +1089,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 		return -EINVAL;
 
 	/* I/O need to be drained during transfer transition */
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 
 	err = loop_release_xfer(lo);
 	if (err)
@@ -1147,7 +1147,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info)
 	__loop_update_dio(lo, lo->use_dio);
 
  exit:
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 
 	if (!err && (info->lo_flags & LO_FLAGS_PARTSCAN) &&
 	     !(lo->lo_flags & LO_FLAGS_PARTSCAN)) {
@@ -1329,14 +1329,14 @@ static int loop_set_block_size(struct loop_device *lo, unsigned long arg)
 	if (arg < 512 || arg > PAGE_SIZE || !is_power_of_2(arg))
 		return -EINVAL;
 
-	blk_mq_freeze_queue(lo->lo_queue);
+	blk_freeze_queue(lo->lo_queue);
 
 	blk_queue_logical_block_size(lo->lo_queue, arg);
 	blk_queue_physical_block_size(lo->lo_queue, arg);
 	blk_queue_io_min(lo->lo_queue, arg);
 	loop_update_dio(lo);
 
-	blk_mq_unfreeze_queue(lo->lo_queue);
+	blk_unfreeze_queue(lo->lo_queue);
 
 	return 0;
 }
@@ -1598,8 +1598,8 @@ static void lo_release(struct gendisk *disk, fmode_t mode)
 		 * Otherwise keep thread (if running) and config,
 		 * but flush possible ongoing bios in thread.
 		 */
-		blk_mq_freeze_queue(lo->lo_queue);
-		blk_mq_unfreeze_queue(lo->lo_queue);
+		blk_freeze_queue(lo->lo_queue);
+		blk_unfreeze_queue(lo->lo_queue);
 	}
 
 	mutex_unlock(&lo->lo_ctl_mutex);
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index b008b6a98098..3a97ffcb3a81 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -6347,7 +6347,7 @@ static ssize_t do_rbd_remove(struct bus_type *bus,
 		 * Prevent new IO from being queued and wait for existing
 		 * IO to complete/fail.
 		 */
-		blk_mq_freeze_queue(rbd_dev->disk->queue);
+		blk_freeze_queue(rbd_dev->disk->queue);
 		blk_set_queue_dying(rbd_dev->disk->queue);
 	}
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 277a7a02cba5..62808c4536f5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1209,7 +1209,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 	bs = 1 << ns->lba_shift;
 	ns->noiob = le16_to_cpu(id->noiob);
 
-	blk_mq_freeze_queue(disk->queue);
+	blk_freeze_queue(disk->queue);
 
 	if (ctrl->ops->flags & NVME_F_METADATA_SUPPORTED)
 		nvme_prep_integrity(disk, id, bs);
@@ -1225,7 +1225,7 @@ static void __nvme_revalidate_disk(struct gendisk *disk, struct nvme_id_ns *id)
 
 	if (ctrl->oncs & NVME_CTRL_ONCS_DSM)
 		nvme_config_discard(ns);
-	blk_mq_unfreeze_queue(disk->queue);
+	blk_unfreeze_queue(disk->queue);
 }
 
 static int nvme_revalidate_disk(struct gendisk *disk)
@@ -2879,7 +2879,7 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_unfreeze_queue(ns->queue);
+		blk_unfreeze_queue(ns->queue);
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_unfreeze);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..355d74507656 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -253,8 +253,8 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async);
 void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
 void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
-void blk_mq_freeze_queue(struct request_queue *q);
-void blk_mq_unfreeze_queue(struct request_queue *q);
+void blk_freeze_queue(struct request_queue *q);
+void blk_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V4 04/10] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
                   ` (2 preceding siblings ...)
  2017-09-11 11:10 ` [PATCH V4 03/10] blk-mq: rename blk_mq_[freeze|unfreeze]_queue Ming Lei
@ 2017-09-11 11:10 ` Ming Lei
  2017-09-11 11:10 ` [PATCH V4 05/10] block: rename .mq_freeze_wq and .mq_freeze_depth Ming Lei
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

The only change on legacy is that blk_drain_queue() is run
from blk_freeze_queue(), which is called in blk_cleanup_queue().

So this patch removes the explicit call of __blk_drain_queue() in
blk_cleanup_queue().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c         | 17 +++++++++++++++--
 block/blk-mq.c           |  8 +++++---
 block/blk.h              |  1 +
 drivers/nvme/host/core.c |  2 +-
 include/linux/blk-mq.h   |  2 +-
 5 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7e436ee04e08..2347107eeca4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -530,6 +530,21 @@ static void __blk_drain_queue(struct request_queue *q, bool drain_all)
 }
 
 /**
+ * blk_drain_queue - drain requests from request_queue
+ * @q: queue to drain
+ *
+ * Drain requests from @q.  All pending requests are drained.
+ * The caller is responsible for ensuring that no new requests
+ * which need to be drained are queued.
+ */
+void blk_drain_queue(struct request_queue *q)
+{
+	spin_lock_irq(q->queue_lock);
+	__blk_drain_queue(q, true);
+	spin_unlock_irq(q->queue_lock);
+}
+
+/**
  * blk_queue_bypass_start - enter queue bypass mode
  * @q: queue of interest
  *
@@ -659,8 +674,6 @@ void blk_cleanup_queue(struct request_queue *q)
 	 */
 	blk_freeze_queue(q);
 	spin_lock_irq(lock);
-	if (!q->mq_ops)
-		__blk_drain_queue(q, true);
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index aacf47f15b9a..345943fea998 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -131,11 +131,13 @@ void blk_freeze_queue_start(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
 
-void blk_mq_freeze_queue_wait(struct request_queue *q)
+void blk_freeze_queue_wait(struct request_queue *q)
 {
+	if (!q->mq_ops)
+		blk_drain_queue(q);
 	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
-EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait);
+EXPORT_SYMBOL_GPL(blk_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout)
@@ -160,7 +162,7 @@ void blk_freeze_queue(struct request_queue *q)
 	 * exported to drivers as the only user for unfreeze is blk_mq.
 	 */
 	blk_freeze_queue_start(q);
-	blk_mq_freeze_queue_wait(q);
+	blk_freeze_queue_wait(q);
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
diff --git a/block/blk.h b/block/blk.h
index fcb9775b997d..21eed59d96db 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -64,6 +64,7 @@ void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
 			struct bio *bio);
 void blk_queue_bypass_start(struct request_queue *q);
 void blk_queue_bypass_end(struct request_queue *q);
+void blk_drain_queue(struct request_queue *q);
 void __blk_queue_free_tags(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 62808c4536f5..8506f7402a8d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2904,7 +2904,7 @@ void nvme_wait_freeze(struct nvme_ctrl *ctrl)
 
 	mutex_lock(&ctrl->namespaces_mutex);
 	list_for_each_entry(ns, &ctrl->namespaces, list)
-		blk_mq_freeze_queue_wait(ns->queue);
+		blk_freeze_queue_wait(ns->queue);
 	mutex_unlock(&ctrl->namespaces_mutex);
 }
 EXPORT_SYMBOL_GPL(nvme_wait_freeze);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 355d74507656..62c3d1f7d12a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -256,7 +256,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 void blk_freeze_queue(struct request_queue *q);
 void blk_unfreeze_queue(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
-void blk_mq_freeze_queue_wait(struct request_queue *q);
+void blk_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout);
 int blk_mq_reinit_tagset(struct blk_mq_tag_set *set,
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V4 05/10] block: rename .mq_freeze_wq and .mq_freeze_depth
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
                   ` (3 preceding siblings ...)
  2017-09-11 11:10 ` [PATCH V4 04/10] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait Ming Lei
@ 2017-09-11 11:10 ` Ming Lei
  2017-09-11 11:10 ` [PATCH V4 06/10] block: pass flags to blk_queue_enter() Ming Lei
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

Both two are used for legacy and blk-mq, so rename them
as .freeze_wq and .freeze_depth for avoiding to confuse
people.

No functional change.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       | 12 ++++++------
 block/blk-mq.c         | 12 ++++++------
 include/linux/blkdev.h |  4 ++--
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2347107eeca4..eec5881a9e74 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -630,7 +630,7 @@ void blk_set_queue_dying(struct request_queue *q)
 	 * We need to ensure that processes currently waiting on
 	 * the queue are notified as well.
 	 */
-	wake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -793,14 +793,14 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
 		/*
 		 * read pair of barrier in blk_freeze_queue_start(),
 		 * we need to order reading __PERCPU_REF_DEAD flag of
-		 * .q_usage_counter and reading .mq_freeze_depth or
+		 * .q_usage_counter and reading .freeze_depth or
 		 * queue dying flag, otherwise the following wait may
 		 * never return if the two reads are reordered.
 		 */
 		smp_rmb();
 
-		ret = wait_event_interruptible(q->mq_freeze_wq,
-				!atomic_read(&q->mq_freeze_depth) ||
+		ret = wait_event_interruptible(q->freeze_wq,
+				!atomic_read(&q->freeze_depth) ||
 				blk_queue_dying(q));
 		if (blk_queue_dying(q))
 			return -ENODEV;
@@ -819,7 +819,7 @@ static void blk_queue_usage_counter_release(struct percpu_ref *ref)
 	struct request_queue *q =
 		container_of(ref, struct request_queue, q_usage_counter);
 
-	wake_up_all(&q->mq_freeze_wq);
+	wake_up_all(&q->freeze_wq);
 }
 
 static void blk_rq_timed_out_timer(unsigned long data)
@@ -891,7 +891,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	q->bypass_depth = 1;
 	__set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags);
 
-	init_waitqueue_head(&q->mq_freeze_wq);
+	init_waitqueue_head(&q->freeze_wq);
 
 	/*
 	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 345943fea998..205ae2d3da14 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -122,7 +122,7 @@ void blk_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
-	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
+	freeze_depth = atomic_inc_return(&q->freeze_depth);
 	if (freeze_depth == 1) {
 		percpu_ref_kill(&q->q_usage_counter);
 		if (q->mq_ops)
@@ -135,14 +135,14 @@ void blk_freeze_queue_wait(struct request_queue *q)
 {
 	if (!q->mq_ops)
 		blk_drain_queue(q);
-	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
+	wait_event(q->freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_wait);
 
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 				     unsigned long timeout)
 {
-	return wait_event_timeout(q->mq_freeze_wq,
+	return wait_event_timeout(q->freeze_wq,
 					percpu_ref_is_zero(&q->q_usage_counter),
 					timeout);
 }
@@ -170,11 +170,11 @@ void blk_unfreeze_queue(struct request_queue *q)
 {
 	int freeze_depth;
 
-	freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
+	freeze_depth = atomic_dec_return(&q->freeze_depth);
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
 		percpu_ref_reinit(&q->q_usage_counter);
-		wake_up_all(&q->mq_freeze_wq);
+		wake_up_all(&q->freeze_wq);
 	}
 }
 EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
@@ -2424,7 +2424,7 @@ void blk_mq_free_queue(struct request_queue *q)
 /* Basically redo blk_mq_init_queue with queue frozen */
 static void blk_mq_queue_reinit(struct request_queue *q)
 {
-	WARN_ON_ONCE(!atomic_read(&q->mq_freeze_depth));
+	WARN_ON_ONCE(!atomic_read(&q->freeze_depth));
 
 	blk_mq_debugfs_unregister_hctxs(q);
 	blk_mq_sysfs_unregister(q);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 460294bb0fa5..b8053bcc6b5f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -564,7 +564,7 @@ struct request_queue {
 	struct mutex		sysfs_lock;
 
 	int			bypass_depth;
-	atomic_t		mq_freeze_depth;
+	atomic_t		freeze_depth;
 
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
@@ -576,7 +576,7 @@ struct request_queue {
 	struct throtl_data *td;
 #endif
 	struct rcu_head		rcu_head;
-	wait_queue_head_t	mq_freeze_wq;
+	wait_queue_head_t	freeze_wq;
 	struct percpu_ref	q_usage_counter;
 	struct list_head	all_q_node;
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V4 06/10] block: pass flags to blk_queue_enter()
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
                   ` (4 preceding siblings ...)
  2017-09-11 11:10 ` [PATCH V4 05/10] block: rename .mq_freeze_wq and .mq_freeze_depth Ming Lei
@ 2017-09-11 11:10 ` Ming Lei
  2017-09-11 11:10 ` [PATCH V4 07/10] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

We need to pass PREEMPT flags to blk_queue_enter()
for allocating request with RQF_PREEMPT in the
following patch.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       | 10 ++++++----
 block/blk-mq.c         |  5 +++--
 block/blk-timeout.c    |  2 +-
 fs/block_dev.c         |  4 ++--
 include/linux/blkdev.h |  7 ++++++-
 5 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index eec5881a9e74..04327a60061e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -779,7 +779,7 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+int blk_queue_enter(struct request_queue *q, unsigned flags)
 {
 	while (true) {
 		int ret;
@@ -787,7 +787,7 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
 		if (percpu_ref_tryget_live(&q->q_usage_counter))
 			return 0;
 
-		if (nowait)
+		if (flags & BLK_REQ_NOWAIT)
 			return -EBUSY;
 
 		/*
@@ -1418,7 +1418,8 @@ static struct request *blk_old_get_request(struct request_queue *q,
 	/* create ioc upfront */
 	create_io_context(gfp_mask, q->node);
 
-	ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+	ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
+			BLK_REQ_NOWAIT : 0);
 	if (ret)
 		return ERR_PTR(ret);
 	spin_lock_irq(q->queue_lock);
@@ -2225,7 +2226,8 @@ blk_qc_t generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bio->bi_disk->queue;
 
-		if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
+		if (likely(blk_queue_enter(q, (bio->bi_opf & REQ_NOWAIT) ?
+						BLK_REQ_NOWAIT : 0) == 0)) {
 			struct bio_list lower, same;
 
 			/* Create a fresh bio_list for all subordinate requests */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 205ae2d3da14..358b2ca33010 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -377,7 +377,8 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 	struct request *rq;
 	int ret;
 
-	ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
+	ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
+			BLK_REQ_NOWAIT : 0);
 	if (ret)
 		return ERR_PTR(ret);
 
@@ -416,7 +417,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	if (hctx_idx >= q->nr_hw_queues)
 		return ERR_PTR(-EIO);
 
-	ret = blk_queue_enter(q, true);
+	ret = blk_queue_enter(q, BLK_REQ_NOWAIT);
 	if (ret)
 		return ERR_PTR(ret);
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 17ec83bb0900..e803106a5e5b 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -134,7 +134,7 @@ void blk_timeout_work(struct work_struct *work)
 	struct request *rq, *tmp;
 	int next_set = 0;
 
-	if (blk_queue_enter(q, true))
+	if (blk_queue_enter(q, BLK_REQ_NOWAIT))
 		return;
 	spin_lock_irqsave(q->queue_lock, flags);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index bb715b2fcfb8..44e6502e377e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -674,7 +674,7 @@ int bdev_read_page(struct block_device *bdev, sector_t sector,
 	if (!ops->rw_page || bdev_get_integrity(bdev))
 		return result;
 
-	result = blk_queue_enter(bdev->bd_queue, false);
+	result = blk_queue_enter(bdev->bd_queue, 0);
 	if (result)
 		return result;
 	result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false);
@@ -710,7 +710,7 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
 
 	if (!ops->rw_page || bdev_get_integrity(bdev))
 		return -EOPNOTSUPP;
-	result = blk_queue_enter(bdev->bd_queue, false);
+	result = blk_queue_enter(bdev->bd_queue, 0);
 	if (result)
 		return result;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b8053bcc6b5f..54450715915b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -857,6 +857,11 @@ enum {
 	BLKPREP_INVALID,	/* invalid command, kill, return -EREMOTEIO */
 };
 
+/* passed to blk_queue_enter */
+enum {
+	BLK_REQ_NOWAIT = (1 << 0),
+};
+
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
 
 /*
@@ -962,7 +967,7 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
-extern int blk_queue_enter(struct request_queue *q, bool nowait);
+extern int blk_queue_enter(struct request_queue *q, unsigned flags);
 extern void blk_queue_exit(struct request_queue *q);
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_start_queue_async(struct request_queue *q);
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V4 07/10] block: introduce preempt version of blk_[freeze|unfreeze]_queue
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
                   ` (5 preceding siblings ...)
  2017-09-11 11:10 ` [PATCH V4 06/10] block: pass flags to blk_queue_enter() Ming Lei
@ 2017-09-11 11:10 ` Ming Lei
  2017-09-11 11:10 ` [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen Ming Lei
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

The two APIs are required to allow request allocation of
RQF_PREEMPT when queue is preempt frozen.

We have to guarantee that normal freeze and preempt freeze
are run exclusive. Because for normal freezing, once
blk_freeze_queue_wait() is returned, no request can enter
queue any more.

Another issue we should pay attention to is that the race
of preempt freeze vs. blk_cleanup_queue(), and it is avoided
by not allowing to preempt freeeze after queue becomes dying,
otherwise preempt freeeze may hang forever.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       |   2 +
 block/blk-mq.c         | 133 +++++++++++++++++++++++++++++++++++++++++++------
 block/blk.h            |  11 ++++
 include/linux/blk-mq.h |   2 +
 include/linux/blkdev.h |   6 +++
 5 files changed, 140 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 04327a60061e..ade9b5484a6e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -905,6 +905,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	if (blkcg_init_queue(q))
 		goto fail_ref;
 
+	spin_lock_init(&q->freeze_lock);
+
 	return q;
 
 fail_ref:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 358b2ca33010..096c5f0ea518 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -118,19 +118,6 @@ void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
 }
 
-void blk_freeze_queue_start(struct request_queue *q)
-{
-	int freeze_depth;
-
-	freeze_depth = atomic_inc_return(&q->freeze_depth);
-	if (freeze_depth == 1) {
-		percpu_ref_kill(&q->q_usage_counter);
-		if (q->mq_ops)
-			blk_mq_run_hw_queues(q, false);
-	}
-}
-EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-
 void blk_freeze_queue_wait(struct request_queue *q)
 {
 	if (!q->mq_ops)
@@ -148,6 +135,69 @@ int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait_timeout);
 
+static bool queue_freeze_is_over(struct request_queue *q,
+		bool preempt, bool *queue_dying)
+{
+	/*
+	 * preempt freeze has to be prevented after queue is set as
+	 * dying, otherwise we may hang forever
+	 */
+	if (preempt) {
+		spin_lock_irq(q->queue_lock);
+		*queue_dying = !!blk_queue_dying(q);
+		spin_unlock_irq(q->queue_lock);
+
+		return !q->normal_freezing || *queue_dying;
+	}
+	return !q->preempt_freezing;
+}
+
+static void __blk_freeze_queue_start(struct request_queue *q, bool preempt)
+{
+	int freeze_depth;
+	bool queue_dying;
+
+	/*
+	 * Make sure normal freeze and preempt freeze are run
+	 * exclusively, but each kind itself is allowed to be
+	 * run concurrently, even nested.
+	 */
+	spin_lock(&q->freeze_lock);
+	wait_event_cmd(q->freeze_wq,
+		       queue_freeze_is_over(q, preempt, &queue_dying),
+		       spin_unlock(&q->freeze_lock),
+		       spin_lock(&q->freeze_lock));
+
+	if (preempt && queue_dying)
+		goto unlock;
+
+	freeze_depth = atomic_inc_return(&q->freeze_depth);
+	if (freeze_depth == 1) {
+		if (preempt) {
+			q->preempt_freezing = 1;
+			q->preempt_unfreezing = 0;
+		} else
+			q->normal_freezing = 1;
+		spin_unlock(&q->freeze_lock);
+
+		percpu_ref_kill(&q->q_usage_counter);
+		if (q->mq_ops)
+			blk_mq_run_hw_queues(q, false);
+
+		/* have to drain I/O here for preempt quiesce */
+		if (preempt)
+			blk_freeze_queue_wait(q);
+	} else
+ unlock:
+		spin_unlock(&q->freeze_lock);
+}
+
+void blk_freeze_queue_start(struct request_queue *q)
+{
+	__blk_freeze_queue_start(q, false);
+}
+EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
+
 /*
  * Guarantee no request is in use, so we can change any data structure of
  * the queue afterward.
@@ -166,20 +216,75 @@ void blk_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue);
 
-void blk_unfreeze_queue(struct request_queue *q)
+static void blk_start_unfreeze_queue_preempt(struct request_queue *q)
+{
+	/* no new request can be coming after unfreezing */
+	spin_lock(&q->freeze_lock);
+	q->preempt_unfreezing = 1;
+	spin_unlock(&q->freeze_lock);
+
+	blk_freeze_queue_wait(q);
+}
+
+static void __blk_unfreeze_queue(struct request_queue *q, bool preempt)
 {
 	int freeze_depth;
 
 	freeze_depth = atomic_dec_return(&q->freeze_depth);
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
+		if (preempt)
+			blk_start_unfreeze_queue_preempt(q);
+
 		percpu_ref_reinit(&q->q_usage_counter);
+
+		/*
+		 * clearing the freeze flag so that any pending
+		 * freeze can move on
+		 */
+		spin_lock(&q->freeze_lock);
+		if (preempt)
+			q->preempt_freezing = 0;
+		else
+			q->normal_freezing = 0;
+		spin_unlock(&q->freeze_lock);
 		wake_up_all(&q->freeze_wq);
 	}
 }
+
+void blk_unfreeze_queue(struct request_queue *q)
+{
+	__blk_unfreeze_queue(q, false);
+}
 EXPORT_SYMBOL_GPL(blk_unfreeze_queue);
 
 /*
+ * Once this function is returned, only allow to get request
+ * of RQF_PREEMPT.
+ */
+void blk_freeze_queue_preempt(struct request_queue *q)
+{
+	/*
+	 * If queue isn't in preempt_frozen, the queue has
+	 * to be dying, so do nothing since no I/O can
+	 * succeed any more.
+	 */
+	__blk_freeze_queue_start(q, true);
+}
+EXPORT_SYMBOL_GPL(blk_freeze_queue_preempt);
+
+void blk_unfreeze_queue_preempt(struct request_queue *q)
+{
+	/*
+	 * If queue isn't in preempt_frozen, the queue should
+	 * be dying , so do nothing since no I/O can succeed.
+	 */
+	if (blk_queue_is_preempt_frozen(q))
+		__blk_unfreeze_queue(q, true);
+}
+EXPORT_SYMBOL_GPL(blk_unfreeze_queue_preempt);
+
+/*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
  * mpt3sas driver such that this function can be removed.
  */
diff --git a/block/blk.h b/block/blk.h
index 21eed59d96db..243b2e7e5098 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -79,6 +79,17 @@ static inline void blk_queue_enter_live(struct request_queue *q)
 	percpu_ref_get(&q->q_usage_counter);
 }
 
+static inline bool blk_queue_is_preempt_frozen(struct request_queue *q)
+{
+	bool preempt_frozen;
+
+	spin_lock(&q->freeze_lock);
+	preempt_frozen = q->preempt_freezing && !q->preempt_unfreezing;
+	spin_unlock(&q->freeze_lock);
+
+	return preempt_frozen;
+}
+
 #ifdef CONFIG_BLK_DEV_INTEGRITY
 void blk_flush_integrity(void);
 bool __bio_integrity_endio(struct bio *);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 62c3d1f7d12a..54b160bcb6a2 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -255,6 +255,8 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 		busy_tag_iter_fn *fn, void *priv);
 void blk_freeze_queue(struct request_queue *q);
 void blk_unfreeze_queue(struct request_queue *q);
+void blk_freeze_queue_preempt(struct request_queue *q);
+void blk_unfreeze_queue_preempt(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
 void blk_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 54450715915b..3c14c9588dcf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -566,6 +566,12 @@ struct request_queue {
 	int			bypass_depth;
 	atomic_t		freeze_depth;
 
+	/* for run normal freeze and preempt freeze exclusive */
+	spinlock_t		freeze_lock;
+	unsigned		normal_freezing:1;
+	unsigned		preempt_freezing:1;
+	unsigned		preempt_unfreezing:1;
+
 #if defined(CONFIG_BLK_DEV_BSG)
 	bsg_job_fn		*bsg_job_fn;
 	struct bsg_class_device bsg_dev;
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
                   ` (6 preceding siblings ...)
  2017-09-11 11:10 ` [PATCH V4 07/10] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
@ 2017-09-11 11:10 ` Ming Lei
  2017-09-11 16:03     ` Bart Van Assche
  2017-09-11 11:10 ` [PATCH V4 09/10] SCSI: transport_spi: resume a quiesced device Ming Lei
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

REQF_PREEMPT is a bit special because the request is required
to be dispatched to lld even when SCSI device is quiesced.

So this patch introduces __blk_get_request() to allow block
layer to allocate request when queue is preempt frozen, since we
will preempt freeze queue before quiescing SCSI device in the
following patch for supporting safe SCSI quiescing.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-core.c       | 48 ++++++++++++++++++++++++++++++++++++++----------
 block/blk-mq.c         |  3 +--
 include/linux/blk-mq.h |  7 ++++---
 include/linux/blkdev.h | 17 ++++++++++++++---
 4 files changed, 57 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ade9b5484a6e..1c8e264753f0 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -787,6 +787,35 @@ int blk_queue_enter(struct request_queue *q, unsigned flags)
 		if (percpu_ref_tryget_live(&q->q_usage_counter))
 			return 0;
 
+		/*
+		 * If queue is preempt frozen and caller need to allocate
+		 * request for RQF_PREEMPT, we grab the .q_usage_counter
+		 * unconditionally and return successfully.
+		 *
+		 * There isn't race with queue cleanup because:
+		 *
+		 * 1) it is guaranteed that preempt freeze can't be
+		 * started after queue is set as dying
+		 *
+		 * 2) normal freeze runs exclusively with preempt
+		 * freeze, so even after queue is set as dying
+		 * afterwards, blk_queue_cleanup() won't move on
+		 * until preempt freeze is done
+		 *
+		 * 3) blk_queue_dying() needn't to be checked here
+		 * 	- for legacy path, it will be checked in
+		 * 	__get_request()
+		 * 	- blk-mq depends on driver to handle dying well
+		 * 	because it is normal for queue to be set as dying
+		 * 	just between blk_queue_enter() and allocating new
+		 * 	request.
+		 */
+		if ((flags & BLK_REQ_PREEMPT) &&
+				blk_queue_is_preempt_frozen(q)) {
+			blk_queue_enter_live(q);
+			return 0;
+		}
+
 		if (flags & BLK_REQ_NOWAIT)
 			return -EBUSY;
 
@@ -1410,7 +1439,8 @@ static struct request *get_request(struct request_queue *q, unsigned int op,
 }
 
 static struct request *blk_old_get_request(struct request_queue *q,
-					   unsigned int op, gfp_t gfp_mask)
+					   unsigned int op, gfp_t gfp_mask,
+					   unsigned int flags)
 {
 	struct request *rq;
 	int ret = 0;
@@ -1420,8 +1450,7 @@ static struct request *blk_old_get_request(struct request_queue *q,
 	/* create ioc upfront */
 	create_io_context(gfp_mask, q->node);
 
-	ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
-			BLK_REQ_NOWAIT : 0);
+	ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
 	if (ret)
 		return ERR_PTR(ret);
 	spin_lock_irq(q->queue_lock);
@@ -1439,26 +1468,25 @@ static struct request *blk_old_get_request(struct request_queue *q,
 	return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-				gfp_t gfp_mask)
+struct request *__blk_get_request(struct request_queue *q, unsigned int op,
+				  gfp_t gfp_mask, unsigned int flags)
 {
 	struct request *req;
 
+	flags |= gfp_mask & __GFP_DIRECT_RECLAIM ? 0 : BLK_REQ_NOWAIT;
 	if (q->mq_ops) {
-		req = blk_mq_alloc_request(q, op,
-			(gfp_mask & __GFP_DIRECT_RECLAIM) ?
-				0 : BLK_MQ_REQ_NOWAIT);
+		req = blk_mq_alloc_request(q, op, flags);
 		if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
 			q->mq_ops->initialize_rq_fn(req);
 	} else {
-		req = blk_old_get_request(q, op, gfp_mask);
+		req = blk_old_get_request(q, op, gfp_mask, flags);
 		if (!IS_ERR(req) && q->initialize_rq_fn)
 			q->initialize_rq_fn(req);
 	}
 
 	return req;
 }
-EXPORT_SYMBOL(blk_get_request);
+EXPORT_SYMBOL(__blk_get_request);
 
 /**
  * blk_requeue_request - put a request back on queue
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 096c5f0ea518..720559724f97 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -482,8 +482,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 	struct request *rq;
 	int ret;
 
-	ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
-			BLK_REQ_NOWAIT : 0);
+	ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
 	if (ret)
 		return ERR_PTR(ret);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 54b160bcb6a2..3cb3856f83db 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -197,9 +197,10 @@ void blk_mq_free_request(struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
 
 enum {
-	BLK_MQ_REQ_NOWAIT	= (1 << 0), /* return when out of requests */
-	BLK_MQ_REQ_RESERVED	= (1 << 1), /* allocate from reserved pool */
-	BLK_MQ_REQ_INTERNAL	= (1 << 2), /* allocate internal/sched tag */
+	BLK_MQ_REQ_NOWAIT	= BLK_REQ_NOWAIT, /* return when out of requests */
+	BLK_MQ_REQ_PREEMPT	= BLK_REQ_PREEMPT, /* allocate for RQF_PREEMPT */
+	BLK_MQ_REQ_RESERVED	= (1 << BLK_REQ_MQ_START_BIT), /* allocate from reserved pool */
+	BLK_MQ_REQ_INTERNAL	= (1 << (BLK_REQ_MQ_START_BIT + 1)), /* allocate internal/sched tag */
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3c14c9588dcf..99c252bb5361 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -865,7 +865,10 @@ enum {
 
 /* passed to blk_queue_enter */
 enum {
-	BLK_REQ_NOWAIT = (1 << 0),
+	BLK_REQ_NOWAIT		= (1 << 0),
+	BLK_REQ_PREEMPT		= (1 << 1),
+	BLK_REQ_MQ_START_BIT	= 2,
+	BLK_REQ_BITS_MASK	= (1U << BLK_REQ_MQ_START_BIT) - 1,
 };
 
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
@@ -950,8 +953,9 @@ extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_init_request_from_bio(struct request *req, struct bio *bio);
 extern void blk_put_request(struct request *);
 extern void __blk_put_request(struct request_queue *, struct request *);
-extern struct request *blk_get_request(struct request_queue *, unsigned int op,
-				       gfp_t gfp_mask);
+extern struct request *__blk_get_request(struct request_queue *,
+					 unsigned int op, gfp_t gfp_mask,
+					 unsigned int flags);
 extern void blk_requeue_request(struct request_queue *, struct request *);
 extern int blk_lld_busy(struct request_queue *q);
 extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src,
@@ -1002,6 +1006,13 @@ blk_status_t errno_to_blk_status(int errno);
 
 bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie);
 
+static inline struct request *blk_get_request(struct request_queue *q,
+					      unsigned int op,
+					      gfp_t gfp_mask)
+{
+	return __blk_get_request(q, op, gfp_mask, 0);
+}
+
 static inline struct request_queue *bdev_get_queue(struct block_device *bdev)
 {
 	return bdev->bd_disk->queue;	/* this is never NULL */
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V4 09/10] SCSI: transport_spi: resume a quiesced device
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
                   ` (7 preceding siblings ...)
  2017-09-11 11:10 ` [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen Ming Lei
@ 2017-09-11 11:10 ` Ming Lei
  2017-09-11 11:10 ` [PATCH V4 10/10] SCSI: preempt freeze block queue when SCSI device is put into quiesce Ming Lei
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

We have to preempt freeze queue in scsi_device_quiesce(),
and unfreeze in scsi_device_resume(), so call scsi_device_resume()
for the device which is quiesced by scsi_device_quiesce().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_transport_spi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/scsi_transport_spi.c b/drivers/scsi/scsi_transport_spi.c
index d0219e36080c..bdecf99b6ab1 100644
--- a/drivers/scsi/scsi_transport_spi.c
+++ b/drivers/scsi/scsi_transport_spi.c
@@ -1040,6 +1040,9 @@ spi_dv_device(struct scsi_device *sdev)
 
 	scsi_target_resume(starget);
 
+	/* undo what scsi_device_quiesce() did */
+	scsi_device_resume(sdev);
+
 	spi_initial_dv(starget) = 1;
 
  out_free:
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH V4 10/10] SCSI: preempt freeze block queue when SCSI device is put into quiesce
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
                   ` (8 preceding siblings ...)
  2017-09-11 11:10 ` [PATCH V4 09/10] SCSI: transport_spi: resume a quiesced device Ming Lei
@ 2017-09-11 11:10 ` Ming Lei
  2017-09-11 21:24   ` Oleksandr Natalenko
  2017-09-12 19:03 ` Cathy Avery
  11 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-11 11:10 UTC (permalink / raw)
  To: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn,
	Cathy Avery, Ming Lei

Simply quiesing SCSI device and waiting for completeion of IO
dispatched to SCSI queue isn't safe, it is easy to use up
request pool because all allocated requests before can't
be dispatched when device is put in QIUESCE. Then no request
can be allocated for RQF_PREEMPT, and system may hang somewhere,
such as When sending commands of sync_cache or start_stop during
system suspend path.

Before quiesing SCSI, this patch freezes block queue in preempt
mode first, so no new normal request can enter queue any more,
and all pending requests are drained too once blk_freeze_queue_preempt
is returned. Then RQF_PREEMPT can be allocated successfully duirng
preempt freeze.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80fe297..751a956b7b2b 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -252,9 +252,10 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	struct scsi_request *rq;
 	int ret = DRIVER_ERROR << 24;
 
-	req = blk_get_request(sdev->request_queue,
+	req = __blk_get_request(sdev->request_queue,
 			data_direction == DMA_TO_DEVICE ?
-			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+			REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM,
+			BLK_REQ_PREEMPT);
 	if (IS_ERR(req))
 		return ret;
 	rq = scsi_req(req);
@@ -2928,12 +2929,28 @@ scsi_device_quiesce(struct scsi_device *sdev)
 {
 	int err;
 
+	/*
+	 * Simply quiesing SCSI device isn't safe, it is easy
+	 * to use up requests because all these allocated requests
+	 * can't be dispatched when device is put in QIUESCE.
+	 * Then no request can be allocated and we may hang
+	 * somewhere, such as system suspend/resume.
+	 *
+	 * So we freeze block queue in preempt mode first, no new
+	 * normal request can enter queue any more, and all pending
+	 * requests are drained once blk_freeze_queue_preempt()
+	 * is returned. Only RQF_PREEMPT is allowed in preempt freeze.
+	 */
+	blk_freeze_queue_preempt(sdev->request_queue);
+
 	mutex_lock(&sdev->state_mutex);
 	err = scsi_device_set_state(sdev, SDEV_QUIESCE);
 	mutex_unlock(&sdev->state_mutex);
 
-	if (err)
+	if (err) {
+		blk_unfreeze_queue_preempt(sdev->request_queue);
 		return err;
+	}
 
 	scsi_run_queue(sdev->request_queue);
 	while (atomic_read(&sdev->device_busy)) {
@@ -2964,6 +2981,8 @@ void scsi_device_resume(struct scsi_device *sdev)
 	    scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
 		scsi_run_queue(sdev->request_queue);
 	mutex_unlock(&sdev->state_mutex);
+
+	blk_unfreeze_queue_preempt(sdev->request_queue);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
-- 
2.9.5

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
  2017-09-11 11:10 ` [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen Ming Lei
@ 2017-09-11 16:03     ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2017-09-11 16:03 UTC (permalink / raw)
  To: linux-scsi, hch, jejb, linux-block, axboe, ming.lei, martin.petersen
  Cc: Bart Van Assche, jthumshirn, oleksandr, cavery

T24gTW9uLCAyMDE3LTA5LTExIGF0IDE5OjEwICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gQEAg
LTc4Nyw2ICs3ODcsMzUgQEAgaW50IGJsa19xdWV1ZV9lbnRlcihzdHJ1Y3QgcmVxdWVzdF9xdWV1
ZSAqcSwgdW5zaWduZWQgZmxhZ3MpDQo+ICAJCWlmIChwZXJjcHVfcmVmX3RyeWdldF9saXZlKCZx
LT5xX3VzYWdlX2NvdW50ZXIpKQ0KPiAgCQkJcmV0dXJuIDA7DQo+ICANCj4gKwkJLyoNCj4gKwkJ
ICogSWYgcXVldWUgaXMgcHJlZW1wdCBmcm96ZW4gYW5kIGNhbGxlciBuZWVkIHRvIGFsbG9jYXRl
DQo+ICsJCSAqIHJlcXVlc3QgZm9yIFJRRl9QUkVFTVBULCB3ZSBncmFiIHRoZSAucV91c2FnZV9j
b3VudGVyDQo+ICsJCSAqIHVuY29uZGl0aW9uYWxseSBhbmQgcmV0dXJuIHN1Y2Nlc3NmdWxseS4N
Cj4gKwkJICoNCj4gKwkJICogVGhlcmUgaXNuJ3QgcmFjZSB3aXRoIHF1ZXVlIGNsZWFudXAgYmVj
YXVzZToNCj4gKwkJICoNCj4gKwkJICogMSkgaXQgaXMgZ3VhcmFudGVlZCB0aGF0IHByZWVtcHQg
ZnJlZXplIGNhbid0IGJlDQo+ICsJCSAqIHN0YXJ0ZWQgYWZ0ZXIgcXVldWUgaXMgc2V0IGFzIGR5
aW5nDQo+ICsJCSAqDQo+ICsJCSAqIDIpIG5vcm1hbCBmcmVlemUgcnVucyBleGNsdXNpdmVseSB3
aXRoIHByZWVtcHQNCj4gKwkJICogZnJlZXplLCBzbyBldmVuIGFmdGVyIHF1ZXVlIGlzIHNldCBh
cyBkeWluZw0KPiArCQkgKiBhZnRlcndhcmRzLCBibGtfcXVldWVfY2xlYW51cCgpIHdvbid0IG1v
dmUgb24NCj4gKwkJICogdW50aWwgcHJlZW1wdCBmcmVlemUgaXMgZG9uZQ0KPiArCQkgKg0KPiAr
CQkgKiAzKSBibGtfcXVldWVfZHlpbmcoKSBuZWVkbid0IHRvIGJlIGNoZWNrZWQgaGVyZQ0KPiAr
CQkgKiAJLSBmb3IgbGVnYWN5IHBhdGgsIGl0IHdpbGwgYmUgY2hlY2tlZCBpbg0KPiArCQkgKiAJ
X19nZXRfcmVxdWVzdCgpDQoNCkZvciB0aGUgbGVnYWN5IGJsb2NrIGxheWVyIGNvcmUsIHdoYXQg
ZG8geW91IHRoaW5rIHdpbGwgaGFwcGVuIGlmIHRoZQ0KImR5aW5nIiBzdGF0ZSBpcyBzZXQgYnkg
YW5vdGhlciB0aHJlYWQgYWZ0ZXIgX19nZXRfcmVxdWVzdCgpIGhhcyBwYXNzZWQgdGhlDQpibGtf
cXVldWVfZHlpbmcoKSBjaGVjaz8NCg0KPiArCQkgKiAJLSBibGstbXEgZGVwZW5kcyBvbiBkcml2
ZXIgdG8gaGFuZGxlIGR5aW5nIHdlbGwNCj4gKwkJICogCWJlY2F1c2UgaXQgaXMgbm9ybWFsIGZv
ciBxdWV1ZSB0byBiZSBzZXQgYXMgZHlpbmcNCj4gKwkJICogCWp1c3QgYmV0d2VlbiBibGtfcXVl
dWVfZW50ZXIoKSBhbmQgYWxsb2NhdGluZyBuZXcNCj4gKwkJICogCXJlcXVlc3QuDQoNClRoZSBh
Ym92ZSBjb21tZW50IGlzIG5vdCBjb3JyZWN0LiBUaGUgYmxvY2sgbGF5ZXIgY29yZSBoYW5kbGVz
IHRoZSAiZHlpbmciDQpzdGF0ZS4gQmxvY2sgZHJpdmVycyBvdGhlciB0aGFuIGRtLW1wYXRoIHNo
b3VsZCBub3QgaGF2ZSB0byBxdWVyeSB0aGlzIHN0YXRlDQpkaXJlY3RseS4NCg0KPiArCQkgKi8N
Cj4gKwkJaWYgKChmbGFncyAmIEJMS19SRVFfUFJFRU1QVCkgJiYNCj4gKwkJCQlibGtfcXVldWVf
aXNfcHJlZW1wdF9mcm96ZW4ocSkpIHsNCj4gKwkJCWJsa19xdWV1ZV9lbnRlcl9saXZlKHEpOw0K
PiArCQkJcmV0dXJuIDA7DQo+ICsJCX0NCj4gKw0KDQpTb3JyeSBidXQgdG8gbWUgaXQgbG9va3Mg
bGlrZSB0aGUgYWJvdmUgY29kZSBpbnRyb2R1Y2VzIGEgcmFjZSBjb25kaXRpb24NCmJldHdlZW4g
YmxrX3F1ZXVlX2NsZWFudXAoKSBhbmQgYmxrX2dldF9yZXF1ZXN0KCkgZm9yIGF0IGxlYXN0IGJs
ay1tcS4NCkNvbnNpZGVyIGUuZy4gdGhlIGZvbGxvd2luZyBzY2VuYXJpbzoNCiogQSBmaXJzdCB0
aHJlYWQgcHJlZW1wdC1mcmVlemVzIGEgcmVxdWVzdCBxdWV1ZS4NCiogQSBzZWNvbmQgdGhyZWFk
IGNhbGxzIGJsa19nZXRfcmVxdWVzdCgpIHdpdGggQkxLX1JFUV9QUkVFTVBUIHNldC4gVGhhdA0K
ICByZXN1bHRzIGluIGEgY2FsbCBvZiBibGtfcXVldWVfaXNfcHJlZW1wdF9mcm96ZW4oKS4NCiog
QSBjb250ZXh0IHN3aXRjaCBvY2N1cnMgdG8gdGhlIGZpcnN0IHRocmVhZC4NCiogVGhlIGZpcnN0
IHRocmVhZCBwcmVlbXB0LXVuZnJlZXplcyB0aGUgc2FtZSByZXF1ZXN0IHF1ZXVlIGFuZCBjYWxs
cw0KICBibGtfcXVldWVfY2xlYW51cCgpLiBUaGF0IGxhc3QgZnVuY3Rpb24gY2hhbmdlcyB0aGUg
cmVxdWVzdCBxdWV1ZSBzdGF0ZQ0KICBpbnRvIERZSU5HIGFuZCB3YWl0cyB1bnRpbCBhbGwgcGVu
ZGluZyByZXF1ZXN0cyBoYXZlIGZpbmlzaGVkLg0KKiBUaGUgc2Vjb25kIHRocmVhZCBjb250aW51
ZXMgYW5kIGNhbGxzIGJsa19xdWV1ZV9lbnRlcl9saXZlKCksIGFsbG9jYXRlcw0KICBhIHJlcXVl
c3QgYW5kIHN1Ym1pdHMgaXQuDQoNCkluIG90aGVyIHdvcmRzLCBhIHJlcXVlc3QgZ2V0cyBzdWJt
aXR0ZWQgYWdhaW5zdCBhIGR5aW5nIHF1ZXVlLiBUaGlzIG11c3QNCm5vdCBoYXBwZW4uIFNlZSBh
bHNvIG15IGV4cGxhbmF0aW9uIG9mIHF1ZXVlIHNodXRkb3duIGZyb20gYSBmZXcgZGF5cyBhZ28N
CihodHRwczovL21hcmMuaW5mby8/bD1saW51eC1ibG9jayZtPTE1MDQ0OTg0NTgzMTc4OSkuDQoN
CkJhcnQu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
@ 2017-09-11 16:03     ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2017-09-11 16:03 UTC (permalink / raw)
  To: linux-scsi, hch, jejb, linux-block, axboe, ming.lei, martin.petersen
  Cc: Bart Van Assche, jthumshirn, oleksandr, cavery

On Mon, 2017-09-11 at 19:10 +0800, Ming Lei wrote:
> @@ -787,6 +787,35 @@ int blk_queue_enter(struct request_queue *q, unsigned flags)
>  		if (percpu_ref_tryget_live(&q->q_usage_counter))
>  			return 0;
>  
> +		/*
> +		 * If queue is preempt frozen and caller need to allocate
> +		 * request for RQF_PREEMPT, we grab the .q_usage_counter
> +		 * unconditionally and return successfully.
> +		 *
> +		 * There isn't race with queue cleanup because:
> +		 *
> +		 * 1) it is guaranteed that preempt freeze can't be
> +		 * started after queue is set as dying
> +		 *
> +		 * 2) normal freeze runs exclusively with preempt
> +		 * freeze, so even after queue is set as dying
> +		 * afterwards, blk_queue_cleanup() won't move on
> +		 * until preempt freeze is done
> +		 *
> +		 * 3) blk_queue_dying() needn't to be checked here
> +		 * 	- for legacy path, it will be checked in
> +		 * 	__get_request()

For the legacy block layer core, what do you think will happen if the
"dying" state is set by another thread after __get_request() has passed the
blk_queue_dying() check?

> +		 * 	- blk-mq depends on driver to handle dying well
> +		 * 	because it is normal for queue to be set as dying
> +		 * 	just between blk_queue_enter() and allocating new
> +		 * 	request.

The above comment is not correct. The block layer core handles the "dying"
state. Block drivers other than dm-mpath should not have to query this state
directly.

> +		 */
> +		if ((flags & BLK_REQ_PREEMPT) &&
> +				blk_queue_is_preempt_frozen(q)) {
> +			blk_queue_enter_live(q);
> +			return 0;
> +		}
> +

Sorry but to me it looks like the above code introduces a race condition
between blk_queue_cleanup() and blk_get_request() for at least blk-mq.
Consider e.g. the following scenario:
* A first thread preempt-freezes a request queue.
* A second thread calls blk_get_request() with BLK_REQ_PREEMPT set. That
  results in a call of blk_queue_is_preempt_frozen().
* A context switch occurs to the first thread.
* The first thread preempt-unfreezes the same request queue and calls
  blk_queue_cleanup(). That last function changes the request queue state
  into DYING and waits until all pending requests have finished.
* The second thread continues and calls blk_queue_enter_live(), allocates
  a request and submits it.

In other words, a request gets submitted against a dying queue. This must
not happen. See also my explanation of queue shutdown from a few days ago
(https://marc.info/?l=linux-block&m=150449845831789).

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 0/10] block/scsi: safe SCSI quiescing
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
@ 2017-09-11 21:24   ` Oleksandr Natalenko
  2017-09-11 11:10 ` [PATCH V4 02/10] block: tracking request allocation with q_usage_counter Ming Lei
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 27+ messages in thread
From: Oleksandr Natalenko @ 2017-09-11 21:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley, Bart Van Assche,
	Johannes Thumshirn, Cathy Avery

=46or v4 with regard to suspend/resume:

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>

On pond=C4=9Bl=C3=AD 11. z=C3=A1=C5=99=C3=AD 2017 13:10:11 CEST Ming Lei wr=
ote:
> Hi,
>=20
> The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
>=20
> Once SCSI device is put into QUIESCE, no new request except for
> RQF_PREEMPT can be dispatched to SCSI successfully, and
> scsi_device_quiesce() just simply waits for completion of I/Os
> dispatched to SCSI stack. It isn't enough at all.
>=20
> Because new request still can be comming, but all the allocated
> requests can't be dispatched successfully, so request pool can be
> consumed up easily.
>=20
> Then request with RQF_PREEMPT can't be allocated and wait forever,
> meantime scsi_device_resume() waits for completion of RQF_PREEMPT,
> then system hangs forever, such as during system suspend or
> sending SCSI domain alidation.
>=20
> Both IO hang inside system suspend[1] or SCSI domain validation
> were reported before.
>=20
> This patch introduces preempt freeze, and solves the issue
> by preempt freezing block queue during SCSI quiesce, and allows
> to allocate request of RQF_PREEMPT when queue is in this state.
>=20
> Oleksandr verified that V3 does fix the hang during suspend/resume,
> and Cathy verified that revised V3 fixes hang in sending
> SCSI domain validation.
>=20
> Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> them all by introducing/unifying blk_freeze_queue_preempt() and
> blk_unfreeze_queue_preempt(), and cleanup is done together.
>=20
> The patchset can be found in the following gitweb:
>=20
> 	https://github.com/ming1/linux/tree/blk_safe_scsi_quiesce_V4
>=20
> V4:
> 	- reorganize patch order to make it more reasonable
> 	- support nested preempt freeze, as required by SCSI transport spi
> 	- check preempt freezing in slow path of of blk_queue_enter()
> 	- add "SCSI: transport_spi: resume a quiesced device"
> 	- wake up freeze queue in setting dying for both blk-mq and legacy
> 	- rename blk_mq_[freeze|unfreeze]_queue() in one patch
> 	- rename .mq_freeze_wq and .mq_freeze_depth
> 	- improve comment
>=20
> V3:
> 	- introduce q->preempt_unfreezing to fix one bug of preempt freeze
> 	- call blk_queue_enter_live() only when queue is preempt frozen
> 	- cleanup a bit on the implementation of preempt freeze
> 	- only patch 6 and 7 are changed
>=20
> V2:
> 	- drop the 1st patch in V1 because percpu_ref_is_dying() is
> 	enough as pointed by Tejun
> 	- introduce preempt version of blk_[freeze|unfreeze]_queue
> 	- sync between preempt freeze and normal freeze
> 	- fix warning from percpu-refcount as reported by Oleksandr
>=20
>=20
> [1] https://marc.info/?t=3D150340250100013&r=3D3&w=3D2
>=20
>=20
> Thanks,
> Ming
>=20
>=20
> Ming Lei (10):
>   blk-mq: only run hw queues for blk-mq
>   block: tracking request allocation with q_usage_counter
>   blk-mq: rename blk_mq_[freeze|unfreeze]_queue
>   blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
>   block: rename .mq_freeze_wq and .mq_freeze_depth
>   block: pass flags to blk_queue_enter()
>   block: introduce preempt version of blk_[freeze|unfreeze]_queue
>   block: allow to allocate req with RQF_PREEMPT when queue is preempt
>     frozen
>   SCSI: transport_spi: resume a quiesced device
>   SCSI: preempt freeze block queue when SCSI device is put into quiesce
>=20
>  block/bfq-iosched.c               |   2 +-
>  block/blk-cgroup.c                |   8 +-
>  block/blk-core.c                  |  95 ++++++++++++++++----
>  block/blk-mq.c                    | 180
> ++++++++++++++++++++++++++++---------- block/blk-mq.h                    =
|=20
>  1 -
>  block/blk-timeout.c               |   2 +-
>  block/blk.h                       |  12 +++
>  block/elevator.c                  |   4 +-
>  drivers/block/loop.c              |  24 ++---
>  drivers/block/rbd.c               |   2 +-
>  drivers/nvme/host/core.c          |   8 +-
>  drivers/scsi/scsi_lib.c           |  25 +++++-
>  drivers/scsi/scsi_transport_spi.c |   3 +
>  fs/block_dev.c                    |   4 +-
>  include/linux/blk-mq.h            |  15 ++--
>  include/linux/blkdev.h            |  32 +++++--
>  16 files changed, 313 insertions(+), 104 deletions(-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 0/10] block/scsi: safe SCSI quiescing
@ 2017-09-11 21:24   ` Oleksandr Natalenko
  0 siblings, 0 replies; 27+ messages in thread
From: Oleksandr Natalenko @ 2017-09-11 21:24 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley, Bart Van Assche,
	Johannes Thumshirn, Cathy Avery

For v4 with regard to suspend/resume:

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>

On pondělí 11. září 2017 13:10:11 CEST Ming Lei wrote:
> Hi,
> 
> The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
> 
> Once SCSI device is put into QUIESCE, no new request except for
> RQF_PREEMPT can be dispatched to SCSI successfully, and
> scsi_device_quiesce() just simply waits for completion of I/Os
> dispatched to SCSI stack. It isn't enough at all.
> 
> Because new request still can be comming, but all the allocated
> requests can't be dispatched successfully, so request pool can be
> consumed up easily.
> 
> Then request with RQF_PREEMPT can't be allocated and wait forever,
> meantime scsi_device_resume() waits for completion of RQF_PREEMPT,
> then system hangs forever, such as during system suspend or
> sending SCSI domain alidation.
> 
> Both IO hang inside system suspend[1] or SCSI domain validation
> were reported before.
> 
> This patch introduces preempt freeze, and solves the issue
> by preempt freezing block queue during SCSI quiesce, and allows
> to allocate request of RQF_PREEMPT when queue is in this state.
> 
> Oleksandr verified that V3 does fix the hang during suspend/resume,
> and Cathy verified that revised V3 fixes hang in sending
> SCSI domain validation.
> 
> Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> them all by introducing/unifying blk_freeze_queue_preempt() and
> blk_unfreeze_queue_preempt(), and cleanup is done together.
> 
> The patchset can be found in the following gitweb:
> 
> 	https://github.com/ming1/linux/tree/blk_safe_scsi_quiesce_V4
> 
> V4:
> 	- reorganize patch order to make it more reasonable
> 	- support nested preempt freeze, as required by SCSI transport spi
> 	- check preempt freezing in slow path of of blk_queue_enter()
> 	- add "SCSI: transport_spi: resume a quiesced device"
> 	- wake up freeze queue in setting dying for both blk-mq and legacy
> 	- rename blk_mq_[freeze|unfreeze]_queue() in one patch
> 	- rename .mq_freeze_wq and .mq_freeze_depth
> 	- improve comment
> 
> V3:
> 	- introduce q->preempt_unfreezing to fix one bug of preempt freeze
> 	- call blk_queue_enter_live() only when queue is preempt frozen
> 	- cleanup a bit on the implementation of preempt freeze
> 	- only patch 6 and 7 are changed
> 
> V2:
> 	- drop the 1st patch in V1 because percpu_ref_is_dying() is
> 	enough as pointed by Tejun
> 	- introduce preempt version of blk_[freeze|unfreeze]_queue
> 	- sync between preempt freeze and normal freeze
> 	- fix warning from percpu-refcount as reported by Oleksandr
> 
> 
> [1] https://marc.info/?t=150340250100013&r=3&w=2
> 
> 
> Thanks,
> Ming
> 
> 
> Ming Lei (10):
>   blk-mq: only run hw queues for blk-mq
>   block: tracking request allocation with q_usage_counter
>   blk-mq: rename blk_mq_[freeze|unfreeze]_queue
>   blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
>   block: rename .mq_freeze_wq and .mq_freeze_depth
>   block: pass flags to blk_queue_enter()
>   block: introduce preempt version of blk_[freeze|unfreeze]_queue
>   block: allow to allocate req with RQF_PREEMPT when queue is preempt
>     frozen
>   SCSI: transport_spi: resume a quiesced device
>   SCSI: preempt freeze block queue when SCSI device is put into quiesce
> 
>  block/bfq-iosched.c               |   2 +-
>  block/blk-cgroup.c                |   8 +-
>  block/blk-core.c                  |  95 ++++++++++++++++----
>  block/blk-mq.c                    | 180
> ++++++++++++++++++++++++++++---------- block/blk-mq.h                    | 
>  1 -
>  block/blk-timeout.c               |   2 +-
>  block/blk.h                       |  12 +++
>  block/elevator.c                  |   4 +-
>  drivers/block/loop.c              |  24 ++---
>  drivers/block/rbd.c               |   2 +-
>  drivers/nvme/host/core.c          |   8 +-
>  drivers/scsi/scsi_lib.c           |  25 +++++-
>  drivers/scsi/scsi_transport_spi.c |   3 +
>  fs/block_dev.c                    |   4 +-
>  include/linux/blk-mq.h            |  15 ++--
>  include/linux/blkdev.h            |  32 +++++--
>  16 files changed, 313 insertions(+), 104 deletions(-)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
  2017-09-11 16:03     ` Bart Van Assche
  (?)
@ 2017-09-12  3:40     ` Ming Lei
  2017-09-13 16:48       ` Ming Lei
  -1 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2017-09-12  3:40 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-scsi, hch, jejb, linux-block, axboe, martin.petersen,
	jthumshirn, oleksandr, cavery

On Mon, Sep 11, 2017 at 04:03:55PM +0000, Bart Van Assche wrote:
> On Mon, 2017-09-11 at 19:10 +0800, Ming Lei wrote:
> > @@ -787,6 +787,35 @@ int blk_queue_enter(struct request_queue *q, unsigned flags)
> >  		if (percpu_ref_tryget_live(&q->q_usage_counter))
> >  			return 0;
> >  
> > +		/*
> > +		 * If queue is preempt frozen and caller need to allocate
> > +		 * request for RQF_PREEMPT, we grab the .q_usage_counter
> > +		 * unconditionally and return successfully.
> > +		 *
> > +		 * There isn't race with queue cleanup because:
> > +		 *
> > +		 * 1) it is guaranteed that preempt freeze can't be
> > +		 * started after queue is set as dying
> > +		 *
> > +		 * 2) normal freeze runs exclusively with preempt
> > +		 * freeze, so even after queue is set as dying
> > +		 * afterwards, blk_queue_cleanup() won't move on
> > +		 * until preempt freeze is done
> > +		 *
> > +		 * 3) blk_queue_dying() needn't to be checked here
> > +		 * 	- for legacy path, it will be checked in
> > +		 * 	__get_request()
> 
> For the legacy block layer core, what do you think will happen if the
> "dying" state is set by another thread after __get_request() has passed the
> blk_queue_dying() check?

Without this patchset, block core still need to handle the above
situation, so your question isn't related with this patchset.

Also q->queue_lock is required in both setting dying and checking
dying in__get_request(). But the lock can be released in
__get_request(), so it is possible to allocate one request after
queue is set as dying, and the request can be dispatched to a
dying queue too for legacy.

> 
> > +		 * 	- blk-mq depends on driver to handle dying well
> > +		 * 	because it is normal for queue to be set as dying
> > +		 * 	just between blk_queue_enter() and allocating new
> > +		 * 	request.
> 
> The above comment is not correct. The block layer core handles the "dying"
> state. Block drivers other than dm-mpath should not have to query this state
> directly.

If blk-mq doesn't query dying state, how does it know queue is dying
and handle the state? Also blk-mq isn't different with legacy wrt.
depending on driver to handle dying.

> 
> > +		 */
> > +		if ((flags & BLK_REQ_PREEMPT) &&
> > +				blk_queue_is_preempt_frozen(q)) {
> > +			blk_queue_enter_live(q);
> > +			return 0;
> > +		}
> > +
> 
> Sorry but to me it looks like the above code introduces a race condition
> between blk_queue_cleanup() and blk_get_request() for at least blk-mq.
> Consider e.g. the following scenario:
> * A first thread preempt-freezes a request queue.
> * A second thread calls blk_get_request() with BLK_REQ_PREEMPT set. That
>   results in a call of blk_queue_is_preempt_frozen().
> * A context switch occurs to the first thread.
> * The first thread preempt-unfreezes the same request queue and calls
>   blk_queue_cleanup(). That last function changes the request queue state
>   into DYING and waits until all pending requests have finished.
> * The second thread continues and calls blk_queue_enter_live(), allocates
>   a request and submits it.

OK, looks a race I don't think of, but it can be fixed easily by calling
blk_queue_enter_live() with holding q->freeze_lock, and it won't
cause performance issue too since it is in slow path.

For example, we can introduce the following code in blk_queue_enter():

		if ((flags & BLK_REQ_PREEMPT) &&
				blk_queue_enter_preempt_freeze(q))
			return 0;

static inline bool blk_queue_enter_preempt_freeze(struct request_queue *q)
{
	bool preempt_frozen;

	spin_lock(&q->freeze_lock);
	preempt_frozen = q->preempt_freezing && !q->preempt_unfreezing;
	if (preempt_froze)
		blk_queue_enter_live(q);
	spin_unlock(&q->freeze_lock);

	return preempt_frozen;
}

> 
> In other words, a request gets submitted against a dying queue. This must
> not happen. See also my explanation of queue shutdown from a few days ago

That is not correct, think about why queue dead is checked in
__blk_run_queue_uncond() instead of queue dying. We still need to
submit requests to driver when queue is dying, and driver knows
how to handle that.

> (https://marc.info/?l=linux-block&m=150449845831789).

> from (https://marc.info/?l=linux-block&m=150449845831789).
>> Do you understand how request queue cleanup works? The algorithm used for
>> request queue cleanup is as follows:
>> * Set the DYING flag. This flag makes all later blk_get_request() calls
>>   fail.

Your description isn't true for both legacy and blk-mq:

For legacy, as you see q->queue_lock can be released in
__get_request(), at that time, the queue can be setting as dying.
but the request can be allocated successfully, and be submitted
to driver. This way has been there for long time.

For blk-mq, follows the way for allocating blk-mq request:

        ret = blk_queue_enter(q, ...);
        if (ret)
                return ERR_PTR(ret);
        rq = blk_mq_get_request(q, ...);

The setting queue dying can just happen between blk_queue_enter()
and blk_mq_get_request(), so it is usual to see requests sent
to driver after queue is dying.

We shouldn't worry about that, because either cleanup queue or setting
dying is triggered by driver, and driver knows that queue is dying
at that time, and they can handle the request well under this
situation.

>> * Wait until all pending requests have finished.
>> * Set the DEAD flag. For the traditional block layer, this flag causes
>>   blk_run_queue() not to call .request_fn() anymore. For blk-mq it is
>>   guaranteed in another way that .queue_rq() won't be called anymore after
>>   this flag has been set.

That is true, but still not related with this patch.


-- 
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 0/10] block/scsi: safe SCSI quiescing
  2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
                   ` (10 preceding siblings ...)
  2017-09-11 21:24   ` Oleksandr Natalenko
@ 2017-09-12 19:03 ` Cathy Avery
  11 siblings, 0 replies; 27+ messages in thread
From: Cathy Avery @ 2017-09-12 19:03 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, Christoph Hellwig, linux-scsi,
	Martin K . Petersen, James E . J . Bottomley
  Cc: Bart Van Assche, Oleksandr Natalenko, Johannes Thumshirn

On 09/11/2017 07:10 AM, Ming Lei wrote:
> Hi,
>
> The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
>
> Once SCSI device is put into QUIESCE, no new request except for
> RQF_PREEMPT can be dispatched to SCSI successfully, and
> scsi_device_quiesce() just simply waits for completion of I/Os
> dispatched to SCSI stack. It isn't enough at all.
>
> Because new request still can be comming, but all the allocated
> requests can't be dispatched successfully, so request pool can be
> consumed up easily.
>
> Then request with RQF_PREEMPT can't be allocated and wait forever,
> meantime scsi_device_resume() waits for completion of RQF_PREEMPT,
> then system hangs forever, such as during system suspend or
> sending SCSI domain alidation.
>
> Both IO hang inside system suspend[1] or SCSI domain validation
> were reported before.
>
> This patch introduces preempt freeze, and solves the issue
> by preempt freezing block queue during SCSI quiesce, and allows
> to allocate request of RQF_PREEMPT when queue is in this state.
>
> Oleksandr verified that V3 does fix the hang during suspend/resume,
> and Cathy verified that revised V3 fixes hang in sending
> SCSI domain validation.
>
> Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> them all by introducing/unifying blk_freeze_queue_preempt() and
> blk_unfreeze_queue_preempt(), and cleanup is done together.
>
> The patchset can be found in the following gitweb:
>
> 	https://github.com/ming1/linux/tree/blk_safe_scsi_quiesce_V4
>
> V4:
> 	- reorganize patch order to make it more reasonable
> 	- support nested preempt freeze, as required by SCSI transport spi
> 	- check preempt freezing in slow path of of blk_queue_enter()
> 	- add "SCSI: transport_spi: resume a quiesced device"
> 	- wake up freeze queue in setting dying for both blk-mq and legacy
> 	- rename blk_mq_[freeze|unfreeze]_queue() in one patch
> 	- rename .mq_freeze_wq and .mq_freeze_depth
> 	- improve comment
>
> V3:
> 	- introduce q->preempt_unfreezing to fix one bug of preempt freeze
> 	- call blk_queue_enter_live() only when queue is preempt frozen
> 	- cleanup a bit on the implementation of preempt freeze
> 	- only patch 6 and 7 are changed
>
> V2:
> 	- drop the 1st patch in V1 because percpu_ref_is_dying() is
> 	enough as pointed by Tejun
> 	- introduce preempt version of blk_[freeze|unfreeze]_queue
> 	- sync between preempt freeze and normal freeze
> 	- fix warning from percpu-refcount as reported by Oleksandr
>
>
> [1]https://marc.info/?t=150340250100013&r=3&w=2
>
>
> Thanks,
> Ming
>
>
> Ming Lei (10):
>    blk-mq: only run hw queues for blk-mq
>    block: tracking request allocation with q_usage_counter
>    blk-mq: rename blk_mq_[freeze|unfreeze]_queue
>    blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait
>    block: rename .mq_freeze_wq and .mq_freeze_depth
>    block: pass flags to blk_queue_enter()
>    block: introduce preempt version of blk_[freeze|unfreeze]_queue
>    block: allow to allocate req with RQF_PREEMPT when queue is preempt
>      frozen
>    SCSI: transport_spi: resume a quiesced device
>    SCSI: preempt freeze block queue when SCSI device is put into quiesce
>
>   block/bfq-iosched.c               |   2 +-
>   block/blk-cgroup.c                |   8 +-
>   block/blk-core.c                  |  95 ++++++++++++++++----
>   block/blk-mq.c                    | 180 ++++++++++++++++++++++++++++----------
>   block/blk-mq.h                    |   1 -
>   block/blk-timeout.c               |   2 +-
>   block/blk.h                       |  12 +++
>   block/elevator.c                  |   4 +-
>   drivers/block/loop.c              |  24 ++---
>   drivers/block/rbd.c               |   2 +-
>   drivers/nvme/host/core.c          |   8 +-
>   drivers/scsi/scsi_lib.c           |  25 +++++-
>   drivers/scsi/scsi_transport_spi.c |   3 +
>   fs/block_dev.c                    |   4 +-
>   include/linux/blk-mq.h            |  15 ++--
>   include/linux/blkdev.h            |  32 +++++--
>   16 files changed, 313 insertions(+), 104 deletions(-)
>

I've tested this patch set for spi_transport issuing a domain validation 
under low blk_request conditions.

Tested-by: Cathy Avery <cavery@redhat.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
  2017-09-12  3:40     ` Ming Lei
@ 2017-09-13 16:48       ` Ming Lei
  2017-09-13 17:28           ` Bart Van Assche
  0 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2017-09-13 16:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-scsi, hch, jejb, linux-block, axboe, martin.petersen,
	jthumshirn, oleksandr, cavery

On Tue, Sep 12, 2017 at 11:40:57AM +0800, Ming Lei wrote:
> On Mon, Sep 11, 2017 at 04:03:55PM +0000, Bart Van Assche wrote:
> > On Mon, 2017-09-11 at 19:10 +0800, Ming Lei wrote:
> > > @@ -787,6 +787,35 @@ int blk_queue_enter(struct request_queue *q, unsigned flags)
> > >  		if (percpu_ref_tryget_live(&q->q_usage_counter))
> > >  			return 0;
> > >  
> > > +		/*
> > > +		 * If queue is preempt frozen and caller need to allocate
> > > +		 * request for RQF_PREEMPT, we grab the .q_usage_counter
> > > +		 * unconditionally and return successfully.
> > > +		 *
> > > +		 * There isn't race with queue cleanup because:
> > > +		 *
> > > +		 * 1) it is guaranteed that preempt freeze can't be
> > > +		 * started after queue is set as dying
> > > +		 *
> > > +		 * 2) normal freeze runs exclusively with preempt
> > > +		 * freeze, so even after queue is set as dying
> > > +		 * afterwards, blk_queue_cleanup() won't move on
> > > +		 * until preempt freeze is done
> > > +		 *
> > > +		 * 3) blk_queue_dying() needn't to be checked here
> > > +		 * 	- for legacy path, it will be checked in
> > > +		 * 	__get_request()
> > 
> > For the legacy block layer core, what do you think will happen if the
> > "dying" state is set by another thread after __get_request() has passed the
> > blk_queue_dying() check?
> 
> Without this patchset, block core still need to handle the above
> situation, so your question isn't related with this patchset.
> 
> Also q->queue_lock is required in both setting dying and checking
> dying in__get_request(). But the lock can be released in
> __get_request(), so it is possible to allocate one request after
> queue is set as dying, and the request can be dispatched to a
> dying queue too for legacy.
> 
> > 
> > > +		 * 	- blk-mq depends on driver to handle dying well
> > > +		 * 	because it is normal for queue to be set as dying
> > > +		 * 	just between blk_queue_enter() and allocating new
> > > +		 * 	request.
> > 
> > The above comment is not correct. The block layer core handles the "dying"
> > state. Block drivers other than dm-mpath should not have to query this state
> > directly.
> 
> If blk-mq doesn't query dying state, how does it know queue is dying
> and handle the state? Also blk-mq isn't different with legacy wrt.
> depending on driver to handle dying.
> 
> > 
> > > +		 */
> > > +		if ((flags & BLK_REQ_PREEMPT) &&
> > > +				blk_queue_is_preempt_frozen(q)) {
> > > +			blk_queue_enter_live(q);
> > > +			return 0;
> > > +		}
> > > +
> > 
> > Sorry but to me it looks like the above code introduces a race condition
> > between blk_queue_cleanup() and blk_get_request() for at least blk-mq.
> > Consider e.g. the following scenario:
> > * A first thread preempt-freezes a request queue.
> > * A second thread calls blk_get_request() with BLK_REQ_PREEMPT set. That
> >   results in a call of blk_queue_is_preempt_frozen().
> > * A context switch occurs to the first thread.
> > * The first thread preempt-unfreezes the same request queue and calls
> >   blk_queue_cleanup(). That last function changes the request queue state
> >   into DYING and waits until all pending requests have finished.
> > * The second thread continues and calls blk_queue_enter_live(), allocates
> >   a request and submits it.
> 
> OK, looks a race I don't think of, but it can be fixed easily by calling
> blk_queue_enter_live() with holding q->freeze_lock, and it won't
> cause performance issue too since it is in slow path.
> 
> For example, we can introduce the following code in blk_queue_enter():
> 
> 		if ((flags & BLK_REQ_PREEMPT) &&
> 				blk_queue_enter_preempt_freeze(q))
> 			return 0;
> 
> static inline bool blk_queue_enter_preempt_freeze(struct request_queue *q)
> {
> 	bool preempt_frozen;
> 
> 	spin_lock(&q->freeze_lock);
> 	preempt_frozen = q->preempt_freezing && !q->preempt_unfreezing;
> 	if (preempt_froze)
> 		blk_queue_enter_live(q);
> 	spin_unlock(&q->freeze_lock);
> 
> 	return preempt_frozen;
> }
> 
> > 
> > In other words, a request gets submitted against a dying queue. This must
> > not happen. See also my explanation of queue shutdown from a few days ago
> 
> That is not correct, think about why queue dead is checked in
> __blk_run_queue_uncond() instead of queue dying. We still need to
> submit requests to driver when queue is dying, and driver knows
> how to handle that.
> 
> > (https://marc.info/?l=linux-block&m=150449845831789).
> 
> > from (https://marc.info/?l=linux-block&m=150449845831789).
> >> Do you understand how request queue cleanup works? The algorithm used for
> >> request queue cleanup is as follows:
> >> * Set the DYING flag. This flag makes all later blk_get_request() calls
> >>   fail.
> 
> Your description isn't true for both legacy and blk-mq:
> 
> For legacy, as you see q->queue_lock can be released in
> __get_request(), at that time, the queue can be setting as dying.
> but the request can be allocated successfully, and be submitted
> to driver. This way has been there for long time.
> 
> For blk-mq, follows the way for allocating blk-mq request:
> 
>         ret = blk_queue_enter(q, ...);
>         if (ret)
>                 return ERR_PTR(ret);
>         rq = blk_mq_get_request(q, ...);
> 
> The setting queue dying can just happen between blk_queue_enter()
> and blk_mq_get_request(), so it is usual to see requests sent
> to driver after queue is dying.
> 
> We shouldn't worry about that, because either cleanup queue or setting
> dying is triggered by driver, and driver knows that queue is dying
> at that time, and they can handle the request well under this
> situation.
> 
> >> * Wait until all pending requests have finished.
> >> * Set the DEAD flag. For the traditional block layer, this flag causes
> >>   blk_run_queue() not to call .request_fn() anymore. For blk-mq it is
> >>   guaranteed in another way that .queue_rq() won't be called anymore after
> >>   this flag has been set.
> 
> That is true, but still not related with this patch.

Hi Bart,

Could you please let me know if your concern about race between
preempt freeze and blk_cleanup_queue() is addressed in my last
reply?


-- 
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
  2017-09-13 16:48       ` Ming Lei
@ 2017-09-13 17:28           ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2017-09-13 17:28 UTC (permalink / raw)
  To: ming.lei
  Cc: jthumshirn, hch, linux-block, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, cavery

T24gVGh1LCAyMDE3LTA5LTE0IGF0IDAwOjQ4ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gQ291
bGQgeW91IHBsZWFzZSBsZXQgbWUga25vdyBpZiB5b3VyIGNvbmNlcm4gYWJvdXQgcmFjZSBiZXR3
ZWVuDQo+IHByZWVtcHQgZnJlZXplIGFuZCBibGtfY2xlYW51cF9xdWV1ZSgpIGlzIGFkZHJlc3Nl
ZCBpbiBteSBsYXN0DQo+IHJlcGx5Pw0KDQpTaG91bGRuJ3Qgd2Ugd2FpdCB1bnRpbCB2NC4xMy1y
YzEgaGFzIGJlZW4gcmVsZWFzZWQgYmVmb3JlIHNwZW5kaW5nIG1vcmUNCmVuZXJneSBvbiB0aGlz
PyBDZXJ0YWluIHBhdGNoZXMgdGhhdCBoYXZlIGJlZW4gc2VudCB0byBMaW51cywgZS5nLiBjb21t
aXQNCjE1N2YzNzdiZWI3MSAoImJsb2NrOiBkaXJlY3RseSBpbnNlcnQgYmxrLW1xIHJlcXVlc3Qg
ZnJvbQ0KYmxrX2luc2VydF9jbG9uZWRfcmVxdWVzdCgpIiksIG1heSBhZmZlY3QgdGhlIHRlc3Qg
cmVzdWx0cy4gSSBoYXZlIG5vdA0KYmVlbiBhYmxlIHRvIHJlcHJvZHVjZSB0aGUgaGFuZyBkdXJp
bmcgcmVzdW1lIHdpdGggSmVucycgbGF0ZXN0IGZvci1uZXh0DQpicmFuY2guDQoNCkJhcnQu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
@ 2017-09-13 17:28           ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2017-09-13 17:28 UTC (permalink / raw)
  To: ming.lei
  Cc: jthumshirn, hch, linux-block, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, cavery

On Thu, 2017-09-14 at 00:48 +0800, Ming Lei wrote:
> Could you please let me know if your concern about race between
> preempt freeze and blk_cleanup_queue() is addressed in my last
> reply?

Shouldn't we wait until v4.13-rc1 has been released before spending more
energy on this? Certain patches that have been sent to Linus, e.g. commit
157f377beb71 ("block: directly insert blk-mq request from
blk_insert_cloned_request()"), may affect the test results. I have not
been able to reproduce the hang during resume with Jens' latest for-next
branch.

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
  2017-09-13 17:28           ` Bart Van Assche
  (?)
@ 2017-09-13 17:48           ` Ming Lei
  2017-09-13 19:07               ` Bart Van Assche
  -1 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2017-09-13 17:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jthumshirn, hch, linux-block, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, cavery

On Wed, Sep 13, 2017 at 05:28:24PM +0000, Bart Van Assche wrote:
> On Thu, 2017-09-14 at 00:48 +0800, Ming Lei wrote:
> > Could you please let me know if your concern about race between
> > preempt freeze and blk_cleanup_queue() is addressed in my last
> > reply?
> 
> Shouldn't we wait until v4.13-rc1 has been released before spending more
> energy on this? Certain patches that have been sent to Linus, e.g. commit

The discussion should have been done any time, right?

I believe I replied all your comments, if you don't agree any one of
my follow-up, please point it out.

> 157f377beb71 ("block: directly insert blk-mq request from
> blk_insert_cloned_request()"), may affect the test results. I have not

No, that patch only changes blk_insert_cloned_request() which is used
by dm-rq(mpath) only, nothing to do with the reported issue during
suspend and sending SCSI Domain validation.


-- 
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
  2017-09-13 17:48           ` Ming Lei
@ 2017-09-13 19:07               ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2017-09-13 19:07 UTC (permalink / raw)
  To: ming.lei
  Cc: jthumshirn, hch, linux-block, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, cavery

T24gVGh1LCAyMDE3LTA5LTE0IGF0IDAxOjQ4ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gTm8s
IHRoYXQgcGF0Y2ggb25seSBjaGFuZ2VzIGJsa19pbnNlcnRfY2xvbmVkX3JlcXVlc3QoKSB3aGlj
aCBpcyB1c2VkDQo+IGJ5IGRtLXJxKG1wYXRoKSBvbmx5LCBub3RoaW5nIHRvIGRvIHdpdGggdGhl
IHJlcG9ydGVkIGlzc3VlIGR1cmluZw0KPiBzdXNwZW5kIGFuZCBzZW5kaW5nIFNDU0kgRG9tYWlu
IHZhbGlkYXRpb24uDQoNClRoZXJlIG1heSBiZSBvdGhlciB3YXlzIHRvIGZpeCB0aGUgU0NTSSBk
b21haW4gdmFsaWRhdGlvbiBjb2RlLg0KDQpSZWdhcmRpbmcgdGhlIGhhbmcgZHVyaW5nIHJlc3Vt
ZTogSSBoYXZlIG5vdCBiZWVuIGFibGUgdG8gcmVwcm9kdWNlIHRoYXQNCmhhbmcgd2l0aCBKZW5z
JyBsYXRlc3QgZm9yLW5leHQgYnJhbmNoLiBJZiB0aGF0IGhhbmcgd291bGQgdHVybiBvdXQgbm90
IHRvDQpiZSByZXByb2R1Y2libGUgd2l0aCB2NC4xMy1yYzEsIHdoeSB0byBjaGFuZ2UgdGhlIGJl
aGF2aW9yIG9mIHRoZSBibG9jaw0KbGF5ZXIgd2l0aCByZWdhcmQgdG8gUFJFRU1QVCByZXF1ZXN0
cz8NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
@ 2017-09-13 19:07               ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2017-09-13 19:07 UTC (permalink / raw)
  To: ming.lei
  Cc: jthumshirn, hch, linux-block, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, cavery

On Thu, 2017-09-14 at 01:48 +0800, Ming Lei wrote:
> No, that patch only changes blk_insert_cloned_request() which is used
> by dm-rq(mpath) only, nothing to do with the reported issue during
> suspend and sending SCSI Domain validation.

There may be other ways to fix the SCSI domain validation code.

Regarding the hang during resume: I have not been able to reproduce that
hang with Jens' latest for-next branch. If that hang would turn out not to
be reproducible with v4.13-rc1, why to change the behavior of the block
layer with regard to PREEMPT requests?

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
  2017-09-13 19:07               ` Bart Van Assche
  (?)
@ 2017-09-14  1:15               ` Ming Lei
  2017-09-14 13:37                   ` Bart Van Assche
  -1 siblings, 1 reply; 27+ messages in thread
From: Ming Lei @ 2017-09-14  1:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jthumshirn, hch, linux-block, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, cavery

On Wed, Sep 13, 2017 at 07:07:53PM +0000, Bart Van Assche wrote:
> On Thu, 2017-09-14 at 01:48 +0800, Ming Lei wrote:
> > No, that patch only changes blk_insert_cloned_request() which is used
> > by dm-rq(mpath) only, nothing to do with the reported issue during
> > suspend and sending SCSI Domain validation.
> 
> There may be other ways to fix the SCSI domain validation code.

Again the issue isn't in domain validation, it is in quiesce,
so we need to fix quiesce, instead of working around transport_spi.

Also What is the other way? Why not this patchset?

> 
> Regarding the hang during resume: I have not been able to reproduce that
> hang with Jens' latest for-next branch. If that hang would turn out not to
> be reproducible with v4.13-rc1, why to change the behavior of the block
> layer with regard to PREEMPT requests?

Previously you object this patchset because you mention there
is race, now you want to change to another reason?

So I suppose there isn't the race you worried about.

PREEMPT is special enough but used widely, not like NVMe,
traditional SCSI doesn't has a specific queue for
administration purpose only, then RFQ_PREEMPT need to
be dispatched to the same queue even normal I/O is not
allowed at that time.

Because it is a common/generic issue with quiesce, we have to
avoid I/O hang with quiesce. This way is a generic enough for
cover all this kind of issue. And no side effect on normal I/O
of block layer.

Can you make sure that the following two points can't happen
wrt. I/O hang during suspend/resume?

	- no normal I/O is triggered during suspend

	- all I/O requests in queue will be drained out before
	entering suspend

-- 
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
  2017-09-14  1:15               ` Ming Lei
@ 2017-09-14 13:37                   ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2017-09-14 13:37 UTC (permalink / raw)
  To: ming.lei
  Cc: jthumshirn, hch, linux-block, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, cavery

T24gVGh1LCAyMDE3LTA5LTE0IGF0IDA5OjE1ICswODAwLCBNaW5nIExlaSB3cm90ZToNCj4gT24g
V2VkLCBTZXAgMTMsIDIwMTcgYXQgMDc6MDc6NTNQTSArMDAwMCwgQmFydCBWYW4gQXNzY2hlIHdy
b3RlOg0KPiA+IE9uIFRodSwgMjAxNy0wOS0xNCBhdCAwMTo0OCArMDgwMCwgTWluZyBMZWkgd3Jv
dGU6DQo+ID4gPiBObywgdGhhdCBwYXRjaCBvbmx5IGNoYW5nZXMgYmxrX2luc2VydF9jbG9uZWRf
cmVxdWVzdCgpIHdoaWNoIGlzIHVzZWQNCj4gPiA+IGJ5IGRtLXJxKG1wYXRoKSBvbmx5LCBub3Ro
aW5nIHRvIGRvIHdpdGggdGhlIHJlcG9ydGVkIGlzc3VlIGR1cmluZw0KPiA+ID4gc3VzcGVuZCBh
bmQgc2VuZGluZyBTQ1NJIERvbWFpbiB2YWxpZGF0aW9uLg0KPiA+IA0KPiA+IFRoZXJlIG1heSBi
ZSBvdGhlciB3YXlzIHRvIGZpeCB0aGUgU0NTSSBkb21haW4gdmFsaWRhdGlvbiBjb2RlLg0KPiAN
Cj4gQWdhaW4gdGhlIGlzc3VlIGlzbid0IGluIGRvbWFpbiB2YWxpZGF0aW9uLCBpdCBpcyBpbiBx
dWllc2NlLA0KPiBzbyB3ZSBuZWVkIHRvIGZpeCBxdWllc2NlLCBpbnN0ZWFkIG9mIHdvcmtpbmcg
YXJvdW5kIHRyYW5zcG9ydF9zcGkuDQo+IA0KPiBBbHNvIFdoYXQgaXMgdGhlIG90aGVyIHdheT8g
V2h5IG5vdCB0aGlzIHBhdGNoc2V0Pw0KDQpTb3JyeSBpZiBJIGhhZCBub3QgbWFkZSB0aGlzIGNs
ZWFyIGVub3VnaCBidXQgSSBkb24ndCBsaWtlIHRoZSBhcHByb2FjaCBvZg0KdGhpcyBwYXRjaCBz
ZXJpZXMgc28gcGxlYXNlIGRvIG5vdCBleHBlY3QgYW55ICJSZXZpZXdlZC1ieSIgdGFncyBmcm9t
IG1lLg0KQXMgdGhlIGRpc2N1c3Npb24gYWJvdXQgdjQgb2YgdGhpcyBwYXRjaCBzZXJpZXMgbWFk
ZSBjbGVhciB0aGUgaW50ZXJhY3Rpb24NCmJldHdlZW4gYmxrX2NsZWFudXBfcXVldWUoKSBhbmQg
dGhlIGNoYW5nZXMgaW50cm9kdWNlZCBieSB0aGlzIHBhdGNoIHNlcmllcw0KaW4gYmxrX2dldF9y
ZXF1ZXN0KCkgaXMgc3VidGxlIGFuZCBoYXJkIHRvIGFuYWx5emUuIFRoZSBibGstbXEgY29yZSBp
cw0KYWxyZWFkeSBjb21wbGljYXRlZC4gSW4gbXkgdmlldyBwYXRjaGVzIHRoYXQgbWFrZSB0aGUg
YmxrLW1xIGNvcmUgc2ltcGxlcg0KYXJlIG11Y2ggbW9yZSB3ZWxjb21lIHRoYW4gcGF0Y2hlcyB0
aGF0IG1ha2UgdGhlIGJsay1tcSBjb3JlIG1vcmUNCmNvbXBsaWNhdGVkLg0KDQpTaW5jZSBJIGV4
cGVjdCB0aGF0IGFueSBmaXggZm9yIHRoZSBpbnRlcmFjdGlvbiBiZXR3ZWVuIGJsay1tcSBhbmQg
cG93ZXINCm1hbmFnZW1lbnQgd2lsbCBiZSBpbnRlZ3JhdGVkIGluIGtlcm5lbCB2NC4xNSBhdCBl
YXJsaWVzdCB0aGVyZSBpcyBubyByZWFzb24NCnRvIHJ1c2guIE15IHByb3Bvc2FsIGlzIHRvIHdh
aXQgYSBmZXcgd2Vla3MgYW5kIHRvIHNlZSB3aGV0aGVyIGFueW9uZSBjb21lcw0KdXAgd2l0aCBh
IGJldHRlciBzb2x1dGlvbi4NCg0KQmFydC4=

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
@ 2017-09-14 13:37                   ` Bart Van Assche
  0 siblings, 0 replies; 27+ messages in thread
From: Bart Van Assche @ 2017-09-14 13:37 UTC (permalink / raw)
  To: ming.lei
  Cc: jthumshirn, hch, linux-block, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, cavery

On Thu, 2017-09-14 at 09:15 +0800, Ming Lei wrote:
> On Wed, Sep 13, 2017 at 07:07:53PM +0000, Bart Van Assche wrote:
> > On Thu, 2017-09-14 at 01:48 +0800, Ming Lei wrote:
> > > No, that patch only changes blk_insert_cloned_request() which is used
> > > by dm-rq(mpath) only, nothing to do with the reported issue during
> > > suspend and sending SCSI Domain validation.
> > 
> > There may be other ways to fix the SCSI domain validation code.
> 
> Again the issue isn't in domain validation, it is in quiesce,
> so we need to fix quiesce, instead of working around transport_spi.
> 
> Also What is the other way? Why not this patchset?

Sorry if I had not made this clear enough but I don't like the approach of
this patch series so please do not expect any "Reviewed-by" tags from me.
As the discussion about v4 of this patch series made clear the interaction
between blk_cleanup_queue() and the changes introduced by this patch series
in blk_get_request() is subtle and hard to analyze. The blk-mq core is
already complicated. In my view patches that make the blk-mq core simpler
are much more welcome than patches that make the blk-mq core more
complicated.

Since I expect that any fix for the interaction between blk-mq and power
management will be integrated in kernel v4.15 at earliest there is no reason
to rush. My proposal is to wait a few weeks and to see whether anyone comes
up with a better solution.

Bart.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen
  2017-09-14 13:37                   ` Bart Van Assche
  (?)
@ 2017-09-14 16:18                   ` Ming Lei
  -1 siblings, 0 replies; 27+ messages in thread
From: Ming Lei @ 2017-09-14 16:18 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: jthumshirn, hch, linux-block, martin.petersen, linux-scsi, axboe,
	oleksandr, jejb, cavery

On Thu, Sep 14, 2017 at 01:37:14PM +0000, Bart Van Assche wrote:
> On Thu, 2017-09-14 at 09:15 +0800, Ming Lei wrote:
> > On Wed, Sep 13, 2017 at 07:07:53PM +0000, Bart Van Assche wrote:
> > > On Thu, 2017-09-14 at 01:48 +0800, Ming Lei wrote:
> > > > No, that patch only changes blk_insert_cloned_request() which is used
> > > > by dm-rq(mpath) only, nothing to do with the reported issue during
> > > > suspend and sending SCSI Domain validation.
> > > 
> > > There may be other ways to fix the SCSI domain validation code.
> > 
> > Again the issue isn't in domain validation, it is in quiesce,
> > so we need to fix quiesce, instead of working around transport_spi.
> > 
> > Also What is the other way? Why not this patchset?
> 
> Sorry if I had not made this clear enough but I don't like the approach of
> this patch series so please do not expect any "Reviewed-by" tags from me.
> As the discussion about v4 of this patch series made clear the interaction
> between blk_cleanup_queue() and the changes introduced by this patch series
> in blk_get_request() is subtle and hard to analyze. The blk-mq core is

No, it isn't subtle at all, as I explained, queue dying can be
set during allocating request in both legacy and blk-mq, and driver
is required to handle requests after queue becomes dying, this way
has been there for long time.

Is that really hard to analyze?

> already complicated. In my view patches that make the blk-mq core simpler
> are much more welcome than patches that make the blk-mq core more
> complicated.

Sorry, I can't agree this patchset is too complicated, this patchset just
touches quiesce interface. For other change such as holding queue usage
counter, it follows blk-mq's way, and we can reuse this way for
legacy too.

> 
> Since I expect that any fix for the interaction between blk-mq and power
> management will be integrated in kernel v4.15 at earliest there is no reason

Again, it isn't not related PM only, it is actually related with
SCSI quiesce.

> to rush. My proposal is to wait a few weeks and to see whether anyone comes
> up with a better solution.

I am open for any solution and happy to review them if someone posts
them out, but it should cover at least the two kind of reported issues.

However I won't wait for that, since people have been troubled with this
stuff much, like Oleksandr's case, the system is simple dead after
one susend. And the I/O hang in sending SCSI domain validation was
actually reported from a production system too.


-- 
Ming

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2017-09-14 16:18 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-11 11:10 [PATCH V4 0/10] block/scsi: safe SCSI quiescing Ming Lei
2017-09-11 11:10 ` [PATCH V4 01/10] blk-mq: only run hw queues for blk-mq Ming Lei
2017-09-11 11:10 ` [PATCH V4 02/10] block: tracking request allocation with q_usage_counter Ming Lei
2017-09-11 11:10 ` [PATCH V4 03/10] blk-mq: rename blk_mq_[freeze|unfreeze]_queue Ming Lei
2017-09-11 11:10 ` [PATCH V4 04/10] blk-mq: rename blk_mq_freeze_queue_wait as blk_freeze_queue_wait Ming Lei
2017-09-11 11:10 ` [PATCH V4 05/10] block: rename .mq_freeze_wq and .mq_freeze_depth Ming Lei
2017-09-11 11:10 ` [PATCH V4 06/10] block: pass flags to blk_queue_enter() Ming Lei
2017-09-11 11:10 ` [PATCH V4 07/10] block: introduce preempt version of blk_[freeze|unfreeze]_queue Ming Lei
2017-09-11 11:10 ` [PATCH V4 08/10] block: allow to allocate req with RQF_PREEMPT when queue is preempt frozen Ming Lei
2017-09-11 16:03   ` Bart Van Assche
2017-09-11 16:03     ` Bart Van Assche
2017-09-12  3:40     ` Ming Lei
2017-09-13 16:48       ` Ming Lei
2017-09-13 17:28         ` Bart Van Assche
2017-09-13 17:28           ` Bart Van Assche
2017-09-13 17:48           ` Ming Lei
2017-09-13 19:07             ` Bart Van Assche
2017-09-13 19:07               ` Bart Van Assche
2017-09-14  1:15               ` Ming Lei
2017-09-14 13:37                 ` Bart Van Assche
2017-09-14 13:37                   ` Bart Van Assche
2017-09-14 16:18                   ` Ming Lei
2017-09-11 11:10 ` [PATCH V4 09/10] SCSI: transport_spi: resume a quiesced device Ming Lei
2017-09-11 11:10 ` [PATCH V4 10/10] SCSI: preempt freeze block queue when SCSI device is put into quiesce Ming Lei
2017-09-11 21:24 ` [PATCH V4 0/10] block/scsi: safe SCSI quiescing Oleksandr Natalenko
2017-09-11 21:24   ` Oleksandr Natalenko
2017-09-12 19:03 ` Cathy Avery

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.