All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2 V2] Zoned block device support fixes
@ 2017-08-04  7:52 Damien Le Moal
  2017-08-04  7:52 ` [PATCH 1/2] block: Zoned block device single-threaded submission Damien Le Moal
  2017-08-04  7:52 ` [PATCH 2/2] sd_zbc: Write unlock zones from sd_uninit_cmnd() Damien Le Moal
  0 siblings, 2 replies; 8+ messages in thread
From: Damien Le Moal @ 2017-08-04  7:52 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, Jens Axboe
  Cc: Hannes Reinecke, Bart Van Assche, Christoph Hellwig

This small series addresses a couple of problems with zoned block devices
detected with 4.13-rc.

The first patch ensures that a well behaved host managed zoned block device
user (an application doing direct disk accesses, f2fs or dm-zoned) will not
see unaligned write errors due to reordering of write commands at dispatch
time.

The second patch addresses a request dispatch deadlock that can very easily
trigger with f2fs or dm-zoned when scsi-mq is enabled. The root cause of this
problem is the high probability of unintended reordering of sequential writes
in the dispatch queue due to concurrent requeue and insert events. This
patch only fixes the deadlock problem and is not a fix for the reordering
problem.

This means that host managed zoned block devices cannot be reliably used under
a regular asynchronous (queued) write BIO issuing pattern to a zone with
scsi-mq enabled for now.

The second patch requires the patch 2f2d7c92dda
"scsi-mq: Always unprepare before requeuing a request" sent by Bart.

Damien Le Moal (1):
  sd_zbc: Write unlock zone from sd_uninit_cmnd()

Hannes Reinecke (1):
  block: Zoned block device single-threaded submission

 block/blk-core.c         | 7 +++++++
 drivers/scsi/sd.c        | 3 +++
 drivers/scsi/sd_zbc.c    | 9 +++++----
 include/scsi/scsi_cmnd.h | 1 +
 4 files changed, 16 insertions(+), 4 deletions(-)

-- 
2.13.3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] block: Zoned block device single-threaded submission
  2017-08-04  7:52 [PATCH 0/2 V2] Zoned block device support fixes Damien Le Moal
@ 2017-08-04  7:52 ` Damien Le Moal
  2017-08-04 15:54   ` Bart Van Assche
  2017-08-05 11:34   ` Christoph Hellwig
  2017-08-04  7:52 ` [PATCH 2/2] sd_zbc: Write unlock zones from sd_uninit_cmnd() Damien Le Moal
  1 sibling, 2 replies; 8+ messages in thread
From: Damien Le Moal @ 2017-08-04  7:52 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, Jens Axboe
  Cc: Hannes Reinecke, Bart Van Assche, Christoph Hellwig

From: Hannes Reinecke <hare@suse.de>

The scsi_request_fn() dispatch function internally unlocks the request
queue before submitting a request to the underlying LLD. This can
potentially lead to write request reordering if the context executing
scsi_request_fn() is preempted before the request is submitted to the
LLD and another context start the same function execution.

This is not a problem for regular disks but leads to write I/O errors
on host managed zoned block devices and reduce the effectivness of
sequential write optimizations for host aware disks.
(Note: the zone write lock in place in the scsi command init code will
prevent multiple writes from being issued simultaneously to the same
zone to avoid HBA level reordering issues, but this locking mechanism
is ineffective to prevent reordering at the dispatch level)

Prevent this from happening by limiting the number of context that can
simultaneously execute the queue request_fn() function to a single
thread.

A similar patch was originally proposed by Hannes Reinecke in a first
set of patches implementing ZBC support but ultimately not included in
the final support implementation. See commit 92f5e2a295
"block: add flag for single-threaded submission" in the tree
https://git.kernel.org/pub/scm/linux/kernel/git/hare/scsi-devel.git/log/?h=zac.v3

Authorship thus goes to Hannes.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 block/blk-core.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index dbecbf4a64e0..cf590cbddcfd 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -371,7 +371,14 @@ inline void __blk_run_queue_uncond(struct request_queue *q)
 	 * running such a request function concurrently. Keep track of the
 	 * number of active request_fn invocations such that blk_drain_queue()
 	 * can wait until all these request_fn calls have finished.
+	 *
+	 * For zoned block devices, do not allow multiple threads to
+	 * dequeue requests as this can lead to write request reordering
+	 * during the time the queue is unlocked.
 	 */
+	if (blk_queue_is_zoned(q) && q->request_fn_active)
+		return;
+
 	q->request_fn_active++;
 	q->request_fn(q);
 	q->request_fn_active--;
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] sd_zbc: Write unlock zones from sd_uninit_cmnd()
  2017-08-04  7:52 [PATCH 0/2 V2] Zoned block device support fixes Damien Le Moal
  2017-08-04  7:52 ` [PATCH 1/2] block: Zoned block device single-threaded submission Damien Le Moal
@ 2017-08-04  7:52 ` Damien Le Moal
  2017-08-04 15:47   ` Bart Van Assche
  2017-08-05 11:34   ` Christoph Hellwig
  1 sibling, 2 replies; 8+ messages in thread
From: Damien Le Moal @ 2017-08-04  7:52 UTC (permalink / raw)
  To: linux-scsi, Martin K . Petersen, Jens Axboe
  Cc: Hannes Reinecke, Bart Van Assche, Christoph Hellwig

Releasing the write lock of a zone when the write commnand that
acquired the lock completes can cause deadlocks with scsi-mq due to
potential queue reordering if the lock owning request is requeued and
not executed.

Since sd_uninit_cmnd() is always called when a request is requeued,
call sd_zbc_write_unlock_zone() from that function for write requests
that acquired a zone lock. Acquisition of a zone lock by a write command
is indicated using the new command flag SCMD_ZONE_WRITE_LOCK.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/scsi/sd.c        | 3 +++
 drivers/scsi/sd_zbc.c    | 9 +++++----
 include/scsi/scsi_cmnd.h | 1 +
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index bea36adeee17..e2647f2d4430 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1277,6 +1277,9 @@ static void sd_uninit_command(struct scsi_cmnd *SCpnt)
 {
 	struct request *rq = SCpnt->request;
 
+	if (SCpnt->flags & SCMD_ZONE_WRITE_LOCK)
+		sd_zbc_write_unlock_zone(SCpnt);
+
 	if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
 		__free_page(rq->special_vec.bv_page);
 
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index 96855df9f49d..6423ae70477e 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -294,6 +294,9 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
 	    test_and_set_bit(zno, sdkp->zones_wlock))
 		return BLKPREP_DEFER;
 
+	WARN_ON(cmd->flags & SCMD_ZONE_WRITE_LOCK);
+	cmd->flags |= SCMD_ZONE_WRITE_LOCK;
+
 	return BLKPREP_OK;
 }
 
@@ -302,9 +305,10 @@ void sd_zbc_write_unlock_zone(struct scsi_cmnd *cmd)
 	struct request *rq = cmd->request;
 	struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
 
-	if (sdkp->zones_wlock) {
+	if (sdkp->zones_wlock && cmd->flags & SCMD_ZONE_WRITE_LOCK) {
 		unsigned int zno = sd_zbc_zone_no(sdkp, blk_rq_pos(rq));
 		WARN_ON_ONCE(!test_bit(zno, sdkp->zones_wlock));
+		cmd->flags &= ~SCMD_ZONE_WRITE_LOCK;
 		clear_bit_unlock(zno, sdkp->zones_wlock);
 		smp_mb__after_atomic();
 	}
@@ -335,9 +339,6 @@ void sd_zbc_complete(struct scsi_cmnd *cmd,
 	case REQ_OP_WRITE_ZEROES:
 	case REQ_OP_WRITE_SAME:
 
-		/* Unlock the zone */
-		sd_zbc_write_unlock_zone(cmd);
-
 		if (result &&
 		    sshdr->sense_key == ILLEGAL_REQUEST &&
 		    sshdr->asc == 0x21)
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index a1266d318c85..6af198d8120b 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -57,6 +57,7 @@ struct scsi_pointer {
 /* for scmd->flags */
 #define SCMD_TAGGED		(1 << 0)
 #define SCMD_UNCHECKED_ISA_DMA	(1 << 1)
+#define SCMD_ZONE_WRITE_LOCK	(1 << 2)
 
 struct scsi_cmnd {
 	struct scsi_request req;
-- 
2.13.3

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sd_zbc: Write unlock zones from sd_uninit_cmnd()
  2017-08-04  7:52 ` [PATCH 2/2] sd_zbc: Write unlock zones from sd_uninit_cmnd() Damien Le Moal
@ 2017-08-04 15:47   ` Bart Van Assche
  2017-08-05 11:34   ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Bart Van Assche @ 2017-08-04 15:47 UTC (permalink / raw)
  To: linux-scsi, Damien Le Moal, martin.petersen, axboe; +Cc: hch, hare

On Fri, 2017-08-04 at 16:52 +0900, Damien Le Moal wrote:
> Releasing the write lock of a zone when the write commnand that
> acquired the lock completes can cause deadlocks with scsi-mq due to
> potential queue reordering if the lock owning request is requeued and
> not executed.
> 
> Since sd_uninit_cmnd() is always called when a request is requeued,
> call sd_zbc_write_unlock_zone() from that function for write requests
> that acquired a zone lock. Acquisition of a zone lock by a write command
> is indicated using the new command flag SCMD_ZONE_WRITE_LOCK.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>

Hello Damien,

Should "Cc: <stable@vger.kernel.org>" be added to this patch?

> diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
> index 96855df9f49d..6423ae70477e 100644
> --- a/drivers/scsi/sd_zbc.c
> +++ b/drivers/scsi/sd_zbc.c
> @@ -294,6 +294,9 @@ int sd_zbc_write_lock_zone(struct scsi_cmnd *cmd)
>  	    test_and_set_bit(zno, sdkp->zones_wlock))
>  		return BLKPREP_DEFER;
>  
> +	WARN_ON(cmd->flags & SCMD_ZONE_WRITE_LOCK);
> +	cmd->flags |= SCMD_ZONE_WRITE_LOCK;
> +
>  	return BLKPREP_OK;
>  }

Did you perhaps intend WARN_ON_ONCE() instead of WARN_ON()?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] block: Zoned block device single-threaded submission
  2017-08-04  7:52 ` [PATCH 1/2] block: Zoned block device single-threaded submission Damien Le Moal
@ 2017-08-04 15:54   ` Bart Van Assche
  2017-08-05 11:34   ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Bart Van Assche @ 2017-08-04 15:54 UTC (permalink / raw)
  To: linux-scsi, Damien Le Moal, martin.petersen, axboe; +Cc: hch, hare

On Fri, 2017-08-04 at 16:52 +0900, Damien Le Moal wrote:
> From: Hannes Reinecke <hare@suse.de>
> 
> The scsi_request_fn() dispatch function internally unlocks the request
> queue before submitting a request to the underlying LLD. This can
> potentially lead to write request reordering if the context executing
> scsi_request_fn() is preempted before the request is submitted to the
> LLD and another context start the same function execution.
> 
> This is not a problem for regular disks but leads to write I/O errors
> on host managed zoned block devices and reduce the effectivness of
> sequential write optimizations for host aware disks.
> (Note: the zone write lock in place in the scsi command init code will
> prevent multiple writes from being issued simultaneously to the same
> zone to avoid HBA level reordering issues, but this locking mechanism
> is ineffective to prevent reordering at the dispatch level)
> 
> Prevent this from happening by limiting the number of context that can
> simultaneously execute the queue request_fn() function to a single
> thread.
> 
> A similar patch was originally proposed by Hannes Reinecke in a first
> set of patches implementing ZBC support but ultimately not included in
> the final support implementation. See commit 92f5e2a295
> "block: add flag for single-threaded submission" in the tree
> https://git.kernel.org/pub/scm/linux/kernel/git/hare/scsi-devel.git/log/?h=zac.v3
> 
> Authorship thus goes to Hannes.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> ---
>  block/blk-core.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index dbecbf4a64e0..cf590cbddcfd 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -371,7 +371,14 @@ inline void __blk_run_queue_uncond(struct request_queue *q)
>  	 * running such a request function concurrently. Keep track of the
>  	 * number of active request_fn invocations such that blk_drain_queue()
>  	 * can wait until all these request_fn calls have finished.
> +	 *
> +	 * For zoned block devices, do not allow multiple threads to
> +	 * dequeue requests as this can lead to write request reordering
> +	 * during the time the queue is unlocked.
>  	 */
> +	if (blk_queue_is_zoned(q) && q->request_fn_active)
> +		return;
> +
>  	q->request_fn_active++;
>  	q->request_fn(q);
>  	q->request_fn_active--;

Hello Damien,

Since serialization of request queue processing is only needed for ZBC and
since all ZBC devices use the SCSI core, could this serialization have been
achieved by modifying the SCSI core, e.g. by adding the following before the
for-loop in scsi_request_fn():

        if (blk_queue_is_zoned(q) && q->request_fn_active > 1)
                return;

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] block: Zoned block device single-threaded submission
  2017-08-04  7:52 ` [PATCH 1/2] block: Zoned block device single-threaded submission Damien Le Moal
  2017-08-04 15:54   ` Bart Van Assche
@ 2017-08-05 11:34   ` Christoph Hellwig
  2017-08-07  6:15     ` Damien Le Moal
  1 sibling, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2017-08-05 11:34 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, Jens Axboe, Hannes Reinecke,
	Bart Van Assche, Christoph Hellwig

We'll need a blk-mq version as well, otherwise: NAK.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sd_zbc: Write unlock zones from sd_uninit_cmnd()
  2017-08-04  7:52 ` [PATCH 2/2] sd_zbc: Write unlock zones from sd_uninit_cmnd() Damien Le Moal
  2017-08-04 15:47   ` Bart Van Assche
@ 2017-08-05 11:34   ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2017-08-05 11:34 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: linux-scsi, Martin K . Petersen, Jens Axboe, Hannes Reinecke,
	Bart Van Assche, Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] block: Zoned block device single-threaded submission
  2017-08-05 11:34   ` Christoph Hellwig
@ 2017-08-07  6:15     ` Damien Le Moal
  0 siblings, 0 replies; 8+ messages in thread
From: Damien Le Moal @ 2017-08-07  6:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-scsi, Martin K . Petersen, Jens Axboe, Hannes Reinecke,
	Bart Van Assche

Chistoph,

On 8/5/17 20:34, Christoph Hellwig wrote:
> We'll need a blk-mq version as well, otherwise: NAK.

Not that I have not tried, but I do not see how this is possible without
in the end making blk-mq/scsi-mq for a ZBC disk work exactly like the sq
path, that is adding locks/barriers in many places to prevent the mq 3
different contexts form potentially messing with the dispatch queue
order (submission, run and requeue). I do not see any solution simple
enough to be considered RC material.

This patch ensures that for 4.13 we at least have the legacy single
queue I/O path that is safe for zoned block devices. With the other
patch I sent (+ Bart's "always unprep" patch) enduring that mq does not
deadlock (and only that, unaligned write errors can happen with ZBC drives).

Going forward, considering only block-mq/scsi-mq (since the legacy path
will eventually go away), I think that trying to ensure per-zone
sequential writes at the SCSI layer is not a sustainable approach. It
will add too many constraints on the mq path/queue management and will
only make the mq code more complex and very hard to debug any issue with
sequential writes.

I thought of another simpler and easier to maintain approach: extending
the writeback throttling code to implement a "only one write per
sequential zone" I/O pattern, which will always result in sequential
writes within a zone no matter what blk-mq, the mq schedulers or the
scsi dispatch code do. In effect, this is exactly the same as what the
zone locking does currently, but all the implementation would be limited
to the higher bio_submit() level. This would allow removing all the ZBC
specific code in the I/O path (single threaded dispatch, zone lock) and
will not need messing mq I/O path. So overall, a much cleaner and easier
to maintain approach.

Of course, this kind of writeback throttling could be implemented in
each zoned block device user (currently only f2fs and dm-zoned, but
likely more coming). But that would lead to a lot of duplicated code. So
integrating that to bio_submit()/WBT makes sense to me.

What do you think ?

Of course, I may be missing something really simple to solve the problem
in blk-mq. I would be happy to tackle the implementation & testing if
someone has an idea.

Best regards.

-- 
Damien Le Moal,
Western Digital

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-08-07  6:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-04  7:52 [PATCH 0/2 V2] Zoned block device support fixes Damien Le Moal
2017-08-04  7:52 ` [PATCH 1/2] block: Zoned block device single-threaded submission Damien Le Moal
2017-08-04 15:54   ` Bart Van Assche
2017-08-05 11:34   ` Christoph Hellwig
2017-08-07  6:15     ` Damien Le Moal
2017-08-04  7:52 ` [PATCH 2/2] sd_zbc: Write unlock zones from sd_uninit_cmnd() Damien Le Moal
2017-08-04 15:47   ` Bart Van Assche
2017-08-05 11:34   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.