All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Improve zoned storage write performance
@ 2022-06-14 17:49 Bart Van Assche
  2022-06-14 17:49 ` [PATCH 1/5] block: Introduce the blk_rq_is_seq_write() function Bart Van Assche
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Bart Van Assche @ 2022-06-14 17:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim,
	Bart Van Assche

Hi Jens,

Measurements have shown that limiting the queue depth to one per sequential
zone has a significant negative performance impact on zoned UFS devices. Hence
this patch series that increases the queue depth for write commands for
sequential zones.

Please consider this patch series for kernel v5.20.

Thanks,

Bart.

Bart Van Assche (5):
  block: Introduce the blk_rq_is_seq_write() function
  scsi: Retry unaligned zoned writes
  nvme: Make the number of retries request specific
  nvme: Increase the number of retries for zoned writes
  block/mq-deadline: Remove zone locking

 block/mq-deadline.c       | 74 ++++-----------------------------------
 drivers/nvme/host/core.c  |  7 +++-
 drivers/nvme/host/nvme.h  |  1 +
 drivers/scsi/scsi_error.c |  6 ++++
 drivers/scsi/sd.c         |  2 ++
 include/linux/blk-mq.h    | 14 ++++++++
 6 files changed, 35 insertions(+), 69 deletions(-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/5] block: Introduce the blk_rq_is_seq_write() function
  2022-06-14 17:49 [PATCH 0/5] Improve zoned storage write performance Bart Van Assche
@ 2022-06-14 17:49 ` Bart Van Assche
  2022-06-16 20:41   ` Jens Axboe
  2022-06-14 17:49 ` [PATCH 2/5] scsi: Retry unaligned zoned writes Bart Van Assche
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2022-06-14 17:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim,
	Bart Van Assche

Introduce a function that makes it easy to verify whether a write
request is for a sequential write required or sequential write preferred
zone.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 include/linux/blk-mq.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e2d9daf7e8dd..3e7feb48105f 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1129,6 +1129,15 @@ static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
 	return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
 }
 
+/**
+ * blk_rq_is_seq_write() - Whether @rq is a write request for a sequential zone.
+ * @rq: Request to examine.
+ */
+static inline bool blk_rq_is_seq_write(struct request *rq)
+{
+	return req_op(rq) == REQ_OP_WRITE && blk_rq_zone_is_seq(rq);
+}
+
 bool blk_req_needs_zone_write_lock(struct request *rq);
 bool blk_req_zone_write_trylock(struct request *rq);
 void __blk_req_zone_write_lock(struct request *rq);
@@ -1159,6 +1168,11 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
 	return !blk_req_zone_is_write_locked(rq);
 }
 #else /* CONFIG_BLK_DEV_ZONED */
+static inline bool blk_rq_is_seq_write(struct request *rq)
+{
+	return false;
+}
+
 static inline bool blk_req_needs_zone_write_lock(struct request *rq)
 {
 	return false;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-14 17:49 [PATCH 0/5] Improve zoned storage write performance Bart Van Assche
  2022-06-14 17:49 ` [PATCH 1/5] block: Introduce the blk_rq_is_seq_write() function Bart Van Assche
@ 2022-06-14 17:49 ` Bart Van Assche
  2022-06-14 21:47   ` Khazhy Kumykov
  2022-06-14 23:29   ` Damien Le Moal
  2022-06-14 17:49 ` [PATCH 3/5] nvme: Make the number of retries request specific Bart Van Assche
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: Bart Van Assche @ 2022-06-14 17:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim,
	Bart Van Assche

From ZBC-2: "The device server terminates with CHECK CONDITION status, with
the sense key set to ILLEGAL REQUEST, and the additional sense code set to
UNALIGNED WRITE COMMAND a write command, other than an entire medium write
same command, that specifies: a) the starting LBA in a sequential write
required zone set to a value that is not equal to the write pointer for that
sequential write required zone; or b) an ending LBA that is not equal to the
last logical block within a physical block (see SBC-5)."

I am not aware of any other conditions that may trigger the UNALIGNED
WRITE COMMAND response.

Retry unaligned writes in preparation of removing zone locking.

Increase the number of retries for write commands sent to a sequential
zone to the maximum number of outstanding commands.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c | 6 ++++++
 drivers/scsi/sd.c         | 2 ++
 2 files changed, 8 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 49ef864df581..8e22d4ba22a3 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -674,6 +674,12 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		fallthrough;
 
 	case ILLEGAL_REQUEST:
+		/*
+		 * Unaligned write command. Retry immediately to handle
+		 * out-of-order zoned writes.
+		 */
+		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04)
+			return NEEDS_RETRY;
 		if (sshdr.asc == 0x20 || /* Invalid command operation code */
 		    sshdr.asc == 0x21 || /* Logical block address out of range */
 		    sshdr.asc == 0x22 || /* Invalid function */
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a1a2ac09066f..8d68bd20723e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1202,6 +1202,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	cmd->transfersize = sdp->sector_size;
 	cmd->underflow = nr_blocks << 9;
 	cmd->allowed = sdkp->max_retries;
+	if (blk_rq_is_seq_write(rq))
+		cmd->allowed += rq->q->nr_hw_queues * rq->q->nr_requests;
 	cmd->sdb.length = nr_blocks * sdp->sector_size;
 
 	SCSI_LOG_HLQUEUE(1,

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/5] nvme: Make the number of retries request specific
  2022-06-14 17:49 [PATCH 0/5] Improve zoned storage write performance Bart Van Assche
  2022-06-14 17:49 ` [PATCH 1/5] block: Introduce the blk_rq_is_seq_write() function Bart Van Assche
  2022-06-14 17:49 ` [PATCH 2/5] scsi: Retry unaligned zoned writes Bart Van Assche
@ 2022-06-14 17:49 ` Bart Van Assche
  2022-06-14 17:49 ` [PATCH 4/5] nvme: Increase the number of retries for zoned writes Bart Van Assche
  2022-06-14 17:49 ` [PATCH 5/5] block/mq-deadline: Remove zone locking Bart Van Assche
  4 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2022-06-14 17:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim,
	Bart Van Assche, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni

Add support for specifying the number of retries per NVMe request.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/nvme/host/core.c | 3 ++-
 drivers/nvme/host/nvme.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 24165daee3c8..fe0d09fc70ba 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -339,7 +339,7 @@ static inline enum nvme_disposition nvme_decide_disposition(struct request *req)
 
 	if (blk_noretry_request(req) ||
 	    (nvme_req(req)->status & NVME_SC_DNR) ||
-	    nvme_req(req)->retries >= nvme_max_retries)
+	    nvme_req(req)->retries >= nvme_req(req)->max_retries)
 		return COMPLETE;
 
 	if (req->cmd_flags & REQ_NVME_MPATH) {
@@ -632,6 +632,7 @@ static inline void nvme_clear_nvme_request(struct request *req)
 {
 	nvme_req(req)->status = 0;
 	nvme_req(req)->retries = 0;
+	nvme_req(req)->max_retries = nvme_max_retries;
 	nvme_req(req)->flags = 0;
 	req->rq_flags |= RQF_DONTPREP;
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 9b72b6ecf33c..15bb36923d09 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -160,6 +160,7 @@ struct nvme_request {
 	union nvme_result	result;
 	u8			genctr;
 	u8			retries;
+	u8			max_retries;
 	u8			flags;
 	u16			status;
 	struct nvme_ctrl	*ctrl;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/5] nvme: Increase the number of retries for zoned writes
  2022-06-14 17:49 [PATCH 0/5] Improve zoned storage write performance Bart Van Assche
                   ` (2 preceding siblings ...)
  2022-06-14 17:49 ` [PATCH 3/5] nvme: Make the number of retries request specific Bart Van Assche
@ 2022-06-14 17:49 ` Bart Van Assche
  2022-06-14 18:03   ` Keith Busch
  2022-06-14 17:49 ` [PATCH 5/5] block/mq-deadline: Remove zone locking Bart Van Assche
  4 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2022-06-14 17:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim,
	Bart Van Assche, Keith Busch, Sagi Grimberg, Chaitanya Kulkarni

Before removing zone locking, increase the number of retries for zoned
writes to the maximum number of outstanding commands.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/nvme/host/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fe0d09fc70ba..347a06118282 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -854,6 +854,10 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
 
+	if (blk_rq_is_seq_write(req))
+		nvme_req(req)->max_retries += req->q->nr_hw_queues *
+			req->q->nr_requests;
+
 	cmnd->rw.opcode = op;
 	cmnd->rw.flags = 0;
 	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/5] block/mq-deadline: Remove zone locking
  2022-06-14 17:49 [PATCH 0/5] Improve zoned storage write performance Bart Van Assche
                   ` (3 preceding siblings ...)
  2022-06-14 17:49 ` [PATCH 4/5] nvme: Increase the number of retries for zoned writes Bart Van Assche
@ 2022-06-14 17:49 ` Bart Van Assche
  2022-06-14 23:40   ` Damien Le Moal
  4 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2022-06-14 17:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim,
	Bart Van Assche

Measurements have shown that limiting the queue depth to one has a
significant negative performance impact on zoned UFS devices. Hence this
patch that removes zone locking from the mq-deadline scheduler. This
patch is based on the following assumptions:
- Applications submit write requests to sequential write required zones
  in order.
- If such write requests get reordered by the software or hardware queue
  mechanism, nr_hw_queues * nr_requests - 1 retries are sufficient to
  reorder the write requests.
- It happens infrequently that zoned write requests are reordered by the
  block layer.
- Either no I/O scheduler is used or an I/O scheduler is used that
  submits write requests per zone in LBA order.

DD_BE_PRIO is selected for sequential writes to preserve the LBA order.

See also commit 5700f69178e9 ("mq-deadline: Introduce zone locking
support").

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 74 ++++-----------------------------------------
 1 file changed, 6 insertions(+), 68 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 6ed602b2f80a..e168fc9a980a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -104,7 +104,6 @@ struct deadline_data {
 	int prio_aging_expire;
 
 	spinlock_t lock;
-	spinlock_t zone_lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -285,30 +284,10 @@ static struct request *
 deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 		      enum dd_data_dir data_dir)
 {
-	struct request *rq;
-	unsigned long flags;
-
 	if (list_empty(&per_prio->fifo_list[data_dir]))
 		return NULL;
 
-	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
-		return rq;
-
-	/*
-	 * Look for a write request that can be dispatched, that is one with
-	 * an unlocked target zone.
-	 */
-	spin_lock_irqsave(&dd->zone_lock, flags);
-	list_for_each_entry(rq, &per_prio->fifo_list[DD_WRITE], queuelist) {
-		if (blk_req_can_dispatch_to_zone(rq))
-			goto out;
-	}
-	rq = NULL;
-out:
-	spin_unlock_irqrestore(&dd->zone_lock, flags);
-
-	return rq;
+	return rq_entry_fifo(per_prio->fifo_list[data_dir].next);
 }
 
 /*
@@ -319,29 +298,7 @@ static struct request *
 deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 		      enum dd_data_dir data_dir)
 {
-	struct request *rq;
-	unsigned long flags;
-
-	rq = per_prio->next_rq[data_dir];
-	if (!rq)
-		return NULL;
-
-	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
-		return rq;
-
-	/*
-	 * Look for a write request that can be dispatched, that is one with
-	 * an unlocked target zone.
-	 */
-	spin_lock_irqsave(&dd->zone_lock, flags);
-	while (rq) {
-		if (blk_req_can_dispatch_to_zone(rq))
-			break;
-		rq = deadline_latter_request(rq);
-	}
-	spin_unlock_irqrestore(&dd->zone_lock, flags);
-
-	return rq;
+	return per_prio->next_rq[data_dir];
 }
 
 /*
@@ -467,10 +424,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	ioprio_class = dd_rq_ioclass(rq);
 	prio = ioprio_class_to_prio[ioprio_class];
 	dd->per_prio[prio].stats.dispatched++;
-	/*
-	 * If the request needs its target zone locked, do it.
-	 */
-	blk_req_zone_write_lock(rq);
 	rq->rq_flags |= RQF_STARTED;
 	return rq;
 }
@@ -640,7 +593,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->fifo_batch = fifo_batch;
 	dd->prio_aging_expire = prio_aging_expire;
 	spin_lock_init(&dd->lock);
-	spin_lock_init(&dd->zone_lock);
 
 	q->elevator = eq;
 	return 0;
@@ -716,17 +668,13 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
 	struct dd_per_prio *per_prio;
 	enum dd_prio prio;
+	bool seq_write = blk_rq_is_seq_write(rq);
 	LIST_HEAD(free);
 
 	lockdep_assert_held(&dd->lock);
 
-	/*
-	 * This may be a requeue of a write request that has locked its
-	 * target zone. If it is the case, this releases the zone lock.
-	 */
-	blk_req_zone_write_unlock(rq);
-
-	prio = ioprio_class_to_prio[ioprio_class];
+	prio = seq_write ? DD_BE_PRIO :
+		ioprio_class_to_prio[ioprio_class];
 	per_prio = &dd->per_prio[prio];
 	if (!rq->elv.priv[0]) {
 		per_prio->stats.inserted++;
@@ -740,7 +688,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	trace_block_rq_insert(rq);
 
-	if (at_head) {
+	if (at_head && !seq_write) {
 		list_add(&rq->queuelist, &per_prio->dispatch);
 		rq->fifo_time = jiffies;
 	} else {
@@ -819,16 +767,6 @@ static void dd_finish_request(struct request *rq)
 		return;
 
 	atomic_inc(&per_prio->stats.completed);
-
-	if (blk_queue_is_zoned(q)) {
-		unsigned long flags;
-
-		spin_lock_irqsave(&dd->zone_lock, flags);
-		blk_req_zone_write_unlock(rq);
-		if (!list_empty(&per_prio->fifo_list[DD_WRITE]))
-			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
-		spin_unlock_irqrestore(&dd->zone_lock, flags);
-	}
 }
 
 static bool dd_has_work_for_prio(struct dd_per_prio *per_prio)

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/5] nvme: Increase the number of retries for zoned writes
  2022-06-14 17:49 ` [PATCH 4/5] nvme: Increase the number of retries for zoned writes Bart Van Assche
@ 2022-06-14 18:03   ` Keith Busch
  0 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2022-06-14 18:03 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim, Sagi Grimberg,
	Chaitanya Kulkarni

On Tue, Jun 14, 2022 at 10:49:42AM -0700, Bart Van Assche wrote:
> @@ -854,6 +854,10 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
>  	if (req->cmd_flags & REQ_RAHEAD)
>  		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
>  
> +	if (blk_rq_is_seq_write(req))
> +		nvme_req(req)->max_retries += req->q->nr_hw_queues *
> +			req->q->nr_requests;

Even with the weakest nvme devices, the result of this is unlikely to fit in a
u8.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-14 17:49 ` [PATCH 2/5] scsi: Retry unaligned zoned writes Bart Van Assche
@ 2022-06-14 21:47   ` Khazhy Kumykov
  2022-06-14 22:39     ` Bart Van Assche
  2022-06-14 23:29   ` Damien Le Moal
  1 sibling, 1 reply; 24+ messages in thread
From: Khazhy Kumykov @ 2022-06-14 21:47 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Jaegeuk Kim

[-- Attachment #1: Type: text/plain, Size: 3299 bytes --]

On Tue, Jun 14, 2022 at 10:49 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> From ZBC-2: "The device server terminates with CHECK CONDITION status, with
> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
> same command, that specifies: a) the starting LBA in a sequential write
> required zone set to a value that is not equal to the write pointer for that
> sequential write required zone; or b) an ending LBA that is not equal to the
> last logical block within a physical block (see SBC-5)."
>
> I am not aware of any other conditions that may trigger the UNALIGNED
> WRITE COMMAND response.
>
> Retry unaligned writes in preparation of removing zone locking.
Is /just/ retrying effective here? A series of writes to the same zone
would all need to be sent in order - in the worst case (requests
somehow ordered in reverse order) this becomes quadratic as only 1
request "succeeds" out of the N outstanding requests, with the rest
all needing to retry. (Imagine a user writes an entire "zone" - which
could be split into hundreds of requests).

Block layer / schedulers are free to do this reordering, which I
understand does happen whenever we need to requeue - and would result
in a retry of all writes after the first re-ordered request. (side
note: fwiw "requests somehow in reverse order" can happen - bfq
inherited cfq's odd behavior of sometimes issuing sequential IO in
reverse order due to back_seek, e.g.)

>
> Increase the number of retries for write commands sent to a sequential
> zone to the maximum number of outstanding commands.
>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/scsi/scsi_error.c | 6 ++++++
>  drivers/scsi/sd.c         | 2 ++
>  2 files changed, 8 insertions(+)
>
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 49ef864df581..8e22d4ba22a3 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -674,6 +674,12 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>                 fallthrough;
>
>         case ILLEGAL_REQUEST:
> +               /*
> +                * Unaligned write command. Retry immediately to handle
> +                * out-of-order zoned writes.
> +                */
> +               if (sshdr.asc == 0x21 && sshdr.ascq == 0x04)
> +                       return NEEDS_RETRY;
>                 if (sshdr.asc == 0x20 || /* Invalid command operation code */
>                     sshdr.asc == 0x21 || /* Logical block address out of range */
>                     sshdr.asc == 0x22 || /* Invalid function */
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index a1a2ac09066f..8d68bd20723e 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1202,6 +1202,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>         cmd->transfersize = sdp->sector_size;
>         cmd->underflow = nr_blocks << 9;
>         cmd->allowed = sdkp->max_retries;
> +       if (blk_rq_is_seq_write(rq))
> +               cmd->allowed += rq->q->nr_hw_queues * rq->q->nr_requests;
>         cmd->sdb.length = nr_blocks * sdp->sector_size;
>
>         SCSI_LOG_HLQUEUE(1,

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3999 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-14 21:47   ` Khazhy Kumykov
@ 2022-06-14 22:39     ` Bart Van Assche
  2022-06-14 23:50       ` Damien Le Moal
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2022-06-14 22:39 UTC (permalink / raw)
  To: Khazhy Kumykov
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Jaegeuk Kim

On 6/14/22 14:47, Khazhy Kumykov wrote:
> On Tue, Jun 14, 2022 at 10:49 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>>  From ZBC-2: "The device server terminates with CHECK CONDITION status, with
>> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
>> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
>> same command, that specifies: a) the starting LBA in a sequential write
>> required zone set to a value that is not equal to the write pointer for that
>> sequential write required zone; or b) an ending LBA that is not equal to the
>> last logical block within a physical block (see SBC-5)."
>>
>> I am not aware of any other conditions that may trigger the UNALIGNED
>> WRITE COMMAND response.
>>
>> Retry unaligned writes in preparation of removing zone locking.
> Is /just/ retrying effective here? A series of writes to the same zone
> would all need to be sent in order - in the worst case (requests
> somehow ordered in reverse order) this becomes quadratic as only 1
> request "succeeds" out of the N outstanding requests, with the rest
> all needing to retry. (Imagine a user writes an entire "zone" - which
> could be split into hundreds of requests).
> 
> Block layer / schedulers are free to do this reordering, which I
> understand does happen whenever we need to requeue - and would result
> in a retry of all writes after the first re-ordered request. (side
> note: fwiw "requests somehow in reverse order" can happen - bfq
> inherited cfq's odd behavior of sometimes issuing sequential IO in
> reverse order due to back_seek, e.g.)

Hi Khazhy,

For zoned block devices I propose to only support those I/O schedulers 
that either preserve the LBA order or fix the LBA order if two or more 
out-of-order requests are received by the I/O scheduler.

I agree that in the worst case the number of retries is proportional to 
the square of the number of pending requests. However, for the use case 
that matters most to me, F2FS on top of a UFS device, we haven't seen 
any retries in our tests without I/O scheduler. This is probably because 
of how F2FS submits writes combined with the UFS controller only 
supporting a single hardware queue. I expect to see a small number of 
retries once UFS controllers become available that support multiple 
hardware queues.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-14 17:49 ` [PATCH 2/5] scsi: Retry unaligned zoned writes Bart Van Assche
  2022-06-14 21:47   ` Khazhy Kumykov
@ 2022-06-14 23:29   ` Damien Le Moal
  2022-06-14 23:56     ` Bart Van Assche
  1 sibling, 1 reply; 24+ messages in thread
From: Damien Le Moal @ 2022-06-14 23:29 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim

On 6/15/22 02:49, Bart Van Assche wrote:
> From ZBC-2: "The device server terminates with CHECK CONDITION status, with
> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
> same command, that specifies: a) the starting LBA in a sequential write
> required zone set to a value that is not equal to the write pointer for that
> sequential write required zone; or b) an ending LBA that is not equal to the
> last logical block within a physical block (see SBC-5)."
> 
> I am not aware of any other conditions that may trigger the UNALIGNED
> WRITE COMMAND response.
> 
> Retry unaligned writes in preparation of removing zone locking.

Arg. No. No way. AHCI will totally break with that because most AHCI
adapters do not send commands to the drive in the order they are delivered
to the LLD. In more details, the order in which tag bit in the AHCI ready
register are set does not determine the order of command delivery to the
disk. So if zone locking is removed, you constantly get unaligned write
errors.

> 
> Increase the number of retries for write commands sent to a sequential
> zone to the maximum number of outstanding commands.
> 
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/scsi/scsi_error.c | 6 ++++++
>  drivers/scsi/sd.c         | 2 ++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 49ef864df581..8e22d4ba22a3 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -674,6 +674,12 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>  		fallthrough;
>  
>  	case ILLEGAL_REQUEST:
> +		/*
> +		 * Unaligned write command. Retry immediately to handle
> +		 * out-of-order zoned writes.
> +		 */
> +		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04)
> +			return NEEDS_RETRY;
>  		if (sshdr.asc == 0x20 || /* Invalid command operation code */
>  		    sshdr.asc == 0x21 || /* Logical block address out of range */
>  		    sshdr.asc == 0x22 || /* Invalid function */
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index a1a2ac09066f..8d68bd20723e 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1202,6 +1202,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>  	cmd->transfersize = sdp->sector_size;
>  	cmd->underflow = nr_blocks << 9;
>  	cmd->allowed = sdkp->max_retries;
> +	if (blk_rq_is_seq_write(rq))
> +		cmd->allowed += rq->q->nr_hw_queues * rq->q->nr_requests;
>  	cmd->sdb.length = nr_blocks * sdp->sector_size;
>  
>  	SCSI_LOG_HLQUEUE(1,


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/5] block/mq-deadline: Remove zone locking
  2022-06-14 17:49 ` [PATCH 5/5] block/mq-deadline: Remove zone locking Bart Van Assche
@ 2022-06-14 23:40   ` Damien Le Moal
  0 siblings, 0 replies; 24+ messages in thread
From: Damien Le Moal @ 2022-06-14 23:40 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim

On 6/15/22 02:49, Bart Van Assche wrote:
> Measurements have shown that limiting the queue depth to one has a
> significant negative performance impact on zoned UFS devices. Hence this
> patch that removes zone locking from the mq-deadline scheduler. This
> patch is based on the following assumptions:
> - Applications submit write requests to sequential write required zones
>   in order.
> - If such write requests get reordered by the software or hardware queue
>   mechanism, nr_hw_queues * nr_requests - 1 retries are sufficient to
>   reorder the write requests.

As mentioned in my previous reply for patch 2, AHCI will not behave like
that, at all. Retrying will be useless most of the time because the
adapter send commands to the drive randomly from the set of commands that
are marked ready in the ready register. So no. I am opposed to
unconditionally removing zone write locking.

If UFS LLD can deliver commands in order then use a queue flag to say
"zone write locking not needed" to disable it for that device class. There
probably are some SAS HBAs that could benefit from this too, but I have
seen so many reordering bugs with these (e.g. requeue at tail of a write
that got a TSF) that I would not want to remove zone write locking for these.

And I also do not want to start getting 10 calls a day from customers
complaining about very bad write performance due to all these retries
which are likely going to be slower in the end than writing at QD=1 per zone.

> - It happens infrequently that zoned write requests are reordered by the
>   block layer.

What make you say that ? It only takes 2 contexts trying to dispatch
commands to different queues. Or a write process being rescheduled to a
different CPU/queue.

> - Either no I/O scheduler is used or an I/O scheduler is used that
>   submits write requests per zone in LBA order.

And unfortunately, as the AHCI example shows, having the scheduler
dispatch requests in LBA order is not enough.

> 
> DD_BE_PRIO is selected for sequential writes to preserve the LBA order.

So if the application wanted the writes to have RT policy so that these
commands get the high priority bit set on SATA disks, that will not be
honored. No to that too.

> 
> See also commit 5700f69178e9 ("mq-deadline: Introduce zone locking
> support").
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 74 ++++-----------------------------------------
>  1 file changed, 6 insertions(+), 68 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 6ed602b2f80a..e168fc9a980a 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -104,7 +104,6 @@ struct deadline_data {
>  	int prio_aging_expire;
>  
>  	spinlock_t lock;
> -	spinlock_t zone_lock;
>  };
>  
>  /* Maps an I/O priority class to a deadline scheduler priority. */
> @@ -285,30 +284,10 @@ static struct request *
>  deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  		      enum dd_data_dir data_dir)
>  {
> -	struct request *rq;
> -	unsigned long flags;
> -
>  	if (list_empty(&per_prio->fifo_list[data_dir]))
>  		return NULL;
>  
> -	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> -		return rq;
> -
> -	/*
> -	 * Look for a write request that can be dispatched, that is one with
> -	 * an unlocked target zone.
> -	 */
> -	spin_lock_irqsave(&dd->zone_lock, flags);
> -	list_for_each_entry(rq, &per_prio->fifo_list[DD_WRITE], queuelist) {
> -		if (blk_req_can_dispatch_to_zone(rq))
> -			goto out;
> -	}
> -	rq = NULL;
> -out:
> -	spin_unlock_irqrestore(&dd->zone_lock, flags);
> -
> -	return rq;
> +	return rq_entry_fifo(per_prio->fifo_list[data_dir].next);
>  }
>  
>  /*
> @@ -319,29 +298,7 @@ static struct request *
>  deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  		      enum dd_data_dir data_dir)
>  {
> -	struct request *rq;
> -	unsigned long flags;
> -
> -	rq = per_prio->next_rq[data_dir];
> -	if (!rq)
> -		return NULL;
> -
> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> -		return rq;
> -
> -	/*
> -	 * Look for a write request that can be dispatched, that is one with
> -	 * an unlocked target zone.
> -	 */
> -	spin_lock_irqsave(&dd->zone_lock, flags);
> -	while (rq) {
> -		if (blk_req_can_dispatch_to_zone(rq))
> -			break;
> -		rq = deadline_latter_request(rq);
> -	}
> -	spin_unlock_irqrestore(&dd->zone_lock, flags);
> -
> -	return rq;
> +	return per_prio->next_rq[data_dir];
>  }
>  
>  /*
> @@ -467,10 +424,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	ioprio_class = dd_rq_ioclass(rq);
>  	prio = ioprio_class_to_prio[ioprio_class];
>  	dd->per_prio[prio].stats.dispatched++;
> -	/*
> -	 * If the request needs its target zone locked, do it.
> -	 */
> -	blk_req_zone_write_lock(rq);
>  	rq->rq_flags |= RQF_STARTED;
>  	return rq;
>  }
> @@ -640,7 +593,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	dd->fifo_batch = fifo_batch;
>  	dd->prio_aging_expire = prio_aging_expire;
>  	spin_lock_init(&dd->lock);
> -	spin_lock_init(&dd->zone_lock);
>  
>  	q->elevator = eq;
>  	return 0;
> @@ -716,17 +668,13 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
>  	struct dd_per_prio *per_prio;
>  	enum dd_prio prio;
> +	bool seq_write = blk_rq_is_seq_write(rq);
>  	LIST_HEAD(free);
>  
>  	lockdep_assert_held(&dd->lock);
>  
> -	/*
> -	 * This may be a requeue of a write request that has locked its
> -	 * target zone. If it is the case, this releases the zone lock.
> -	 */
> -	blk_req_zone_write_unlock(rq);
> -
> -	prio = ioprio_class_to_prio[ioprio_class];
> +	prio = seq_write ? DD_BE_PRIO :
> +		ioprio_class_to_prio[ioprio_class];
>  	per_prio = &dd->per_prio[prio];
>  	if (!rq->elv.priv[0]) {
>  		per_prio->stats.inserted++;
> @@ -740,7 +688,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  
>  	trace_block_rq_insert(rq);
>  
> -	if (at_head) {
> +	if (at_head && !seq_write) {
>  		list_add(&rq->queuelist, &per_prio->dispatch);
>  		rq->fifo_time = jiffies;
>  	} else {
> @@ -819,16 +767,6 @@ static void dd_finish_request(struct request *rq)
>  		return;
>  
>  	atomic_inc(&per_prio->stats.completed);
> -
> -	if (blk_queue_is_zoned(q)) {
> -		unsigned long flags;
> -
> -		spin_lock_irqsave(&dd->zone_lock, flags);
> -		blk_req_zone_write_unlock(rq);
> -		if (!list_empty(&per_prio->fifo_list[DD_WRITE]))
> -			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
> -		spin_unlock_irqrestore(&dd->zone_lock, flags);
> -	}
>  }
>  
>  static bool dd_has_work_for_prio(struct dd_per_prio *per_prio)


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-14 22:39     ` Bart Van Assche
@ 2022-06-14 23:50       ` Damien Le Moal
  2022-06-14 23:54         ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Damien Le Moal @ 2022-06-14 23:50 UTC (permalink / raw)
  To: Bart Van Assche, Khazhy Kumykov
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Jaegeuk Kim

On 6/15/22 07:39, Bart Van Assche wrote:
> On 6/14/22 14:47, Khazhy Kumykov wrote:
>> On Tue, Jun 14, 2022 at 10:49 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>
>>>  From ZBC-2: "The device server terminates with CHECK CONDITION status, with
>>> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
>>> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
>>> same command, that specifies: a) the starting LBA in a sequential write
>>> required zone set to a value that is not equal to the write pointer for that
>>> sequential write required zone; or b) an ending LBA that is not equal to the
>>> last logical block within a physical block (see SBC-5)."
>>>
>>> I am not aware of any other conditions that may trigger the UNALIGNED
>>> WRITE COMMAND response.
>>>
>>> Retry unaligned writes in preparation of removing zone locking.
>> Is /just/ retrying effective here? A series of writes to the same zone
>> would all need to be sent in order - in the worst case (requests
>> somehow ordered in reverse order) this becomes quadratic as only 1
>> request "succeeds" out of the N outstanding requests, with the rest
>> all needing to retry. (Imagine a user writes an entire "zone" - which
>> could be split into hundreds of requests).
>>
>> Block layer / schedulers are free to do this reordering, which I
>> understand does happen whenever we need to requeue - and would result
>> in a retry of all writes after the first re-ordered request. (side
>> note: fwiw "requests somehow in reverse order" can happen - bfq
>> inherited cfq's odd behavior of sometimes issuing sequential IO in
>> reverse order due to back_seek, e.g.)
> 
> Hi Khazhy,
> 
> For zoned block devices I propose to only support those I/O schedulers 
> that either preserve the LBA order or fix the LBA order if two or more 
> out-of-order requests are received by the I/O scheduler.

We try that "fix" with the work for zoned btrfs. It does not work. Even
adding a delay to wait for out of order requests (if there is a hole in a
write sequence) does not reliably work as FSes may sometimes take 10s of
seconds to issue all write requests that can be all ordered into a nice
write stream. Even with that delay increased to minutes, we were still
seeing unaligned write errors.

> 
> I agree that in the worst case the number of retries is proportional to 
> the square of the number of pending requests. However, for the use case 
> that matters most to me, F2FS on top of a UFS device, we haven't seen 
> any retries in our tests without I/O scheduler. This is probably because 
> of how F2FS submits writes combined with the UFS controller only 
> supporting a single hardware queue. I expect to see a small number of 
> retries once UFS controllers become available that support multiple 
> hardware queues.
> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-14 23:50       ` Damien Le Moal
@ 2022-06-14 23:54         ` Bart Van Assche
  2022-06-15  0:55           ` Damien Le Moal
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2022-06-14 23:54 UTC (permalink / raw)
  To: Damien Le Moal, Khazhy Kumykov
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Jaegeuk Kim

On 6/14/22 16:50, Damien Le Moal wrote:
> We try that "fix" with the work for zoned btrfs. It does not work. Even
> adding a delay to wait for out of order requests (if there is a hole in a
> write sequence) does not reliably work as FSes may sometimes take 10s of
> seconds to issue all write requests that can be all ordered into a nice
> write stream. Even with that delay increased to minutes, we were still
> seeing unaligned write errors.

Please clarify. Doesn't BTRFS use REQ_OP_ZONE_APPEND for zoned storage? 
REQ_OP_ZONE_APPEND requests do not trigger write errors if reordered.

Additionally, our tests with F2FS on top of this patch series pass.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-14 23:29   ` Damien Le Moal
@ 2022-06-14 23:56     ` Bart Van Assche
  2022-06-15  1:09       ` Damien Le Moal
  2022-06-15  5:49       ` Christoph Hellwig
  0 siblings, 2 replies; 24+ messages in thread
From: Bart Van Assche @ 2022-06-14 23:56 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim

On 6/14/22 16:29, Damien Le Moal wrote:
> On 6/15/22 02:49, Bart Van Assche wrote:
>>  From ZBC-2: "The device server terminates with CHECK CONDITION status, with
>> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
>> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
>> same command, that specifies: a) the starting LBA in a sequential write
>> required zone set to a value that is not equal to the write pointer for that
>> sequential write required zone; or b) an ending LBA that is not equal to the
>> last logical block within a physical block (see SBC-5)."
>>
>> I am not aware of any other conditions that may trigger the UNALIGNED
>> WRITE COMMAND response.
>>
>> Retry unaligned writes in preparation of removing zone locking.
> 
> Arg. No. No way. AHCI will totally break with that because most AHCI
> adapters do not send commands to the drive in the order they are delivered
> to the LLD. In more details, the order in which tag bit in the AHCI ready
> register are set does not determine the order of command delivery to the
> disk. So if zone locking is removed, you constantly get unaligned write
> errors.

The performance penalty of zone locking is not acceptable for our use 
case. Does this mean that zone locking needs to be preserved for AHCI 
but not for UFS?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-14 23:54         ` Bart Van Assche
@ 2022-06-15  0:55           ` Damien Le Moal
  0 siblings, 0 replies; 24+ messages in thread
From: Damien Le Moal @ 2022-06-15  0:55 UTC (permalink / raw)
  To: Bart Van Assche, Khazhy Kumykov
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Jaegeuk Kim

On 6/15/22 08:54, Bart Van Assche wrote:
> On 6/14/22 16:50, Damien Le Moal wrote:
>> We try that "fix" with the work for zoned btrfs. It does not work. Even
>> adding a delay to wait for out of order requests (if there is a hole in a
>> write sequence) does not reliably work as FSes may sometimes take 10s of
>> seconds to issue all write requests that can be all ordered into a nice
>> write stream. Even with that delay increased to minutes, we were still
>> seeing unaligned write errors.
> 
> Please clarify. Doesn't BTRFS use REQ_OP_ZONE_APPEND for zoned storage? 
> REQ_OP_ZONE_APPEND requests do not trigger write errors if reordered.

The trial I am mentioning above was before we moved to using zone append
and implementing scsi emulation for it.

Note that btrfs uses zone append write for data only. Metadata writes
still require regular write commands.

> 
> Additionally, our tests with F2FS on top of this patch series pass.

Try on an AHCI SMR drive and I am 100% sure that you will see unaligned
write errors.

> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-14 23:56     ` Bart Van Assche
@ 2022-06-15  1:09       ` Damien Le Moal
  2022-06-15  5:49       ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Damien Le Moal @ 2022-06-15  1:09 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim

On 6/15/22 08:56, Bart Van Assche wrote:
> On 6/14/22 16:29, Damien Le Moal wrote:
>> On 6/15/22 02:49, Bart Van Assche wrote:
>>>  From ZBC-2: "The device server terminates with CHECK CONDITION status, with
>>> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
>>> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
>>> same command, that specifies: a) the starting LBA in a sequential write
>>> required zone set to a value that is not equal to the write pointer for that
>>> sequential write required zone; or b) an ending LBA that is not equal to the
>>> last logical block within a physical block (see SBC-5)."
>>>
>>> I am not aware of any other conditions that may trigger the UNALIGNED
>>> WRITE COMMAND response.
>>>
>>> Retry unaligned writes in preparation of removing zone locking.
>>
>> Arg. No. No way. AHCI will totally break with that because most AHCI
>> adapters do not send commands to the drive in the order they are delivered
>> to the LLD. In more details, the order in which tag bit in the AHCI ready
>> register are set does not determine the order of command delivery to the
>> disk. So if zone locking is removed, you constantly get unaligned write
>> errors.
> 
> The performance penalty of zone locking is not acceptable for our use 
> case. Does this mean that zone locking needs to be preserved for AHCI 
> but not for UFS?

I did mention that: if for a UFS device it is OK to not have zone write
locking, then sure, have mq-deadline not use it and eventually even do not
set ELEVATOR_F_ZBD_SEQ_WRITE for the device queue. But AHCI and SAS HBAs
definitely still need it. NVMe too since all it would take to see an
unaligned write is to have the writer context being rescheduled to a
different CPU or multiple contexts simultaneously writing. Also note that
the command requeue path uses a workqueue and that also results in
reordering, potentially with large delays. I seriously doubt that any
reasonable amount of retry will prevent unaligned write errors if there is
a requeue.

Another solution would be to try to hold the zone write lock for a shorter
interval. All we need is to guarantee in order delivery to the device. We
do not care about completion order. So theoretically, all we need, is to
have the LLD unlock the zone after it issues a write to a device. That is
very tricky to do though as that could be very racy. And that is not
always possible too. E.g., for AHCI, the "command delivered to the device"
essentially boils down to "command tag marked as ready in ready register".
But then you need to wait for that bit to be cleared before setting any
other bit for the next write command in sequence (the bit being cleared
means that the drive got the command). And with the current dispatch push
model, that is not easily possible. We would need to go back to legacy
command pull model.

Also note that for ATA & SAS, with recent drives, the performance penalty
of zone write locking is almost nill as long as the drive is running with
write-cache enabled. And even with write-cache disabled, recent drives are
almost as fast as the WCE case.

> 
> Thanks,
> 
> Bart.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-14 23:56     ` Bart Van Assche
  2022-06-15  1:09       ` Damien Le Moal
@ 2022-06-15  5:49       ` Christoph Hellwig
  2022-06-15  7:21         ` Damien Le Moal
                           ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Christoph Hellwig @ 2022-06-15  5:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Damien Le Moal, Jens Axboe, linux-block, Christoph Hellwig,
	Damien Le Moal, Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim

On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
> The performance penalty of zone locking is not acceptable for our use case. 
> Does this mean that zone locking needs to be preserved for AHCI but not for 
> UFS?

It means you use case needs to use zone append, and we need to make sure
it is added to SCSI assuming your are on SCSI based on your other comments.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-15  5:49       ` Christoph Hellwig
@ 2022-06-15  7:21         ` Damien Le Moal
  2022-06-15 19:38           ` Bart Van Assche
  2022-06-15 19:42         ` Bart Van Assche
  2022-06-16 18:36         ` Bart Van Assche
  2 siblings, 1 reply; 24+ messages in thread
From: Damien Le Moal @ 2022-06-15  7:21 UTC (permalink / raw)
  To: Christoph Hellwig, Bart Van Assche
  Cc: Jens Axboe, linux-block, Damien Le Moal, Martin K . Petersen,
	Khazhy Kumykov, Jaegeuk Kim

On 6/15/22 14:49, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
>> The performance penalty of zone locking is not acceptable for our use case. 
>> Does this mean that zone locking needs to be preserved for AHCI but not for 
>> UFS?
> 
> It means you use case needs to use zone append, and we need to make sure
> it is added to SCSI assuming your are on SCSI based on your other comments.

For scsi, we already have the zone append emulation in the sd driver which
issues regular writes together with taking the zone lock. So UFS has zone
append, not as a native command though. But if for UFS device there are no
issue of command reordering underneath sd, then the zone append emulation
can be done without taking the zone write lock, which would result in the
same performance as what a native zone append command would give.

At least for the short term, that could be a good solution until native
zone append is added to zbc specs. That last part may face a lot of
pushback because of the difficulty of having that same command on ATA side.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-15  7:21         ` Damien Le Moal
@ 2022-06-15 19:38           ` Bart Van Assche
  2022-06-16  0:14             ` Damien Le Moal
  0 siblings, 1 reply; 24+ messages in thread
From: Bart Van Assche @ 2022-06-15 19:38 UTC (permalink / raw)
  To: Damien Le Moal, Christoph Hellwig
  Cc: Jens Axboe, linux-block, Damien Le Moal, Martin K . Petersen,
	Khazhy Kumykov, Jaegeuk Kim

On 6/15/22 00:21, Damien Le Moal wrote:
> On 6/15/22 14:49, Christoph Hellwig wrote:
>> On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
>>> The performance penalty of zone locking is not acceptable for our use case.
>>> Does this mean that zone locking needs to be preserved for AHCI but not for
>>> UFS?
>>
>> It means you use case needs to use zone append, and we need to make sure
>> it is added to SCSI assuming your are on SCSI based on your other comments.
> 
> For scsi, we already have the zone append emulation in the sd driver which
> issues regular writes together with taking the zone lock. So UFS has zone
> append, not as a native command though. But if for UFS device there are no
> issue of command reordering underneath sd, then the zone append emulation
> can be done without taking the zone write lock, which would result in the
> same performance as what a native zone append command would give.
> 
> At least for the short term, that could be a good solution until native
> zone append is added to zbc specs. That last part may face a lot of
> pushback because of the difficulty of having that same command on ATA side.

Is there more information available about the difficulties of defining a 
ZONE APPEND command for ATA? It seems to me that there is enough space 
available in the ATA Status Return Descriptor to report the LBA back to 
the host?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-15  5:49       ` Christoph Hellwig
  2022-06-15  7:21         ` Damien Le Moal
@ 2022-06-15 19:42         ` Bart Van Assche
  2022-06-16 18:36         ` Bart Van Assche
  2 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2022-06-15 19:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Damien Le Moal, Jens Axboe, linux-block, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim

On 6/14/22 22:49, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
>> The performance penalty of zone locking is not acceptable for our use case.
>> Does this mean that zone locking needs to be preserved for AHCI but not for
>> UFS?
> 
> It means you use case needs to use zone append, and we need to make sure
> it is added to SCSI assuming your are on SCSI based on your other comments.

Unfortunately I do not know enough about F2FS to comment on whether or 
not making it use REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE is feasible.

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-15 19:38           ` Bart Van Assche
@ 2022-06-16  0:14             ` Damien Le Moal
  0 siblings, 0 replies; 24+ messages in thread
From: Damien Le Moal @ 2022-06-16  0:14 UTC (permalink / raw)
  To: Bart Van Assche, Christoph Hellwig
  Cc: Jens Axboe, linux-block, Damien Le Moal, Martin K . Petersen,
	Khazhy Kumykov, Jaegeuk Kim

On 2022/06/16 4:38, Bart Van Assche wrote:
> On 6/15/22 00:21, Damien Le Moal wrote:
>> On 6/15/22 14:49, Christoph Hellwig wrote:
>>> On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
>>>> The performance penalty of zone locking is not acceptable for our use case.
>>>> Does this mean that zone locking needs to be preserved for AHCI but not for
>>>> UFS?
>>>
>>> It means you use case needs to use zone append, and we need to make sure
>>> it is added to SCSI assuming your are on SCSI based on your other comments.
>>
>> For scsi, we already have the zone append emulation in the sd driver which
>> issues regular writes together with taking the zone lock. So UFS has zone
>> append, not as a native command though. But if for UFS device there are no
>> issue of command reordering underneath sd, then the zone append emulation
>> can be done without taking the zone write lock, which would result in the
>> same performance as what a native zone append command would give.
>>
>> At least for the short term, that could be a good solution until native
>> zone append is added to zbc specs. That last part may face a lot of
>> pushback because of the difficulty of having that same command on ATA side.
> 
> Is there more information available about the difficulties of defining a 
> ZONE APPEND command for ATA? It seems to me that there is enough space 
> available in the ATA Status Return Descriptor to report the LBA back to 
> the host?

Unfortunately no, there is not enough space. See table 354 of ACS-5
specifications for the normal output of NCQ commands: 1B error, 1B status and 4B
sactive field. That is all you get. No room for an LBA. And even worse: sata IO
does not define any FIS to do that (return an LBA). So adding zone append to ATA
without relying on an extra command to get the written LBA (similarly to how you
get sense data from ATA drives using an extra command) would need both ATA/ACS
changes *and* SATA-IO changes, which mean that all HBAs/AHCI/SATA adapters
existing today would not work.

But in the context of UFS only, it should be possible to add zone append to ZBC
without going into ZAC. Since we have zone append emulation already, that would
not break anything in Linux stack.

> 
> Thanks,
> 
> Bart.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/5] scsi: Retry unaligned zoned writes
  2022-06-15  5:49       ` Christoph Hellwig
  2022-06-15  7:21         ` Damien Le Moal
  2022-06-15 19:42         ` Bart Van Assche
@ 2022-06-16 18:36         ` Bart Van Assche
  2 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2022-06-16 18:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Damien Le Moal, Jens Axboe, linux-block, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim

On 6/14/22 22:49, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
>> The performance penalty of zone locking is not acceptable for our use case.
>> Does this mean that zone locking needs to be preserved for AHCI but not for
>> UFS?
> 
> It means you use case needs to use zone append, and we need to make sure
> it is added to SCSI assuming your are on SCSI based on your other comments.

Using ZONE APPEND instead of SCSI WRITE commands would have the 
following consequences:
* Data of a single file might be stored out-of-order on the storage medium.
* A translation table would have to be stored on the storage medium with 
one entry per ZONE APPEND command with the LBAs of where the ZONE APPEND 
commands wrote their data. I think that translation table can be 
considered as another L2P table. One of our motivations to switch to 
zoned storage is to get rid of nested L2P tables.

Both aspects above would have a negative impact on sequential read 
performance. Hence our preference to use WRITE commands for zoned 
storage instead of ZONE APPEND commands.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] block: Introduce the blk_rq_is_seq_write() function
  2022-06-14 17:49 ` [PATCH 1/5] block: Introduce the blk_rq_is_seq_write() function Bart Van Assche
@ 2022-06-16 20:41   ` Jens Axboe
  2022-06-16 21:16     ` Bart Van Assche
  0 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2022-06-16 20:41 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim

On 6/14/22 11:49 AM, Bart Van Assche wrote:
> Introduce a function that makes it easy to verify whether a write
> request is for a sequential write required or sequential write preferred
> zone.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  include/linux/blk-mq.h | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index e2d9daf7e8dd..3e7feb48105f 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -1129,6 +1129,15 @@ static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
>  	return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
>  }
>  
> +/**
> + * blk_rq_is_seq_write() - Whether @rq is a write request for a sequential zone.
> + * @rq: Request to examine.
> + */
> +static inline bool blk_rq_is_seq_write(struct request *rq)
> +{
> +	return req_op(rq) == REQ_OP_WRITE && blk_rq_zone_is_seq(rq);
> +}

This should include something telling us it's a zone thing, because it
sounds generic. blk_rq_is_zoned_seq_write()?


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/5] block: Introduce the blk_rq_is_seq_write() function
  2022-06-16 20:41   ` Jens Axboe
@ 2022-06-16 21:16     ` Bart Van Assche
  0 siblings, 0 replies; 24+ messages in thread
From: Bart Van Assche @ 2022-06-16 21:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Damien Le Moal,
	Martin K . Petersen, Khazhy Kumykov, Jaegeuk Kim

On 6/16/22 13:41, Jens Axboe wrote:
> On 6/14/22 11:49 AM, Bart Van Assche wrote:
>> Introduce a function that makes it easy to verify whether a write
>> request is for a sequential write required or sequential write preferred
>> zone.
>>
>> Cc: Damien Le Moal <damien.lemoal@wdc.com>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>   include/linux/blk-mq.h | 14 ++++++++++++++
>>   1 file changed, 14 insertions(+)
>>
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index e2d9daf7e8dd..3e7feb48105f 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -1129,6 +1129,15 @@ static inline unsigned int blk_rq_zone_is_seq(struct request *rq)
>>   	return blk_queue_zone_is_seq(rq->q, blk_rq_pos(rq));
>>   }
>>   
>> +/**
>> + * blk_rq_is_seq_write() - Whether @rq is a write request for a sequential zone.
>> + * @rq: Request to examine.
>> + */
>> +static inline bool blk_rq_is_seq_write(struct request *rq)
>> +{
>> +	return req_op(rq) == REQ_OP_WRITE && blk_rq_zone_is_seq(rq);
>> +}
> 
> This should include something telling us it's a zone thing, because it
> sounds generic. blk_rq_is_zoned_seq_write()?

Agreed. I will rename this function before I repost this patch series.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-06-16 21:16 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14 17:49 [PATCH 0/5] Improve zoned storage write performance Bart Van Assche
2022-06-14 17:49 ` [PATCH 1/5] block: Introduce the blk_rq_is_seq_write() function Bart Van Assche
2022-06-16 20:41   ` Jens Axboe
2022-06-16 21:16     ` Bart Van Assche
2022-06-14 17:49 ` [PATCH 2/5] scsi: Retry unaligned zoned writes Bart Van Assche
2022-06-14 21:47   ` Khazhy Kumykov
2022-06-14 22:39     ` Bart Van Assche
2022-06-14 23:50       ` Damien Le Moal
2022-06-14 23:54         ` Bart Van Assche
2022-06-15  0:55           ` Damien Le Moal
2022-06-14 23:29   ` Damien Le Moal
2022-06-14 23:56     ` Bart Van Assche
2022-06-15  1:09       ` Damien Le Moal
2022-06-15  5:49       ` Christoph Hellwig
2022-06-15  7:21         ` Damien Le Moal
2022-06-15 19:38           ` Bart Van Assche
2022-06-16  0:14             ` Damien Le Moal
2022-06-15 19:42         ` Bart Van Assche
2022-06-16 18:36         ` Bart Van Assche
2022-06-14 17:49 ` [PATCH 3/5] nvme: Make the number of retries request specific Bart Van Assche
2022-06-14 17:49 ` [PATCH 4/5] nvme: Increase the number of retries for zoned writes Bart Van Assche
2022-06-14 18:03   ` Keith Busch
2022-06-14 17:49 ` [PATCH 5/5] block/mq-deadline: Remove zone locking Bart Van Assche
2022-06-14 23:40   ` Damien Le Moal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.