All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/18] Add Command Duration Limits support
@ 2023-01-24 19:02 Niklas Cassel
  2023-01-24 19:02 ` [PATCH v3 01/18] block: introduce duration-limits priority class Niklas Cassel
                   ` (17 more replies)
  0 siblings, 18 replies; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: Paolo Valente, Jens Axboe, Damien Le Moal, James E.J. Bottomley,
	Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block, Niklas Cassel

Hello,

This series adds support for Command Duration Limits.
The series is based on linux-next tag: next-20230124
The series can also be found in git:
https://github.com/floatious/linux/commits/cdl-v3


=================
CDL in ATA / SCSI
=================
Command Duration Limits is defined in:
T13 ATA Command Set - 5 (ACS-5) and
T10 SCSI Primary Commands - 6 (SPC-6) respectively
(a simpler version of CDL is defined in T10 SPC-5).

CDL defines Duration Limits Descriptors (DLD).
7 DLDs for read commands and 7 DLDs for write commands.
Simply put, a DLD contains a limit and a policy.

A command can specify that a certain limit should be applied by setting
the DLD index field (3 bits, so 0-7) in the command itself.

The DLD index points to one of the 7 DLDs.
DLD index 0 means no descriptor, so no limit.
DLD index 1-7 means DLD 1-7.

A DLD can have a few different policies, but the two major ones are:
-Policy 0xF (abort), command will be completed with command aborted error
(ATA) or status CHECK CONDITION (SCSI), with sense data indicating that
the command timed out.
-Policy 0xD (complete-unavailable), command will be completed without
error (ATA) or status GOOD (SCSI), with sense data indicating that the
command timed out. Note that the command will not have transferred any
data to/from the device when the command timed out, even though the
command returned success.

Regardless of the CDL policy, in case of a CDL timeout, the I/O will
result in a -ETIME error to user-space.

The DLDs are defined in the CDL log page(s) and are readable and writable.
For convenience, the kernel provides a sysfs interface for reading the
descriptors. If a user really wants to change the descriptors, they can do
so using a user-space application that sends passthrough commands,
one such application is cdl-tools:
https://github.com/westerndigitalcorporation/cdl-tools


==============================
How to use CDL from user-space
==============================
Since CDL is mutually exclusive with NCQ priority
(see ncq_prio_enable and sas_ncq_prio_enable in
Documentation/ABI/testing/sysfs-block-device),
CDL has to be enabled using:
echo 1 > /sys/block/$bdev/device/duration_limits/enable

In order for user-space to be able to select a specific DLD for an I/O,
we have decided to reuse the I/O priority API.

This means that we introduce a new priority class (IOPRIO_CLASS_DL).
When using this class, the existing I/O priority levels (0-7) directly
indicates the DLD index to use.

By reusing the I/O priority API, the user can both define DLD to use
per AIO (io_uring sqe->ioprio or libaio iocb->aio_reqprio) or per-thread
(ioprio_set()).


=======
Testing
=======
With the following fio patch that simply adds the new priority class:
https://github.com/westerndigitalcorporation/cdl-tools/blob/main/patches/fio-3.29-and-newer/0001-os-linux-Add-IORPIO_CLASS_DL-definition.patch

CDL can be tested using fio, e.g.:
fio --ioengine=io_uring --cmdprio_percentage=10 --cmdprio_class=4 --cmdprio=DLD_index

A simple way to test is to use a DLD with a very short duration limit,
and send large reads. Regardless of the CDL policy, in case of a CDL
timeout, the I/O will result in a -ETIME error to user-space.

We also provide a CDL test suite located in the cdl-tools repo, see:
https://github.com/westerndigitalcorporation/cdl-tools/blob/main/README.md#testing-a-system-command-duration-limits-support


We have tested this patch series using:
-real hardware
-the following QEMU implementation:
https://github.com/floatious/qemu/tree/cdl
(NOTE: the QEMU implementation requires you to define the CDL policy at compile
time, so you currently need to recompile QEMU when switching between policies.)


===================
Further information
===================
For further information about CDL, see Damien's slides:

Presented at SDC 2021:
https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf

Presented at Lund Linux Con 2022:
https://drive.google.com/file/d/1I6ChFc0h4JY9qZdO1bY5oCAdYCSZVqWw/view?usp=sharing


================
Changes since V2
================
-Reordered the patches by subsystem, so that the different subsystem maintainers
 can pick up a single range of patches to their respective tree.
-Dropped extern keyword when modifying SCSI function declarations. (Christoph)
-Renamed flag SCMD_EH_SUCCESS_CMD to SCMD_FORCE_EH_SUCCESS. (Christoph)
-Improved commit message for patch "block: introduce duration-limits priority
 class". (Christoph)
-Added a new patch (10/18) that removes unnecessary !cmd checks. (Christoph)
-Modified ata_eh_request_sense(), instead of taking an extra parameter,
 let the caller set scsicmd->result. (Christoph)
-Dropped the patch that changed ata_scsi_set_sense(), let CDL specific code
 call scsi_build_sense_buffer() directly instead. (Christoph)
-Picked up Reviewed-by tags from Hannes and Christoph.


For older change logs, see previous patch series versions:
https://lore.kernel.org/linux-scsi/20230112140412.667308-1-niklas.cassel@wdc.com/
https://lore.kernel.org/linux-scsi/20221208105947.2399894-1-niklas.cassel@wdc.com/


Kind regards,
Niklas & Damien

Damien Le Moal (12):
  block: introduce duration-limits priority class
  block: introduce BLK_STS_DURATION_LIMIT
  scsi: support retrieving sub-pages of mode pages
  scsi: support service action in scsi_report_opcode()
  scsi: sd: detect support for command duration limits
  scsi: sd: set read/write commands CDL index
  ata: libata: detect support for command duration limits
  ata: libata-scsi: handle CDL bits in ata_scsiop_maint_in()
  ata: libata-scsi: add support for CDL pages mode sense
  ata: libata: add ATA feature control sub-page translation
  ata: libata: set read/write commands CDL index
  Documentation: sysfs-block-device: document command duration limits

Niklas Cassel (6):
  scsi: core: allow libata to complete successful commands via EH
  scsi: rename and move get_scsi_ml_byte()
  scsi: sd: handle read/write CDL timeout failures
  ata: libata-scsi: remove unnecessary !cmd checks
  ata: libata: change ata_eh_request_sense() to not set CHECK_CONDITION
  ata: libata: handle completion of CDL commands using policy 0xD

 Documentation/ABI/testing/sysfs-block-device | 150 ++++
 block/bfq-iosched.c                          |  10 +
 block/blk-core.c                             |   3 +
 block/blk-ioprio.c                           |   3 +
 block/ioprio.c                               |   3 +-
 block/mq-deadline.c                          |   1 +
 drivers/ata/libata-core.c                    | 215 ++++-
 drivers/ata/libata-eh.c                      | 130 ++-
 drivers/ata/libata-sata.c                    | 103 ++-
 drivers/ata/libata-scsi.c                    | 371 ++++++--
 drivers/ata/libata.h                         |   2 +-
 drivers/scsi/Makefile                        |   2 +-
 drivers/scsi/scsi.c                          |  28 +-
 drivers/scsi/scsi_error.c                    |  49 +-
 drivers/scsi/scsi_lib.c                      |  15 +-
 drivers/scsi/scsi_priv.h                     |   6 +
 drivers/scsi/scsi_transport_sas.c            |   2 +-
 drivers/scsi/sd.c                            |  37 +-
 drivers/scsi/sd.h                            |  71 ++
 drivers/scsi/sd_cdl.c                        | 894 +++++++++++++++++++
 drivers/scsi/sr.c                            |   2 +-
 include/linux/ata.h                          |  11 +-
 include/linux/blk_types.h                    |   6 +
 include/linux/ioprio.h                       |   2 +-
 include/linux/libata.h                       |  42 +-
 include/scsi/scsi_cmnd.h                     |   5 +
 include/scsi/scsi_device.h                   |  13 +-
 include/uapi/linux/ioprio.h                  |   7 +
 28 files changed, 2039 insertions(+), 144 deletions(-)
 create mode 100644 drivers/scsi/sd_cdl.c

-- 
2.39.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-24 19:27   ` Bart Van Assche
  2023-01-27 12:43   ` Hannes Reinecke
  2023-01-24 19:02 ` [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT Niklas Cassel
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should
be executed using duration-limits targets. The duration target to apply
to a command is indicated using the priority level. Up to 8 levels are
supported, with level 0 indiating "no limit".

This priority class has effect only if the target device supports the
command duration limits feature and this feature is enabled by the user.

While it is recommended to not use an ioscheduler when using the
IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler,
IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT.

The reason for this is twofold:
1) Each priority level for the IOPRIO_CLASS_DL priority class represents a
duration limit descriptor (DLD) inside the device. Users can configure
these limits themselves using passthrough commands, so from a block layer
perspective, Linux has no idea of how each DLD is actually configured.

By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds
its duration limit (because it was held too long in the scheduler) is
decreased. It is still possible to use the IOPRIO_CLASS_DL priority class
for "low priority" IOs by configuring a large limit in the respective DLD.

2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands
(IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority
commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to
IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these
two incompatible priority classes.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 block/bfq-iosched.c         | 10 ++++++++++
 block/blk-ioprio.c          |  3 +++
 block/ioprio.c              |  3 ++-
 block/mq-deadline.c         |  1 +
 include/linux/ioprio.h      |  2 +-
 include/uapi/linux/ioprio.h |  7 +++++++
 6 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 815b884d6c5a..7add9346c585 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5545,6 +5545,14 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
 		bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE;
 		bfqq->new_ioprio = 7;
 		break;
+	case IOPRIO_CLASS_DL:
+		/*
+		 * For the duration-limits class, we want the disk to do the
+		 * scheduling. So map all levels to the highest RT level.
+		 */
+		bfqq->new_ioprio = 0;
+		bfqq->new_ioprio_class = IOPRIO_CLASS_RT;
+		break;
 	}
 
 	if (bfqq->new_ioprio >= IOPRIO_NR_LEVELS) {
@@ -5673,6 +5681,8 @@ static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
 		return &bfqg->async_bfqq[1][ioprio][act_idx];
 	case IOPRIO_CLASS_IDLE:
 		return &bfqg->async_idle_bfqq[act_idx];
+	case IOPRIO_CLASS_DL:
+		return &bfqg->async_bfqq[0][0][act_idx];
 	default:
 		return NULL;
 	}
diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
index 8bb6b8eba4ce..dfb5c3f447f4 100644
--- a/block/blk-ioprio.c
+++ b/block/blk-ioprio.c
@@ -27,6 +27,7 @@
  * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into
  *		IOPRIO_CLASS_BE.
  * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE.
+ * @POLICY_ALL_TO_DL: change the I/O priority class into IOPRIO_CLASS_DL.
  *
  * See also <linux/ioprio.h>.
  */
@@ -35,6 +36,7 @@ enum prio_policy {
 	POLICY_NONE_TO_RT	= 1,
 	POLICY_RESTRICT_TO_BE	= 2,
 	POLICY_ALL_TO_IDLE	= 3,
+	POLICY_ALL_TO_DL	= 4,
 };
 
 static const char *policy_name[] = {
@@ -42,6 +44,7 @@ static const char *policy_name[] = {
 	[POLICY_NONE_TO_RT]	= "none-to-rt",
 	[POLICY_RESTRICT_TO_BE]	= "restrict-to-be",
 	[POLICY_ALL_TO_IDLE]	= "idle",
+	[POLICY_ALL_TO_DL]	= "duration-limits",
 };
 
 static struct blkcg_policy ioprio_policy;
diff --git a/block/ioprio.c b/block/ioprio.c
index 32a456b45804..1b3a9da82597 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -37,6 +37,7 @@ int ioprio_check_cap(int ioprio)
 
 	switch (class) {
 		case IOPRIO_CLASS_RT:
+		case IOPRIO_CLASS_DL:
 			/*
 			 * Originally this only checked for CAP_SYS_ADMIN,
 			 * which was implicitly allowed for pid 0 by security
@@ -47,7 +48,7 @@ int ioprio_check_cap(int ioprio)
 			if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_NICE))
 				return -EPERM;
 			fallthrough;
-			/* rt has prio field too */
+			/* RT and DL have prio field too */
 		case IOPRIO_CLASS_BE:
 			if (data >= IOPRIO_NR_LEVELS || data < 0)
 				return -EINVAL;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f10c2a0d18d4..526d0ea4dbf9 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -113,6 +113,7 @@ static const enum dd_prio ioprio_class_to_prio[] = {
 	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
 	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
 	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
+	[IOPRIO_CLASS_DL]	= DD_RT_PRIO,
 };
 
 static inline struct rb_root *
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 7578d4f6a969..2f3fc2fbd668 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -20,7 +20,7 @@ static inline bool ioprio_valid(unsigned short ioprio)
 {
 	unsigned short class = IOPRIO_PRIO_CLASS(ioprio);
 
-	return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_IDLE;
+	return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_DL;
 }
 
 /*
diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index f70f2596a6bf..15908b9e9d8c 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -29,6 +29,7 @@ enum {
 	IOPRIO_CLASS_RT,
 	IOPRIO_CLASS_BE,
 	IOPRIO_CLASS_IDLE,
+	IOPRIO_CLASS_DL,
 };
 
 /*
@@ -37,6 +38,12 @@ enum {
 #define IOPRIO_NR_LEVELS	8
 #define IOPRIO_BE_NR		IOPRIO_NR_LEVELS
 
+/*
+ * The Duration limits class allows 8 levels: level 0 for "no limit" and levels
+ * 1 to 7, each corresponding to a read or write limit descriptor.
+ */
+#define IOPRIO_DL_NR_LEVELS	8
+
 enum {
 	IOPRIO_WHO_PROCESS = 1,
 	IOPRIO_WHO_PGRP,
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
  2023-01-24 19:02 ` [PATCH v3 01/18] block: introduce duration-limits priority class Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-24 19:29   ` Bart Van Assche
  2023-01-24 19:02 ` [PATCH v3 03/18] scsi: core: allow libata to complete successful commands via EH Niklas Cassel
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
report command that failed due to a command duration limit being
exceeded. This new status is mapped to the ETIME error code to allow
users to differentiate "soft" duration limit failures from other more
serious hardware related errors.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c          | 3 +++
 include/linux/blk_types.h | 6 ++++++
 2 files changed, 9 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 46d12b3344c9..9ca31b779fc1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -170,6 +170,9 @@ static const struct {
 	[BLK_STS_ZONE_OPEN_RESOURCE]	= { -ETOOMANYREFS, "open zones exceeded" },
 	[BLK_STS_ZONE_ACTIVE_RESOURCE]	= { -EOVERFLOW, "active zones exceeded" },
 
+	/* Command duration limit device-side timeout */
+	[BLK_STS_DURATION_LIMIT]	= { -ETIME, "duration limit exceeded" },
+
 	/* everything else not covered above: */
 	[BLK_STS_IOERR]		= { -EIO,	"I/O" },
 };
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..cde997590765 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -166,6 +166,12 @@ typedef u16 blk_short_t;
  */
 #define BLK_STS_OFFLINE		((__force blk_status_t)17)
 
+/*
+ * BLK_STS_DURATION_LIMIT is returned from the driver when the target device
+ * aborted the command because it exceeded one of its Command Duration Limits.
+ */
+#define BLK_STS_DURATION_LIMIT	((__force blk_status_t)18)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 03/18] scsi: core: allow libata to complete successful commands via EH
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
  2023-01-24 19:02 ` [PATCH v3 01/18] block: introduce duration-limits priority class Niklas Cassel
  2023-01-24 19:02 ` [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-24 19:02 ` [PATCH v3 04/18] scsi: rename and move get_scsi_ml_byte() Niklas Cassel
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Niklas Cassel

In SCSI, we get the sense data as part of the completion, for ATA
however, we need to fetch the sense data as an extra step. For an
aborted ATA command the sense data is fetched via libata's
->eh_strategy_handler().

For Command Duration Limits policy 0xD:
The device shall complete the command without error with the additional
sense code set to DATA CURRENTLY UNAVAILABLE.

In order to handle this policy in libata, we intend to send a successful
command via SCSI EH, and let libata's ->eh_strategy_handler() fetch the
sense data for the good command. This is similar to how we handle an
aborted ATA command, just that we need to read the Successful NCQ
Commands log instead of the NCQ Command Error log.

When we get a SATA completion with successful commands, ATA_SENSE will
be set, indicating that some commands in the completion have sense data.

The sense_valid bitmask in the Sense Data for Successful NCQ Commands
log will inform exactly which commands that had sense data, which might
be a subset of all the commands that was completed in the same
completion. (Yet all will have ATA_SENSE set, since the status is per
completion.)

The successful commands that have e.g. a "DATA CURRENTLY UNAVAILABLE"
sense data will have a SCSI ML byte set, so scsi_eh_flush_done_q() will
not set the scmd->result to DID_TIME_OUT for these commands. However,
the successful commands that did not have sense data, must not get their
result marked as DID_TIME_OUT by SCSI EH.

Add a new flag SCMD_FORCE_EH_SUCCESS, which tells SCSI EH to not mark a
command as DID_TIME_OUT, even if it has scmd->result == SAM_STAT_GOOD.

This will be used by libata in a follow-up patch.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/scsi_error.c | 3 ++-
 include/scsi/scsi_cmnd.h  | 5 +++++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 2aa2c2aee6e7..cf5ec5f5f4f6 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -2165,7 +2165,8 @@ void scsi_eh_flush_done_q(struct list_head *done_q)
 			 * scsi_eh_get_sense), scmd->result is already
 			 * set, do not set DID_TIME_OUT.
 			 */
-			if (!scmd->result)
+			if (!scmd->result &&
+			    !(scmd->flags & SCMD_FORCE_EH_SUCCESS))
 				scmd->result |= (DID_TIME_OUT << 16);
 			SCSI_LOG_ERROR_RECOVERY(3,
 				scmd_printk(KERN_INFO, scmd,
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index c2cb5f69635c..526def14e7fb 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -52,6 +52,11 @@ struct scsi_pointer {
 #define SCMD_TAGGED		(1 << 0)
 #define SCMD_INITIALIZED	(1 << 1)
 #define SCMD_LAST		(1 << 2)
+/*
+ * libata uses SCSI EH to fetch sense data for successful commands.
+ * SCSI EH should not overwrite scmd->result when SCMD_FORCE_EH_SUCCESS is set.
+ */
+#define SCMD_FORCE_EH_SUCCESS	(1 << 3)
 #define SCMD_FAIL_IF_RECOVERING	(1 << 4)
 /* flags preserved across unprep / reprep */
 #define SCMD_PRESERVED_FLAGS	(SCMD_INITIALIZED | SCMD_FAIL_IF_RECOVERING)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 04/18] scsi: rename and move get_scsi_ml_byte()
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (2 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 03/18] scsi: core: allow libata to complete successful commands via EH Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-24 19:32   ` Bart Van Assche
  2023-01-24 19:02 ` [PATCH v3 05/18] scsi: support retrieving sub-pages of mode pages Niklas Cassel
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Niklas Cassel, Mike Christie

SCSI has two different getters:
- get_XXX_byte() (in scsi_cmnd.h) which takes a struct scsi_cmnd *, and
- XXX_byte() (in scsi.h) which takes a scmd->result.
The proper name for get_scsi_ml_byte() should thus be without the get_
prefix, as it takes a scmd->result. Rename the function to rectify this.
(This change was suggested by Mike Christie.)

Additionally, move get_scsi_ml_byte() to scsi_priv.h since both scsi_lib.c
and scsi_error.c will need to use this helper in a follow-up patch.

Cc: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_lib.c  | 7 +------
 drivers/scsi/scsi_priv.h | 5 +++++
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index abe93ec8b7d0..b1dcd7eb831e 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -577,11 +577,6 @@ static bool scsi_end_request(struct request *req, blk_status_t error,
 	return false;
 }
 
-static inline u8 get_scsi_ml_byte(int result)
-{
-	return (result >> 8) & 0xff;
-}
-
 /**
  * scsi_result_to_blk_status - translate a SCSI result code into blk_status_t
  * @result:	scsi error code
@@ -594,7 +589,7 @@ static blk_status_t scsi_result_to_blk_status(int result)
 	 * Check the scsi-ml byte first in case we converted a host or status
 	 * byte.
 	 */
-	switch (get_scsi_ml_byte(result)) {
+	switch (scsi_ml_byte(result)) {
 	case SCSIML_STAT_OK:
 		break;
 	case SCSIML_STAT_RESV_CONFLICT:
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 96284a0e13fe..74324fba4281 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -29,6 +29,11 @@ enum scsi_ml_status {
 	SCSIML_STAT_TGT_FAILURE		= 0x04,	/* Permanent target failure */
 };
 
+static inline u8 scsi_ml_byte(int result)
+{
+	return (result >> 8) & 0xff;
+}
+
 /*
  * Scsi Error Handler Flags
  */
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 05/18] scsi: support retrieving sub-pages of mode pages
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (3 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 04/18] scsi: rename and move get_scsi_ml_byte() Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-24 19:34   ` Bart Van Assche
  2023-01-24 19:02 ` [PATCH v3 06/18] scsi: support service action in scsi_report_opcode() Niklas Cassel
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Allow scsi_mode_sense() to retrieve sub-pages of mode pages by adding
the subpage argument. Change all the current caller sites to specify
the subpage 0.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi_lib.c           | 4 +++-
 drivers/scsi/scsi_transport_sas.c | 2 +-
 drivers/scsi/sd.c                 | 9 ++++-----
 drivers/scsi/sr.c                 | 2 +-
 include/scsi/scsi_device.h        | 8 ++++----
 5 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index b1dcd7eb831e..e1a021dd4da2 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2143,6 +2143,7 @@ EXPORT_SYMBOL_GPL(scsi_mode_select);
  *	@sdev:	SCSI device to be queried
  *	@dbd:	set to prevent mode sense from returning block descriptors
  *	@modepage: mode page being requested
+ *	@subpage: sub-page of the mode page being requested
  *	@buffer: request buffer (may not be smaller than eight bytes)
  *	@len:	length of request buffer.
  *	@timeout: command timeout
@@ -2154,7 +2155,7 @@ EXPORT_SYMBOL_GPL(scsi_mode_select);
  *	Returns zero if successful, or a negative error number on failure
  */
 int
-scsi_mode_sense(struct scsi_device *sdev, int dbd, int modepage,
+scsi_mode_sense(struct scsi_device *sdev, int dbd, int modepage, int subpage,
 		  unsigned char *buffer, int len, int timeout, int retries,
 		  struct scsi_mode_data *data, struct scsi_sense_hdr *sshdr)
 {
@@ -2174,6 +2175,7 @@ scsi_mode_sense(struct scsi_device *sdev, int dbd, int modepage,
 	dbd = sdev->set_dbd_for_ms ? 8 : dbd;
 	cmd[1] = dbd & 0x18;	/* allows DBD and LLBA bits */
 	cmd[2] = modepage;
+	cmd[3] = subpage;
 
 	sshdr = exec_args.sshdr;
 
diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c
index 74b99f2b0b74..d704c484a251 100644
--- a/drivers/scsi/scsi_transport_sas.c
+++ b/drivers/scsi/scsi_transport_sas.c
@@ -1245,7 +1245,7 @@ int sas_read_port_mode_page(struct scsi_device *sdev)
 	if (!buffer)
 		return -ENOMEM;
 
-	error = scsi_mode_sense(sdev, 1, 0x19, buffer, BUF_SIZE, 30*HZ, 3,
+	error = scsi_mode_sense(sdev, 1, 0x19, 0, buffer, BUF_SIZE, 30*HZ, 3,
 				&mode_data, NULL);
 
 	if (error)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 2aa3b0393b96..7582e02a8d5a 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -184,7 +184,7 @@ cache_type_store(struct device *dev, struct device_attribute *attr,
 		return count;
 	}
 
-	if (scsi_mode_sense(sdp, 0x08, 8, buffer, sizeof(buffer), SD_TIMEOUT,
+	if (scsi_mode_sense(sdp, 0x08, 8, 0, buffer, sizeof(buffer), SD_TIMEOUT,
 			    sdkp->max_retries, &data, NULL))
 		return -EINVAL;
 	len = min_t(size_t, sizeof(buffer), data.length - data.header_length -
@@ -2616,9 +2616,8 @@ sd_do_mode_sense(struct scsi_disk *sdkp, int dbd, int modepage,
 	if (sdkp->device->use_10_for_ms && len < 8)
 		len = 8;
 
-	return scsi_mode_sense(sdkp->device, dbd, modepage, buffer, len,
-			       SD_TIMEOUT, sdkp->max_retries, data,
-			       sshdr);
+	return scsi_mode_sense(sdkp->device, dbd, modepage, 0, buffer, len,
+			       SD_TIMEOUT, sdkp->max_retries, data, sshdr);
 }
 
 /*
@@ -2875,7 +2874,7 @@ static void sd_read_app_tag_own(struct scsi_disk *sdkp, unsigned char *buffer)
 	if (sdkp->protection_type == 0)
 		return;
 
-	res = scsi_mode_sense(sdp, 1, 0x0a, buffer, 36, SD_TIMEOUT,
+	res = scsi_mode_sense(sdp, 1, 0x0a, 0, buffer, 36, SD_TIMEOUT,
 			      sdkp->max_retries, &data, &sshdr);
 
 	if (res < 0 || !data.header_length ||
diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 9e51dcd30bfd..09fdb0e269d9 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -830,7 +830,7 @@ static int get_capabilities(struct scsi_cd *cd)
 	scsi_test_unit_ready(cd->device, SR_TIMEOUT, MAX_RETRIES, &sshdr);
 
 	/* ask for mode page 0x2a */
-	rc = scsi_mode_sense(cd->device, 0, 0x2a, buffer, ms_len,
+	rc = scsi_mode_sense(cd->device, 0, 0x2a, 0, buffer, ms_len,
 			     SR_TIMEOUT, 3, &data, NULL);
 
 	if (rc < 0 || data.length > ms_len ||
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 7e95ec45138f..15e005982032 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -419,10 +419,10 @@ extern int scsi_track_queue_full(struct scsi_device *, int);
 
 extern int scsi_set_medium_removal(struct scsi_device *, char);
 
-extern int scsi_mode_sense(struct scsi_device *sdev, int dbd, int modepage,
-			   unsigned char *buffer, int len, int timeout,
-			   int retries, struct scsi_mode_data *data,
-			   struct scsi_sense_hdr *);
+int scsi_mode_sense(struct scsi_device *sdev, int dbd, int modepage,
+		    int subpage, unsigned char *buffer, int len, int timeout,
+		    int retries, struct scsi_mode_data *data,
+		    struct scsi_sense_hdr *);
 extern int scsi_mode_select(struct scsi_device *sdev, int pf, int sp,
 			    unsigned char *buffer, int len, int timeout,
 			    int retries, struct scsi_mode_data *data,
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 06/18] scsi: support service action in scsi_report_opcode()
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (4 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 05/18] scsi: support retrieving sub-pages of mode pages Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-24 19:36   ` Bart Van Assche
  2023-01-24 19:02 ` [PATCH v3 07/18] scsi: sd: detect support for command duration limits Niklas Cassel
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

The REPORT_SUPPORTED_OPERATION_CODES command allows checking for support
of commands that have the same opcode but different service actions,
such as READ 32 and WRITE 32. However, the current implementation of
scsi_report_opcode() only allows checking an operation code without a
service action differentiation.

Add the "sa" argument to scsi_report_opcode() to allow passing a service
action. If a non-zero service action is specified, the reporting
options field value is set to 3 to have the service action field taken
into account by the device. If no service action field is specified
(zero), the reporting options field is set to 1 as before.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/scsi.c        | 28 +++++++++++++++++++---------
 drivers/scsi/sd.c          | 10 +++++-----
 include/scsi/scsi_device.h |  5 +++--
 3 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index 00ee47a04403..579c3153b9f3 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -494,18 +494,22 @@ void scsi_attach_vpd(struct scsi_device *sdev)
 }
 
 /**
- * scsi_report_opcode - Find out if a given command opcode is supported
+ * scsi_report_opcode - Find out if a given command is supported
  * @sdev:	scsi device to query
  * @buffer:	scratch buffer (must be at least 20 bytes long)
  * @len:	length of buffer
- * @opcode:	opcode for command to look up
- *
- * Uses the REPORT SUPPORTED OPERATION CODES to look up the given
- * opcode. Returns -EINVAL if RSOC fails, 0 if the command opcode is
- * unsupported and 1 if the device claims to support the command.
+ * @opcode:	opcode for the command to look up
+ * @sa:		service action for the command to look up
+ *
+ * Uses the REPORT SUPPORTED OPERATION CODES to check support for the
+ * command identified with @opcode and @sa. If the command does not
+ * have a service action, @sa must be 0. Returns -EINVAL if RSOC fails,
+ * 0 if the command is not supported and 1 if the device claims to
+ * support the command.
  */
 int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
-		       unsigned int len, unsigned char opcode)
+		       unsigned int len, unsigned char opcode,
+		       unsigned short sa)
 {
 	unsigned char cmd[16];
 	struct scsi_sense_hdr sshdr;
@@ -529,8 +533,14 @@ int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
 	memset(cmd, 0, 16);
 	cmd[0] = MAINTENANCE_IN;
 	cmd[1] = MI_REPORT_SUPPORTED_OPERATION_CODES;
-	cmd[2] = 1;		/* One command format */
-	cmd[3] = opcode;
+	if (!sa) {
+		cmd[2] = 1;	/* One command format */
+		cmd[3] = opcode;
+	} else {
+		cmd[2] = 3;	/* One command format with service action */
+		cmd[3] = opcode;
+		put_unaligned_be16(sa, &cmd[4]);
+	}
 	put_unaligned_be32(request_len, &cmd[6]);
 	memset(buffer, 0, len);
 
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7582e02a8d5a..45945bfeee92 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3058,7 +3058,7 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
 		return;
 	}
 
-	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY) < 0) {
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, INQUIRY, 0) < 0) {
 		struct scsi_vpd *vpd;
 
 		sdev->no_report_opcodes = 1;
@@ -3074,10 +3074,10 @@ static void sd_read_write_same(struct scsi_disk *sdkp, unsigned char *buffer)
 		rcu_read_unlock();
 	}
 
-	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME_16) == 1)
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME_16, 0) == 1)
 		sdkp->ws16 = 1;
 
-	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME) == 1)
+	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE, WRITE_SAME, 0) == 1)
 		sdkp->ws10 = 1;
 }
 
@@ -3089,9 +3089,9 @@ static void sd_read_security(struct scsi_disk *sdkp, unsigned char *buffer)
 		return;
 
 	if (scsi_report_opcode(sdev, buffer, SD_BUF_SIZE,
-			SECURITY_PROTOCOL_IN) == 1 &&
+			SECURITY_PROTOCOL_IN, 0) == 1 &&
 	    scsi_report_opcode(sdev, buffer, SD_BUF_SIZE,
-			SECURITY_PROTOCOL_OUT) == 1)
+			SECURITY_PROTOCOL_OUT, 0) == 1)
 		sdkp->security = 1;
 }
 
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 15e005982032..8978c2a58702 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -431,8 +431,9 @@ extern int scsi_test_unit_ready(struct scsi_device *sdev, int timeout,
 				int retries, struct scsi_sense_hdr *sshdr);
 extern int scsi_get_vpd_page(struct scsi_device *, u8 page, unsigned char *buf,
 			     int buf_len);
-extern int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
-			      unsigned int len, unsigned char opcode);
+int scsi_report_opcode(struct scsi_device *sdev, unsigned char *buffer,
+		       unsigned int len, unsigned char opcode,
+		       unsigned short sa);
 extern int scsi_device_set_state(struct scsi_device *sdev,
 				 enum scsi_device_state state);
 extern struct scsi_event *sdev_evt_alloc(enum scsi_device_event evt_type,
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 07/18] scsi: sd: detect support for command duration limits
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (5 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 06/18] scsi: support service action in scsi_report_opcode() Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-24 19:39   ` Bart Van Assche
  2023-01-27 13:00   ` Hannes Reinecke
  2023-01-24 19:02 ` [PATCH v3 08/18] scsi: sd: set read/write commands CDL index Niklas Cassel
                   ` (10 subsequent siblings)
  17 siblings, 2 replies; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Detect if a disk supports command duration limits. Support for
the READ 16, WRITE 16, READ 32 and WRITE 32 commands is tested using
the function scsi_report_opcode(). For a disk supporting command
duration limits, the mode page indicating the command duration limits
descriptors that apply to the command is indicated using the rwcdlp
and cdlp bits.

Support duration limits is advertizes through sysfs using the new
"duration_limits" sysfs sub-directory of the generic device directory,
that is, /sys/block/sdX/device/duration_limits. Within this new
directory, the limit descriptors that apply to read and write operations
are exposed within the read and write directories, with descriptor
attributes grouped together in directories. The overall sysfs structure
created is:

/sys/block/sde/device/duration_limits/
├── perf_vs_duration_guideline
├── read
│   ├── 1
│   │   ├── duration_guideline
│   │   ├── duration_guideline_policy
│   │   ├── max_active_time
│   │   ├── max_active_time_policy
│   │   ├── max_inactive_time
│   │   └── max_inactive_time_policy
│   ├── 2
│   │   ├── duration_guideline
...
│   └── page
└── write
    ├── 1
    │   ├── duration_guideline
    │   ├── duration_guideline_policy
...

For each of the read and write descriptor directories, the page
attribute file indicate the command duration limit page providing the
descriptors. The possible values for the page attribute are "A", "B",
"T2A" and "T2B".

The new "duration_limits" attributes directory is added only for disks
that supports command duration limits.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/scsi/Makefile |   2 +-
 drivers/scsi/sd.c     |   2 +
 drivers/scsi/sd.h     |  61 ++++
 drivers/scsi/sd_cdl.c | 764 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 828 insertions(+), 1 deletion(-)
 create mode 100644 drivers/scsi/sd_cdl.c

diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index f055bfd54a68..0e48cb6d21d6 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -170,7 +170,7 @@ scsi_mod-$(CONFIG_BLK_DEV_BSG)	+= scsi_bsg.o
 
 hv_storvsc-y			:= storvsc_drv.o
 
-sd_mod-objs	:= sd.o
+sd_mod-objs	:= sd.o sd_cdl.o
 sd_mod-$(CONFIG_BLK_DEV_INTEGRITY) += sd_dif.o
 sd_mod-$(CONFIG_BLK_DEV_ZONED) += sd_zbc.o
 
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 45945bfeee92..7879a5470773 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3326,6 +3326,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
 		sd_read_write_same(sdkp, buffer);
 		sd_read_security(sdkp, buffer);
 		sd_config_protection(sdkp);
+		sd_read_cdl(sdkp, buffer);
 	}
 
 	/*
@@ -3646,6 +3647,7 @@ static void scsi_disk_release(struct device *dev)
 
 	ida_free(&sd_index_ida, sdkp->index);
 	sd_zbc_free_zone_info(sdkp);
+	sd_cdl_release(sdkp);
 	put_device(&sdkp->device->sdev_gendev);
 	free_opal_dev(sdkp->opal_dev);
 
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 5eea762f84d1..e60d33bd222a 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -81,6 +81,62 @@ struct zoned_disk_info {
 	u32		zone_blocks;
 };
 
+/*
+ * Command duration limits sub-pages for the control mode page 0Ah.
+ */
+enum sd_cdlp {
+	SD_CDLP_A,
+	SD_CDLP_B,
+	SD_CDLP_T2A,
+	SD_CDLP_T2B,
+	SD_CDLP_NONE,
+
+	SD_CDL_MAX_PAGES = SD_CDLP_NONE,
+};
+
+enum sd_cdl_cmd {
+	SD_CDL_READ_16,
+	SD_CDL_WRITE_16,
+	SD_CDL_READ_32,
+	SD_CDL_WRITE_32,
+
+	SD_CDL_CMD_MAX,
+};
+
+enum sd_cdl_rw {
+	SD_CDL_READ,
+	SD_CDL_WRITE,
+	SD_CDL_RW,
+};
+
+struct sd_cdl_desc {
+	struct kobject	kobj;
+	u64		max_inactive_time;
+	u64		max_active_time;
+	u64		duration;
+	u8		max_inactive_policy;
+	u8		max_active_policy;
+	u8		duration_policy;
+	u8		cdlp;
+};
+
+#define SD_CDL_MAX_DESC		7
+
+struct sd_cdl_page {
+	struct kobject		kobj;
+	bool			sysfs_registered;
+	enum sd_cdl_rw		rw;
+	enum sd_cdlp		cdlp;
+	struct sd_cdl_desc      descs[SD_CDL_MAX_DESC];
+};
+
+struct sd_cdl {
+	struct kobject		kobj;
+	bool			sysfs_registered;
+	u8			perf_vs_duration_guideline;
+	struct sd_cdl_page	pages[SD_CDL_RW];
+};
+
 struct scsi_disk {
 	struct scsi_device *device;
 
@@ -131,6 +187,7 @@ struct scsi_disk {
 	u8		provisioning_mode;
 	u8		zeroing_mode;
 	u8		nr_actuators;		/* Number of actuators */
+	struct sd_cdl	*cdl;
 	unsigned	ATO : 1;	/* state of disk ATO bit */
 	unsigned	cache_override : 1; /* temp override of WCE,RCD */
 	unsigned	WCE : 1;	/* state of disk WCE bit */
@@ -295,4 +352,8 @@ static inline blk_status_t sd_zbc_prepare_zone_append(struct scsi_cmnd *cmd,
 void sd_print_sense_hdr(struct scsi_disk *sdkp, struct scsi_sense_hdr *sshdr);
 void sd_print_result(const struct scsi_disk *sdkp, const char *msg, int result);
 
+/* Command duration limits support (in sd_cdl.c) */
+void sd_read_cdl(struct scsi_disk *sdkp, unsigned char *buf);
+void sd_cdl_release(struct scsi_disk *sdkp);
+
 #endif /* _SCSI_DISK_H */
diff --git a/drivers/scsi/sd_cdl.c b/drivers/scsi/sd_cdl.c
new file mode 100644
index 000000000000..513cd989f19a
--- /dev/null
+++ b/drivers/scsi/sd_cdl.c
@@ -0,0 +1,764 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * SCSI Command Duration Limits (CDL)
+ *
+ * Copyright (C) 2021 Western Digital Corporation or its affiliates.
+ */
+#include <linux/vmalloc.h>
+#include <linux/mutex.h>
+
+#include <asm/unaligned.h>
+
+#include <scsi/scsi.h>
+#include <scsi/scsi_cmnd.h>
+
+#include "sd.h"
+
+/*
+ * Command duration limits sub-pages for the control mode page 0Ah.
+ */
+static const struct sd_cdlp_info {
+	u8	subpage;
+	char	*name;
+} cdl_page[SD_CDL_MAX_PAGES + 1] = {
+	{ 0x03,	"A"	},
+	{ 0x04,	"B"	},
+	{ 0x07,	"T2A"	},
+	{ 0x08,	"T2B"	},
+	{ 0x00,	"none"	},
+};
+
+static const struct sd_cdl_cmd_info {
+	u8	opcode;
+	u16	sa;
+	char	*name;
+} cdl_cmd[SD_CDL_CMD_MAX] = {
+	{ READ_16,		0,		"READ_16"	},
+	{ WRITE_16,		0,		"WRITE_16"	},
+	{ VARIABLE_LENGTH_CMD,	READ_32,	"READ_32"	},
+	{ VARIABLE_LENGTH_CMD,	WRITE_32,	"WRITE_32"	},
+};
+
+static const char *sd_cdl_perf_name(u8 val)
+{
+	switch (val) {
+	case 0x00:
+		return "0";
+	case 0x01:
+		return "0.5";
+	case 0x02:
+		return "1.0";
+	case 0x03:
+		return "1.5";
+	case 0x04:
+		return "2.0";
+	case 0x05:
+		return "2.5";
+	case 0x06:
+		return "3";
+	case 0x07:
+		return "4";
+	case 0x08:
+		return "5";
+	case 0x09:
+		return "8";
+	case 0x0A:
+		return "10";
+	case 0x0B:
+		return "15";
+	case 0x0C:
+		return "20";
+	default:
+		return "?";
+	}
+}
+
+static const char *sd_cdl_policy_name(u8 policy)
+{
+	switch (policy) {
+	case 0x00:
+		return "complete-earliest";
+	case 0x01:
+		return "continue-next-limit";
+	case 0x02:
+		return "continue-no-limit";
+	case 0x0d:
+		return "complete-unavailable";
+	case 0x0e:
+		return "abort-recovery";
+	case 0x0f:
+		return "abort";
+	default:
+		return "?";
+	}
+}
+
+/*
+ * Command duration limits descriptors sysfs plumbing.
+ */
+struct sd_cdl_desc_sysfs_entry {
+	struct attribute attr;
+	ssize_t (*show)(struct sd_cdl_desc *desc, char *buf);
+};
+
+#define CDL_DESC_ATTR_RO(_name)	\
+	static struct sd_cdl_desc_sysfs_entry				\
+	cdl_desc_##_name##_entry = {					\
+		.attr	= { .name = __stringify(_name), .mode = 0444 },	\
+		.show	= cdl_desc_##_name##_show,			\
+	}
+
+static ssize_t cdl_desc_max_inactive_time_show(struct sd_cdl_desc *desc,
+					       char *buf)
+{
+	return sysfs_emit(buf, "%llu\n", desc->max_inactive_time);
+}
+CDL_DESC_ATTR_RO(max_inactive_time);
+
+static ssize_t cdl_desc_max_inactive_time_policy_show(struct sd_cdl_desc *desc,
+						      char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			sd_cdl_policy_name(desc->max_inactive_policy));
+}
+CDL_DESC_ATTR_RO(max_inactive_time_policy);
+
+static ssize_t cdl_desc_max_active_time_show(struct sd_cdl_desc *desc,
+					     char *buf)
+{
+	return sysfs_emit(buf, "%llu\n", desc->max_active_time);
+}
+CDL_DESC_ATTR_RO(max_active_time);
+
+static ssize_t cdl_desc_max_active_time_policy_show(struct sd_cdl_desc *desc,
+						    char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			sd_cdl_policy_name(desc->max_active_policy));
+}
+CDL_DESC_ATTR_RO(max_active_time_policy);
+
+static ssize_t cdl_desc_duration_guideline_show(struct sd_cdl_desc *desc,
+						char *buf)
+{
+	return sysfs_emit(buf, "%llu\n", desc->duration);
+}
+CDL_DESC_ATTR_RO(duration_guideline);
+
+static ssize_t cdl_desc_duration_guideline_policy_show(struct sd_cdl_desc *desc,
+						       char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+		sd_cdl_policy_name(desc->duration_policy));
+}
+CDL_DESC_ATTR_RO(duration_guideline_policy);
+
+static umode_t sd_cdl_desc_attr_visible(struct kobject *kobj,
+					struct attribute *attr, int n)
+{
+	struct sd_cdl_desc *desc = container_of(kobj, struct sd_cdl_desc, kobj);
+
+	/*
+	 * Descriptors in pages A and B only have the duration guideline
+	 * field.
+	 */
+	if ((desc->cdlp == SD_CDLP_A || desc->cdlp == SD_CDLP_B) &&
+	    (attr != &cdl_desc_duration_guideline_entry.attr))
+		return 0;
+
+	return attr->mode;
+}
+
+static struct attribute *sd_cdl_desc_attrs[] = {
+	&cdl_desc_max_inactive_time_entry.attr,
+	&cdl_desc_max_inactive_time_policy_entry.attr,
+	&cdl_desc_max_active_time_entry.attr,
+	&cdl_desc_max_active_time_policy_entry.attr,
+	&cdl_desc_duration_guideline_entry.attr,
+	&cdl_desc_duration_guideline_policy_entry.attr,
+	NULL,
+};
+
+static const struct attribute_group sd_cdl_desc_group = {
+	.attrs = sd_cdl_desc_attrs,
+	.is_visible = sd_cdl_desc_attr_visible,
+};
+__ATTRIBUTE_GROUPS(sd_cdl_desc);
+
+static ssize_t sd_cdl_desc_sysfs_show(struct kobject *kobj,
+				      struct attribute *attr, char *buf)
+{
+	struct sd_cdl_desc_sysfs_entry *entry =
+		container_of(attr, struct sd_cdl_desc_sysfs_entry, attr);
+	struct sd_cdl_desc *desc = container_of(kobj, struct sd_cdl_desc, kobj);
+
+	return entry->show(desc, buf);
+}
+
+static const struct sysfs_ops sd_cdl_desc_sysfs_ops = {
+	.show	= sd_cdl_desc_sysfs_show,
+};
+
+static void sd_cdl_sysfs_nop_release(struct kobject *kobj) { }
+
+static struct kobj_type sd_cdl_desc_ktype = {
+	.sysfs_ops	= &sd_cdl_desc_sysfs_ops,
+	.default_groups	= sd_cdl_desc_groups,
+	.release	= sd_cdl_sysfs_nop_release,
+};
+
+/*
+ * Duration limits page sysfs plumbing.
+ */
+struct sd_cdl_page_sysfs_entry {
+	struct attribute attr;
+	ssize_t (*show)(struct sd_cdl_page *page, char *buf);
+};
+
+#define CDL_PAGE_ATTR_RO(_name)	\
+	static struct sd_cdl_page_sysfs_entry				\
+	cdl_page_##_name##_entry = {					\
+		.attr	= { .name = __stringify(_name), .mode = 0444 },	\
+		.show	= cdl_page_##_name##_show,			\
+	}
+
+static ssize_t cdl_page_page_show(struct sd_cdl_page *page, char *buf)
+{
+	return sysfs_emit(buf, "%s\n", cdl_page[page->cdlp].name);
+}
+CDL_PAGE_ATTR_RO(page);
+
+static struct attribute *sd_cdl_page_attrs[] = {
+	&cdl_page_page_entry.attr,
+	NULL,
+};
+
+static const struct attribute_group sd_cdl_page_group = {
+	.attrs = sd_cdl_page_attrs,
+};
+__ATTRIBUTE_GROUPS(sd_cdl_page);
+
+static ssize_t sd_cdl_page_sysfs_show(struct kobject *kobj,
+				      struct attribute *attr, char *buf)
+{
+	struct sd_cdl_page_sysfs_entry *entry =
+		container_of(attr, struct sd_cdl_page_sysfs_entry, attr);
+	struct sd_cdl_page *page = container_of(kobj, struct sd_cdl_page, kobj);
+
+	return entry->show(page, buf);
+}
+
+static const struct sysfs_ops sd_cdl_page_sysfs_ops = {
+	.show	= sd_cdl_page_sysfs_show,
+};
+
+static struct kobj_type sd_cdl_page_ktype = {
+	.sysfs_ops	= &sd_cdl_page_sysfs_ops,
+	.default_groups	= sd_cdl_page_groups,
+	.release	= sd_cdl_sysfs_nop_release,
+};
+
+static void sd_cdl_sysfs_unregister_page(struct sd_cdl_page *page)
+{
+	int i;
+
+	for (i = 0; i < SD_CDL_MAX_DESC; i++) {
+		if (page->sysfs_registered)
+			kobject_del(&page->descs[i].kobj);
+		kobject_put(&page->descs[i].kobj);
+	}
+	if (page->sysfs_registered)
+		kobject_del(&page->kobj);
+	kobject_put(&page->kobj);
+
+	page->cdlp = SD_CDLP_NONE;
+	page->sysfs_registered = false;
+}
+
+static int sd_cdl_sysfs_register_page(struct scsi_disk *sdkp,
+				      struct sd_cdl_page *page)
+{
+	int i, ret;
+
+	/*
+	 * If the page is already registered, the updated page descriptors
+	 * are already exported.
+	 */
+	if (page->sysfs_registered)
+		return 0;
+
+	ret = kobject_add(&page->kobj, &sdkp->cdl->kobj,
+			  "%s", page->rw ? "write" : "read");
+	if (ret) {
+		kobject_put(&page->kobj);
+		return ret;
+	}
+
+	for (i = 0; i < SD_CDL_MAX_DESC; i++) {
+		ret = kobject_add(&page->descs[i].kobj, &page->kobj,
+				  "%d", i + 1);
+		if (ret) {
+			int j;
+
+			kobject_put(&page->descs[i].kobj);
+			for (j = 0; j < SD_CDL_MAX_DESC; j++) {
+				if (j < i)
+					kobject_del(&page->descs[j].kobj);
+				kobject_put(&page->descs[j].kobj);
+			}
+			kobject_del(&page->kobj);
+			kobject_put(&page->kobj);
+			return ret;
+		}
+	}
+
+	page->sysfs_registered = true;
+
+	return 0;
+}
+
+/*
+ * Command duration limits sysfs plumbing, top level (duration limits directory
+ * under the "device" sysfs directory.
+ */
+struct sd_cdl_sysfs_entry {
+	struct attribute attr;
+	ssize_t (*show)(struct sd_cdl *cdl, char *buf);
+};
+
+#define CDL_ATTR_RO(_name)	\
+	static struct sd_cdl_sysfs_entry cdl_##_name##_entry = {	\
+		.attr	= { .name = __stringify(_name), .mode = 0444 },	\
+		.show	= cdl_##_name##_show,				\
+	}
+
+static ssize_t cdl_perf_vs_duration_guideline_show(struct sd_cdl *cdl,
+						   char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  sd_cdl_perf_name(cdl->perf_vs_duration_guideline));
+}
+CDL_ATTR_RO(perf_vs_duration_guideline);
+
+static struct attribute *sd_cdl_attrs[] = {
+	&cdl_perf_vs_duration_guideline_entry.attr,
+	NULL,
+};
+
+static umode_t sd_cdl_attr_visible(struct kobject *kobj,
+				   struct attribute *attr, int n)
+{
+	struct sd_cdl *cdl = container_of(kobj, struct sd_cdl, kobj);
+
+	/* perf_vs_duration_guideline exists only if page T2A is supported */
+	if (attr == &cdl_perf_vs_duration_guideline_entry.attr &&
+	    cdl->pages[SD_CDL_READ].cdlp != SD_CDLP_T2A &&
+	    cdl->pages[SD_CDL_WRITE].cdlp != SD_CDLP_T2A)
+		return 0;
+
+	return attr->mode;
+}
+
+static const struct attribute_group sd_cdl_group = {
+	.attrs		= sd_cdl_attrs,
+	.is_visible	= sd_cdl_attr_visible,
+};
+__ATTRIBUTE_GROUPS(sd_cdl);
+
+static ssize_t sd_cdl_sysfs_show(struct kobject *kobj,
+				 struct attribute *attr, char *page)
+{
+	struct sd_cdl_sysfs_entry *entry =
+		container_of(attr, struct sd_cdl_sysfs_entry, attr);
+	struct sd_cdl *cdl = container_of(kobj, struct sd_cdl, kobj);
+
+	return entry->show(cdl, page);
+}
+
+static const struct sysfs_ops sd_cdl_sysfs_ops = {
+	.show	= sd_cdl_sysfs_show,
+};
+
+static void sd_cdl_sysfs_release(struct kobject *kobj)
+{
+	struct sd_cdl *cdl = container_of(kobj, struct sd_cdl, kobj);
+
+	kfree(cdl);
+}
+
+static struct kobj_type sd_cdl_ktype = {
+	.sysfs_ops	= &sd_cdl_sysfs_ops,
+	.default_groups	= sd_cdl_groups,
+	.release	= sd_cdl_sysfs_release,
+};
+
+static void sd_cdl_sysfs_unregister(struct scsi_disk *sdkp)
+{
+	struct sd_cdl *cdl = NULL;
+	int i;
+
+	swap(sdkp->cdl, cdl);
+	if (!cdl)
+		return;
+
+	if (!cdl->sysfs_registered) {
+		kfree(cdl);
+		return;
+	}
+
+	for (i = 0; i < SD_CDL_RW; i++) {
+		if (cdl->pages[i].sysfs_registered)
+			sd_cdl_sysfs_unregister_page(&cdl->pages[i]);
+	}
+
+	kobject_del(&cdl->kobj);
+	kobject_put(&cdl->kobj);
+}
+
+static void sd_cdl_sysfs_register(struct scsi_disk *sdkp)
+{
+	struct scsi_device *sdev = sdkp->device;
+	struct sd_cdl *cdl = sdkp->cdl;
+	struct sd_cdl_page *page;
+	int i, ret;
+
+	if (!cdl->sysfs_registered) {
+		ret = kobject_add(&cdl->kobj, &sdev->sdev_gendev.kobj,
+				  "duration_limits");
+		if (ret) {
+			kobject_put(&cdl->kobj);
+			goto unregister;
+		}
+
+		cdl->sysfs_registered = true;
+	}
+
+	/* Check if the pages changed */
+	for (i = 0; i < SD_CDL_RW; i++) {
+		page = &cdl->pages[i];
+		if (page->cdlp == SD_CDLP_NONE) {
+			sd_cdl_sysfs_unregister_page(page);
+			continue;
+		}
+
+		ret = sd_cdl_sysfs_register_page(sdkp, page);
+		if (ret) {
+			page->cdlp = SD_CDLP_NONE;
+			goto unregister;
+		}
+	}
+
+	return;
+
+unregister:
+	sd_cdl_sysfs_unregister(sdkp);
+}
+
+/*
+ * CDL pages A and B time limits in microseconds.
+ */
+static u64 sd_cdl_time(u8 *buf, u8 cdlunit)
+{
+	u64 val = get_unaligned_be16(buf);
+	u64 factor;
+
+	switch (cdlunit) {
+	case 0x00:
+		return 0;
+	case 0x04:
+		/* 1 microsecond */
+		factor = NSEC_PER_USEC;
+		break;
+	case 0x05:
+		/* 10 milliseconds */
+		factor = 10ULL * USEC_PER_MSEC;
+		break;
+	case 0x06:
+		/* 500 milliseconds */
+		factor = 500ULL * USEC_PER_MSEC;
+		break;
+	default:
+		return 0;
+	}
+
+	return val * factor;
+}
+
+/*
+ * CDL pages T2A and T2B time limits in microseconds.
+ */
+static u64 sd_cdl_t2time(u8 *buf, u8 t2cdlunits)
+{
+	u64 val = get_unaligned_be16(buf);
+	u64 factor;
+
+	switch (t2cdlunits) {
+	case 0x00:
+		return 0;
+	case 0x06:
+		/* 500 nanoseconds */
+		factor = 500;
+		break;
+	case 0x08:
+		/* 1 microsecond */
+		factor = NSEC_PER_USEC;
+		break;
+	case 0x0A:
+		/* 10 milliseconds */
+		factor = 10ULL * NSEC_PER_MSEC;
+		break;
+	case 0x0E:
+		/* 500 milliseconds */
+		factor = 500ULL * NSEC_PER_MSEC;
+		break;
+	default:
+		return 0;
+	}
+
+	val *= factor;
+	do_div(val, NSEC_PER_USEC);
+
+	return val;
+}
+
+static int sd_cdl_read_page(struct scsi_disk *sdkp, struct sd_cdl_page *page,
+			    unsigned char *buf)
+{
+	struct sd_cdl *cdl = sdkp->cdl;
+	struct sd_cdl_desc *desc = &page->descs[0];
+	u8 cdlp = page->cdlp;
+	struct scsi_mode_data data;
+	int i, ret;
+
+	ret = scsi_mode_sense(sdkp->device, 0x08, 0x0a, cdl_page[cdlp].subpage,
+			      buf, SD_BUF_SIZE, SD_TIMEOUT, sdkp->max_retries,
+			      &data, NULL);
+	if (ret) {
+		sd_printk(KERN_ERR, sdkp,
+			  "Command duration limits: read CDL page %s failed\n",
+			  cdl_page[cdlp].name);
+		return ret;
+	}
+	buf += data.header_length + data.block_descriptor_length;
+
+	if (cdlp == SD_CDLP_A || cdlp == SD_CDLP_B) {
+		buf += 8;
+
+		for (i = 0; i < SD_CDL_MAX_DESC; i++, buf += 4, desc++) {
+			u8 cdlunit = (buf[0] & 0xe0) >> 5;
+
+			desc->duration = sd_cdl_time(&buf[2], cdlunit);
+			desc->cdlp = cdlp;
+		}
+	} else {
+		/* T2A and T2B limits page */
+		if (cdlp == SD_CDLP_T2A)
+			cdl->perf_vs_duration_guideline = buf[7] >> 4;
+
+		buf += 8;
+
+		for (i = 0; i < SD_CDL_MAX_DESC; i++, buf += 32, desc++) {
+			u8 t2cdlunits = buf[0] & 0x0f;
+
+			desc->max_inactive_time =
+				sd_cdl_t2time(&buf[2], t2cdlunits);
+			desc->max_active_time =
+				sd_cdl_t2time(&buf[4], t2cdlunits);
+			desc->duration =
+				sd_cdl_t2time(&buf[10], t2cdlunits);
+			desc->max_inactive_policy =  (buf[6] >> 4) & 0x0f;
+			desc->max_active_policy = buf[6] & 0x0f;
+			desc->duration_policy = buf[14] & 0x0f;
+			desc->cdlp = cdlp;
+		}
+	}
+
+	return 0;
+}
+
+static int sd_cdl_read_pages(struct scsi_disk *sdkp, enum sd_cdlp *rw_cdlp,
+			     unsigned char *buf)
+{
+	struct sd_cdl *cdl = sdkp->cdl;
+	struct sd_cdl_page *page;
+	int i, ret;
+
+	/* Read supported pages */
+	for (i = 0; i < SD_CDL_RW; i++) {
+		page = &cdl->pages[i];
+		page->cdlp = rw_cdlp[i];
+		if (page->cdlp == SD_CDLP_NONE)
+			continue;
+
+		ret = sd_cdl_read_page(sdkp, page, buf);
+		if (ret) {
+			page->cdlp = SD_CDLP_NONE;
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+static u8 sd_cdl_check_cmd_support(struct scsi_disk *sdkp,
+				   enum sd_cdl_cmd cmd, unsigned char *buf)
+{
+	u8 opcode = cdl_cmd[cmd].opcode;
+	u16 sa = cdl_cmd[cmd].sa;
+	u8 cdlp;
+
+	/*
+	 * READ 32 and WRITE 32 are used only for disks that also support
+	 * type 2 data protection. If the disk does not have such feature,
+	 * ignore these commands.
+	 */
+	if ((sa == READ_32 || sa == WRITE_32) &&
+	    sdkp->protection_type != T10_PI_TYPE2_PROTECTION)
+		return SD_CDLP_NONE;
+
+	/* Check operation code */
+	if (scsi_report_opcode(sdkp->device, buf, SD_BUF_SIZE, opcode, sa) < 0)
+		return SD_CDLP_NONE;
+
+	if ((buf[1] & 0x03) != 0x03)
+		return SD_CDLP_NONE;
+
+	/* See SPC-6, one command format of REPORT SUPPORTED OPERATION CODES */
+	cdlp = (buf[1] & 0x18) >> 3;
+	if (buf[0] & 0x01) {
+		/* rwcdlp == 1 */
+		switch (cdlp) {
+		case 0x01:
+			return SD_CDLP_T2A;
+		case 0x02:
+			return SD_CDLP_T2B;
+		}
+	} else {
+		/* rwcdlp == 0 */
+		switch (cdlp) {
+		case 0x01:
+			return SD_CDLP_A;
+		case 0x02:
+			return SD_CDLP_B;
+		}
+	}
+
+	return SD_CDLP_NONE;
+}
+
+static bool sd_cdl_supported(struct scsi_disk *sdkp, enum sd_cdlp *rw_cdlp,
+			     unsigned char *buf)
+{
+	enum sd_cdlp cmd_cdlp[SD_CDL_CMD_MAX];
+	int i;
+
+	/*
+	 * Command duration limits is supported for READ 16, WRITE 16,
+	 * READ 32 and WRITE 32. Go through all these commands one at a time
+	 * and check if any support duration limits.
+	 */
+	for (i = 0; i < SD_CDL_CMD_MAX; i++)
+		cmd_cdlp[i] = sd_cdl_check_cmd_support(sdkp, i, buf);
+
+	/*
+	 * Allow support only for drives that report the same CDL page for the
+	 * read 16 and 32 variants and the same page for the write 16 and 32
+	 * variants.
+	 */
+	if (cmd_cdlp[SD_CDL_READ_32] != SD_CDLP_NONE &&
+	    cmd_cdlp[SD_CDL_READ_16] != SD_CDLP_NONE) {
+		if (cmd_cdlp[SD_CDL_READ_32] != cmd_cdlp[SD_CDL_READ_16])
+			rw_cdlp[SD_CDL_READ] = SD_CDLP_NONE;
+		else
+			rw_cdlp[SD_CDL_READ] = cmd_cdlp[SD_CDL_READ_16];
+	} else {
+		rw_cdlp[SD_CDL_READ] = cmd_cdlp[SD_CDL_READ_16];
+	}
+
+	if (cmd_cdlp[SD_CDL_WRITE_32] != SD_CDLP_NONE &&
+	    cmd_cdlp[SD_CDL_WRITE_16] != SD_CDLP_NONE) {
+		if (cmd_cdlp[SD_CDL_WRITE_32] != cmd_cdlp[SD_CDL_WRITE_16])
+			rw_cdlp[SD_CDL_WRITE] = SD_CDLP_NONE;
+		else
+			rw_cdlp[SD_CDL_WRITE] = cmd_cdlp[SD_CDL_WRITE_16];
+	} else {
+		rw_cdlp[SD_CDL_WRITE] = cmd_cdlp[SD_CDL_WRITE_16];
+	}
+
+	return rw_cdlp[SD_CDL_READ] != SD_CDLP_NONE ||
+		rw_cdlp[SD_CDL_WRITE] != SD_CDLP_NONE;
+}
+
+static struct sd_cdl *sd_cdl_alloc(void)
+{
+	struct sd_cdl *cdl;
+	struct sd_cdl_page *page;
+	int i, j;
+
+	cdl = kzalloc(sizeof(struct sd_cdl), GFP_KERNEL);
+	if (!cdl)
+		return NULL;
+
+	kobject_init(&cdl->kobj, &sd_cdl_ktype);
+	for (i = 0; i < SD_CDL_RW; i++) {
+		page = &cdl->pages[i];
+		kobject_init(&page->kobj, &sd_cdl_page_ktype);
+		page->rw = i;
+		page->cdlp = SD_CDLP_NONE;
+		for (j = 0; j < SD_CDL_MAX_DESC; j++)
+			kobject_init(&page->descs[j].kobj, &sd_cdl_desc_ktype);
+	}
+
+	return cdl;
+}
+
+void sd_read_cdl(struct scsi_disk *sdkp, unsigned char *buf)
+{
+	struct sd_cdl *cdl = sdkp->cdl;
+	enum sd_cdlp rw_cdlp[SD_CDL_RW];
+
+	/*
+	 * Check for CDL support. If the disk does not support duration limits,
+	 * clear any support information that was previously registered.
+	 */
+	if (!sd_cdl_supported(sdkp, rw_cdlp, buf))
+		goto unregister;
+
+	if (!cdl) {
+		cdl = sd_cdl_alloc();
+		if (!cdl)
+			return;
+	}
+
+	/*
+	 * We have CDL support: force the use of READ16/WRITE16.
+	 * READ32 and WRITE32 will be used automatically for disks with
+	 * T10_PI_TYPE2_PROTECTION support.
+	 */
+	sdkp->device->use_16_for_rw = 1;
+	sdkp->device->use_10_for_rw = 0;
+
+	if (!sdkp->cdl) {
+		sd_printk(KERN_NOTICE, sdkp,
+			"Command duration limits supported, reads: %s, writes: %s\n",
+			cdl_page[rw_cdlp[SD_CDL_READ]].name,
+			cdl_page[rw_cdlp[SD_CDL_WRITE]].name);
+		sdkp->cdl = cdl;
+	}
+
+	/* Update duration limits descriptor pages */
+	if (sd_cdl_read_pages(sdkp, rw_cdlp, buf))
+		goto unregister;
+
+	sd_cdl_sysfs_register(sdkp);
+
+	return;
+
+unregister:
+	sd_cdl_sysfs_unregister(sdkp);
+}
+
+void sd_cdl_release(struct scsi_disk *sdkp)
+{
+	sd_cdl_sysfs_unregister(sdkp);
+}
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 08/18] scsi: sd: set read/write commands CDL index
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (6 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 07/18] scsi: sd: detect support for command duration limits Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-27 15:30   ` Hannes Reinecke
  2023-01-24 19:02 ` [PATCH v3 09/18] scsi: sd: handle read/write CDL timeout failures Niklas Cassel
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Introduce the command duration limits helper function
sd_cdl_cmd_limit() to retrieve and set the DLD bits of the
READ/WRITE 16 and READ/WRITE 32 commands to indicate to the device
the command duration limit descriptor to apply to the command.

When command duration limits are enabled, sd_cdl_cmd_limit() obtains the
index of the descriptor to apply to the command for requests that have
the IOPRIO_CLASS_DL priority class with a priority data sepcifying a
valid descriptor index (1 to 7).

The read-write sysfs attribute "enable" is introduced to control
setting the command duration limits indexes. If this attribute is set
to 0 (default), command duration limits specified by the user are
ignored. The user must set this attribute to 1 for command duration
limits to be set. Enabling and disabling the command duration limits
feature for ATA devices must be done using the ATA feature sub-page of
the control mode page. The sd_cdl_enable() function is introduced to
check if this mode page is supported by the device and if it is, use
it to enable/disable CDL.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/scsi/sd.c     |  16 +++--
 drivers/scsi/sd.h     |  10 ++++
 drivers/scsi/sd_cdl.c | 134 +++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 152 insertions(+), 8 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7879a5470773..d2eb01337943 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1045,13 +1045,14 @@ static blk_status_t sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
 
 static blk_status_t sd_setup_rw32_cmnd(struct scsi_cmnd *cmd, bool write,
 				       sector_t lba, unsigned int nr_blocks,
-				       unsigned char flags)
+				       unsigned char flags, unsigned int dld)
 {
 	cmd->cmd_len = SD_EXT_CDB_SIZE;
 	cmd->cmnd[0]  = VARIABLE_LENGTH_CMD;
 	cmd->cmnd[7]  = 0x18; /* Additional CDB len */
 	cmd->cmnd[9]  = write ? WRITE_32 : READ_32;
 	cmd->cmnd[10] = flags;
+	cmd->cmnd[11] = dld & 0x07;
 	put_unaligned_be64(lba, &cmd->cmnd[12]);
 	put_unaligned_be32(lba, &cmd->cmnd[20]); /* Expected Indirect LBA */
 	put_unaligned_be32(nr_blocks, &cmd->cmnd[28]);
@@ -1061,12 +1062,12 @@ static blk_status_t sd_setup_rw32_cmnd(struct scsi_cmnd *cmd, bool write,
 
 static blk_status_t sd_setup_rw16_cmnd(struct scsi_cmnd *cmd, bool write,
 				       sector_t lba, unsigned int nr_blocks,
-				       unsigned char flags)
+				       unsigned char flags, unsigned int dld)
 {
 	cmd->cmd_len  = 16;
 	cmd->cmnd[0]  = write ? WRITE_16 : READ_16;
-	cmd->cmnd[1]  = flags;
-	cmd->cmnd[14] = 0;
+	cmd->cmnd[1]  = flags | ((dld >> 2) & 0x01);
+	cmd->cmnd[14] = (dld & 0x03) << 6;
 	cmd->cmnd[15] = 0;
 	put_unaligned_be64(lba, &cmd->cmnd[2]);
 	put_unaligned_be32(nr_blocks, &cmd->cmnd[10]);
@@ -1129,6 +1130,7 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	unsigned int mask = logical_to_sectors(sdp, 1) - 1;
 	bool write = rq_data_dir(rq) == WRITE;
 	unsigned char protect, fua;
+	unsigned int dld = 0;
 	blk_status_t ret;
 	unsigned int dif;
 	bool dix;
@@ -1178,6 +1180,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	fua = rq->cmd_flags & REQ_FUA ? 0x8 : 0;
 	dix = scsi_prot_sg_count(cmd);
 	dif = scsi_host_dif_capable(cmd->device->host, sdkp->protection_type);
+	if (sd_cdl_enabled(sdkp))
+		dld = sd_cdl_dld(sdkp, cmd);
 
 	if (dif || dix)
 		protect = sd_setup_protect_cmnd(cmd, dix, dif);
@@ -1186,10 +1190,10 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 
 	if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
 		ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
-					 protect | fua);
+					 protect | fua, dld);
 	} else if (sdp->use_16_for_rw || (nr_blocks > 0xffff)) {
 		ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks,
-					 protect | fua);
+					 protect | fua, dld);
 	} else if ((nr_blocks > 0xff) || (lba > 0x1fffff) ||
 		   sdp->use_10_for_rw || protect) {
 		ret = sd_setup_rw10_cmnd(cmd, write, lba, nr_blocks,
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index e60d33bd222a..5b6b6dc4b92d 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -130,8 +130,11 @@ struct sd_cdl_page {
 	struct sd_cdl_desc      descs[SD_CDL_MAX_DESC];
 };
 
+struct scsi_disk;
+
 struct sd_cdl {
 	struct kobject		kobj;
+	struct scsi_disk	*sdkp;
 	bool			sysfs_registered;
 	u8			perf_vs_duration_guideline;
 	struct sd_cdl_page	pages[SD_CDL_RW];
@@ -188,6 +191,7 @@ struct scsi_disk {
 	u8		zeroing_mode;
 	u8		nr_actuators;		/* Number of actuators */
 	struct sd_cdl	*cdl;
+	unsigned	cdl_enabled : 1;
 	unsigned	ATO : 1;	/* state of disk ATO bit */
 	unsigned	cache_override : 1; /* temp override of WCE,RCD */
 	unsigned	WCE : 1;	/* state of disk WCE bit */
@@ -355,5 +359,11 @@ void sd_print_result(const struct scsi_disk *sdkp, const char *msg, int result);
 /* Command duration limits support (in sd_cdl.c) */
 void sd_read_cdl(struct scsi_disk *sdkp, unsigned char *buf);
 void sd_cdl_release(struct scsi_disk *sdkp);
+int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd);
+
+static inline bool sd_cdl_enabled(struct scsi_disk *sdkp)
+{
+	return sdkp->cdl && sdkp->cdl_enabled;
+}
 
 #endif /* _SCSI_DISK_H */
diff --git a/drivers/scsi/sd_cdl.c b/drivers/scsi/sd_cdl.c
index 513cd989f19a..59d02dbb5ea1 100644
--- a/drivers/scsi/sd_cdl.c
+++ b/drivers/scsi/sd_cdl.c
@@ -93,6 +93,63 @@ static const char *sd_cdl_policy_name(u8 policy)
 	}
 }
 
+/*
+ * Enable/disable CDL.
+ */
+static int sd_cdl_enable(struct scsi_disk *sdkp, bool enable)
+{
+	struct scsi_device *sdp = sdkp->device;
+	struct scsi_mode_data data;
+	struct scsi_sense_hdr sshdr;
+	struct scsi_vpd *vpd;
+	bool is_ata = false;
+	char buf[64];
+	int ret;
+
+	rcu_read_lock();
+	vpd = rcu_dereference(sdp->vpd_pg89);
+	if (vpd)
+		is_ata = true;
+	rcu_read_unlock();
+
+	/*
+	 * For ATA devices, CDL needs to be enabled with a SET FEATURES command.
+	 */
+	if (is_ata) {
+		char *buf_data;
+		int len;
+
+		ret = scsi_mode_sense(sdp, 0x08, 0x0a, 0xf2, buf, sizeof(buf),
+				      SD_TIMEOUT, sdkp->max_retries, &data,
+				      NULL);
+		if (ret)
+			return -EINVAL;
+
+		/* Enable CDL using the ATA feature page */
+		len = min_t(size_t, sizeof(buf),
+			    data.length - data.header_length -
+			    data.block_descriptor_length);
+		buf_data = buf + data.header_length +
+			data.block_descriptor_length;
+		if (enable)
+			buf_data[4] = 0x02;
+		else
+			buf_data[4] = 0;
+
+		ret = scsi_mode_select(sdp, 1, 0, buf_data, len, SD_TIMEOUT,
+				       sdkp->max_retries, &data, &sshdr);
+		if (ret) {
+			if (scsi_sense_valid(&sshdr))
+				sd_print_sense_hdr(sdkp, &sshdr);
+			return -EINVAL;
+		}
+	}
+
+	sdkp->cdl_enabled = enable;
+
+	return 0;
+}
+
 /*
  * Command duration limits descriptors sysfs plumbing.
  */
@@ -324,6 +381,7 @@ static int sd_cdl_sysfs_register_page(struct scsi_disk *sdkp,
 struct sd_cdl_sysfs_entry {
 	struct attribute attr;
 	ssize_t (*show)(struct sd_cdl *cdl, char *buf);
+	ssize_t (*store)(struct sd_cdl *cdl, const char *buf, size_t length);
 };
 
 #define CDL_ATTR_RO(_name)	\
@@ -332,6 +390,13 @@ struct sd_cdl_sysfs_entry {
 		.show	= cdl_##_name##_show,				\
 	}
 
+#define CDL_ATTR_RW(_name)	\
+	static struct sd_cdl_sysfs_entry cdl_##_name##_entry = {	\
+		.attr	= { .name = __stringify(_name), .mode = 0644 },	\
+		.show	= cdl_##_name##_show,				\
+		.store	= cdl_##_name##_store,				\
+	}
+
 static ssize_t cdl_perf_vs_duration_guideline_show(struct sd_cdl *cdl,
 						   char *buf)
 {
@@ -340,8 +405,31 @@ static ssize_t cdl_perf_vs_duration_guideline_show(struct sd_cdl *cdl,
 }
 CDL_ATTR_RO(perf_vs_duration_guideline);
 
+static ssize_t cdl_enable_show(struct sd_cdl *cdl, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", (int)cdl->sdkp->cdl_enabled);
+}
+
+static ssize_t cdl_enable_store(struct sd_cdl *cdl,
+				const char *buf, size_t count)
+{
+	int ret;
+	bool v;
+
+	if (kstrtobool(buf, &v))
+		return -EINVAL;
+
+	ret = sd_cdl_enable(cdl->sdkp, v);
+	if (ret)
+		return ret;
+
+	return count;
+}
+CDL_ATTR_RW(enable);
+
 static struct attribute *sd_cdl_attrs[] = {
 	&cdl_perf_vs_duration_guideline_entry.attr,
+	&cdl_enable_entry.attr,
 	NULL,
 };
 
@@ -375,8 +463,25 @@ static ssize_t sd_cdl_sysfs_show(struct kobject *kobj,
 	return entry->show(cdl, page);
 }
 
+static ssize_t sd_cdl_sysfs_store(struct kobject *kobj, struct attribute *attr,
+				  const char *buf, size_t length)
+{
+	struct sd_cdl_sysfs_entry *entry =
+		container_of(attr, struct sd_cdl_sysfs_entry, attr);
+	struct sd_cdl *cdl = container_of(kobj, struct sd_cdl, kobj);
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (!entry->store)
+		return -EIO;
+
+	return entry->store(cdl, buf, length);
+}
+
 static const struct sysfs_ops sd_cdl_sysfs_ops = {
 	.show	= sd_cdl_sysfs_show,
+	.store	= sd_cdl_sysfs_store,
 };
 
 static void sd_cdl_sysfs_release(struct kobject *kobj)
@@ -411,6 +516,7 @@ static void sd_cdl_sysfs_unregister(struct scsi_disk *sdkp)
 			sd_cdl_sysfs_unregister_page(&cdl->pages[i]);
 	}
 
+	cdl->sdkp->cdl_enabled = 0;
 	kobject_del(&cdl->kobj);
 	kobject_put(&cdl->kobj);
 }
@@ -689,7 +795,7 @@ static bool sd_cdl_supported(struct scsi_disk *sdkp, enum sd_cdlp *rw_cdlp,
 		rw_cdlp[SD_CDL_WRITE] != SD_CDLP_NONE;
 }
 
-static struct sd_cdl *sd_cdl_alloc(void)
+static struct sd_cdl *sd_cdl_alloc(struct scsi_disk *sdkp)
 {
 	struct sd_cdl *cdl;
 	struct sd_cdl_page *page;
@@ -699,6 +805,7 @@ static struct sd_cdl *sd_cdl_alloc(void)
 	if (!cdl)
 		return NULL;
 
+	cdl->sdkp = sdkp;
 	kobject_init(&cdl->kobj, &sd_cdl_ktype);
 	for (i = 0; i < SD_CDL_RW; i++) {
 		page = &cdl->pages[i];
@@ -725,7 +832,7 @@ void sd_read_cdl(struct scsi_disk *sdkp, unsigned char *buf)
 		goto unregister;
 
 	if (!cdl) {
-		cdl = sd_cdl_alloc();
+		cdl = sd_cdl_alloc(sdkp);
 		if (!cdl)
 			return;
 	}
@@ -762,3 +869,26 @@ void sd_cdl_release(struct scsi_disk *sdkp)
 {
 	sd_cdl_sysfs_unregister(sdkp);
 }
+
+/*
+ * Check if a command has a duration limit set. If it does, return the
+ * descriptor index to use and 0 if the command has no limit set.
+ */
+int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd)
+{
+	unsigned int ioprio = req_get_ioprio(scsi_cmd_to_rq(scmd));
+	unsigned int dld;
+
+	/*
+	 * Use "no limit" if the request ioprio class is not IOPRIO_CLASS_DL
+	 * or if the user specified an invalid CDL descriptor index.
+	 */
+	if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_DL)
+		return 0;
+
+	dld = IOPRIO_PRIO_DATA(ioprio);
+	if (dld > SD_CDL_MAX_DESC)
+		return 0;
+
+	return dld;
+}
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 09/18] scsi: sd: handle read/write CDL timeout failures
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (7 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 08/18] scsi: sd: set read/write commands CDL index Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-27 15:34   ` Hannes Reinecke
  2023-01-24 19:02 ` [PATCH v3 10/18] ata: libata-scsi: remove unnecessary !cmd checks Niklas Cassel
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Niklas Cassel

Commands using a duration limit descriptor that has limit policies set
to a value other than 0x0 may be failed by the device if one of the
limits are exceeded. For such commands, since the failure is the result
of the user duration limit configuration and workload, the commands
should not be retried and terminated immediately. Furthermore, to allow
the user to differentiate these "soft" failures from hard errors due to
hardware problem, a different error code than EIO should be returned.

There are 2 cases to consider:
(1) The failure is due to a limit policy failing the command with a
check condition sense key, that is, any limit policy other than 0xD.
For this case, scsi_check_sense() is modified to detect failures with
the ABORTED COMMAND sense key and the COMMAND TIMEOUT BEFORE PROCESSING
or COMMAND TIMEOUT DURING PROCESSING or COMMAND TIMEOUT DURING
PROCESSING DUE TO ERROR RECOVERY additional sense code. For these
failures, a SUCCESS disposition is returned so that
scsi_finish_command() is called to terminate the command.

(2) The failure is due to a limit policy set to 0xD, which result in the
command being terminated with a GOOD status, COMPLETED sense key, and
DATA CURRENTLY UNAVAILABLE additional sense code. To handle this case,
the scsi_check_sense() is modified to return a SUCCESS disposition so
that scsi_finish_command() is called to terminate the command.
In addition, scsi_decide_disposition() has to be modified to see if a
command being terminated with GOOD status has sense data.
This is as defined in SCSI Primary Commands - 6 (SPC-6), so all
according to spec, even if GOOD status commands were not checked before.

If scsi_check_sense() detects sense data representing a duration limit,
scsi_check_sense() will set the newly introduced SCSI ML byte
SCSIML_STAT_DL_TIMEOUT. This SCSI ML byte is checked in
scsi_noretry_cmd(), so that a command that failed because of a CDL
timeout cannot be retried. The SCSI ML byte is also checked in
scsi_result_to_blk_status() to complete the command request with the
BLK_STS_DURATION_LIMIT status, which result in the user seeing ETIME
errors for the failed commands.

Co-developed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/scsi/scsi_error.c | 46 +++++++++++++++++++++++++++++++++++++++
 drivers/scsi/scsi_lib.c   |  4 ++++
 drivers/scsi/scsi_priv.h  |  1 +
 3 files changed, 51 insertions(+)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index cf5ec5f5f4f6..9988539bc348 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -536,6 +536,7 @@ static inline void set_scsi_ml_byte(struct scsi_cmnd *cmd, u8 status)
  */
 enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 {
+	struct request *req = scsi_cmd_to_rq(scmd);
 	struct scsi_device *sdev = scmd->device;
 	struct scsi_sense_hdr sshdr;
 
@@ -595,6 +596,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		if (sshdr.asc == 0x10) /* DIF */
 			return SUCCESS;
 
+		/*
+		 * Check aborts due to command duration limit policy:
+		 * ABORTED COMMAND additional sense code with the
+		 * COMMAND TIMEOUT BEFORE PROCESSING or
+		 * COMMAND TIMEOUT DURING PROCESSING or
+		 * COMMAND TIMEOUT DURING PROCESSING DUE TO ERROR RECOVERY
+		 * additional sense code qualifiers.
+		 */
+		if (sshdr.asc == 0x2e &&
+		    sshdr.ascq >= 0x01 && sshdr.ascq <= 0x03) {
+			set_scsi_ml_byte(scmd, SCSIML_STAT_DL_TIMEOUT);
+			req->cmd_flags |= REQ_FAILFAST_DEV;
+			req->rq_flags |= RQF_QUIET;
+			return SUCCESS;
+		}
+
 		if (sshdr.asc == 0x44 && sdev->sdev_bflags & BLIST_RETRY_ITF)
 			return ADD_TO_MLQUEUE;
 		if (sshdr.asc == 0xc1 && sshdr.ascq == 0x01 &&
@@ -691,6 +708,15 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		}
 		return SUCCESS;
 
+	case COMPLETED:
+		if (sshdr.asc == 0x55 && sshdr.ascq == 0x0a) {
+			set_scsi_ml_byte(scmd, SCSIML_STAT_DL_TIMEOUT);
+			req->cmd_flags |= REQ_FAILFAST_DEV;
+			req->rq_flags |= RQF_QUIET;
+			return SUCCESS;
+		}
+		return SUCCESS;
+
 	default:
 		return SUCCESS;
 	}
@@ -785,6 +811,14 @@ static enum scsi_disposition scsi_eh_completed_normally(struct scsi_cmnd *scmd)
 	switch (get_status_byte(scmd)) {
 	case SAM_STAT_GOOD:
 		scsi_handle_queue_ramp_up(scmd->device);
+		if (scmd->sense_buffer && SCSI_SENSE_VALID(scmd))
+			/*
+			 * If we have sense data, call scsi_check_sense() in
+			 * order to set the correct SCSI ML byte (if any).
+			 * No point in checking the return value, since the
+			 * command has already completed successfully.
+			 */
+			scsi_check_sense(scmd);
 		fallthrough;
 	case SAM_STAT_COMMAND_TERMINATED:
 		return SUCCESS;
@@ -1807,6 +1841,10 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
 		return !!(req->cmd_flags & REQ_FAILFAST_DRIVER);
 	}
 
+	/* Never retry commands aborted due to a duration limit timeout */
+	if (scsi_ml_byte(scmd->result) == SCSIML_STAT_DL_TIMEOUT)
+		return true;
+
 	if (!scsi_status_is_check_condition(scmd->result))
 		return false;
 
@@ -1966,6 +2004,14 @@ enum scsi_disposition scsi_decide_disposition(struct scsi_cmnd *scmd)
 		if (scmd->cmnd[0] == REPORT_LUNS)
 			scmd->device->sdev_target->expecting_lun_change = 0;
 		scsi_handle_queue_ramp_up(scmd->device);
+		if (scmd->sense_buffer && SCSI_SENSE_VALID(scmd))
+			/*
+			 * If we have sense data, call scsi_check_sense() in
+			 * order to set the correct SCSI ML byte (if any).
+			 * No point in checking the return value, since the
+			 * command has already completed successfully.
+			 */
+			scsi_check_sense(scmd);
 		fallthrough;
 	case SAM_STAT_COMMAND_TERMINATED:
 		return SUCCESS;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e1a021dd4da2..406952e72a68 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -600,6 +600,8 @@ static blk_status_t scsi_result_to_blk_status(int result)
 		return BLK_STS_MEDIUM;
 	case SCSIML_STAT_TGT_FAILURE:
 		return BLK_STS_TARGET;
+	case SCSIML_STAT_DL_TIMEOUT:
+		return BLK_STS_DURATION_LIMIT;
 	}
 
 	switch (host_byte(result)) {
@@ -797,6 +799,8 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
 				blk_stat = BLK_STS_ZONE_OPEN_RESOURCE;
 			}
 			break;
+		case COMPLETED:
+			fallthrough;
 		default:
 			action = ACTION_FAIL;
 			break;
diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
index 74324fba4281..f42388ecb024 100644
--- a/drivers/scsi/scsi_priv.h
+++ b/drivers/scsi/scsi_priv.h
@@ -27,6 +27,7 @@ enum scsi_ml_status {
 	SCSIML_STAT_NOSPC		= 0x02,	/* Space allocation on the dev failed */
 	SCSIML_STAT_MED_ERROR		= 0x03,	/* Medium error */
 	SCSIML_STAT_TGT_FAILURE		= 0x04,	/* Permanent target failure */
+	SCSIML_STAT_DL_TIMEOUT		= 0x05, /* Command Duration Limit timeout */
 };
 
 static inline u8 scsi_ml_byte(int result)
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 10/18] ata: libata-scsi: remove unnecessary !cmd checks
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (8 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 09/18] scsi: sd: handle read/write CDL timeout failures Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-27 15:35   ` Hannes Reinecke
  2023-01-24 19:02 ` [PATCH v3 11/18] ata: libata: change ata_eh_request_sense() to not set CHECK_CONDITION Niklas Cassel
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block, Niklas Cassel

There is no need to check if !cmd as this can only happen for
ATA internal commands which uses the ATA internal tag (32).

Most users of ata_scsi_set_sense() are from _xlat functions that
translate a scsicmd to an ATA command. These obviously have a qc->scsicmd.

ata_scsi_qc_complete() can also call ata_scsi_set_sense() via
ata_gen_passthru_sense() / ata_gen_ata_sense(), called via
ata_scsi_qc_complete(). This callback is only called for translated
commands, so it also has a qc->scsicmd.

ata_eh_analyze_ncq_error(): the NCQ error log can only contain a 0-31
value, so it will never be able to get the ATA internal tag (32).

ata_eh_request_sense(): only called by ata_eh_analyze_tf(), which
is only called when iteratating the QCs using ata_qc_for_each_raw(),
which does not include the internal tag.

Since there is no existing call site where cmd can be NULL, remove the
!cmd check from ata_scsi_set_sense() and ata_scsi_set_sense_information().

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/ata/libata-scsi.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index e093c7a7deeb..26746609bf76 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -209,9 +209,6 @@ void ata_scsi_set_sense(struct ata_device *dev, struct scsi_cmnd *cmd,
 {
 	bool d_sense = (dev->flags & ATA_DFLAG_D_SENSE);
 
-	if (!cmd)
-		return;
-
 	scsi_build_sense(cmd, d_sense, sk, asc, ascq);
 }
 
@@ -221,9 +218,6 @@ void ata_scsi_set_sense_information(struct ata_device *dev,
 {
 	u64 information;
 
-	if (!cmd)
-		return;
-
 	information = ata_tf_read_block(tf, dev);
 	if (information == U64_MAX)
 		return;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 11/18] ata: libata: change ata_eh_request_sense() to not set CHECK_CONDITION
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (9 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 10/18] ata: libata-scsi: remove unnecessary !cmd checks Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-27 15:36   ` Hannes Reinecke
  2023-01-24 19:02 ` [PATCH v3 12/18] ata: libata: detect support for command duration limits Niklas Cassel
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block, Niklas Cassel

Currently, ata_eh_request_sense() unconditionally sets the scsicmd->result
to SAM_STAT_CHECK_CONDITION.

For Command Duration Limits policy 0xD:
The device shall complete the command without error (SAM_STAT_GOOD)
with the additional sense code set to DATA CURRENTLY UNAVAILABLE.

It is perfectly fine to have sense data for a command that returned
completion without error.

In order to support for CDL policy 0xD, we have to remove this
assumption that having sense data means that the command failed
(SAM_STAT_CHECK_CONDITION).

Change ata_eh_request_sense() to not set SAM_STAT_CHECK_CONDITION,
and instead move the setting of SAM_STAT_CHECK_CONDITION to the single
caller that wants SAM_STAT_CHECK_CONDITION set, that way
ata_eh_request_sense() can be reused in a follow-up patch that adds
support for CDL policy 0xD.

The only caller of ata_eh_request_sense() is protected by:
if (!(qc->flags & ATA_QCFLAG_SENSE_VALID)), so we can remove this
duplicated check from ata_eh_request_sense() itself.

Additionally, ata_eh_request_sense() is only called from
ata_eh_analyze_tf(), which is only called when iteratating the QCs using
ata_qc_for_each_raw(), which does not include the internal tag,
so cmd can never be NULL (all non-internal commands have qc->scsicmd set),
so remove the !cmd check as well.

Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/ata/libata-eh.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index a6c901811802..598ae07195b6 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1401,8 +1401,11 @@ unsigned int atapi_eh_tur(struct ata_device *dev, u8 *r_sense_key)
  *
  *	LOCKING:
  *	Kernel thread context (may sleep).
+ *
+ *	RETURNS:
+ *	true if sense data could be fetched, false otherwise.
  */
-static void ata_eh_request_sense(struct ata_queued_cmd *qc)
+static bool ata_eh_request_sense(struct ata_queued_cmd *qc)
 {
 	struct scsi_cmnd *cmd = qc->scsicmd;
 	struct ata_device *dev = qc->dev;
@@ -1411,15 +1414,12 @@ static void ata_eh_request_sense(struct ata_queued_cmd *qc)
 
 	if (ata_port_is_frozen(qc->ap)) {
 		ata_dev_warn(dev, "sense data available but port frozen\n");
-		return;
+		return false;
 	}
 
-	if (!cmd || qc->flags & ATA_QCFLAG_SENSE_VALID)
-		return;
-
 	if (!ata_id_sense_reporting_enabled(dev->id)) {
 		ata_dev_warn(qc->dev, "sense data reporting disabled\n");
-		return;
+		return false;
 	}
 
 	ata_tf_init(dev, &tf);
@@ -1432,13 +1432,19 @@ static void ata_eh_request_sense(struct ata_queued_cmd *qc)
 	/* Ignore err_mask; ATA_ERR might be set */
 	if (tf.status & ATA_SENSE) {
 		if (ata_scsi_sense_is_valid(tf.lbah, tf.lbam, tf.lbal)) {
-			ata_scsi_set_sense(dev, cmd, tf.lbah, tf.lbam, tf.lbal);
+			/* Set sense without also setting scsicmd->result */
+			scsi_build_sense_buffer(dev->flags & ATA_DFLAG_D_SENSE,
+						cmd->sense_buffer, tf.lbah,
+						tf.lbam, tf.lbal);
 			qc->flags |= ATA_QCFLAG_SENSE_VALID;
+			return true;
 		}
 	} else {
 		ata_dev_warn(dev, "request sense failed stat %02x emask %x\n",
 			     tf.status, err_mask);
 	}
+
+	return false;
 }
 
 /**
@@ -1588,8 +1594,9 @@ static unsigned int ata_eh_analyze_tf(struct ata_queued_cmd *qc)
 		 *  was not included in the NCQ command error log
 		 *  (i.e. NCQ autosense is not supported by the device).
 		 */
-		if (!(qc->flags & ATA_QCFLAG_SENSE_VALID) && (stat & ATA_SENSE))
-			ata_eh_request_sense(qc);
+		if (!(qc->flags & ATA_QCFLAG_SENSE_VALID) &&
+		    (stat & ATA_SENSE) && ata_eh_request_sense(qc))
+			set_status_byte(qc->scsicmd, SAM_STAT_CHECK_CONDITION);
 		if (err & ATA_ICRC)
 			qc->err_mask |= AC_ERR_ATA_BUS;
 		if (err & (ATA_UNC | ATA_AMNF))
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 12/18] ata: libata: detect support for command duration limits
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (10 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 11/18] ata: libata: change ata_eh_request_sense() to not set CHECK_CONDITION Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-24 19:02 ` [PATCH v3 13/18] ata: libata-scsi: handle CDL bits in ata_scsiop_maint_in() Niklas Cassel
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Use the supported capabilities identify device data log page to detect
if a device supports the command duration limits feature. For devices
supporting this feature, set the device flag ATA_DFLAG_CDL. To support
scsi-ata translation, retrieve the command duration limits log page 18h
and cache this page content using the cdl array added to the ata_device
data structure.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 drivers/ata/libata-core.c | 52 ++++++++++++++++++++++++++++++++++++++-
 drivers/ata/libata-scsi.c | 17 ++++++-------
 include/linux/ata.h       |  5 +++-
 include/linux/libata.h    | 29 +++++++++++++---------
 4 files changed, 80 insertions(+), 23 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 36c1aca310e9..17e32b0a6364 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2367,6 +2367,54 @@ static void ata_dev_config_trusted(struct ata_device *dev)
 		dev->flags |= ATA_DFLAG_TRUSTED;
 }
 
+static void ata_dev_config_cdl(struct ata_device *dev)
+{
+	struct ata_port *ap = dev->link->ap;
+	unsigned int err_mask;
+	u64 val;
+
+	if (ata_id_major_version(dev->id) < 12)
+		goto not_supported;
+
+	if (!ata_log_supported(dev, ATA_LOG_IDENTIFY_DEVICE) ||
+	    !ata_identify_page_supported(dev, ATA_LOG_SUPPORTED_CAPABILITIES))
+		goto not_supported;
+
+	err_mask = ata_read_log_page(dev, ATA_LOG_IDENTIFY_DEVICE,
+				     ATA_LOG_SUPPORTED_CAPABILITIES,
+				     ap->sector_buf, 1);
+	if (err_mask)
+		goto not_supported;
+
+	/* Check Command Duration Limit Supported bits */
+	val = get_unaligned_le64(&ap->sector_buf[168]);
+	if (!(val & BIT_ULL(63)) || !(val & BIT_ULL(0)))
+		goto not_supported;
+
+	/* Warn the user if command duration guideline is not supported */
+	if (!(val & BIT_ULL(1)))
+		ata_dev_warn(dev,
+			"Command duration guideline is not supported\n");
+
+	/*
+	 * Command duration limits is supported: cache the CDL log page 18h
+	 * (command duration descriptors).
+	 */
+	err_mask = ata_read_log_page(dev, ATA_LOG_CDL, 0, ap->sector_buf, 1);
+	if (err_mask) {
+		ata_dev_warn(dev, "Read Command Duration Limits log failed\n");
+		goto not_supported;
+	}
+
+	memcpy(dev->cdl, ap->sector_buf, ATA_LOG_CDL_SIZE);
+	dev->flags |= ATA_DFLAG_CDL;
+
+	return;
+
+not_supported:
+	dev->flags &= ~ATA_DFLAG_CDL;
+}
+
 static int ata_dev_config_lba(struct ata_device *dev)
 {
 	const u16 *id = dev->id;
@@ -2534,13 +2582,14 @@ static void ata_dev_print_features(struct ata_device *dev)
 		return;
 
 	ata_dev_info(dev,
-		     "Features:%s%s%s%s%s%s%s\n",
+		     "Features:%s%s%s%s%s%s%s%s\n",
 		     dev->flags & ATA_DFLAG_FUA ? " FUA" : "",
 		     dev->flags & ATA_DFLAG_TRUSTED ? " Trust" : "",
 		     dev->flags & ATA_DFLAG_DA ? " Dev-Attention" : "",
 		     dev->flags & ATA_DFLAG_DEVSLP ? " Dev-Sleep" : "",
 		     dev->flags & ATA_DFLAG_NCQ_SEND_RECV ? " NCQ-sndrcv" : "",
 		     dev->flags & ATA_DFLAG_NCQ_PRIO ? " NCQ-prio" : "",
+		     dev->flags & ATA_DFLAG_CDL ? " CDL" : "",
 		     dev->cpr_log ? " CPR" : "");
 }
 
@@ -2702,6 +2751,7 @@ int ata_dev_configure(struct ata_device *dev)
 		ata_dev_config_zac(dev);
 		ata_dev_config_trusted(dev);
 		ata_dev_config_cpr(dev);
+		ata_dev_config_cdl(dev);
 		dev->cdb_len = 32;
 
 		if (print_info)
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 26746609bf76..716c33af999c 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -47,15 +47,14 @@ typedef unsigned int (*ata_xlat_func_t)(struct ata_queued_cmd *qc);
 static struct ata_device *__ata_scsi_find_dev(struct ata_port *ap,
 					const struct scsi_device *scsidev);
 
-#define RW_RECOVERY_MPAGE 0x1
-#define RW_RECOVERY_MPAGE_LEN 12
-#define CACHE_MPAGE 0x8
-#define CACHE_MPAGE_LEN 20
-#define CONTROL_MPAGE 0xa
-#define CONTROL_MPAGE_LEN 12
-#define ALL_MPAGES 0x3f
-#define ALL_SUB_MPAGES 0xff
-
+#define RW_RECOVERY_MPAGE		0x1
+#define RW_RECOVERY_MPAGE_LEN		12
+#define CACHE_MPAGE			0x8
+#define CACHE_MPAGE_LEN			20
+#define CONTROL_MPAGE			0xa
+#define CONTROL_MPAGE_LEN		12
+#define ALL_MPAGES			0x3f
+#define ALL_SUB_MPAGES			0xff
 
 static const u8 def_rw_recovery_mpage[RW_RECOVERY_MPAGE_LEN] = {
 	RW_RECOVERY_MPAGE,
diff --git a/include/linux/ata.h b/include/linux/ata.h
index 0c18499f60b6..b01e2cebe1fe 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -323,15 +323,18 @@ enum {
 	ATA_LOG_SATA_NCQ	= 0x10,
 	ATA_LOG_NCQ_NON_DATA	= 0x12,
 	ATA_LOG_NCQ_SEND_RECV	= 0x13,
+	ATA_LOG_CDL		= 0x18,
+	ATA_LOG_CDL_SIZE	= ATA_SECT_SIZE,
 	ATA_LOG_IDENTIFY_DEVICE	= 0x30,
 	ATA_LOG_CONCURRENT_POSITIONING_RANGES = 0x47,
 
 	/* Identify device log pages: */
+	ATA_LOG_SUPPORTED_CAPABILITIES	= 0x03,
 	ATA_LOG_SECURITY	  = 0x06,
 	ATA_LOG_SATA_SETTINGS	  = 0x08,
 	ATA_LOG_ZONED_INFORMATION = 0x09,
 
-	/* Identify device SATA settings log:*/
+	/* Identify device SATA settings log: */
 	ATA_LOG_DEVSLP_OFFSET	  = 0x30,
 	ATA_LOG_DEVSLP_SIZE	  = 0x08,
 	ATA_LOG_DEVSLP_MDAT	  = 0x00,
diff --git a/include/linux/libata.h b/include/linux/libata.h
index a759dfbdcc91..2b17d6c99a37 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -94,17 +94,18 @@ enum {
 	ATA_DFLAG_DMADIR	= (1 << 10), /* device requires DMADIR */
 	ATA_DFLAG_NCQ_SEND_RECV = (1 << 11), /* device supports NCQ SEND and RECV */
 	ATA_DFLAG_NCQ_PRIO	= (1 << 12), /* device supports NCQ priority */
-	ATA_DFLAG_CFG_MASK	= (1 << 13) - 1,
-
-	ATA_DFLAG_PIO		= (1 << 13), /* device limited to PIO mode */
-	ATA_DFLAG_NCQ_OFF	= (1 << 14), /* device limited to non-NCQ mode */
-	ATA_DFLAG_SLEEPING	= (1 << 15), /* device is sleeping */
-	ATA_DFLAG_DUBIOUS_XFER	= (1 << 16), /* data transfer not verified */
-	ATA_DFLAG_NO_UNLOAD	= (1 << 17), /* device doesn't support unload */
-	ATA_DFLAG_UNLOCK_HPA	= (1 << 18), /* unlock HPA */
-	ATA_DFLAG_INIT_MASK	= (1 << 19) - 1,
-
-	ATA_DFLAG_NCQ_PRIO_ENABLED = (1 << 19), /* Priority cmds sent to dev */
+	ATA_DFLAG_CDL		= (1 << 13), /* supports cmd duration limits */
+	ATA_DFLAG_CFG_MASK	= (1 << 14) - 1,
+
+	ATA_DFLAG_PIO		= (1 << 14), /* device limited to PIO mode */
+	ATA_DFLAG_NCQ_OFF	= (1 << 15), /* device limited to non-NCQ mode */
+	ATA_DFLAG_SLEEPING	= (1 << 16), /* device is sleeping */
+	ATA_DFLAG_DUBIOUS_XFER	= (1 << 17), /* data transfer not verified */
+	ATA_DFLAG_NO_UNLOAD	= (1 << 18), /* device doesn't support unload */
+	ATA_DFLAG_UNLOCK_HPA	= (1 << 19), /* unlock HPA */
+	ATA_DFLAG_INIT_MASK	= (1 << 20) - 1,
+
+	ATA_DFLAG_NCQ_PRIO_ENABLED = (1 << 20), /* Priority cmds sent to dev */
 	ATA_DFLAG_DETACH	= (1 << 24),
 	ATA_DFLAG_DETACHED	= (1 << 25),
 	ATA_DFLAG_DA		= (1 << 26), /* device supports Device Attention */
@@ -115,7 +116,8 @@ enum {
 
 	ATA_DFLAG_FEATURES_MASK	= (ATA_DFLAG_TRUSTED | ATA_DFLAG_DA |	\
 				   ATA_DFLAG_DEVSLP | ATA_DFLAG_NCQ_SEND_RECV | \
-				   ATA_DFLAG_NCQ_PRIO | ATA_DFLAG_FUA),
+				   ATA_DFLAG_NCQ_PRIO | ATA_DFLAG_FUA | \
+				   ATA_DFLAG_CDL),
 
 	ATA_DEV_UNKNOWN		= 0,	/* unknown device */
 	ATA_DEV_ATA		= 1,	/* ATA device */
@@ -709,6 +711,9 @@ struct ata_device {
 	/* Concurrent positioning ranges */
 	struct ata_cpr_log	*cpr_log;
 
+	/* Command Duration Limits log support */
+	u8			cdl[ATA_LOG_CDL_SIZE];
+
 	/* error history */
 	int			spdn_cnt;
 	/* ering is CLEAR_END, read comment above CLEAR_END */
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 13/18] ata: libata-scsi: handle CDL bits in ata_scsiop_maint_in()
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (11 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 12/18] ata: libata: detect support for command duration limits Niklas Cassel
@ 2023-01-24 19:02 ` Niklas Cassel
  2023-01-27 15:37   ` Hannes Reinecke
  2023-01-24 19:03 ` [PATCH v3 14/18] ata: libata-scsi: add support for CDL pages mode sense Niklas Cassel
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:02 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

For a scsi MAINTENANCE_IN/MI_REPORT_SUPPORTED_OPERATION_CODES operation,
add the translation of the rwcdlp and cdlp bits for the READ16 and
WRITE16 commands. If the ATA device does not support command duration
limits, these bits are always 0. If the ATA device supports command
duration limits, the rwcdlp bit is set to 1 for READ16 and WRITE16 and
the cdlp bits are set to 0x1 for READ16 and 0x2 for WRITE16. These
correspond to the T2A mode page containing the read descriptors and
to the T2B mode page containing the write descriptors, as defined in
SAT-5.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/ata/libata-scsi.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 716c33af999c..2a0a04c9e658 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -3235,7 +3235,7 @@ static unsigned int ata_scsiop_maint_in(struct ata_scsi_args *args, u8 *rbuf)
 {
 	struct ata_device *dev = args->dev;
 	u8 *cdb = args->cmd->cmnd;
-	u8 supported = 0;
+	u8 supported = 0, cdlp = 0, rwcdlp = 0;
 	unsigned int err = 0;
 
 	if (cdb[2] != 1 && cdb[2] != 3) {
@@ -3262,10 +3262,8 @@ static unsigned int ata_scsiop_maint_in(struct ata_scsi_args *args, u8 *rbuf)
 	case MAINTENANCE_IN:
 	case READ_6:
 	case READ_10:
-	case READ_16:
 	case WRITE_6:
 	case WRITE_10:
-	case WRITE_16:
 	case ATA_12:
 	case ATA_16:
 	case VERIFY:
@@ -3275,6 +3273,28 @@ static unsigned int ata_scsiop_maint_in(struct ata_scsi_args *args, u8 *rbuf)
 	case START_STOP:
 		supported = 3;
 		break;
+	case READ_16:
+		supported = 3;
+		if (dev->flags & ATA_DFLAG_CDL) {
+			/*
+			 * CDL read descriptors map to the T2A page, that is,
+			 * rwcdlp = 0x01 and cdlp = 0x01
+			 */
+			rwcdlp = 0x01;
+			cdlp = 0x01 << 3;
+		}
+		break;
+	case WRITE_16:
+		supported = 3;
+		if (dev->flags & ATA_DFLAG_CDL) {
+			/*
+			 * CDL write descriptors map to the T2B page, that is,
+			 * rwcdlp = 0x01 and cdlp = 0x02
+			 */
+			rwcdlp = 0x01;
+			cdlp = 0x02 << 3;
+		}
+		break;
 	case ZBC_IN:
 	case ZBC_OUT:
 		if (ata_id_zoned_cap(dev->id) ||
@@ -3290,7 +3310,9 @@ static unsigned int ata_scsiop_maint_in(struct ata_scsi_args *args, u8 *rbuf)
 		break;
 	}
 out:
-	rbuf[1] = supported; /* supported */
+	/* One command format */
+	rbuf[0] = rwcdlp;
+	rbuf[1] = cdlp | supported;
 	return err;
 }
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 14/18] ata: libata-scsi: add support for CDL pages mode sense
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (12 preceding siblings ...)
  2023-01-24 19:02 ` [PATCH v3 13/18] ata: libata-scsi: handle CDL bits in ata_scsiop_maint_in() Niklas Cassel
@ 2023-01-24 19:03 ` Niklas Cassel
  2023-01-27 15:38   ` Hannes Reinecke
  2023-01-24 19:03 ` [PATCH v3 15/18] ata: libata: add ATA feature control sub-page translation Niklas Cassel
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:03 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Modify ata_scsiop_mode_sense() and ata_msense_control() to support mode
sense access to the T2A and T2B sub-pages of the control mode page.
ata_msense_control() is modified to support sub-pages. The T2A sub-page
is generated using the read descriptors of the command duration limits
log page 18h. The T2B sub-page is generated using the write descriptors
of the same log page. With the addition of these sub-pages, getting all
sub-pages of the control mode page is also supported by increasing the
value of ATA_SCSI_RBUF_SIZE from 576B up to 2048B to ensure that all
sub-pages fit in the fill buffer.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/ata/libata-scsi.c | 150 ++++++++++++++++++++++++++++++++------
 1 file changed, 128 insertions(+), 22 deletions(-)

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 2a0a04c9e658..9315a4c01276 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -37,7 +37,7 @@
 #include "libata.h"
 #include "libata-transport.h"
 
-#define ATA_SCSI_RBUF_SIZE	576
+#define ATA_SCSI_RBUF_SIZE	2048
 
 static DEFINE_SPINLOCK(ata_scsi_rbuf_lock);
 static u8 ata_scsi_rbuf[ATA_SCSI_RBUF_SIZE];
@@ -55,6 +55,9 @@ static struct ata_device *__ata_scsi_find_dev(struct ata_port *ap,
 #define CONTROL_MPAGE_LEN		12
 #define ALL_MPAGES			0x3f
 #define ALL_SUB_MPAGES			0xff
+#define CDL_T2A_SUB_MPAGE		0x07
+#define CDL_T2B_SUB_MPAGE		0x08
+#define CDL_T2_SUB_MPAGE_LEN		232
 
 static const u8 def_rw_recovery_mpage[RW_RECOVERY_MPAGE_LEN] = {
 	RW_RECOVERY_MPAGE,
@@ -2196,10 +2199,98 @@ static unsigned int ata_msense_caching(u16 *id, u8 *buf, bool changeable)
 	return sizeof(def_cache_mpage);
 }
 
+/*
+ * Simulate MODE SENSE control mode page, sub-page 0.
+ */
+static unsigned int ata_msense_control_spg0(struct ata_device *dev, u8 *buf,
+					    bool changeable)
+{
+	modecpy(buf, def_control_mpage,
+		sizeof(def_control_mpage), changeable);
+	if (changeable) {
+		/* ata_mselect_control() */
+		buf[2] |= (1 << 2);
+	} else {
+		bool d_sense = (dev->flags & ATA_DFLAG_D_SENSE);
+
+		/* descriptor format sense data */
+		buf[2] |= (d_sense << 2);
+	}
+
+	return sizeof(def_control_mpage);
+}
+
+/*
+ * Translate an ATA duration limit in microseconds to a SCSI duration limit
+ * using the t2cdlunits 0xa (10ms). Since the SCSI duration limits are 2-bytes
+ * only, take care of overflows.
+ */
+static inline u16 ata_xlat_cdl_limit(u8 *buf)
+{
+	u32 limit = get_unaligned_le32(buf);
+
+	return min_t(u32, limit / 10000, 65535);
+}
+
+/*
+ * Simulate MODE SENSE control mode page, sub-pages 07h and 08h
+ * (command duration limits T2A and T2B mode pages).
+ */
+static unsigned int ata_msense_control_spgt2(struct ata_device *dev, u8 *buf,
+					     u8 spg)
+{
+	u8 *b, *cdl = dev->cdl, *desc;
+	u32 policy;
+	int i;
+
+	/*
+	 * Fill the subpage. The first four bytes of the T2A/T2B mode pages
+	 * are a header. The PAGE LENGTH field is the size of the page
+	 * excluding the header.
+	 */
+	buf[0] = CONTROL_MPAGE;
+	buf[1] = spg;
+	put_unaligned_be16(CDL_T2_SUB_MPAGE_LEN - 4, &buf[2]);
+	if (spg == CDL_T2A_SUB_MPAGE) {
+		/*
+		 * Read descriptors map to the T2A page:
+		 * set perf_vs_duration_guidleine.
+		 */
+		buf[7] = (cdl[0] & 0x03) << 4;
+		desc = cdl + 64;
+	} else {
+		/* Write descriptors map to the T2B page */
+		desc = cdl + 288;
+	}
+
+	/* Fill the T2 page descriptors */
+	b = &buf[8];
+	policy = get_unaligned_le32(&cdl[0]);
+	for (i = 0; i < 7; i++, b += 32, desc += 32) {
+		/* t2cdlunits: fixed to 10ms */
+		b[0] = 0x0a;
+
+		/* Max inactive time and its policy */
+		put_unaligned_be16(ata_xlat_cdl_limit(&desc[8]), &b[2]);
+		b[6] = ((policy >> 8) & 0x0f) << 4;
+
+		/* Max active time and its policy */
+		put_unaligned_be16(ata_xlat_cdl_limit(&desc[4]), &b[4]);
+		b[6] |= (policy >> 4) & 0x0f;
+
+		/* Command duration guideline and its policy */
+		put_unaligned_be16(ata_xlat_cdl_limit(&desc[16]), &b[10]);
+		b[14] = policy & 0x0f;
+	}
+
+	return CDL_T2_SUB_MPAGE_LEN;
+}
+
 /**
  *	ata_msense_control - Simulate MODE SENSE control mode page
  *	@dev: ATA device of interest
  *	@buf: output buffer
+ *	@spg: sub-page code
  *	@changeable: whether changeable parameters are requested
  *
  *	Generate a generic MODE SENSE control mode page.
@@ -2208,17 +2299,24 @@ static unsigned int ata_msense_caching(u16 *id, u8 *buf, bool changeable)
  *	None.
  */
 static unsigned int ata_msense_control(struct ata_device *dev, u8 *buf,
-					bool changeable)
+				       u8 spg, bool changeable)
 {
-	modecpy(buf, def_control_mpage, sizeof(def_control_mpage), changeable);
-	if (changeable) {
-		buf[2] |= (1 << 2);	/* ata_mselect_control() */
-	} else {
-		bool d_sense = (dev->flags & ATA_DFLAG_D_SENSE);
-
-		buf[2] |= (d_sense << 2);	/* descriptor format sense data */
+	unsigned int n;
+
+	switch (spg) {
+	case 0:
+		return ata_msense_control_spg0(dev, buf, changeable);
+	case CDL_T2A_SUB_MPAGE:
+	case CDL_T2B_SUB_MPAGE:
+		return ata_msense_control_spgt2(dev, buf, spg);
+	case ALL_SUB_MPAGES:
+		n = ata_msense_control_spg0(dev, buf, changeable);
+		n += ata_msense_control_spgt2(dev, buf + n, CDL_T2A_SUB_MPAGE);
+		n += ata_msense_control_spgt2(dev, buf + n, CDL_T2A_SUB_MPAGE);
+		return n;
+	default:
+		return 0;
 	}
-	return sizeof(def_control_mpage);
 }
 
 /**
@@ -2291,13 +2389,24 @@ static unsigned int ata_scsiop_mode_sense(struct ata_scsi_args *args, u8 *rbuf)
 
 	pg = scsicmd[2] & 0x3f;
 	spg = scsicmd[3];
+
 	/*
-	 * No mode subpages supported (yet) but asking for _all_
-	 * subpages may be valid
+	 * Supported subpages: all subpages and sub-pages 07h and 08h of
+	 * the control page.
 	 */
-	if (spg && (spg != ALL_SUB_MPAGES)) {
-		fp = 3;
-		goto invalid_fld;
+	if (spg) {
+		switch (spg) {
+		case ALL_SUB_MPAGES:
+			break;
+		case CDL_T2A_SUB_MPAGE:
+		case CDL_T2B_SUB_MPAGE:
+			if (dev->flags & ATA_DFLAG_CDL && pg == CONTROL_MPAGE)
+				break;
+			fallthrough;
+		default:
+			fp = 3;
+			goto invalid_fld;
+		}
 	}
 
 	switch(pg) {
@@ -2310,13 +2419,13 @@ static unsigned int ata_scsiop_mode_sense(struct ata_scsi_args *args, u8 *rbuf)
 		break;
 
 	case CONTROL_MPAGE:
-		p += ata_msense_control(args->dev, p, page_control == 1);
+		p += ata_msense_control(args->dev, p, spg, page_control == 1);
 		break;
 
 	case ALL_MPAGES:
 		p += ata_msense_rw_recovery(p, page_control == 1);
 		p += ata_msense_caching(args->id, p, page_control == 1);
-		p += ata_msense_control(args->dev, p, page_control == 1);
+		p += ata_msense_control(args->dev, p, spg, page_control == 1);
 		break;
 
 	default:		/* invalid page code */
@@ -2335,10 +2444,7 @@ static unsigned int ata_scsiop_mode_sense(struct ata_scsi_args *args, u8 *rbuf)
 			memcpy(rbuf + 4, sat_blk_desc, sizeof(sat_blk_desc));
 		}
 	} else {
-		unsigned int output_len = p - rbuf - 2;
-
-		rbuf[0] = output_len >> 8;
-		rbuf[1] = output_len;
+		put_unaligned_be16(p - rbuf - 2, &rbuf[0]);
 		rbuf[3] |= dpofua;
 		if (ebd) {
 			rbuf[7] = sizeof(sat_blk_desc);
@@ -3637,7 +3743,7 @@ static int ata_mselect_control(struct ata_queued_cmd *qc,
 	/*
 	 * Check that read-only bits are not modified.
 	 */
-	ata_msense_control(dev, mpage, false);
+	ata_msense_control_spg0(dev, mpage, false);
 	for (i = 0; i < CONTROL_MPAGE_LEN - 2; i++) {
 		if (i == 0)
 			continue;
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 15/18] ata: libata: add ATA feature control sub-page translation
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (13 preceding siblings ...)
  2023-01-24 19:03 ` [PATCH v3 14/18] ata: libata-scsi: add support for CDL pages mode sense Niklas Cassel
@ 2023-01-24 19:03 ` Niklas Cassel
  2023-01-27 15:40   ` Hannes Reinecke
  2023-01-24 19:03 ` [PATCH v3 16/18] ata: libata: set read/write commands CDL index Niklas Cassel
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:03 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Add support for the ATA feature control sub-page of the control mode
page to enable/disable the command duration limits feature using the
cdl_ctrl field of the ATA feature control sub-page.

Both mode sense and mode select translation are supported. For mode
sense, the ata device flag ATA_DFLAG_CDL_ENABLED is used to cache the
status of the command duration limits feature. Enabling this feature is
done using a SET FEATURES command with a cdl action set to 1 when the
page cdl_ctrl field value is 0x2 (T2A and T2B pages supported). If this
field is 0, CDL is disabled using the SET FEATURES command with a cdl
action set to 0.

Since a device CDL and NCQ priority features should not be used
simultaneously, ata_mselect_control_ata_feature() returns an error when
attempting to enable CDL with the device priority feature enabled.
Conversely, the function ata_ncq_prio_enable_store() used to enable the
use of the device NCQ priority feature through sysfs is modified to
return an error if the device CDL feature is enabled.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/ata/libata-core.c |  40 ++++++++-
 drivers/ata/libata-sata.c |  11 ++-
 drivers/ata/libata-scsi.c | 167 ++++++++++++++++++++++++++++++++------
 include/linux/ata.h       |   3 +
 include/linux/libata.h    |   1 +
 5 files changed, 193 insertions(+), 29 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 17e32b0a6364..9aa49eab2b95 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -2371,13 +2371,15 @@ static void ata_dev_config_cdl(struct ata_device *dev)
 {
 	struct ata_port *ap = dev->link->ap;
 	unsigned int err_mask;
+	bool cdl_enabled;
 	u64 val;
 
 	if (ata_id_major_version(dev->id) < 12)
 		goto not_supported;
 
 	if (!ata_log_supported(dev, ATA_LOG_IDENTIFY_DEVICE) ||
-	    !ata_identify_page_supported(dev, ATA_LOG_SUPPORTED_CAPABILITIES))
+	    !ata_identify_page_supported(dev, ATA_LOG_SUPPORTED_CAPABILITIES) ||
+	    !ata_identify_page_supported(dev, ATA_LOG_CURRENT_SETTINGS))
 		goto not_supported;
 
 	err_mask = ata_read_log_page(dev, ATA_LOG_IDENTIFY_DEVICE,
@@ -2396,6 +2398,40 @@ static void ata_dev_config_cdl(struct ata_device *dev)
 		ata_dev_warn(dev,
 			"Command duration guideline is not supported\n");
 
+	/*
+	 * If CDL is marked as enabled, make sure the feature is enabled too.
+	 * Conversely, if CDL is disabled, make sure the feature is turned off.
+	 */
+	err_mask = ata_read_log_page(dev, ATA_LOG_IDENTIFY_DEVICE,
+				     ATA_LOG_CURRENT_SETTINGS,
+				     ap->sector_buf, 1);
+	if (err_mask)
+		goto not_supported;
+
+	val = get_unaligned_le64(&ap->sector_buf[8]);
+	cdl_enabled = val & BIT_ULL(63) && val & BIT_ULL(21);
+	if (dev->flags & ATA_DFLAG_CDL_ENABLED) {
+		if (!cdl_enabled) {
+			/* Enable CDL on the device */
+			err_mask = ata_dev_set_feature(dev, SETFEATURES_CDL, 1);
+			if (err_mask) {
+				ata_dev_err(dev,
+					    "Enable CDL feature failed\n");
+				goto not_supported;
+			}
+		}
+	} else {
+		if (cdl_enabled) {
+			/* Disable CDL on the device */
+			err_mask = ata_dev_set_feature(dev, SETFEATURES_CDL, 0);
+			if (err_mask) {
+				ata_dev_err(dev,
+					    "Disable CDL feature failed\n");
+				goto not_supported;
+			}
+		}
+	}
+
 	/*
 	 * Command duration limits is supported: cache the CDL log page 18h
 	 * (command duration descriptors).
@@ -2412,7 +2448,7 @@ static void ata_dev_config_cdl(struct ata_device *dev)
 	return;
 
 not_supported:
-	dev->flags &= ~ATA_DFLAG_CDL;
+	dev->flags &= ~(ATA_DFLAG_CDL | ATA_DFLAG_CDL_ENABLED);
 }
 
 static int ata_dev_config_lba(struct ata_device *dev)
diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
index f3e7396e3191..57cb33060c9d 100644
--- a/drivers/ata/libata-sata.c
+++ b/drivers/ata/libata-sata.c
@@ -907,10 +907,17 @@ static ssize_t ata_ncq_prio_enable_store(struct device *device,
 		goto unlock;
 	}
 
-	if (input)
+	if (input) {
+		if (dev->flags & ATA_DFLAG_CDL_ENABLED) {
+			ata_dev_err(dev,
+				"CDL must be disabled to enable NCQ priority\n");
+			rc = -EINVAL;
+			goto unlock;
+		}
 		dev->flags |= ATA_DFLAG_NCQ_PRIO_ENABLED;
-	else
+	} else {
 		dev->flags &= ~ATA_DFLAG_NCQ_PRIO_ENABLED;
+	}
 
 unlock:
 	spin_unlock_irq(ap->lock);
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 9315a4c01276..8dde1cede5ca 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -58,6 +58,8 @@ static struct ata_device *__ata_scsi_find_dev(struct ata_port *ap,
 #define CDL_T2A_SUB_MPAGE		0x07
 #define CDL_T2B_SUB_MPAGE		0x08
 #define CDL_T2_SUB_MPAGE_LEN		232
+#define ATA_FEATURE_SUB_MPAGE		0xf2
+#define ATA_FEATURE_SUB_MPAGE_LEN	16
 
 static const u8 def_rw_recovery_mpage[RW_RECOVERY_MPAGE_LEN] = {
 	RW_RECOVERY_MPAGE,
@@ -2286,6 +2288,31 @@ static unsigned int ata_msense_control_spgt2(struct ata_device *dev, u8 *buf,
 	return CDL_T2_SUB_MPAGE_LEN;
 }
 
+/*
+ * Simulate MODE SENSE control mode page, sub-page f2h
+ * (ATA feature control mode page).
+ */
+static unsigned int ata_msense_control_ata_feature(struct ata_device *dev,
+						   u8 *buf)
+{
+	/* PS=0, SPF=1 */
+	buf[0] = CONTROL_MPAGE | (1 << 6);
+	buf[1] = ATA_FEATURE_SUB_MPAGE;
+
+	/*
+	 * The first four bytes of ATA Feature Control mode page are a header.
+	 * The PAGE LENGTH field is the size of the page excluding the header.
+	 */
+	put_unaligned_be16(ATA_FEATURE_SUB_MPAGE_LEN - 4, &buf[2]);
+
+	if (dev->flags & ATA_DFLAG_CDL)
+		buf[4] = 0x02; /* Support T2A and T2B pages */
+	else
+		buf[4] = 0;
+
+	return ATA_FEATURE_SUB_MPAGE_LEN;
+}
+
 /**
  *	ata_msense_control - Simulate MODE SENSE control mode page
  *	@dev: ATA device of interest
@@ -2309,10 +2336,13 @@ static unsigned int ata_msense_control(struct ata_device *dev, u8 *buf,
 	case CDL_T2A_SUB_MPAGE:
 	case CDL_T2B_SUB_MPAGE:
 		return ata_msense_control_spgt2(dev, buf, spg);
+	case ATA_FEATURE_SUB_MPAGE:
+		return ata_msense_control_ata_feature(dev, buf);
 	case ALL_SUB_MPAGES:
 		n = ata_msense_control_spg0(dev, buf, changeable);
 		n += ata_msense_control_spgt2(dev, buf + n, CDL_T2A_SUB_MPAGE);
 		n += ata_msense_control_spgt2(dev, buf + n, CDL_T2A_SUB_MPAGE);
+		n += ata_msense_control_ata_feature(dev, buf + n);
 		return n;
 	default:
 		return 0;
@@ -2391,7 +2421,7 @@ static unsigned int ata_scsiop_mode_sense(struct ata_scsi_args *args, u8 *rbuf)
 	spg = scsicmd[3];
 
 	/*
-	 * Supported subpages: all subpages and sub-pages 07h and 08h of
+	 * Supported subpages: all subpages and sub-pages 07h, 08h and f2h of
 	 * the control page.
 	 */
 	if (spg) {
@@ -2400,6 +2430,7 @@ static unsigned int ata_scsiop_mode_sense(struct ata_scsi_args *args, u8 *rbuf)
 			break;
 		case CDL_T2A_SUB_MPAGE:
 		case CDL_T2B_SUB_MPAGE:
+		case ATA_FEATURE_SUB_MPAGE:
 			if (dev->flags & ATA_DFLAG_CDL && pg == CONTROL_MPAGE)
 				break;
 			fallthrough;
@@ -3708,20 +3739,11 @@ static int ata_mselect_caching(struct ata_queued_cmd *qc,
 	return 0;
 }
 
-/**
- *	ata_mselect_control - Simulate MODE SELECT for control page
- *	@qc: Storage for translated ATA taskfile
- *	@buf: input buffer
- *	@len: number of valid bytes in the input buffer
- *	@fp: out parameter for the failed field on error
- *
- *	Prepare a taskfile to modify caching information for the device.
- *
- *	LOCKING:
- *	None.
+/*
+ * Simulate MODE SELECT control mode page, sub-page 0.
  */
-static int ata_mselect_control(struct ata_queued_cmd *qc,
-			       const u8 *buf, int len, u16 *fp)
+static int ata_mselect_control_spg0(struct ata_queued_cmd *qc,
+				    const u8 *buf, int len, u16 *fp)
 {
 	struct ata_device *dev = qc->dev;
 	u8 mpage[CONTROL_MPAGE_LEN];
@@ -3759,6 +3781,83 @@ static int ata_mselect_control(struct ata_queued_cmd *qc,
 	return 0;
 }
 
+/*
+ * Translate MODE SELECT control mode page, sub-pages f2h (ATA feature mode
+ * page) into a SET FEATURES command.
+ */
+static unsigned int ata_mselect_control_ata_feature(struct ata_queued_cmd *qc,
+						    const u8 *buf, int len,
+						    u16 *fp)
+{
+	struct ata_device *dev = qc->dev;
+	struct ata_taskfile *tf = &qc->tf;
+	u8 cdl_action;
+
+	/*
+	 * The first four bytes of ATA Feature Control mode page are a header,
+	 * so offsets in mpage are off by 4 compared to buf.  Same for len.
+	 */
+	if (len != ATA_FEATURE_SUB_MPAGE_LEN - 4) {
+		*fp = min(len, ATA_FEATURE_SUB_MPAGE_LEN - 4);
+		return -EINVAL;
+	}
+
+	/* Check cdl_ctrl */
+	switch (buf[0] & 0x03) {
+	case 0:
+		/* Disable CDL */
+		cdl_action = 0;
+		dev->flags &= ~ATA_DFLAG_CDL_ENABLED;
+		break;
+	case 0x02:
+		/* Enable CDL T2A/T2B: NCQ priority must be disabled */
+		if (dev->flags & ATA_DFLAG_NCQ_PRIO_ENABLED) {
+			ata_dev_err(dev,
+				"NCQ priority must be disabled to enable CDL\n");
+			return -EINVAL;
+		}
+		cdl_action = 1;
+		dev->flags |= ATA_DFLAG_CDL_ENABLED;
+		break;
+	default:
+		*fp = 0;
+		return -EINVAL;
+	}
+
+	tf->flags |= ATA_TFLAG_DEVICE | ATA_TFLAG_ISADDR;
+	tf->protocol = ATA_PROT_NODATA;
+	tf->command = ATA_CMD_SET_FEATURES;
+	tf->feature = SETFEATURES_CDL;
+	tf->nsect = cdl_action;
+
+	return 1;
+}
+
+/**
+ *	ata_mselect_control - Simulate MODE SELECT for control page
+ *	@qc: Storage for translated ATA taskfile
+ *	@buf: input buffer
+ *	@len: number of valid bytes in the input buffer
+ *	@fp: out parameter for the failed field on error
+ *
+ *	Prepare a taskfile to modify caching information for the device.
+ *
+ *	LOCKING:
+ *	None.
+ */
+static int ata_mselect_control(struct ata_queued_cmd *qc, u8 spg,
+			       const u8 *buf, int len, u16 *fp)
+{
+	switch (spg) {
+	case 0:
+		return ata_mselect_control_spg0(qc, buf, len, fp);
+	case ATA_FEATURE_SUB_MPAGE:
+		return ata_mselect_control_ata_feature(qc, buf, len, fp);
+	default:
+		return -EINVAL;
+	}
+}
+
 /**
  *	ata_scsi_mode_select_xlat - Simulate MODE SELECT 6, 10 commands
  *	@qc: Storage for translated ATA taskfile
@@ -3776,7 +3875,7 @@ static unsigned int ata_scsi_mode_select_xlat(struct ata_queued_cmd *qc)
 	const u8 *cdb = scmd->cmnd;
 	u8 pg, spg;
 	unsigned six_byte, pg_len, hdr_len, bd_len;
-	int len;
+	int len, ret;
 	u16 fp = (u16)-1;
 	u8 bp = 0xff;
 	u8 buffer[64];
@@ -3861,13 +3960,29 @@ static unsigned int ata_scsi_mode_select_xlat(struct ata_queued_cmd *qc)
 	}
 
 	/*
-	 * No mode subpages supported (yet) but asking for _all_
-	 * subpages may be valid
+	 * Supported subpages: all subpages and ATA feature sub-page f2h of
+	 * the control page.
 	 */
-	if (spg && (spg != ALL_SUB_MPAGES)) {
-		fp = (p[0] & 0x40) ? 1 : 0;
-		fp += hdr_len + bd_len;
-		goto invalid_param;
+	if (spg) {
+		switch (spg) {
+		case ALL_SUB_MPAGES:
+			/* All subpages is not supported for the control page */
+			if (pg == CONTROL_MPAGE) {
+				fp = (p[0] & 0x40) ? 1 : 0;
+				fp += hdr_len + bd_len;
+				goto invalid_param;
+			}
+			break;
+		case ATA_FEATURE_SUB_MPAGE:
+			if (qc->dev->flags & ATA_DFLAG_CDL &&
+			    pg == CONTROL_MPAGE)
+				break;
+			fallthrough;
+		default:
+			fp = (p[0] & 0x40) ? 1 : 0;
+			fp += hdr_len + bd_len;
+			goto invalid_param;
+		}
 	}
 	if (pg_len > len)
 		goto invalid_param_len;
@@ -3880,14 +3995,16 @@ static unsigned int ata_scsi_mode_select_xlat(struct ata_queued_cmd *qc)
 		}
 		break;
 	case CONTROL_MPAGE:
-		if (ata_mselect_control(qc, p, pg_len, &fp) < 0) {
+		ret = ata_mselect_control(qc, spg, p, pg_len, &fp);
+		if (ret < 0) {
 			fp += hdr_len + bd_len;
 			goto invalid_param;
-		} else {
-			goto skip; /* No ATA command to send */
 		}
+		if (!ret)
+			goto skip; /* No ATA command to send */
 		break;
-	default:		/* invalid page code */
+	default:
+		/* Invalid page code */
 		fp = bd_len + hdr_len;
 		goto invalid_param;
 	}
diff --git a/include/linux/ata.h b/include/linux/ata.h
index b01e2cebe1fe..a59b17d6ad11 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -330,6 +330,7 @@ enum {
 
 	/* Identify device log pages: */
 	ATA_LOG_SUPPORTED_CAPABILITIES	= 0x03,
+	ATA_LOG_CURRENT_SETTINGS  = 0x04,
 	ATA_LOG_SECURITY	  = 0x06,
 	ATA_LOG_SATA_SETTINGS	  = 0x08,
 	ATA_LOG_ZONED_INFORMATION = 0x09,
@@ -419,6 +420,8 @@ enum {
 	SETFEATURES_SATA_ENABLE = 0x10, /* Enable use of SATA feature */
 	SETFEATURES_SATA_DISABLE = 0x90, /* Disable use of SATA feature */
 
+	SETFEATURES_CDL		= 0x0d, /* Enable/disable cmd duration limits */
+
 	/* SETFEATURE Sector counts for SATA features */
 	SATA_FPDMA_OFFSET	= 0x01,	/* FPDMA non-zero buffer offsets */
 	SATA_FPDMA_AA		= 0x02, /* FPDMA Setup FIS Auto-Activate */
diff --git a/include/linux/libata.h b/include/linux/libata.h
index 2b17d6c99a37..d7fe735e6322 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -106,6 +106,7 @@ enum {
 	ATA_DFLAG_INIT_MASK	= (1 << 20) - 1,
 
 	ATA_DFLAG_NCQ_PRIO_ENABLED = (1 << 20), /* Priority cmds sent to dev */
+	ATA_DFLAG_CDL_ENABLED	= (1 << 21), /* cmd duration limits is enabled */
 	ATA_DFLAG_DETACH	= (1 << 24),
 	ATA_DFLAG_DETACHED	= (1 << 25),
 	ATA_DFLAG_DA		= (1 << 26), /* device supports Device Attention */
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 16/18] ata: libata: set read/write commands CDL index
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (14 preceding siblings ...)
  2023-01-24 19:03 ` [PATCH v3 15/18] ata: libata: add ATA feature control sub-page translation Niklas Cassel
@ 2023-01-24 19:03 ` Niklas Cassel
  2023-01-27 15:41   ` Hannes Reinecke
  2023-01-24 19:03 ` [PATCH v3 17/18] ata: libata: handle completion of CDL commands using policy 0xD Niklas Cassel
  2023-01-24 19:03 ` [PATCH v3 18/18] Documentation: sysfs-block-device: document command duration limits Niklas Cassel
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:03 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

For devices supporting the command duration limits feature, when a read
or write operation has the IOPRIO_CLASS_DL priority class and the
command duration limits feature is enabled, set the command duration
limit index field of the command to the priority level.

For unqueued read and write operations, the command duration limit index
is set as the lower 2 bits of the feature field. For queued NCQ
read/write commands, the index is set as the lower 2 bits of the
auxiliary field.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/ata/libata-core.c | 43 ++++++++++++++++++++++++++++++++++-----
 drivers/ata/libata-scsi.c |  3 +--
 drivers/ata/libata.h      |  2 +-
 include/linux/libata.h    |  1 +
 4 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 9aa49eab2b95..2c1531ef169d 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -665,13 +665,37 @@ u64 ata_tf_read_block(const struct ata_taskfile *tf, struct ata_device *dev)
 	return block;
 }
 
+/*
+ * Set a taskfile CDL index.
+ */
+static inline void ata_set_tf_cdl(struct ata_queued_cmd *qc, int ioprio)
+{
+	struct ata_taskfile *tf = &qc->tf;
+	int cdl;
+
+	if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_DL)
+		return;
+
+	cdl = IOPRIO_PRIO_DATA(ioprio) & 0x07;
+	if (!cdl)
+		return;
+
+	if (tf->protocol == ATA_PROT_NCQ)
+		tf->auxiliary |= cdl;
+	else
+		tf->feature |= cdl;
+
+	/* Mark this command as having a CDL */
+	qc->flags |= ATA_QCFLAG_HAS_CDL;
+}
+
 /**
  *	ata_build_rw_tf - Build ATA taskfile for given read/write request
  *	@qc: Metadata associated with the taskfile to build
  *	@block: Block address
  *	@n_block: Number of blocks
  *	@tf_flags: RW/FUA etc...
- *	@class: IO priority class
+ *	@ioprio: IO priority class and level
  *
  *	LOCKING:
  *	None.
@@ -685,7 +709,7 @@ u64 ata_tf_read_block(const struct ata_taskfile *tf, struct ata_device *dev)
  *	-EINVAL if the request is invalid.
  */
 int ata_build_rw_tf(struct ata_queued_cmd *qc, u64 block, u32 n_block,
-		    unsigned int tf_flags, int class)
+		    unsigned int tf_flags, int ioprio)
 {
 	struct ata_taskfile *tf = &qc->tf;
 	struct ata_device *dev = qc->dev;
@@ -722,13 +746,22 @@ int ata_build_rw_tf(struct ata_queued_cmd *qc, u64 block, u32 n_block,
 			tf->device |= 1 << 7;
 
 		if (dev->flags & ATA_DFLAG_NCQ_PRIO_ENABLED &&
-		    class == IOPRIO_CLASS_RT)
+		    IOPRIO_PRIO_CLASS(ioprio) == IOPRIO_CLASS_RT)
 			tf->hob_nsect |= ATA_PRIO_HIGH << ATA_SHIFT_PRIO;
+
+		if (dev->flags & ATA_DFLAG_CDL_ENABLED)
+			ata_set_tf_cdl(qc, ioprio);
+
 	} else if (dev->flags & ATA_DFLAG_LBA) {
 		tf->flags |= ATA_TFLAG_LBA;
 
-		/* We need LBA48 for FUA writes */
-		if (!(tf->flags & ATA_TFLAG_FUA) && lba_28_ok(block, n_block)) {
+		if (dev->flags & ATA_DFLAG_CDL_ENABLED)
+			ata_set_tf_cdl(qc, ioprio);
+
+		/* Both FUA writes and a CDL index require 48-bit commands */
+		if (!(tf->flags & ATA_TFLAG_FUA) &&
+		    !(qc->flags & ATA_QCFLAG_HAS_CDL) &&
+		    lba_28_ok(block, n_block)) {
 			/* use LBA28 */
 			tf->device |= (block >> 24) & 0xf;
 		} else if (lba_48_ok(block, n_block)) {
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 8dde1cede5ca..ce5c6a49a098 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1546,7 +1546,6 @@ static unsigned int ata_scsi_rw_xlat(struct ata_queued_cmd *qc)
 	struct scsi_cmnd *scmd = qc->scsicmd;
 	const u8 *cdb = scmd->cmnd;
 	struct request *rq = scsi_cmd_to_rq(scmd);
-	int class = IOPRIO_PRIO_CLASS(req_get_ioprio(rq));
 	unsigned int tf_flags = 0;
 	u64 block;
 	u32 n_block;
@@ -1622,7 +1621,7 @@ static unsigned int ata_scsi_rw_xlat(struct ata_queued_cmd *qc)
 	qc->flags |= ATA_QCFLAG_IO;
 	qc->nbytes = n_block * scmd->device->sector_size;
 
-	rc = ata_build_rw_tf(qc, block, n_block, tf_flags, class);
+	rc = ata_build_rw_tf(qc, block, n_block, tf_flags, req_get_ioprio(rq));
 	if (likely(rc == 0))
 		return 0;
 
diff --git a/drivers/ata/libata.h b/drivers/ata/libata.h
index 2cd6124a01e8..26aa777a2ad0 100644
--- a/drivers/ata/libata.h
+++ b/drivers/ata/libata.h
@@ -45,7 +45,7 @@ static inline void ata_force_cbl(struct ata_port *ap) { }
 extern u64 ata_tf_to_lba(const struct ata_taskfile *tf);
 extern u64 ata_tf_to_lba48(const struct ata_taskfile *tf);
 extern int ata_build_rw_tf(struct ata_queued_cmd *qc, u64 block, u32 n_block,
-			   unsigned int tf_flags, int class);
+			   unsigned int tf_flags, int ioprio);
 extern u64 ata_tf_read_block(const struct ata_taskfile *tf,
 			     struct ata_device *dev);
 extern unsigned ata_exec_internal(struct ata_device *dev,
diff --git a/include/linux/libata.h b/include/linux/libata.h
index d7fe735e6322..ab8b62036c12 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -209,6 +209,7 @@ enum {
 	ATA_QCFLAG_CLEAR_EXCL	= (1 << 5), /* clear excl_link on completion */
 	ATA_QCFLAG_QUIET	= (1 << 6), /* don't report device error */
 	ATA_QCFLAG_RETRY	= (1 << 7), /* retry after failure */
+	ATA_QCFLAG_HAS_CDL	= (1 << 8), /* qc has CDL a descriptor set */
 
 	ATA_QCFLAG_EH		= (1 << 16), /* cmd aborted and owned by EH */
 	ATA_QCFLAG_SENSE_VALID	= (1 << 17), /* sense data valid */
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 17/18] ata: libata: handle completion of CDL commands using policy 0xD
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (15 preceding siblings ...)
  2023-01-24 19:03 ` [PATCH v3 16/18] ata: libata: set read/write commands CDL index Niklas Cassel
@ 2023-01-24 19:03 ` Niklas Cassel
  2023-01-27 15:43   ` Hannes Reinecke
  2023-01-24 19:03 ` [PATCH v3 18/18] Documentation: sysfs-block-device: document command duration limits Niklas Cassel
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:03 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block, Niklas Cassel

A CDL timeout for policy 0xF is defined as a NCQ error, just with a CDL
specific sk/asc/ascq in the sense data. Therefore, the existing code in
libata does not need to be modified to handle a policy 0xF CDL timeout.

For Command Duration Limits policy 0xD:
The device shall complete the command without error with the additional
sense code set to DATA CURRENTLY UNAVAILABLE.

Since a CDL timeout for policy 0xD is not an error, we cannot use the
NCQ Command Error log (10h).

Instead, we need to read the Sense Data for Successful NCQ Commands
log (0Fh).

In the success case, just like in the error case, we cannot simply read
a log page from the interrupt handler itself, since reading a log page
involves sending a READ LOG DMA EXT or READ LOG EXT command.

Therefore, we add a new EH action ATA_EH_GET_SUCCESS_SENSE.
When a command completes without error, and when the ATA_SENSE bit
is set, this new action is set as pending, and EH is scheduled.

This way, similar to the NCQ error case, the log page will be read
from EH context.

An alternative would have been to add a new kthread or workqueue to
handle this. However, extending EH can be done with minimal changes
and avoids the need to synchronize a new kthread/workqueue with EH.

Co-developed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 drivers/ata/libata-core.c |  88 +++++++++++++++++++++++++++++++-
 drivers/ata/libata-eh.c   | 105 +++++++++++++++++++++++++++++++++++++-
 drivers/ata/libata-sata.c |  92 +++++++++++++++++++++++++++++++++
 include/linux/ata.h       |   3 ++
 include/linux/libata.h    |  11 +++-
 5 files changed, 295 insertions(+), 4 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 2c1531ef169d..b4761c3c4b91 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -685,8 +685,12 @@ static inline void ata_set_tf_cdl(struct ata_queued_cmd *qc, int ioprio)
 	else
 		tf->feature |= cdl;
 
-	/* Mark this command as having a CDL */
-	qc->flags |= ATA_QCFLAG_HAS_CDL;
+	/*
+	 * Mark this command as having a CDL and request the result
+	 * task file so that we can inspect the sense data available
+	 * bit on completion.
+	 */
+	qc->flags |= ATA_QCFLAG_HAS_CDL | ATA_QCFLAG_RESULT_TF;
 }
 
 /**
@@ -2431,6 +2435,24 @@ static void ata_dev_config_cdl(struct ata_device *dev)
 		ata_dev_warn(dev,
 			"Command duration guideline is not supported\n");
 
+	/*
+	 * We must have support for the sense data for successful NCQ commands
+	 * log indicated by the successful NCQ command sense data supported bit.
+	 */
+	val = get_unaligned_le64(&ap->sector_buf[8]);
+	if (!(val & BIT_ULL(63)) || !(val & BIT_ULL(47))) {
+		ata_dev_warn(dev,
+			"CDL supported but Successful NCQ Command Sense Data is not supported\n");
+		goto not_supported;
+	}
+
+	/* Without NCQ autosense, the successful NCQ commands log is useless. */
+	if (!ata_id_has_ncq_autosense(dev->id)) {
+		ata_dev_warn(dev,
+			"CDL supported but NCQ autosense is not supported\n");
+		goto not_supported;
+	}
+
 	/*
 	 * If CDL is marked as enabled, make sure the feature is enabled too.
 	 * Conversely, if CDL is disabled, make sure the feature is turned off.
@@ -2465,6 +2487,35 @@ static void ata_dev_config_cdl(struct ata_device *dev)
 		}
 	}
 
+	/*
+	 * While CDL itself has to be enabled using sysfs, CDL requires that
+	 * sense data for successful NCQ commands is enabled to work properly.
+	 * Just like ata_dev_config_sense_reporting(), enable it unconditionally
+	 * if supported.
+	 */
+	if (!(val & BIT_ULL(63)) || !(val & BIT_ULL(18))) {
+		err_mask = ata_dev_set_feature(dev,
+					SETFEATURE_SENSE_DATA_SUCC_NCQ, 0x1);
+		if (err_mask) {
+			ata_dev_warn(dev,
+				     "failed to enable Sense Data for successful NCQ commands, Emask 0x%x\n",
+				     err_mask);
+			goto not_supported;
+		}
+	}
+
+	/*
+	 * Allocate a buffer to handle reading the sense data for successful
+	 * NCQ Commands log page for commands using a CDL with one of the limit
+	 * policy set to 0xD (successful completion with sense data available
+	 * bit set).
+	 */
+	if (!ap->ncq_sense_buf) {
+		ap->ncq_sense_buf = kmalloc(ATA_LOG_SENSE_NCQ_SIZE, GFP_KERNEL);
+		if (!ap->ncq_sense_buf)
+			goto not_supported;
+	}
+
 	/*
 	 * Command duration limits is supported: cache the CDL log page 18h
 	 * (command duration descriptors).
@@ -2482,6 +2533,8 @@ static void ata_dev_config_cdl(struct ata_device *dev)
 
 not_supported:
 	dev->flags &= ~(ATA_DFLAG_CDL | ATA_DFLAG_CDL_ENABLED);
+	kfree(ap->ncq_sense_buf);
+	ap->ncq_sense_buf = NULL;
 }
 
 static int ata_dev_config_lba(struct ata_device *dev)
@@ -4882,6 +4935,36 @@ void ata_qc_complete(struct ata_queued_cmd *qc)
 			fill_result_tf(qc);
 
 		trace_ata_qc_complete_done(qc);
+
+		/*
+		 * For CDL commands that completed without an error, check if
+		 * we have sense data (ATA_SENSE is set). If we do, then the
+		 * command may have been aborted by the device due to a limit
+		 * timeout using the policy 0xD. For these commands, invoke EH
+		 * to get the command sense data.
+		 */
+		if (qc->result_tf.status & ATA_SENSE &&
+		    ((ata_is_ncq(qc->tf.protocol) &&
+		      dev->flags & ATA_DFLAG_CDL_ENABLED) ||
+		     (!(ata_is_ncq(qc->tf.protocol) &&
+			ata_id_sense_reporting_enabled(dev->id))))) {
+			/*
+			 * Tell SCSI EH to not overwrite scmd->result even if
+			 * this command is finished with result SAM_STAT_GOOD.
+			 */
+			qc->scsicmd->flags |= SCMD_FORCE_EH_SUCCESS;
+			qc->flags |= ATA_QCFLAG_EH_SUCCESS_CMD;
+			ehi->dev_action[dev->devno] |= ATA_EH_GET_SUCCESS_SENSE;
+
+			/*
+			 * set pending so that ata_qc_schedule_eh() does not
+			 * trigger fast drain, and freeze the port.
+			 */
+			ap->pflags |= ATA_PFLAG_EH_PENDING;
+			ata_qc_schedule_eh(qc);
+			return;
+		}
+
 		/* Some commands need post-processing after successful
 		 * completion.
 		 */
@@ -5514,6 +5597,7 @@ static void ata_host_release(struct kref *kref)
 
 		kfree(ap->pmp_link);
 		kfree(ap->slave_link);
+		kfree(ap->ncq_sense_buf);
 		kfree(ap);
 		host->ports[i] = NULL;
 	}
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index 598ae07195b6..05af292eb8ce 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -1917,6 +1917,99 @@ static inline bool ata_eh_quiet(struct ata_queued_cmd *qc)
 	return qc->flags & ATA_QCFLAG_QUIET;
 }
 
+static int ata_eh_read_sense_success_non_ncq(struct ata_link *link)
+{
+	struct ata_port *ap = link->ap;
+	struct ata_queued_cmd *qc;
+
+	qc = __ata_qc_from_tag(ap, link->active_tag);
+	if (!qc)
+		return -EIO;
+
+	if (!(qc->flags & ATA_QCFLAG_EH) ||
+	    !(qc->flags & ATA_QCFLAG_EH_SUCCESS_CMD) ||
+	    qc->err_mask)
+		return -EIO;
+
+	if (!ata_eh_request_sense(qc))
+		return -EIO;
+
+	/*
+	 * If we have sense data, call scsi_check_sense() in order to set the
+	 * correct SCSI ML byte (if any). No point in checking the return value,
+	 * since the command has already completed successfully.
+	 */
+	scsi_check_sense(qc->scsicmd);
+
+	return 0;
+}
+
+static void ata_eh_get_success_sense(struct ata_link *link)
+{
+	struct ata_eh_context *ehc = &link->eh_context;
+	struct ata_device *dev = link->device;
+	struct ata_port *ap = link->ap;
+	struct ata_queued_cmd *qc;
+	int tag, ret = 0;
+
+	if (!(ehc->i.dev_action[dev->devno] & ATA_EH_GET_SUCCESS_SENSE))
+		return;
+
+	/* if frozen, we can't do much */
+	if (ata_port_is_frozen(ap)) {
+		ata_dev_warn(dev,
+			"successful sense data available but port frozen\n");
+		goto out;
+	}
+
+	/*
+	 * If the link has sactive set, then we have outstanding NCQ commands
+	 * and have to read the Successful NCQ Commands log to get the sense
+	 * data. Otherwise, we are dealing with a non-NCQ command and use
+	 * request sense ext command to retrieve the sense data.
+	 */
+	if (link->sactive)
+		ret = ata_eh_read_sense_success_ncq_log(link);
+	else
+		ret = ata_eh_read_sense_success_non_ncq(link);
+	if (ret)
+		goto out;
+
+	ata_eh_done(link, dev, ATA_EH_GET_SUCCESS_SENSE);
+	return;
+
+out:
+	/*
+	 * If we failed to get sense data for a successful command that ought to
+	 * have sense data, we cannot simply return BLK_STS_OK to user space.
+	 * This is because we can't know if the sense data that we couldn't get
+	 * was actually "DATA CURRENTLY UNAVAILABLE". Reporting such a command
+	 * as success to user space would result in a silent data corruption.
+	 * Thus, add a bogus ABORTED_COMMAND sense data to such commands, such
+	 * that SCSI will report these commands as BLK_STS_IOERR to user space.
+	 */
+	ata_qc_for_each_raw(ap, qc, tag) {
+		if (!(qc->flags & ATA_QCFLAG_EH) ||
+		    !(qc->flags & ATA_QCFLAG_EH_SUCCESS_CMD) ||
+		    qc->err_mask ||
+		    ata_dev_phys_link(qc->dev) != link)
+			continue;
+
+		/* We managed to get sense for this success command, skip. */
+		if (qc->flags & ATA_QCFLAG_SENSE_VALID)
+			continue;
+
+		/* This success command did not have any sense data, skip. */
+		if (!(qc->result_tf.status & ATA_SENSE))
+			continue;
+
+		/* This success command had sense data, but we failed to get. */
+		ata_scsi_set_sense(dev, qc->scsicmd, ABORTED_COMMAND, 0, 0);
+		qc->flags |= ATA_QCFLAG_SENSE_VALID;
+	}
+	ata_eh_done(link, dev, ATA_EH_GET_SUCCESS_SENSE);
+}
+
 /**
  *	ata_eh_link_autopsy - analyze error and determine recovery action
  *	@link: host link to perform autopsy on
@@ -1957,6 +2050,14 @@ static void ata_eh_link_autopsy(struct ata_link *link)
 	/* analyze NCQ failure */
 	ata_eh_analyze_ncq_error(link);
 
+	/*
+	 * Check if this was a successful command that simply needs sense data.
+	 * Since the sense data is not part of the completion, we need to fetch
+	 * it using an additional command. Since this can't be done from irq
+	 * context, the sense data for successful commands are fetched by EH.
+	 */
+	ata_eh_get_success_sense(link);
+
 	/* any real error trumps AC_ERR_OTHER */
 	if (ehc->i.err_mask & ~AC_ERR_OTHER)
 		ehc->i.err_mask &= ~AC_ERR_OTHER;
@@ -1966,6 +2067,7 @@ static void ata_eh_link_autopsy(struct ata_link *link)
 	ata_qc_for_each_raw(ap, qc, tag) {
 		if (!(qc->flags & ATA_QCFLAG_EH) ||
 		    qc->flags & ATA_QCFLAG_RETRY ||
+		    qc->flags & ATA_QCFLAG_EH_SUCCESS_CMD ||
 		    ata_dev_phys_link(qc->dev) != link)
 			continue;
 
@@ -3825,7 +3927,8 @@ void ata_eh_finish(struct ata_port *ap)
 			else
 				ata_eh_qc_complete(qc);
 		} else {
-			if (qc->flags & ATA_QCFLAG_SENSE_VALID) {
+			if (qc->flags & ATA_QCFLAG_SENSE_VALID ||
+			    qc->flags & ATA_QCFLAG_EH_SUCCESS_CMD) {
 				ata_eh_qc_complete(qc);
 			} else {
 				/* feed zero TF to sense generation */
diff --git a/drivers/ata/libata-sata.c b/drivers/ata/libata-sata.c
index 57cb33060c9d..7de4d8901fac 100644
--- a/drivers/ata/libata-sata.c
+++ b/drivers/ata/libata-sata.c
@@ -11,7 +11,9 @@
 #include <linux/module.h>
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_device.h>
+#include <scsi/scsi_eh.h>
 #include <linux/libata.h>
+#include <asm/unaligned.h>
 
 #include "libata.h"
 #include "libata-transport.h"
@@ -1408,6 +1410,95 @@ static int ata_eh_read_log_10h(struct ata_device *dev,
 	return 0;
 }
 
+/**
+ *	ata_eh_read_sense_success_ncq_log - Read the sense data for successful
+ *					    NCQ commands log
+ *	@link: ATA link to get sense data for
+ *
+ *	Read the sense data for successful NCQ commands log page to obtain
+ *	sense data for all NCQ commands that completed successfully with
+ *	the sense data available bit set.
+ *
+ *	LOCKING:
+ *	Kernel thread context (may sleep).
+ *
+ *	RETURNS:
+ *	0 on success, -errno otherwise.
+ */
+int ata_eh_read_sense_success_ncq_log(struct ata_link *link)
+{
+	struct ata_device *dev = link->device;
+	struct ata_port *ap = dev->link->ap;
+	u8 *buf = ap->ncq_sense_buf;
+	struct ata_queued_cmd *qc;
+	unsigned int err_mask, tag;
+	u8 *sense, sk = 0, asc = 0, ascq = 0;
+	u64 sense_valid, val;
+	int ret = 0;
+
+	err_mask = ata_read_log_page(dev, ATA_LOG_SENSE_NCQ, 0, buf, 2);
+	if (err_mask) {
+		ata_dev_err(dev,
+			"Failed to read Sense Data for Successful NCQ Commands log\n");
+		return -EIO;
+	}
+
+	/* Check the log header */
+	val = get_unaligned_le64(&buf[0]);
+	if ((val & 0xffff) != 1 || ((val >> 16) & 0xff) != 0x0f) {
+		ata_dev_err(dev,
+			"Invalid Sense Data for Successful NCQ Commands log\n");
+		return -EIO;
+	}
+
+	sense_valid = (u64)buf[8] | ((u64)buf[9] << 8) |
+		((u64)buf[10] << 16) | ((u64)buf[11] << 24);
+
+	ata_qc_for_each_raw(ap, qc, tag) {
+		if (!(qc->flags & ATA_QCFLAG_EH) ||
+		    !(qc->flags & ATA_QCFLAG_EH_SUCCESS_CMD) ||
+		    qc->err_mask ||
+		    ata_dev_phys_link(qc->dev) != link)
+			continue;
+
+		/*
+		 * If the command does not have any sense data, clear ATA_SENSE.
+		 * Keep ATA_QCFLAG_EH_SUCCESS_CMD so that command is finished.
+		 */
+		if (!(sense_valid & (1ULL << tag))) {
+			qc->result_tf.status &= ~ATA_SENSE;
+			continue;
+		}
+
+		sense = &buf[32 + 24 * tag];
+		sk = sense[0];
+		asc = sense[1];
+		ascq = sense[2];
+
+		if (!ata_scsi_sense_is_valid(sk, asc, ascq)) {
+			ret = -EIO;
+			continue;
+		}
+
+		/* Set sense without also setting scsicmd->result */
+		scsi_build_sense_buffer(dev->flags & ATA_DFLAG_D_SENSE,
+					qc->scsicmd->sense_buffer, sk,
+					asc, ascq);
+		qc->flags |= ATA_QCFLAG_SENSE_VALID;
+
+		/*
+		 * If we have sense data, call scsi_check_sense() in order to
+		 * set the correct SCSI ML byte (if any). No point in checking
+		 * the return value, since the command has already completed
+		 * successfully.
+		 */
+		scsi_check_sense(qc->scsicmd);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ata_eh_read_sense_success_ncq_log);
+
 /**
  *	ata_eh_analyze_ncq_error - analyze NCQ error
  *	@link: ATA link to analyze NCQ error for
@@ -1488,6 +1579,7 @@ void ata_eh_analyze_ncq_error(struct ata_link *link)
 
 	ata_qc_for_each_raw(ap, qc, tag) {
 		if (!(qc->flags & ATA_QCFLAG_EH) ||
+		    qc->flags & ATA_QCFLAG_EH_SUCCESS_CMD ||
 		    ata_dev_phys_link(qc->dev) != link)
 			continue;
 
diff --git a/include/linux/ata.h b/include/linux/ata.h
index a59b17d6ad11..2e2e22362096 100644
--- a/include/linux/ata.h
+++ b/include/linux/ata.h
@@ -326,6 +326,8 @@ enum {
 	ATA_LOG_CDL		= 0x18,
 	ATA_LOG_CDL_SIZE	= ATA_SECT_SIZE,
 	ATA_LOG_IDENTIFY_DEVICE	= 0x30,
+	ATA_LOG_SENSE_NCQ	= 0x0F,
+	ATA_LOG_SENSE_NCQ_SIZE	= ATA_SECT_SIZE * 2,
 	ATA_LOG_CONCURRENT_POSITIONING_RANGES = 0x47,
 
 	/* Identify device log pages: */
@@ -432,6 +434,7 @@ enum {
 	SATA_DEVSLP		= 0x09,	/* Device Sleep */
 
 	SETFEATURE_SENSE_DATA	= 0xC3, /* Sense Data Reporting feature */
+	SETFEATURE_SENSE_DATA_SUCC_NCQ = 0xC4, /* Sense Data for successful NCQ commands */
 
 	/* feature values for SET_MAX */
 	ATA_SET_MAX_ADDR	= 0x00,
diff --git a/include/linux/libata.h b/include/linux/libata.h
index ab8b62036c12..70ac635fe5d7 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -214,6 +214,7 @@ enum {
 	ATA_QCFLAG_EH		= (1 << 16), /* cmd aborted and owned by EH */
 	ATA_QCFLAG_SENSE_VALID	= (1 << 17), /* sense data valid */
 	ATA_QCFLAG_EH_SCHEDULED = (1 << 18), /* EH scheduled (obsolete) */
+	ATA_QCFLAG_EH_SUCCESS_CMD = (1 << 19), /* EH should fetch sense for this successful cmd */
 
 	/* host set flags */
 	ATA_HOST_SIMPLEX	= (1 << 0),	/* Host is simplex, one DMA channel per host only */
@@ -312,8 +313,10 @@ enum {
 	ATA_EH_RESET		= ATA_EH_SOFTRESET | ATA_EH_HARDRESET,
 	ATA_EH_ENABLE_LINK	= (1 << 3),
 	ATA_EH_PARK		= (1 << 5), /* unload heads and stop I/O */
+	ATA_EH_GET_SUCCESS_SENSE = (1 << 6), /* Get sense data for successful cmd */
 
-	ATA_EH_PERDEV_MASK	= ATA_EH_REVALIDATE | ATA_EH_PARK,
+	ATA_EH_PERDEV_MASK	= ATA_EH_REVALIDATE | ATA_EH_PARK |
+				  ATA_EH_GET_SUCCESS_SENSE,
 	ATA_EH_ALL_ACTIONS	= ATA_EH_REVALIDATE | ATA_EH_RESET |
 				  ATA_EH_ENABLE_LINK,
 
@@ -867,6 +870,7 @@ struct ata_port {
 	struct ata_acpi_gtm	__acpi_init_gtm; /* use ata_acpi_init_gtm() */
 #endif
 	/* owned by EH */
+	u8			*ncq_sense_buf;
 	u8			sector_buf[ATA_SECT_SIZE] ____cacheline_aligned;
 };
 
@@ -1185,6 +1189,7 @@ extern int sata_link_hardreset(struct ata_link *link,
 			bool *online, int (*check_ready)(struct ata_link *));
 extern int sata_link_resume(struct ata_link *link, const unsigned long *params,
 			    unsigned long deadline);
+extern int ata_eh_read_sense_success_ncq_log(struct ata_link *link);
 extern void ata_eh_analyze_ncq_error(struct ata_link *link);
 #else
 static inline const unsigned long *
@@ -1222,6 +1227,10 @@ static inline int sata_link_resume(struct ata_link *link,
 {
 	return -EOPNOTSUPP;
 }
+static inline int ata_eh_read_sense_success_ncq_log(struct ata_link *link)
+{
+	return -EOPNOTSUPP;
+}
 static inline void ata_eh_analyze_ncq_error(struct ata_link *link) { }
 #endif
 extern int sata_link_debounce(struct ata_link *link,
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v3 18/18] Documentation: sysfs-block-device: document command duration limits
  2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
                   ` (16 preceding siblings ...)
  2023-01-24 19:03 ` [PATCH v3 17/18] ata: libata: handle completion of CDL commands using policy 0xD Niklas Cassel
@ 2023-01-24 19:03 ` Niklas Cassel
  2023-01-27 15:43   ` Hannes Reinecke
  17 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-24 19:03 UTC (permalink / raw)
  To: linux-kernel
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Niklas Cassel

From: Damien Le Moal <damien.lemoal@opensource.wdc.com>

Document ABI/testing/sysfs-block-device the sysfs attributes present
under /sys/block/*/device/duration_limits for ATA and SCSI devices
supporting the command duration limits feature.

Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
---
 Documentation/ABI/testing/sysfs-block-device | 150 +++++++++++++++++++
 1 file changed, 150 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-block-device b/Documentation/ABI/testing/sysfs-block-device
index 7ac7b19b2f72..3a32c86942f5 100644
--- a/Documentation/ABI/testing/sysfs-block-device
+++ b/Documentation/ABI/testing/sysfs-block-device
@@ -95,3 +95,153 @@ Description:
 		This file does not exist if the HBA driver does not implement
 		support for the SATA NCQ priority feature, regardless of the
 		device support for this feature.
+
+
+What:		/sys/block/*/device/duration_limits/enable
+Date:		Dec, 2022
+KernelVersion:	v6.3
+Contact:	linux-scsi@vger.kernel.org
+Description:
+		(RW) For ATA and SCSI devices supporting the command duration
+		limits feature, write to the file to turn on or off the
+		feature. By default this feature is turned off. If the device
+		does not support the command duration limits feature, this
+		attribute does not exist (the directory
+		"/sys/block/\*/device/duration_limits" does not exist).
+		Writing "1" to this file enables the use of command duration
+		limits for read and write commands in the kernel and turns on
+		the feature on the device. Writing "0" disables the feature.
+
+
+What:		/sys/block/*/device/duration_limits/read/[1-7]/*
+Date:		Dec, 2022
+KernelVersion:	v6.3
+Contact:	linux-scsi@vger.kernel.org
+Description:
+		(RO) For ATA and SCSI devices supporting the command duration
+		limits feature, this shows the set of 7 command duration limits
+		descriptors for read commands currently set on the device. For
+		each of the 7 descritors, the following read-only attributes
+		are present:
+
+		  - duration_guideline: specifies the preferred length of time
+		    in microseconds for the completion of a command.
+
+		  - duration_guideline_policy: specifies the policy action
+		    taken if the duration_guideline attribute specifies a
+		    non-zero command duration guideline that the device is
+		    unable to achieve for a command.
+
+		    Possible values are:
+
+		      - 0x0: The device will complete the command at the
+			earliest possible time consistent with the specified
+			command duration guideline.
+
+		      - 0x1: If the specified command duration guideline has not
+			been achieved and the command duration guideline policy
+			field is not in the seventh command duration limits
+			descriptor, then the device continues processing that
+			command using the command duration limits descriptor
+			that has the next higher number.
+
+		      - 0x2: The device will continue processing the command as
+			with no command duration limits descriptor being used.
+
+		      - 0xD: The device will complete the command and an IO
+			failure will be reported to the user with the ETIME
+			error code.
+
+		      - 0xF: Same as 0xD.
+
+		  - max_active_time: specifies an upper limit in microseconds
+		    on the time that elapses from the time at which the device
+		    initiates actions to access, transfer, or act upon the
+		    specified data until the time the device returns status for
+		    the command.
+
+		  - max_active_time_policy: specifies the policy action taken
+		    if the time used to process a command exceeds a non-zero
+		    time specified by the max_active_time attribute.
+
+		    Possible values are:
+
+		      - 0x0: The device will complete the command at the
+			earliest possible time (i.e, do nothing based on the max
+			time limit not being met).
+
+		      - 0xD: The device will complete the command and an IO
+			failure will be reported to the user with the ETIME
+			error code.
+
+		      - 0xE: Same as 0xD.
+
+		      - 0xF: Same as 0xD.
+
+		  - max_inactive_time: specifies an upper limit in microseconds
+		    on the time that elapses from the time at which the device
+		    receives the command until the time at which the device
+		    initiates actions to access, transfer, or act upon the
+		    specified data.
+
+		  - max_inactive_time_policy: specifies the policy action taken
+		    if a non-zero max_inactive_time limit is not met.
+
+		    Possible values are:
+
+		      - 0x0: The device will complete the command at the
+			earliest possible time (i.e, do nothing based on the max
+			time limit not being met).
+
+		      - 0xD: The device will complete the command and an IO
+			failure will be reported to the user with the ETIME
+			error code.
+
+		      - 0xF: Same as 0xD.
+
+
+What:		/sys/block/*/device/duration_limits/read/page
+Date:		Dec, 2022
+KernelVersion:	v6.3
+Contact:	linux-scsi@vger.kernel.org
+Description:
+		(RO) For ATA and SCSI devices supporting the command duration
+		limits feature, this shows the name of the device VPD page
+		specifying the set of 7 command duration limits descriptors for
+		read commands. Possible values are "T2A" and "T2B".
+
+
+What:		/sys/block/*/device/duration_limits/write/[1-7]/*
+Date:		Dec, 2022
+KernelVersion:	v6.3
+Contact:	linux-scsi@vger.kernel.org
+Description:
+		(RO) For ATA and SCSI devices supporting the command duration
+		limits feature, this shows the set of 7 command duration limits
+		descriptors for write commands currently set on the device. For
+		each of the 7 descritors, the same set of read-only attributes
+		as for read commands is present.
+
+
+What:		/sys/block/*/device/duration_limits/write/page
+Date:		Dec, 2022
+KernelVersion:	v6.3
+Contact:	linux-scsi@vger.kernel.org
+Description:
+		(RO) For ATA and SCSI devices supporting the command duration
+		limits feature, this shows the name of the device VPD page
+		specifying the set of 7 command duration limits descriptors for
+		write commands. Possible values are "T2A" and "T2B".
+
+
+What:		/sys/block/*/device/duration_limits/perf_vs_duration_guideline
+Date:		Dec, 2022
+KernelVersion:	v6.3
+Contact:	linux-scsi@vger.kernel.org
+Description:
+		(RO) For ATA and SCSI devices supporting the command duration
+		limits feature, this specifies the maximum percentage increase
+		in average command completion times (reduction in IOPS) that
+		is allowed for the device to perform actions based on the
+		contents of the duration guideline field in every command
+		duration limit descriptor for both read and write commands.
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-24 19:02 ` [PATCH v3 01/18] block: introduce duration-limits priority class Niklas Cassel
@ 2023-01-24 19:27   ` Bart Van Assche
  2023-01-24 20:36     ` Bart Van Assche
  2023-01-24 21:29     ` Damien Le Moal
  2023-01-27 12:43   ` Hannes Reinecke
  1 sibling, 2 replies; 82+ messages in thread
From: Bart Van Assche @ 2023-01-24 19:27 UTC (permalink / raw)
  To: Niklas Cassel, Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block

On 1/24/23 11:02, Niklas Cassel wrote:
> Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should
> be executed using duration-limits targets. The duration target to apply
> to a command is indicated using the priority level. Up to 8 levels are
> supported, with level 0 indiating "no limit".
> 
> This priority class has effect only if the target device supports the
> command duration limits feature and this feature is enabled by the user.
> 
> While it is recommended to not use an ioscheduler when using the
> IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler,
> IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT.
> 
> The reason for this is twofold:
> 1) Each priority level for the IOPRIO_CLASS_DL priority class represents a
> duration limit descriptor (DLD) inside the device. Users can configure
> these limits themselves using passthrough commands, so from a block layer
> perspective, Linux has no idea of how each DLD is actually configured.
> 
> By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds
> its duration limit (because it was held too long in the scheduler) is
> decreased. It is still possible to use the IOPRIO_CLASS_DL priority class
> for "low priority" IOs by configuring a large limit in the respective DLD.
> 
> 2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands
> (IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority
> commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to
> IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these
> two incompatible priority classes.

Implementing duration limit support using the I/O priority mechanism 
makes it impossible to configure the I/O priority for commands that have 
a duration limit. Shouldn't the duration limit be independent of the I/O 
priority? Am I perhaps missing something?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT
  2023-01-24 19:02 ` [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT Niklas Cassel
@ 2023-01-24 19:29   ` Bart Van Assche
  2023-01-24 19:59     ` Keith Busch
  2023-01-24 21:34     ` Damien Le Moal
  0 siblings, 2 replies; 82+ messages in thread
From: Bart Van Assche @ 2023-01-24 19:29 UTC (permalink / raw)
  To: Niklas Cassel, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block

On 1/24/23 11:02, Niklas Cassel wrote:
> Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
> report command that failed due to a command duration limit being
> exceeded. This new status is mapped to the ETIME error code to allow
> users to differentiate "soft" duration limit failures from other more
> serious hardware related errors.

What makes exceeding the duration limit different from an I/O timeout 
(BLK_STS_TIMEOUT)? Why is it important to tell the difference between an 
I/O timeout and exceeding the command duration limit?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 04/18] scsi: rename and move get_scsi_ml_byte()
  2023-01-24 19:02 ` [PATCH v3 04/18] scsi: rename and move get_scsi_ml_byte() Niklas Cassel
@ 2023-01-24 19:32   ` Bart Van Assche
  0 siblings, 0 replies; 82+ messages in thread
From: Bart Van Assche @ 2023-01-24 19:32 UTC (permalink / raw)
  To: Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block, Mike Christie

On 1/24/23 11:02, Niklas Cassel wrote:
> SCSI has two different getters:
> - get_XXX_byte() (in scsi_cmnd.h) which takes a struct scsi_cmnd *, and
> - XXX_byte() (in scsi.h) which takes a scmd->result.
> The proper name for get_scsi_ml_byte() should thus be without the get_
> prefix, as it takes a scmd->result. Rename the function to rectify this.
> (This change was suggested by Mike Christie.)
> 
> Additionally, move get_scsi_ml_byte() to scsi_priv.h since both scsi_lib.c
> and scsi_error.c will need to use this helper in a follow-up patch.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 05/18] scsi: support retrieving sub-pages of mode pages
  2023-01-24 19:02 ` [PATCH v3 05/18] scsi: support retrieving sub-pages of mode pages Niklas Cassel
@ 2023-01-24 19:34   ` Bart Van Assche
  0 siblings, 0 replies; 82+ messages in thread
From: Bart Van Assche @ 2023-01-24 19:34 UTC (permalink / raw)
  To: Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block

On 1/24/23 11:02, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> Allow scsi_mode_sense() to retrieve sub-pages of mode pages by adding
> the subpage argument. Change all the current caller sites to specify
> the subpage 0.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 06/18] scsi: support service action in scsi_report_opcode()
  2023-01-24 19:02 ` [PATCH v3 06/18] scsi: support service action in scsi_report_opcode() Niklas Cassel
@ 2023-01-24 19:36   ` Bart Van Assche
  0 siblings, 0 replies; 82+ messages in thread
From: Bart Van Assche @ 2023-01-24 19:36 UTC (permalink / raw)
  To: Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block

On 1/24/23 11:02, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> The REPORT_SUPPORTED_OPERATION_CODES command allows checking for support
> of commands that have the same opcode but different service actions,
> such as READ 32 and WRITE 32. However, the current implementation of
> scsi_report_opcode() only allows checking an operation code without a
> service action differentiation.
> 
> Add the "sa" argument to scsi_report_opcode() to allow passing a service
> action. If a non-zero service action is specified, the reporting
> options field value is set to 3 to have the service action field taken
> into account by the device. If no service action field is specified
> (zero), the reporting options field is set to 1 as before.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 07/18] scsi: sd: detect support for command duration limits
  2023-01-24 19:02 ` [PATCH v3 07/18] scsi: sd: detect support for command duration limits Niklas Cassel
@ 2023-01-24 19:39   ` Bart Van Assche
  2023-01-27 13:00   ` Hannes Reinecke
  1 sibling, 0 replies; 82+ messages in thread
From: Bart Van Assche @ 2023-01-24 19:39 UTC (permalink / raw)
  To: Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block

On 1/24/23 11:02, Niklas Cassel wrote:
> +static const char *sd_cdl_perf_name(u8 val)
> +{
> +	switch (val) {
> +	case 0x00:
> +		return "0";
> +	case 0x01:
> +		return "0.5";
> +	case 0x02:
> +		return "1.0";
> +	case 0x03:
> +		return "1.5";
> +	case 0x04:
> +		return "2.0";
> +	case 0x05:
> +		return "2.5";
> +	case 0x06:
> +		return "3";
> +	case 0x07:
> +		return "4";
> +	case 0x08:
> +		return "5";
> +	case 0x09:
> +		return "8";
> +	case 0x0A:
> +		return "10";
> +	case 0x0B:
> +		return "15";
> +	case 0x0C:
> +		return "20";
> +	default:
> +		return "?";
> +	}
> +}
> +
> +static const char *sd_cdl_policy_name(u8 policy)
> +{
> +	switch (policy) {
> +	case 0x00:
> +		return "complete-earliest";
> +	case 0x01:
> +		return "continue-next-limit";
> +	case 0x02:
> +		return "continue-no-limit";
> +	case 0x0d:
> +		return "complete-unavailable";
> +	case 0x0e:
> +		return "abort-recovery";
> +	case 0x0f:
> +		return "abort";
> +	default:
> +		return "?";
> +	}
> +}

I think that the above two functions can be made shorter by using 
look-up arrays and designated initialzers.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT
  2023-01-24 19:29   ` Bart Van Assche
@ 2023-01-24 19:59     ` Keith Busch
  2023-01-24 20:32       ` Bart Van Assche
  2023-01-24 21:36       ` Damien Le Moal
  2023-01-24 21:34     ` Damien Le Moal
  1 sibling, 2 replies; 82+ messages in thread
From: Keith Busch @ 2023-01-24 19:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Niklas Cassel, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	Damien Le Moal, linux-scsi, linux-ide, linux-block

On Tue, Jan 24, 2023 at 11:29:10AM -0800, Bart Van Assche wrote:
> On 1/24/23 11:02, Niklas Cassel wrote:
> > Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
> > report command that failed due to a command duration limit being
> > exceeded. This new status is mapped to the ETIME error code to allow
> > users to differentiate "soft" duration limit failures from other more
> > serious hardware related errors.
> 
> What makes exceeding the duration limit different from an I/O timeout
> (BLK_STS_TIMEOUT)? Why is it important to tell the difference between an I/O
> timeout and exceeding the command duration limit?

BLK_STS_TIMEOUT should be used if the target device doesn't provide any
response to the command. The DURATION_LIMIT status is used when the device
completes a command with that status.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT
  2023-01-24 19:59     ` Keith Busch
@ 2023-01-24 20:32       ` Bart Van Assche
  2023-01-24 21:39         ` Damien Le Moal
  2023-01-24 21:36       ` Damien Le Moal
  1 sibling, 1 reply; 82+ messages in thread
From: Bart Van Assche @ 2023-01-24 20:32 UTC (permalink / raw)
  To: Keith Busch
  Cc: Niklas Cassel, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	Damien Le Moal, linux-scsi, linux-ide, linux-block

On 1/24/23 11:59, Keith Busch wrote:
> On Tue, Jan 24, 2023 at 11:29:10AM -0800, Bart Van Assche wrote:
>> On 1/24/23 11:02, Niklas Cassel wrote:
>>> Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
>>> report command that failed due to a command duration limit being
>>> exceeded. This new status is mapped to the ETIME error code to allow
>>> users to differentiate "soft" duration limit failures from other more
>>> serious hardware related errors.
>>
>> What makes exceeding the duration limit different from an I/O timeout
>> (BLK_STS_TIMEOUT)? Why is it important to tell the difference between an I/O
>> timeout and exceeding the command duration limit?
> 
> BLK_STS_TIMEOUT should be used if the target device doesn't provide any
> response to the command. The DURATION_LIMIT status is used when the device
> completes a command with that status.

Hi Keith,

 From SPC-6: "The MAX ACTIVE TIME field specifies an upper limit on the 
time that elapses from the time at which the device server initiates 
actions to access, transfer, or act upon the specified data until the 
time the device server returns status for the command."

My interpretation of the above text is that the SCSI command duration 
limit specifies a hard limit, the same type of limit reported by the 
status code BLK_STS_TIMEOUT. It is not clear to me from the patch 
description why a new status code is needed for reporting that the 
command duration limit has been exceeded.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-24 19:27   ` Bart Van Assche
@ 2023-01-24 20:36     ` Bart Van Assche
  2023-01-24 21:48       ` Damien Le Moal
  2023-01-24 21:29     ` Damien Le Moal
  1 sibling, 1 reply; 82+ messages in thread
From: Bart Van Assche @ 2023-01-24 20:36 UTC (permalink / raw)
  To: Niklas Cassel, Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, Damien Le Moal, linux-scsi,
	linux-ide, linux-block

On 1/24/23 11:27, Bart Van Assche wrote:
> Implementing duration limit support using the I/O priority mechanism 
> makes it impossible to configure the I/O priority for commands that have 
> a duration limit. Shouldn't the duration limit be independent of the I/O 
> priority? Am I perhaps missing something?

(replying to my own e-mail)

In SAM-6 I found the following: "The device server may use the duration 
expiration time to determine the order of processing commands with
the SIMPLE task attribute within the task set. A difference in duration 
expiration time between commands may override other scheduling 
considerations (e.g., different times to access different logical block 
addresses or vendor specific scheduling considerations). Processing of a 
collection of commands with different command duration limit settings 
should cause a command with an earlier duration expiration time to 
complete with status sooner than a command with a later duration 
expiration time."

Do I understand correctly that it is optional for a SCSI device to 
interpret the command duration as a priority and that this is not mandatory?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-24 19:27   ` Bart Van Assche
  2023-01-24 20:36     ` Bart Van Assche
@ 2023-01-24 21:29     ` Damien Le Moal
  2023-01-24 22:43       ` Bart Van Assche
  1 sibling, 1 reply; 82+ messages in thread
From: Damien Le Moal @ 2023-01-24 21:29 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/25/23 04:27, Bart Van Assche wrote:
> On 1/24/23 11:02, Niklas Cassel wrote:
>> Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should
>> be executed using duration-limits targets. The duration target to apply
>> to a command is indicated using the priority level. Up to 8 levels are
>> supported, with level 0 indiating "no limit".
>>
>> This priority class has effect only if the target device supports the
>> command duration limits feature and this feature is enabled by the user.
>>
>> While it is recommended to not use an ioscheduler when using the
>> IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler,
>> IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT.
>>
>> The reason for this is twofold:
>> 1) Each priority level for the IOPRIO_CLASS_DL priority class represents a
>> duration limit descriptor (DLD) inside the device. Users can configure
>> these limits themselves using passthrough commands, so from a block layer
>> perspective, Linux has no idea of how each DLD is actually configured.
>>
>> By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds
>> its duration limit (because it was held too long in the scheduler) is
>> decreased. It is still possible to use the IOPRIO_CLASS_DL priority class
>> for "low priority" IOs by configuring a large limit in the respective DLD.
>>
>> 2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands
>> (IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority
>> commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to
>> IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these
>> two incompatible priority classes.
> 
> Implementing duration limit support using the I/O priority mechanism 
> makes it impossible to configure the I/O priority for commands that have 
> a duration limit. Shouldn't the duration limit be independent of the I/O 
> priority? Am I perhaps missing something?

I/O priority at the device level does not exist with SAS and with SATA,
the ACS specifications mandates that NCQ I/O priority and CDL cannot be
used mixed together. So from the device point of view, I/O priority and
CDL are mutually exclusive. No issues.

Now, if you are talking about the host level I/O priority scheduling done
by mq-deadline and bfq, the CDL priority class maps to the RT class. They
are the same, as they should. There is nothing more real-time than CDL in
my opinion :)

Furthermore, if we do not reuse the I/O priority interface, we will have
to add another field to BIOs & requests to propagate the cdl index from
user space down to the device LLD, almost exactly in the manner of I/O
priorities, including all the controls with merging etc. That would be a
lot of overhead to achieve the possibility of prioritized CDL commands.

CDL in of itself allows the user to define "prioritized" commands by
defining CDLs on the drive that are sorted in increasing time limit order,
i.e. with low CDL index numbers having low limits, and higher priority
within the class (as CDL index == prio level). With that, schedulers can
still do the right thing as they do now, with the additional benefit that
they can even be improved to base their scheduling decisions on a known
time limit for the command execution. But such optimization is not
implemented by this series.

> 
> Thanks,
> 
> Bart.
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT
  2023-01-24 19:29   ` Bart Van Assche
  2023-01-24 19:59     ` Keith Busch
@ 2023-01-24 21:34     ` Damien Le Moal
  1 sibling, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-24 21:34 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/25/23 04:29, Bart Van Assche wrote:
> On 1/24/23 11:02, Niklas Cassel wrote:
>> Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
>> report command that failed due to a command duration limit being
>> exceeded. This new status is mapped to the ETIME error code to allow
>> users to differentiate "soft" duration limit failures from other more
>> serious hardware related errors.
> 
> What makes exceeding the duration limit different from an I/O timeout 
> (BLK_STS_TIMEOUT)? Why is it important to tell the difference between an 
> I/O timeout and exceeding the command duration limit?

If the device fail to execute a command in time, it will either
1) Fail the command with an error and sense data set (policy 0xf for the
time limit)
2) Return a success status for the command with sense data set telling the
host "data not available". This (weird) case is in essence equivalent to
(1) but was defined to avoid the penalty of a queue abort with SATA drives
(NCQ command errors always result in all on-going commands being aborted).

In both cases, the drive is still responsive and operational.
BLK_STS_TIMEOUT is used if a command timed-out, indicating that the drive
is *not* responding. BLK_STS_TIMEOUT thus generally mean "something is
wrong" (not always, but most of the time.

So we cetainly do not want to overload BLK_STS_TIMEOUT to indicate failed
CDL IOs as that would not allow the user to distinguished from more
serious hardware issues.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT
  2023-01-24 19:59     ` Keith Busch
  2023-01-24 20:32       ` Bart Van Assche
@ 2023-01-24 21:36       ` Damien Le Moal
  1 sibling, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-24 21:36 UTC (permalink / raw)
  To: Keith Busch, Bart Van Assche
  Cc: Niklas Cassel, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block

On 1/25/23 04:59, Keith Busch wrote:
> On Tue, Jan 24, 2023 at 11:29:10AM -0800, Bart Van Assche wrote:
>> On 1/24/23 11:02, Niklas Cassel wrote:
>>> Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
>>> report command that failed due to a command duration limit being
>>> exceeded. This new status is mapped to the ETIME error code to allow
>>> users to differentiate "soft" duration limit failures from other more
>>> serious hardware related errors.
>>
>> What makes exceeding the duration limit different from an I/O timeout
>> (BLK_STS_TIMEOUT)? Why is it important to tell the difference between an I/O
>> timeout and exceeding the command duration limit?
> 
> BLK_STS_TIMEOUT should be used if the target device doesn't provide any
> response to the command. The DURATION_LIMIT status is used when the device
> completes a command with that status.

Yes, exactly :)


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT
  2023-01-24 20:32       ` Bart Van Assche
@ 2023-01-24 21:39         ` Damien Le Moal
  0 siblings, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-24 21:39 UTC (permalink / raw)
  To: Bart Van Assche, Keith Busch
  Cc: Niklas Cassel, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block

On 1/25/23 05:32, Bart Van Assche wrote:
> On 1/24/23 11:59, Keith Busch wrote:
>> On Tue, Jan 24, 2023 at 11:29:10AM -0800, Bart Van Assche wrote:
>>> On 1/24/23 11:02, Niklas Cassel wrote:
>>>> Introduce the new block IO status BLK_STS_DURATION_LIMIT for LLDDs to
>>>> report command that failed due to a command duration limit being
>>>> exceeded. This new status is mapped to the ETIME error code to allow
>>>> users to differentiate "soft" duration limit failures from other more
>>>> serious hardware related errors.
>>>
>>> What makes exceeding the duration limit different from an I/O timeout
>>> (BLK_STS_TIMEOUT)? Why is it important to tell the difference between an I/O
>>> timeout and exceeding the command duration limit?
>>
>> BLK_STS_TIMEOUT should be used if the target device doesn't provide any
>> response to the command. The DURATION_LIMIT status is used when the device
>> completes a command with that status.
> 
> Hi Keith,
> 
>  From SPC-6: "The MAX ACTIVE TIME field specifies an upper limit on the 
> time that elapses from the time at which the device server initiates 
> actions to access, transfer, or act upon the specified data until the 
> time the device server returns status for the command."
> 
> My interpretation of the above text is that the SCSI command duration 
> limit specifies a hard limit, the same type of limit reported by the 
> status code BLK_STS_TIMEOUT. It is not clear to me from the patch 
> description why a new status code is needed for reporting that the 
> command duration limit has been exceeded.

As explained, this allows differentiating the "drive gave a response"
(BLK_STS_DURATION_LIMIT) from the "drive is not responding" case with
BLK_STS_TIMEOUT. We took care of mapping BLK_STS_DURATION_LIMIT to ETIME
(timer expired) for user space too, to not overload ETIMEDOUT used with
BLK_STS_TIMEOUT.

We can certainly improve the commit message to describe all of this in
more details.

> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-24 20:36     ` Bart Van Assche
@ 2023-01-24 21:48       ` Damien Le Moal
  0 siblings, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-24 21:48 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/25/23 05:36, Bart Van Assche wrote:
> On 1/24/23 11:27, Bart Van Assche wrote:
>> Implementing duration limit support using the I/O priority mechanism 
>> makes it impossible to configure the I/O priority for commands that have 
>> a duration limit. Shouldn't the duration limit be independent of the I/O 
>> priority? Am I perhaps missing something?
> 
> (replying to my own e-mail)
> 
> In SAM-6 I found the following: "The device server may use the duration 
> expiration time to determine the order of processing commands with
> the SIMPLE task attribute within the task set. A difference in duration 
> expiration time between commands may override other scheduling 
> considerations (e.g., different times to access different logical block 
> addresses or vendor specific scheduling considerations). Processing of a 
> collection of commands with different command duration limit settings 
> should cause a command with an earlier duration expiration time to 
> complete with status sooner than a command with a later duration 
> expiration time."
> 
> Do I understand correctly that it is optional for a SCSI device to 
> interpret the command duration as a priority and that this is not mandatory?

This describes the expected behavior from the drive in terms of command
execution ordering when CDL is used. The text is a little "soft" and sound
as if this behavior is optional because CDL is a combination of time
limits AND a policy for each time limit. The policy of a CDL indicates
what the drive behavior should be if a command misses its time limit. In
short, there are 2 policies:
1) best-effort: the time limit is a hint of sorts. If the drive fails to
execute the command before the limit expires, the command is not aborted
and execution continues.
2) fast-fail: If the drive fails to execute the command before the time
limit expires, the command must be completed with an error immediately.

And CDL also has a field, settable by the user, that describes an allowed
performance degradation to achieve CDL scheduling in time. That is, most
important for the best-effort case to indicate how "serious" the user is
about the CDL limit "hint".

So as you can see I think, the SAM-6 text is vague because of the many
possible variations in scheduling policies that need to be implemented by
a drive.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-24 21:29     ` Damien Le Moal
@ 2023-01-24 22:43       ` Bart Van Assche
  2023-01-24 22:59         ` Damien Le Moal
  2023-01-25  6:33         ` Christoph Hellwig
  0 siblings, 2 replies; 82+ messages in thread
From: Bart Van Assche @ 2023-01-24 22:43 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel, Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/24/23 13:29, Damien Le Moal wrote:
> I/O priority at the device level does not exist with SAS and with SATA,
> the ACS specifications mandates that NCQ I/O priority and CDL cannot be
> used mixed together. So from the device point of view, I/O priority and
> CDL are mutually exclusive. No issues.
> 
> Now, if you are talking about the host level I/O priority scheduling done
> by mq-deadline and bfq, the CDL priority class maps to the RT class. They
> are the same, as they should. There is nothing more real-time than CDL in
> my opinion :)
> 
> Furthermore, if we do not reuse the I/O priority interface, we will have
> to add another field to BIOs & requests to propagate the cdl index from
> user space down to the device LLD, almost exactly in the manner of I/O
> priorities, including all the controls with merging etc. That would be a
> lot of overhead to achieve the possibility of prioritized CDL commands.
> 
> CDL in of itself allows the user to define "prioritized" commands by
> defining CDLs on the drive that are sorted in increasing time limit order,
> i.e. with low CDL index numbers having low limits, and higher priority
> within the class (as CDL index == prio level). With that, schedulers can
> still do the right thing as they do now, with the additional benefit that
> they can even be improved to base their scheduling decisions on a known
> time limit for the command execution. But such optimization is not
> implemented by this series.

Hi Damien,

What if a device that supports I/O priorities (e.g. an NVMe device that 
supports configuring the SQ priority) and a device that supports command 
duration limits (e.g. a SATA hard disk) are combined via the device 
mapper into a single block device? Should I/O be submitted to the dm 
device with one of the existing I/O priority classes (not supported by 
SATA hard disks) or with I/O priority class IOPRIO_CLASS_DL (not 
supported by NVMe devices)?

Shouldn't the ATA core translate the existing I/O priority levels into a 
command duration limit instead of introducing a new I/O priority class 
that is only supported by ATA devices?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-24 22:43       ` Bart Van Assche
@ 2023-01-24 22:59         ` Damien Le Moal
  2023-01-25  0:05           ` Bart Van Assche
  2023-01-25  6:33         ` Christoph Hellwig
  1 sibling, 1 reply; 82+ messages in thread
From: Damien Le Moal @ 2023-01-24 22:59 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/25/23 07:43, Bart Van Assche wrote:
> On 1/24/23 13:29, Damien Le Moal wrote:
>> I/O priority at the device level does not exist with SAS and with SATA,
>> the ACS specifications mandates that NCQ I/O priority and CDL cannot be
>> used mixed together. So from the device point of view, I/O priority and
>> CDL are mutually exclusive. No issues.
>>
>> Now, if you are talking about the host level I/O priority scheduling done
>> by mq-deadline and bfq, the CDL priority class maps to the RT class. They
>> are the same, as they should. There is nothing more real-time than CDL in
>> my opinion :)
>>
>> Furthermore, if we do not reuse the I/O priority interface, we will have
>> to add another field to BIOs & requests to propagate the cdl index from
>> user space down to the device LLD, almost exactly in the manner of I/O
>> priorities, including all the controls with merging etc. That would be a
>> lot of overhead to achieve the possibility of prioritized CDL commands.
>>
>> CDL in of itself allows the user to define "prioritized" commands by
>> defining CDLs on the drive that are sorted in increasing time limit order,
>> i.e. with low CDL index numbers having low limits, and higher priority
>> within the class (as CDL index == prio level). With that, schedulers can
>> still do the right thing as they do now, with the additional benefit that
>> they can even be improved to base their scheduling decisions on a known
>> time limit for the command execution. But such optimization is not
>> implemented by this series.
> 
> Hi Damien,
> 
> What if a device that supports I/O priorities (e.g. an NVMe device that 
> supports configuring the SQ priority) and a device that supports command 
> duration limits (e.g. a SATA hard disk) are combined via the device 
> mapper into a single block device? Should I/O be submitted to the dm 
> device with one of the existing I/O priority classes (not supported by 
> SATA hard disks) or with I/O priority class IOPRIO_CLASS_DL (not 
> supported by NVMe devices)?

That is not a use case we considered. My gut feeling is that this is
something the target driver should handle when processing a user IO.
Note that I was not aware that Linux NVMe driver supported queue priorities...

> Shouldn't the ATA core translate the existing I/O priority levels into a 
> command duration limit instead of introducing a new I/O priority class 
> that is only supported by ATA devices?

There is only one priority class that ATA understands: RT (the level is
irrelevant and ignored). All RT class IOs are mapped to high priority NCQ
commands. All other classes map to normal priority (no priority bit set)
commands.

And sure, we could map the level of RT class IOs to a CDL index, as we do
for the CDL class, but what would be the point ? The user should use the
CDL class in that case.

Furthermore, there is one additional thing that we do not yet support but
will later: CDL descriptor 0 can be used to set a target time limit for
high priority NCQ commands. Without this new feature introduced with CDL,
the drive is free to schedule high priority NCQ commands as it wants, and
that is hard coded in FW. So you can endup with very aggressive scheduling
leading to significant overall IOPS drop and long tail latency for low
priority commands. See page 11 and 20 of this presentation for an example:

https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf

For a drive that supports both CDL and NCQ priority, with CDL feature
turned off, CDL descriptor 0 defines the time limit hint for high priority
NCQ commands. Again, CDL and NCQ high priority are mutually exclusive.

So for clarity, I really would prefer separating CDL and RT classes as we
did. We could integrate CDL support reusing the RT class + level for CDL
index, but I think this may be very confusing for users, especially
considering that the CLDs on a drive can be defined in any order the user
wants, resulting in indexes/levels that does do not have any particular
order, making it impossible for the host to correctly schedule commands.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-24 22:59         ` Damien Le Moal
@ 2023-01-25  0:05           ` Bart Van Assche
  2023-01-25  1:19             ` Damien Le Moal
  0 siblings, 1 reply; 82+ messages in thread
From: Bart Van Assche @ 2023-01-25  0:05 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel, Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/24/23 14:59, Damien Le Moal wrote:
> There is only one priority class that ATA understands: RT (the level is
> irrelevant and ignored). All RT class IOs are mapped to high priority NCQ
> commands. All other classes map to normal priority (no priority bit set)
> commands.
> 
> And sure, we could map the level of RT class IOs to a CDL index, as we do
> for the CDL class, but what would be the point ? The user should use the
> CDL class in that case.
> 
> Furthermore, there is one additional thing that we do not yet support but
> will later: CDL descriptor 0 can be used to set a target time limit for
> high priority NCQ commands. Without this new feature introduced with CDL,
> the drive is free to schedule high priority NCQ commands as it wants, and
> that is hard coded in FW. So you can endup with very aggressive scheduling
> leading to significant overall IOPS drop and long tail latency for low
> priority commands. See page 11 and 20 of this presentation for an example:
> 
> https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf
> 
> For a drive that supports both CDL and NCQ priority, with CDL feature
> turned off, CDL descriptor 0 defines the time limit hint for high priority
> NCQ commands. Again, CDL and NCQ high priority are mutually exclusive.
> 
> So for clarity, I really would prefer separating CDL and RT classes as we
> did. We could integrate CDL support reusing the RT class + level for CDL
> index, but I think this may be very confusing for users, especially
> considering that the CLDs on a drive can be defined in any order the user
> wants, resulting in indexes/levels that does do not have any particular
> order, making it impossible for the host to correctly schedule commands.

Hi Damien,

Thanks again for the detailed reply. Your replies are very informative 
and help me understand the context better.

However, I'm still less than enthusiast about the introduction of the 
I/O priority class IOPRIO_CLASS_DL. To me command duration limits (CDL) 
is a mechanism that is supported by one storage standard (SCSI) and of 
which it is not sure that it will be integrated in other storage 
standards (NVMe, ...). Isn't the purpose of the block layer to provide 
an interface that is independent of the specifics of a single storage 
standard? This is why I'm in favor of letting the ATA core translate one 
of the existing I/O priority classes into a CDL instead of introducing a 
new I/O priority class (IOPRIO_CLASS_DL) in the block layer.

Others may have a different opinion.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-25  0:05           ` Bart Van Assche
@ 2023-01-25  1:19             ` Damien Le Moal
  2023-01-25 18:37               ` Bart Van Assche
  2023-01-25 23:11               ` Keith Busch
  0 siblings, 2 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-25  1:19 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/25/23 09:05, Bart Van Assche wrote:
> On 1/24/23 14:59, Damien Le Moal wrote:
>> There is only one priority class that ATA understands: RT (the level is
>> irrelevant and ignored). All RT class IOs are mapped to high priority NCQ
>> commands. All other classes map to normal priority (no priority bit set)
>> commands.
>>
>> And sure, we could map the level of RT class IOs to a CDL index, as we do
>> for the CDL class, but what would be the point ? The user should use the
>> CDL class in that case.
>>
>> Furthermore, there is one additional thing that we do not yet support but
>> will later: CDL descriptor 0 can be used to set a target time limit for
>> high priority NCQ commands. Without this new feature introduced with CDL,
>> the drive is free to schedule high priority NCQ commands as it wants, and
>> that is hard coded in FW. So you can endup with very aggressive scheduling
>> leading to significant overall IOPS drop and long tail latency for low
>> priority commands. See page 11 and 20 of this presentation for an example:
>>
>> https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf
>>
>> For a drive that supports both CDL and NCQ priority, with CDL feature
>> turned off, CDL descriptor 0 defines the time limit hint for high priority
>> NCQ commands. Again, CDL and NCQ high priority are mutually exclusive.
>>
>> So for clarity, I really would prefer separating CDL and RT classes as we
>> did. We could integrate CDL support reusing the RT class + level for CDL
>> index, but I think this may be very confusing for users, especially
>> considering that the CLDs on a drive can be defined in any order the user
>> wants, resulting in indexes/levels that does do not have any particular
>> order, making it impossible for the host to correctly schedule commands.
> 
> Hi Damien,
> 
> Thanks again for the detailed reply. Your replies are very informative 
> and help me understand the context better.
> 
> However, I'm still less than enthusiast about the introduction of the 
> I/O priority class IOPRIO_CLASS_DL. To me command duration limits (CDL) 
> is a mechanism that is supported by one storage standard (SCSI) and of 

And ATA (ACS) too. Not just SCSI. This is actually an improvement over IO
priority (command priority) that is supported only by ATA NCQ and does not
exist with SCSI/SBC.

> which it is not sure that it will be integrated in other storage 
> standards (NVMe, ...). Isn't the purpose of the block layer to provide 
> an interface that is independent of the specifics of a single storage 
> standard? This is why I'm in favor of letting the ATA core translate one 
> of the existing I/O priority classes into a CDL instead of introducing a 
> new I/O priority class (IOPRIO_CLASS_DL) in the block layer.

We discussed CDL with Hannes in the context of NVMe over fabrics. Their
may be interesting extensions to consider for NVMe in that context (the
value for local PCI attached NVMe drive is more limited at best).

I would argue that IO priority is the same: that is not supported by all
device classes either, and for those that support it, the semantic is not
identical (ATA vs NVMe). Yet, we have the RT class that maps to high
priority for ATA, and nothing else as far as I know.

CDL at least covers SCSI *and* ATA, and as mentioned above, could be used
by NVMe-of host drivers to do fancy link selection for a multipath setup
based on the link speed for instance.

We could overload the RT class with a mapping to CDL feature on scsi and
ata, but I think this is more confusing/messy than a separate class as we
implemented.

> 
> Others may have a different opinion.
> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-24 22:43       ` Bart Van Assche
  2023-01-24 22:59         ` Damien Le Moal
@ 2023-01-25  6:33         ` Christoph Hellwig
  1 sibling, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-25  6:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Damien Le Moal, Niklas Cassel, Paolo Valente, Jens Axboe,
	Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block

On Tue, Jan 24, 2023 at 02:43:24PM -0800, Bart Van Assche wrote:
> What if a device that supports I/O priorities (e.g. an NVMe device that 
> supports configuring the SQ priority)

NVMe does not have any I/O priorities, it only has a very limited
priority scheme for queue arbitration.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-25  1:19             ` Damien Le Moal
@ 2023-01-25 18:37               ` Bart Van Assche
  2023-01-25 21:23                 ` Niklas Cassel
  2023-01-25 23:11               ` Keith Busch
  1 sibling, 1 reply; 82+ messages in thread
From: Bart Van Assche @ 2023-01-25 18:37 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel, Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/24/23 17:19, Damien Le Moal wrote:
> On 1/25/23 09:05, Bart Van Assche wrote:
>> Thanks again for the detailed reply. Your replies are very informative
>> and help me understand the context better.
>>
>> However, I'm still less than enthusiast about the introduction of the
>> I/O priority class IOPRIO_CLASS_DL. To me command duration limits (CDL)
>> is a mechanism that is supported by one storage standard (SCSI) and of
> 
> And ATA (ACS) too. Not just SCSI. This is actually an improvement over IO
> priority (command priority) that is supported only by ATA NCQ and does not
> exist with SCSI/SBC.
> 
>> which it is not sure that it will be integrated in other storage
>> standards (NVMe, ...). Isn't the purpose of the block layer to provide
>> an interface that is independent of the specifics of a single storage
>> standard? This is why I'm in favor of letting the ATA core translate one
>> of the existing I/O priority classes into a CDL instead of introducing a
>> new I/O priority class (IOPRIO_CLASS_DL) in the block layer.
> 
> We discussed CDL with Hannes in the context of NVMe over fabrics. Their
> may be interesting extensions to consider for NVMe in that context (the
> value for local PCI attached NVMe drive is more limited at best).
> 
> I would argue that IO priority is the same: that is not supported by all
> device classes either, and for those that support it, the semantic is not
> identical (ATA vs NVMe). Yet, we have the RT class that maps to high
> priority for ATA, and nothing else as far as I know.
> 
> CDL at least covers SCSI *and* ATA, and as mentioned above, could be used
> by NVMe-of host drivers to do fancy link selection for a multipath setup
> based on the link speed for instance.
> 
> We could overload the RT class with a mapping to CDL feature on scsi and
> ata, but I think this is more confusing/messy than a separate class as we
> implemented.

Hi Damien,

The more I think about this, the more I'm convinced that it would be wrong
to introduce IOPRIO_CLASS_DL. Datacenters will have a mix of drives that
support CDL and drives that do not support CDL. It seems wrong to me to
make user space software responsible for figuring out whether or not the
drive supports CDL before it can be decided which I/O priority class should
be used. This is something the kernel should do instead of user space
software.

If we would ask Android storage vendors to implement CDL then IOPRIO_CLASS_DL
would cause trouble too. Android has support since considerable time to give
the foreground application a higher I/O priority than background applications.
The cgroup settings for foreground and background applications come from the
task_profiles.json file (see also
https://android.googlesource.com/platform/system/core/+/master/libprocessgroup/profiles/task_profiles.json).
As one can see all the settings in that file are independent of the features
of the storage device. Introducing IOPRIO_CLASS_DL in the kernel and using it
in task_profiles.json would imply that the storage device type has to be
determined before it can be decided whether or not IOPRIO_CLASS_DL can be used.
This seems wrong to me.

I downloaded the patch series in its entirety and applied it on a local kernel
branch. I verified which changes would be needed to replace IOPRIO_CLASS_DL
with IOPRIO_CLASS_RT. Can you help me with verifying the patch below?

Regarding the BFQ changes in the patch below, is an I/O scheduler useful at all
if CDL is used since a comment in the BFQ driver says that the disk should do
the scheduling instead of BFQ if CDL is used?

Thanks,

Bart.


diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 7add9346c585..815b884d6c5a 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5545,14 +5545,6 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic)
  		bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE;
  		bfqq->new_ioprio = 7;
  		break;
-	case IOPRIO_CLASS_DL:
-		/*
-		 * For the duration-limits class, we want the disk to do the
-		 * scheduling. So map all levels to the highest RT level.
-		 */
-		bfqq->new_ioprio = 0;
-		bfqq->new_ioprio_class = IOPRIO_CLASS_RT;
-		break;
  	}

  	if (bfqq->new_ioprio >= IOPRIO_NR_LEVELS) {
@@ -5681,8 +5673,6 @@ static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd,
  		return &bfqg->async_bfqq[1][ioprio][act_idx];
  	case IOPRIO_CLASS_IDLE:
  		return &bfqg->async_idle_bfqq[act_idx];
-	case IOPRIO_CLASS_DL:
-		return &bfqg->async_bfqq[0][0][act_idx];
  	default:
  		return NULL;
  	}
diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
index dfb5c3f447f4..8bb6b8eba4ce 100644
--- a/block/blk-ioprio.c
+++ b/block/blk-ioprio.c
@@ -27,7 +27,6 @@
   * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into
   *		IOPRIO_CLASS_BE.
   * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE.
- * @POLICY_ALL_TO_DL: change the I/O priority class into IOPRIO_CLASS_DL.
   *
   * See also <linux/ioprio.h>.
   */
@@ -36,7 +35,6 @@ enum prio_policy {
  	POLICY_NONE_TO_RT	= 1,
  	POLICY_RESTRICT_TO_BE	= 2,
  	POLICY_ALL_TO_IDLE	= 3,
-	POLICY_ALL_TO_DL	= 4,
  };

  static const char *policy_name[] = {
@@ -44,7 +42,6 @@ static const char *policy_name[] = {
  	[POLICY_NONE_TO_RT]	= "none-to-rt",
  	[POLICY_RESTRICT_TO_BE]	= "restrict-to-be",
  	[POLICY_ALL_TO_IDLE]	= "idle",
-	[POLICY_ALL_TO_DL]	= "duration-limits",
  };

  static struct blkcg_policy ioprio_policy;
diff --git a/block/ioprio.c b/block/ioprio.c
index 1b3a9da82597..32a456b45804 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -37,7 +37,6 @@ int ioprio_check_cap(int ioprio)

  	switch (class) {
  		case IOPRIO_CLASS_RT:
-		case IOPRIO_CLASS_DL:
  			/*
  			 * Originally this only checked for CAP_SYS_ADMIN,
  			 * which was implicitly allowed for pid 0 by security
@@ -48,7 +47,7 @@ int ioprio_check_cap(int ioprio)
  			if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_NICE))
  				return -EPERM;
  			fallthrough;
-			/* RT and DL have prio field too */
+			/* rt has prio field too */
  		case IOPRIO_CLASS_BE:
  			if (data >= IOPRIO_NR_LEVELS || data < 0)
  				return -EINVAL;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 526d0ea4dbf9..f10c2a0d18d4 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -113,7 +113,6 @@ static const enum dd_prio ioprio_class_to_prio[] = {
  	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
  	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
  	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
-	[IOPRIO_CLASS_DL]	= DD_RT_PRIO,
  };

  static inline struct rb_root *
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index b4761c3c4b91..3065b632e6ae 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -673,7 +673,7 @@ static inline void ata_set_tf_cdl(struct ata_queued_cmd *qc, int ioprio)
  	struct ata_taskfile *tf = &qc->tf;
  	int cdl;

-	if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_DL)
+	if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_RT)
  		return;

  	cdl = IOPRIO_PRIO_DATA(ioprio) & 0x07;
diff --git a/drivers/scsi/sd_cdl.c b/drivers/scsi/sd_cdl.c
index 59d02dbb5ea1..c5286f5ddae4 100644
--- a/drivers/scsi/sd_cdl.c
+++ b/drivers/scsi/sd_cdl.c
@@ -880,10 +880,10 @@ int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd)
  	unsigned int dld;

  	/*
-	 * Use "no limit" if the request ioprio class is not IOPRIO_CLASS_DL
+	 * Use "no limit" if the request ioprio class is not IOPRIO_CLASS_RT
  	 * or if the user specified an invalid CDL descriptor index.
  	 */
-	if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_DL)
+	if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_RT)
  		return 0;

  	dld = IOPRIO_PRIO_DATA(ioprio);
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 2f3fc2fbd668..7578d4f6a969 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -20,7 +20,7 @@ static inline bool ioprio_valid(unsigned short ioprio)
  {
  	unsigned short class = IOPRIO_PRIO_CLASS(ioprio);

-	return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_DL;
+	return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_IDLE;
  }

  /*
diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index 15908b9e9d8c..f70f2596a6bf 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -29,7 +29,6 @@ enum {
  	IOPRIO_CLASS_RT,
  	IOPRIO_CLASS_BE,
  	IOPRIO_CLASS_IDLE,
-	IOPRIO_CLASS_DL,
  };

  /*
@@ -38,12 +37,6 @@ enum {
  #define IOPRIO_NR_LEVELS	8
  #define IOPRIO_BE_NR		IOPRIO_NR_LEVELS

-/*
- * The Duration limits class allows 8 levels: level 0 for "no limit" and levels
- * 1 to 7, each corresponding to a read or write limit descriptor.
- */
-#define IOPRIO_DL_NR_LEVELS	8
-
  enum {
  	IOPRIO_WHO_PROCESS = 1,
  	IOPRIO_WHO_PGRP,


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-25 18:37               ` Bart Van Assche
@ 2023-01-25 21:23                 ` Niklas Cassel
  2023-01-26  0:24                   ` Damien Le Moal
  0 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-25 21:23 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Damien Le Moal, Paolo Valente, Jens Axboe, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, linux-ide, linux-block

On Wed, Jan 25, 2023 at 10:37:52AM -0800, Bart Van Assche wrote:

(snip)

> Hi Damien,
> 
> The more I think about this, the more I'm convinced that it would be wrong
> to introduce IOPRIO_CLASS_DL. Datacenters will have a mix of drives that
> support CDL and drives that do not support CDL. It seems wrong to me to
> make user space software responsible for figuring out whether or not the
> drive supports CDL before it can be decided which I/O priority class should
> be used. This is something the kernel should do instead of user space
> software.

Well, if we take e.g. NCQ priority as an example, as that is probably
the only device side I/O priority feature currently supported by the
kernel.

If you want to use of NCQ priority, you need to first enable
/sys/block/sdX/device/ncq_prio_enable
and then submit I/O using IOPRIO_CLASS_RT, so I would argue the user
already needs to know that a device supports device side I/O priority,
if he wants to make use of it.


For CDL there are 7 different limits for reads and 7 different
limits for writes, these limits can be configured by the user.
So the users that want to get most performance out of their drive
will most likely analyze their workloads, and set the limits depending
on how their workload actually looks like.

Bottom line is that heavy users of CDL will absolutely know how the CDL
limits are configured in user space, as they will pick the correct CDL
index (prio level) for the descriptor that they want to use for the
specific I/O that they are doing. An ioscheduler will most likely be
disabled.

(For CDL, the limit is from the time the command is submitted to the device,
so from the device's PoV, it does not really matter if a command is queued
for a long time in a scheduler or not, but from an application PoV, it does
not make sense to hold back a command for long if it e.g. has a short limit.)


If we were to reuse IOPRIO_CLASS_RT, then I guess the best option would be
to have something like:

$ cat /sys/block/sdX/device/rt_prio_backend
[none] ncq-prio cdl

Devices that does not support ncq-prio or cdl,
e.g. currently NVMe, would just have none
(i.e. RT simply means higher host side priority (if a scheduler is used)).

SCSI would then have none and cdl
(for SCSI devices supporting CDL.)

ATA would have none, ncq-prio and cdl.
(for ATA devices supporting CDL.)

That would theoretically avoid another ioprio class, but like I've just
explained, a user space application making use of CDL would for sure know
how the descriptors look like anyway, so I'm not sure if there is an actual
benefit of doing it this way over simply having a IOPRIO_CLASS_DL.

I guess the only benefit would be that we would avoid introducing another
I/O priority class (at the expense of additional complexity elsewhere).


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-25  1:19             ` Damien Le Moal
  2023-01-25 18:37               ` Bart Van Assche
@ 2023-01-25 23:11               ` Keith Busch
  2023-01-26  0:08                 ` Damien Le Moal
  2023-01-26  5:26                 ` Christoph Hellwig
  1 sibling, 2 replies; 82+ messages in thread
From: Keith Busch @ 2023-01-25 23:11 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe,
	Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block

On Wed, Jan 25, 2023 at 10:19:45AM +0900, Damien Le Moal wrote:
> On 1/25/23 09:05, Bart Van Assche wrote:
> 
> > which it is not sure that it will be integrated in other storage 
> > standards (NVMe, ...). Isn't the purpose of the block layer to provide 
> > an interface that is independent of the specifics of a single storage 
> > standard? This is why I'm in favor of letting the ATA core translate one 
> > of the existing I/O priority classes into a CDL instead of introducing a 
> > new I/O priority class (IOPRIO_CLASS_DL) in the block layer.
> 
> We discussed CDL with Hannes in the context of NVMe over fabrics. Their
> may be interesting extensions to consider for NVMe in that context (the
> value for local PCI attached NVMe drive is more limited at best).

I wouldn't necessarily rule out CDL for PCI attached in some future TP. NVMe
does allow rotating media, and they'll want feature parity if CDL is considered
useful in other protocols.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-25 23:11               ` Keith Busch
@ 2023-01-26  0:08                 ` Damien Le Moal
  2023-01-26  5:26                 ` Christoph Hellwig
  1 sibling, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-26  0:08 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe,
	Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block

On 2023/01/26 8:11, Keith Busch wrote:
> On Wed, Jan 25, 2023 at 10:19:45AM +0900, Damien Le Moal wrote:
>> On 1/25/23 09:05, Bart Van Assche wrote:
>>
>>> which it is not sure that it will be integrated in other storage 
>>> standards (NVMe, ...). Isn't the purpose of the block layer to provide 
>>> an interface that is independent of the specifics of a single storage 
>>> standard? This is why I'm in favor of letting the ATA core translate one 
>>> of the existing I/O priority classes into a CDL instead of introducing a 
>>> new I/O priority class (IOPRIO_CLASS_DL) in the block layer.
>>
>> We discussed CDL with Hannes in the context of NVMe over fabrics. Their
>> may be interesting extensions to consider for NVMe in that context (the
>> value for local PCI attached NVMe drive is more limited at best).
> 
> I wouldn't necessarily rule out CDL for PCI attached in some future TP. NVMe
> does allow rotating media, and they'll want feature parity if CDL is considered
> useful in other protocols.

True. If NVMe HDDs come to market, we'll definitely want a CDL feature too.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-25 21:23                 ` Niklas Cassel
@ 2023-01-26  0:24                   ` Damien Le Moal
  2023-01-26 13:53                     ` Niklas Cassel
  0 siblings, 1 reply; 82+ messages in thread
From: Damien Le Moal @ 2023-01-26  0:24 UTC (permalink / raw)
  To: Niklas Cassel, Bart Van Assche
  Cc: Paolo Valente, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block

On 2023/01/26 6:23, Niklas Cassel wrote:
> On Wed, Jan 25, 2023 at 10:37:52AM -0800, Bart Van Assche wrote:
> 
> (snip)
> 
>> Hi Damien,
>>
>> The more I think about this, the more I'm convinced that it would be wrong
>> to introduce IOPRIO_CLASS_DL. Datacenters will have a mix of drives that
>> support CDL and drives that do not support CDL. It seems wrong to me to
>> make user space software responsible for figuring out whether or not the
>> drive supports CDL before it can be decided which I/O priority class should
>> be used. This is something the kernel should do instead of user space
>> software.
> 
> Well, if we take e.g. NCQ priority as an example, as that is probably
> the only device side I/O priority feature currently supported by the
> kernel.
> 
> If you want to use of NCQ priority, you need to first enable
> /sys/block/sdX/device/ncq_prio_enable
> and then submit I/O using IOPRIO_CLASS_RT, so I would argue the user
> already needs to know that a device supports device side I/O priority,
> if he wants to make use of it.

Yes, absolutely. In addition to this, NCQ high priority feature is optional. The
host-level RT class scheduling works the same way regardless of a SATA drive
supporting NCQ high priority or not. If ncq_prio_enable is not enabled (or not
supported), the scheduler still works as before. If ncq_prio_enable is set for a
drive that supports NCQ high prio, then the user gets the additional benefit of
*also* having the drive prioritize the commands from high-priority user IOs.

> For CDL there are 7 different limits for reads and 7 different
> limits for writes, these limits can be configured by the user.
> So the users that want to get most performance out of their drive
> will most likely analyze their workloads, and set the limits depending
> on how their workload actually looks like.
> 
> Bottom line is that heavy users of CDL will absolutely know how the CDL
> limits are configured in user space, as they will pick the correct CDL
> index (prio level) for the descriptor that they want to use for the
> specific I/O that they are doing. An ioscheduler will most likely be
> disabled.

Yes. And for cases where we still need an IO scheduler (e.g. SMR with
mq-deadline), we really cannot use the priority level (CDL index) as a
meaningful information to make request scheduling decisions because I think it
is simply impossible to reliably define a "priority" order for the 7 read and
write descriptors. We cannot map a set of 14 descriptors with a very large
possible number of variations to sorted array of priority-like levels.

> (For CDL, the limit is from the time the command is submitted to the device,
> so from the device's PoV, it does not really matter if a command is queued
> for a long time in a scheduler or not, but from an application PoV, it does
> not make sense to hold back a command for long if it e.g. has a short limit.)
> 
> 
> If we were to reuse IOPRIO_CLASS_RT, then I guess the best option would be
> to have something like:
> 
> $ cat /sys/block/sdX/device/rt_prio_backend
> [none] ncq-prio cdl

No need for this. We can keep the existing ncq_prio_enable and the proposed
duration_limits/enable sysfs attributes. The user cannot enable both at the same
time with our patches. So if the user enables ncq_prio_enable, then it will get
high priority NCQ commands mapping for any level of the RT class. If
duration_limits/enable is set, then the user will get CDL scheduling of commands
on the drive.

But again, the difficulty with this overloading is that we *cannot* implement a
solid level-based scheduling in IO schedulers because ordering the CDLs in a
meaningful way is impossible. So BFQ handling of the RT class would likely not
result in the most ideal scheduling (that would depend heavily on how the CDL
descriptors are defined on the drive). Hence my reluctance to overload the RT
class for CDL.

> Devices that does not support ncq-prio or cdl,
> e.g. currently NVMe, would just have none
> (i.e. RT simply means higher host side priority (if a scheduler is used)).

Yes. Exactly.

> SCSI would then have none and cdl
> (for SCSI devices supporting CDL.)
> 
> ATA would have none, ncq-prio and cdl.
> (for ATA devices supporting CDL.)
> 
> That would theoretically avoid another ioprio class, but like I've just
> explained, a user space application making use of CDL would for sure know
> how the descriptors look like anyway, so I'm not sure if there is an actual
> benefit of doing it this way over simply having a IOPRIO_CLASS_DL.

Agree. And as explained above, I think that reusing the RT class creates more
problems than the only apparent simplification it is.

> I guess the only benefit would be that we would avoid introducing another
> I/O priority class (at the expense of additional complexity elsewhere).

Yes. And I think that the added complexity to correctly handle the overloaded RT
class is too much. RT class has been around for a long time for host-level IO
priority scheduling. Let's not break it in weird ways.

We certainly can work on improving handling of IOPRIO_CLASS_DL in IO schedulers.
But in my opinion, that can be done later, after this initial series introducing
CDL support is applied.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-25 23:11               ` Keith Busch
  2023-01-26  0:08                 ` Damien Le Moal
@ 2023-01-26  5:26                 ` Christoph Hellwig
  1 sibling, 0 replies; 82+ messages in thread
From: Christoph Hellwig @ 2023-01-26  5:26 UTC (permalink / raw)
  To: Keith Busch
  Cc: Damien Le Moal, Bart Van Assche, Niklas Cassel, Paolo Valente,
	Jens Axboe, Christoph Hellwig, Hannes Reinecke, linux-scsi,
	linux-ide, linux-block

On Wed, Jan 25, 2023 at 04:11:41PM -0700, Keith Busch wrote:
> I wouldn't necessarily rule out CDL for PCI attached in some future TP. NVMe
> does allow rotating media, and they'll want feature parity if CDL is considered
> useful in other protocols.

NVMe has a TP for CDL that is technically active, although it doesn't
seem to be actively worked on right now.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-26  0:24                   ` Damien Le Moal
@ 2023-01-26 13:53                     ` Niklas Cassel
  2023-01-26 17:33                       ` Bart Van Assche
  0 siblings, 1 reply; 82+ messages in thread
From: Niklas Cassel @ 2023-01-26 13:53 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Paolo Valente, Jens Axboe, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, linux-ide, linux-block

On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote:
> On 2023/01/26 6:23, Niklas Cassel wrote:
> > On Wed, Jan 25, 2023 at 10:37:52AM -0800, Bart Van Assche wrote:

(snip)

> > If we were to reuse IOPRIO_CLASS_RT, then I guess the best option would be
> > to have something like:
> > 
> > $ cat /sys/block/sdX/device/rt_prio_backend
> > [none] ncq-prio cdl
> 
> No need for this. We can keep the existing ncq_prio_enable and the proposed
> duration_limits/enable sysfs attributes. The user cannot enable both at the same
> time with our patches. So if the user enables ncq_prio_enable, then it will get
> high priority NCQ commands mapping for any level of the RT class. If
> duration_limits/enable is set, then the user will get CDL scheduling of commands
> on the drive.
> 
> But again, the difficulty with this overloading is that we *cannot* implement a
> solid level-based scheduling in IO schedulers because ordering the CDLs in a
> meaningful way is impossible. So BFQ handling of the RT class would likely not
> result in the most ideal scheduling (that would depend heavily on how the CDL
> descriptors are defined on the drive). Hence my reluctance to overload the RT
> class for CDL.

Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to
disable the IO scheduler, so that lower classdata levels wouldn't be prioritized
over higher classdata levels, or simply use an IO scheduler that does not care
about the classdata level, e.g. mq-deadline.

From ionice man page:

-n, --classdata level
Specify the scheduling class data. This only has an effect if the class accepts
an argument. For realtime and best-effort, 0-7 are valid data (priority levels),
and 0 represents the highest priority level.


I guess it kind of made sense for NCQ priority to piggyback on IOPRIO_CLASS_RT,
since the only thing that libata has to do is to set singular the high prio bit
(so the classdata could still be used for prioritizing IOs on the host side).

However, for CDL, things are not as simple as setting a single bit in the
command, because of all the different descriptors, so we must let the classdata
represent the device side priority level, and not the host side priority level
(as we cannot have both, and I agree with you, it is very hard define an order
between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked
higher or lower than a 20 ms policy 0xd descriptor?).

It's best to let the definition for IOPRIO_CLASS_RT stay the way it always has,
0 represents the highest priority level, 7 the lowest priority level (and we
wouldn't be able to change how the schedulers handle IOPRIO_CLASS_RT anyway).

If we update the man page with a IOPRIO_CLASS_DL entry, we could clearly state
that IO schedulers do not care about the classdata at all for IOPRIO_CLASS_DL
(and that the classdata is solely used to convey a priority state/index to the
device).

So I think this patch is good as it is.

Bart, do you agree? Any chance for a Reviewed-by?


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-26 13:53                     ` Niklas Cassel
@ 2023-01-26 17:33                       ` Bart Van Assche
  2023-01-27  0:18                         ` Damien Le Moal
  0 siblings, 1 reply; 82+ messages in thread
From: Bart Van Assche @ 2023-01-26 17:33 UTC (permalink / raw)
  To: Niklas Cassel, Damien Le Moal
  Cc: Paolo Valente, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block

On 1/26/23 05:53, Niklas Cassel wrote:
> On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote:
>> But again, the difficulty with this overloading is that we *cannot* implement a
>> solid level-based scheduling in IO schedulers because ordering the CDLs in a
>> meaningful way is impossible. So BFQ handling of the RT class would likely not
>> result in the most ideal scheduling (that would depend heavily on how the CDL
>> descriptors are defined on the drive). Hence my reluctance to overload the RT
>> class for CDL.
> 
> Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to
> disable the IO scheduler, so that lower classdata levels wouldn't be prioritized
> over higher classdata levels, or simply use an IO scheduler that does not care
> about the classdata level, e.g. mq-deadline.

How about making the information about whether or not CDL has been 
enabled available to the scheduler such that the scheduler can include 
that information in its decisions?

> However, for CDL, things are not as simple as setting a single bit in the
> command, because of all the different descriptors, so we must let the classdata
> represent the device side priority level, and not the host side priority level
> (as we cannot have both, and I agree with you, it is very hard define an order
> between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked
> higher or lower than a 20 ms policy 0xd descriptor?).

How about only supporting a subset of the standard such that it becomes 
easy to map CDLs to host side priority levels?

If users really need the ability to use all standardized CDL features 
and if there is no easy way to map CDL levels to an I/O priority, is the 
I/O priority mechanism really the best basis for a user space interface 
for CDLs?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-26 17:33                       ` Bart Van Assche
@ 2023-01-27  0:18                         ` Damien Le Moal
  2023-01-27  1:40                           ` Damien Le Moal
  0 siblings, 1 reply; 82+ messages in thread
From: Damien Le Moal @ 2023-01-27  0:18 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel
  Cc: Paolo Valente, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block

On 1/27/23 02:33, Bart Van Assche wrote:
> On 1/26/23 05:53, Niklas Cassel wrote:
>> On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote:
>>> But again, the difficulty with this overloading is that we *cannot* implement a
>>> solid level-based scheduling in IO schedulers because ordering the CDLs in a
>>> meaningful way is impossible. So BFQ handling of the RT class would likely not
>>> result in the most ideal scheduling (that would depend heavily on how the CDL
>>> descriptors are defined on the drive). Hence my reluctance to overload the RT
>>> class for CDL.
>>
>> Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to
>> disable the IO scheduler, so that lower classdata levels wouldn't be prioritized
>> over higher classdata levels, or simply use an IO scheduler that does not care
>> about the classdata level, e.g. mq-deadline.
> 
> How about making the information about whether or not CDL has been 
> enabled available to the scheduler such that the scheduler can include 
> that information in its decisions?

Sure, that is easy to do. But as I mentioned before, I think that is
something we can do after this initial support series.

>> However, for CDL, things are not as simple as setting a single bit in the
>> command, because of all the different descriptors, so we must let the classdata
>> represent the device side priority level, and not the host side priority level
>> (as we cannot have both, and I agree with you, it is very hard define an order
>> between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked
>> higher or lower than a 20 ms policy 0xd descriptor?).
> 
> How about only supporting a subset of the standard such that it becomes 
> easy to map CDLs to host side priority levels?

I am opposed to this, for several reasons:

1) We are seeing different use cases from users that cover a wide range of
use of CDL descriptors with various definitions.

2) Passthrough commands can be used by a user to change a drive CDL
descriptors without the kernel knowing about it, unless we spend our time
revalidating the CDL descriptor log page(s)...

3) CDL standard as is is actually very sensible and not overloaded with
stuff that is only useful in niche use cases. For each CDL descriptor, you
have:
 * The active time limit, which is a clean way to specify how much time
you allow a drive to deal with bad sectors (mostly read case). A typical
HDD will try very hard to recover data from a sector, always. As a result,
the HDD may spend up to several seconds reading a sector again and again
applying different signal processing techniques until it gets the sector
ECC checked to return valid data. That of course can hugely increase an IO
latency seen by the host. In applications such as erasure coded
distributed object stores, maximum latency for an object access can thus
be kept low using this limit without compromising the data since the
object can always be rebuilt from the erasure codes if one HDD is slow to
respond. This limit is also interesting for video streaming/playback to
avoid video buffer underflow (at the expense of may be some block noise
depending on the codec).
 * The inactive time limit can be used to tell the drive how long it is
allowed to let a command stand in the drive internal queue before
processing. This is thus a parameter that allows a host to tune the drive
RPO optimization (rotational positioning optimization, e.g. HDD internal
command scheduling based on angular sector position on tracks withe the
head current position). This is a neat way to control max IOPS vs tail
latency since drives tend to privilege maximizing IOPS over lowering max
tail latency.
 * The duration guideline limit defines an overall time limit for a
command without distinguishing between active and inactive time. It is the
easiest to use (the easiest one to understand from a beginner user point
of view). This is a neat way to define an intelligent IO prioritization in
fact, way better than RT class scheduling on the host or the use of ATA
NCQ high priority, as it provides more information to the drive about the
urgency of a particular command. That allows the drive to still perform
RPO to maximize IOPS without long tail latencies. Chaining such limit with
an active+inactive time limit descriptor using the "next limit" policy
(0x1 policy) can also finely define what the drive should if the guideline
limit is exceeded (as the next descriptor can define what to do based on
the reason for the limit being exceeded: long internal queueing vs bad
sector long access time).

> If users really need the ability to use all standardized CDL features 
> and if there is no easy way to map CDL levels to an I/O priority, is the 
> I/O priority mechanism really the best basis for a user space interface 
> for CDLs?

As you can see above, yes, we need everything and should not attempt
restricting CDL use. The IO priority interface is a perfect fit for CDL in
the sense that all we need to pass along from user to device is one
number: the CDL index to use for a command. So creating a different
interface for this while the IO priority interface exactly does that
sounds silly to me.

One compromise we could do is: have the IO schedulers completely ignore
CDL prio class for now, that is, have them assume that no IO prio
class/level was specified. Given that they are not tuned to handle CDL
well anyway, this is probably the best thing to do for now.

We still need to have the block layer prevent merging of requests with
different CDL descriptors though, which is another reason to reuse the IO
prio interface as the block layer already does this. Less code, which is
always a good thing.

> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-27  0:18                         ` Damien Le Moal
@ 2023-01-27  1:40                           ` Damien Le Moal
  2023-01-27 17:23                             ` Bart Van Assche
  0 siblings, 1 reply; 82+ messages in thread
From: Damien Le Moal @ 2023-01-27  1:40 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel
  Cc: Paolo Valente, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block

On 1/27/23 09:18, Damien Le Moal wrote:
> On 1/27/23 02:33, Bart Van Assche wrote:
>> On 1/26/23 05:53, Niklas Cassel wrote:
>>> On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote:
>>>> But again, the difficulty with this overloading is that we *cannot* implement a
>>>> solid level-based scheduling in IO schedulers because ordering the CDLs in a
>>>> meaningful way is impossible. So BFQ handling of the RT class would likely not
>>>> result in the most ideal scheduling (that would depend heavily on how the CDL
>>>> descriptors are defined on the drive). Hence my reluctance to overload the RT
>>>> class for CDL.
>>>
>>> Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to
>>> disable the IO scheduler, so that lower classdata levels wouldn't be prioritized
>>> over higher classdata levels, or simply use an IO scheduler that does not care
>>> about the classdata level, e.g. mq-deadline.
>>
>> How about making the information about whether or not CDL has been 
>> enabled available to the scheduler such that the scheduler can include 
>> that information in its decisions?
> 
> Sure, that is easy to do. But as I mentioned before, I think that is
> something we can do after this initial support series.
> 
>>> However, for CDL, things are not as simple as setting a single bit in the
>>> command, because of all the different descriptors, so we must let the classdata
>>> represent the device side priority level, and not the host side priority level
>>> (as we cannot have both, and I agree with you, it is very hard define an order
>>> between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked
>>> higher or lower than a 20 ms policy 0xd descriptor?).
>>
>> How about only supporting a subset of the standard such that it becomes 
>> easy to map CDLs to host side priority levels?
> 
> I am opposed to this, for several reasons:
> 
> 1) We are seeing different use cases from users that cover a wide range of
> use of CDL descriptors with various definitions.
> 
> 2) Passthrough commands can be used by a user to change a drive CDL
> descriptors without the kernel knowing about it, unless we spend our time
> revalidating the CDL descriptor log page(s)...
> 
> 3) CDL standard as is is actually very sensible and not overloaded with
> stuff that is only useful in niche use cases. For each CDL descriptor, you
> have:
>  * The active time limit, which is a clean way to specify how much time
> you allow a drive to deal with bad sectors (mostly read case). A typical
> HDD will try very hard to recover data from a sector, always. As a result,
> the HDD may spend up to several seconds reading a sector again and again
> applying different signal processing techniques until it gets the sector
> ECC checked to return valid data. That of course can hugely increase an IO
> latency seen by the host. In applications such as erasure coded
> distributed object stores, maximum latency for an object access can thus
> be kept low using this limit without compromising the data since the
> object can always be rebuilt from the erasure codes if one HDD is slow to
> respond. This limit is also interesting for video streaming/playback to
> avoid video buffer underflow (at the expense of may be some block noise
> depending on the codec).
>  * The inactive time limit can be used to tell the drive how long it is
> allowed to let a command stand in the drive internal queue before
> processing. This is thus a parameter that allows a host to tune the drive
> RPO optimization (rotational positioning optimization, e.g. HDD internal
> command scheduling based on angular sector position on tracks withe the
> head current position). This is a neat way to control max IOPS vs tail
> latency since drives tend to privilege maximizing IOPS over lowering max
> tail latency.
>  * The duration guideline limit defines an overall time limit for a
> command without distinguishing between active and inactive time. It is the
> easiest to use (the easiest one to understand from a beginner user point
> of view). This is a neat way to define an intelligent IO prioritization in
> fact, way better than RT class scheduling on the host or the use of ATA
> NCQ high priority, as it provides more information to the drive about the
> urgency of a particular command. That allows the drive to still perform
> RPO to maximize IOPS without long tail latencies. Chaining such limit with
> an active+inactive time limit descriptor using the "next limit" policy
> (0x1 policy) can also finely define what the drive should if the guideline
> limit is exceeded (as the next descriptor can define what to do based on
> the reason for the limit being exceeded: long internal queueing vs bad
> sector long access time).

Note that all 3 limits can be used in a single CDL descriptor to precisely
define how a command should be processed by the device. That is why it is
nearly impossible to come up with a meaningful ordering of CDL descriptors
as an increasing set of priority levels.

> 
>> If users really need the ability to use all standardized CDL features 
>> and if there is no easy way to map CDL levels to an I/O priority, is the 
>> I/O priority mechanism really the best basis for a user space interface 
>> for CDLs?
> 
> As you can see above, yes, we need everything and should not attempt
> restricting CDL use. The IO priority interface is a perfect fit for CDL in
> the sense that all we need to pass along from user to device is one
> number: the CDL index to use for a command. So creating a different
> interface for this while the IO priority interface exactly does that
> sounds silly to me.
> 
> One compromise we could do is: have the IO schedulers completely ignore
> CDL prio class for now, that is, have them assume that no IO prio
> class/level was specified. Given that they are not tuned to handle CDL
> well anyway, this is probably the best thing to do for now.
> 
> We still need to have the block layer prevent merging of requests with
> different CDL descriptors though, which is another reason to reuse the IO
> prio interface as the block layer already does this. Less code, which is
> always a good thing.
> 
>>
>> Thanks,
>>
>> Bart.
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-24 19:02 ` [PATCH v3 01/18] block: introduce duration-limits priority class Niklas Cassel
  2023-01-24 19:27   ` Bart Van Assche
@ 2023-01-27 12:43   ` Hannes Reinecke
  1 sibling, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 12:43 UTC (permalink / raw)
  To: Niklas Cassel, Paolo Valente, Jens Axboe
  Cc: Christoph Hellwig, Damien Le Moal, linux-scsi, linux-ide, linux-block

On 1/24/23 20:02, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should
> be executed using duration-limits targets. The duration target to apply
> to a command is indicated using the priority level. Up to 8 levels are
> supported, with level 0 indiating "no limit".
> 
> This priority class has effect only if the target device supports the
> command duration limits feature and this feature is enabled by the user.
> 
> While it is recommended to not use an ioscheduler when using the
> IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler,
> IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT.
> 
> The reason for this is twofold:
> 1) Each priority level for the IOPRIO_CLASS_DL priority class represents a
> duration limit descriptor (DLD) inside the device. Users can configure
> these limits themselves using passthrough commands, so from a block layer
> perspective, Linux has no idea of how each DLD is actually configured.
> 
> By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds
> its duration limit (because it was held too long in the scheduler) is
> decreased. It is still possible to use the IOPRIO_CLASS_DL priority class
> for "low priority" IOs by configuring a large limit in the respective DLD.
> 
> 2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands
> (IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority
> commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to
> IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these
> two incompatible priority classes.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   block/bfq-iosched.c         | 10 ++++++++++
>   block/blk-ioprio.c          |  3 +++
>   block/ioprio.c              |  3 ++-
>   block/mq-deadline.c         |  1 +
>   include/linux/ioprio.h      |  2 +-
>   include/uapi/linux/ioprio.h |  7 +++++++
>   6 files changed, 24 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 07/18] scsi: sd: detect support for command duration limits
  2023-01-24 19:02 ` [PATCH v3 07/18] scsi: sd: detect support for command duration limits Niklas Cassel
  2023-01-24 19:39   ` Bart Van Assche
@ 2023-01-27 13:00   ` Hannes Reinecke
  2023-01-28  0:51     ` Damien Le Moal
  1 sibling, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 13:00 UTC (permalink / raw)
  To: Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Damien Le Moal, linux-scsi, linux-ide, linux-block

On 1/24/23 20:02, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> Detect if a disk supports command duration limits. Support for
> the READ 16, WRITE 16, READ 32 and WRITE 32 commands is tested using
> the function scsi_report_opcode(). For a disk supporting command
> duration limits, the mode page indicating the command duration limits
> descriptors that apply to the command is indicated using the rwcdlp
> and cdlp bits.
> 
> Support duration limits is advertizes through sysfs using the new
> "duration_limits" sysfs sub-directory of the generic device directory,
> that is, /sys/block/sdX/device/duration_limits. Within this new
> directory, the limit descriptors that apply to read and write operations
> are exposed within the read and write directories, with descriptor
> attributes grouped together in directories. The overall sysfs structure
> created is:
> 
> /sys/block/sde/device/duration_limits/
> ├── perf_vs_duration_guideline
> ├── read
> │   ├── 1
> │   │   ├── duration_guideline
> │   │   ├── duration_guideline_policy
> │   │   ├── max_active_time
> │   │   ├── max_active_time_policy
> │   │   ├── max_inactive_time
> │   │   └── max_inactive_time_policy
> │   ├── 2
> │   │   ├── duration_guideline
> ...
> │   └── page
> └── write
>      ├── 1
>      │   ├── duration_guideline
>      │   ├── duration_guideline_policy
> ...
> 
> For each of the read and write descriptor directories, the page
> attribute file indicate the command duration limit page providing the
> descriptors. The possible values for the page attribute are "A", "B",
> "T2A" and "T2B".
> 
> The new "duration_limits" attributes directory is added only for disks
> that supports command duration limits.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/scsi/Makefile |   2 +-
>   drivers/scsi/sd.c     |   2 +
>   drivers/scsi/sd.h     |  61 ++++
>   drivers/scsi/sd_cdl.c | 764 ++++++++++++++++++++++++++++++++++++++++++
>   4 files changed, 828 insertions(+), 1 deletion(-)
>   create mode 100644 drivers/scsi/sd_cdl.c
> 
I'm not particularly happy with having sysfs reflect user settings, but 
every other place I can think of is even more convoluted.
So there.

> diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
> index f055bfd54a68..0e48cb6d21d6 100644
> --- a/drivers/scsi/Makefile
> +++ b/drivers/scsi/Makefile
> @@ -170,7 +170,7 @@ scsi_mod-$(CONFIG_BLK_DEV_BSG)	+= scsi_bsg.o
>   
>   hv_storvsc-y			:= storvsc_drv.o
>   
> -sd_mod-objs	:= sd.o
> +sd_mod-objs	:= sd.o sd_cdl.o
>   sd_mod-$(CONFIG_BLK_DEV_INTEGRITY) += sd_dif.o
>   sd_mod-$(CONFIG_BLK_DEV_ZONED) += sd_zbc.o
>   
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 45945bfeee92..7879a5470773 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3326,6 +3326,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
>   		sd_read_write_same(sdkp, buffer);
>   		sd_read_security(sdkp, buffer);
>   		sd_config_protection(sdkp);
> +		sd_read_cdl(sdkp, buffer);
>   	}
>   
>   	/*
> @@ -3646,6 +3647,7 @@ static void scsi_disk_release(struct device *dev)
>   
>   	ida_free(&sd_index_ida, sdkp->index);
>   	sd_zbc_free_zone_info(sdkp);
> +	sd_cdl_release(sdkp);
>   	put_device(&sdkp->device->sdev_gendev);
>   	free_opal_dev(sdkp->opal_dev);
>   
Hmm. Calling this during revalidate() makes sense, but how can we ensure 
that we call revalidate() when the user issues a MODE_SELECT command?

Other than that:

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 08/18] scsi: sd: set read/write commands CDL index
  2023-01-24 19:02 ` [PATCH v3 08/18] scsi: sd: set read/write commands CDL index Niklas Cassel
@ 2023-01-27 15:30   ` Hannes Reinecke
  2023-01-28  0:03     ` Damien Le Moal
  0 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 15:30 UTC (permalink / raw)
  To: Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Damien Le Moal, linux-scsi, linux-ide, linux-block

On 1/24/23 20:02, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> Introduce the command duration limits helper function
> sd_cdl_cmd_limit() to retrieve and set the DLD bits of the
> READ/WRITE 16 and READ/WRITE 32 commands to indicate to the device
> the command duration limit descriptor to apply to the command.
> 
> When command duration limits are enabled, sd_cdl_cmd_limit() obtains the
> index of the descriptor to apply to the command for requests that have
> the IOPRIO_CLASS_DL priority class with a priority data sepcifying a
> valid descriptor index (1 to 7).
> 
> The read-write sysfs attribute "enable" is introduced to control
> setting the command duration limits indexes. If this attribute is set
> to 0 (default), command duration limits specified by the user are
> ignored. The user must set this attribute to 1 for command duration
> limits to be set. Enabling and disabling the command duration limits
> feature for ATA devices must be done using the ATA feature sub-page of
> the control mode page. The sd_cdl_enable() function is introduced to
> check if this mode page is supported by the device and if it is, use
> it to enable/disable CDL.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/scsi/sd.c     |  16 +++--
>   drivers/scsi/sd.h     |  10 ++++
>   drivers/scsi/sd_cdl.c | 134 +++++++++++++++++++++++++++++++++++++++++-
>   3 files changed, 152 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 7879a5470773..d2eb01337943 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1045,13 +1045,14 @@ static blk_status_t sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
>   
>   static blk_status_t sd_setup_rw32_cmnd(struct scsi_cmnd *cmd, bool write,
>   				       sector_t lba, unsigned int nr_blocks,
> -				       unsigned char flags)
> +				       unsigned char flags, unsigned int dld)
>   {
>   	cmd->cmd_len = SD_EXT_CDB_SIZE;
>   	cmd->cmnd[0]  = VARIABLE_LENGTH_CMD;
>   	cmd->cmnd[7]  = 0x18; /* Additional CDB len */
>   	cmd->cmnd[9]  = write ? WRITE_32 : READ_32;
>   	cmd->cmnd[10] = flags;
> +	cmd->cmnd[11] = dld & 0x07;
>   	put_unaligned_be64(lba, &cmd->cmnd[12]);
>   	put_unaligned_be32(lba, &cmd->cmnd[20]); /* Expected Indirect LBA */
>   	put_unaligned_be32(nr_blocks, &cmd->cmnd[28]);
> @@ -1061,12 +1062,12 @@ static blk_status_t sd_setup_rw32_cmnd(struct scsi_cmnd *cmd, bool write,
>   
>   static blk_status_t sd_setup_rw16_cmnd(struct scsi_cmnd *cmd, bool write,
>   				       sector_t lba, unsigned int nr_blocks,
> -				       unsigned char flags)
> +				       unsigned char flags, unsigned int dld)
>   {
>   	cmd->cmd_len  = 16;
>   	cmd->cmnd[0]  = write ? WRITE_16 : READ_16;
> -	cmd->cmnd[1]  = flags;
> -	cmd->cmnd[14] = 0;
> +	cmd->cmnd[1]  = flags | ((dld >> 2) & 0x01);
> +	cmd->cmnd[14] = (dld & 0x03) << 6;
>   	cmd->cmnd[15] = 0;
>   	put_unaligned_be64(lba, &cmd->cmnd[2]);
>   	put_unaligned_be32(nr_blocks, &cmd->cmnd[10]);
> @@ -1129,6 +1130,7 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>   	unsigned int mask = logical_to_sectors(sdp, 1) - 1;
>   	bool write = rq_data_dir(rq) == WRITE;
>   	unsigned char protect, fua;
> +	unsigned int dld = 0;
>   	blk_status_t ret;
>   	unsigned int dif;
>   	bool dix;
> @@ -1178,6 +1180,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>   	fua = rq->cmd_flags & REQ_FUA ? 0x8 : 0;
>   	dix = scsi_prot_sg_count(cmd);
>   	dif = scsi_host_dif_capable(cmd->device->host, sdkp->protection_type);
> +	if (sd_cdl_enabled(sdkp))
> +		dld = sd_cdl_dld(sdkp, cmd);
>   
>   	if (dif || dix)
>   		protect = sd_setup_protect_cmnd(cmd, dix, dif);
> @@ -1186,10 +1190,10 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>   
>   	if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
>   		ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
> -					 protect | fua);
> +					 protect | fua, dld);
>   	} else if (sdp->use_16_for_rw || (nr_blocks > 0xffff)) {
>   		ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks,
> -					 protect | fua);
> +					 protect | fua, dld);
>   	} else if ((nr_blocks > 0xff) || (lba > 0x1fffff) ||
>   		   sdp->use_10_for_rw || protect) {
>   		ret = sd_setup_rw10_cmnd(cmd, write, lba, nr_blocks,
> diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
> index e60d33bd222a..5b6b6dc4b92d 100644
> --- a/drivers/scsi/sd.h
> +++ b/drivers/scsi/sd.h
> @@ -130,8 +130,11 @@ struct sd_cdl_page {
>   	struct sd_cdl_desc      descs[SD_CDL_MAX_DESC];
>   };
>   
> +struct scsi_disk;
> +
>   struct sd_cdl {
>   	struct kobject		kobj;
> +	struct scsi_disk	*sdkp;
>   	bool			sysfs_registered;
>   	u8			perf_vs_duration_guideline;
>   	struct sd_cdl_page	pages[SD_CDL_RW];
> @@ -188,6 +191,7 @@ struct scsi_disk {
>   	u8		zeroing_mode;
>   	u8		nr_actuators;		/* Number of actuators */
>   	struct sd_cdl	*cdl;
> +	unsigned	cdl_enabled : 1;
>   	unsigned	ATO : 1;	/* state of disk ATO bit */
>   	unsigned	cache_override : 1; /* temp override of WCE,RCD */
>   	unsigned	WCE : 1;	/* state of disk WCE bit */
> @@ -355,5 +359,11 @@ void sd_print_result(const struct scsi_disk *sdkp, const char *msg, int result);
>   /* Command duration limits support (in sd_cdl.c) */
>   void sd_read_cdl(struct scsi_disk *sdkp, unsigned char *buf);
>   void sd_cdl_release(struct scsi_disk *sdkp);
> +int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd);
> +
> +static inline bool sd_cdl_enabled(struct scsi_disk *sdkp)
> +{
> +	return sdkp->cdl && sdkp->cdl_enabled;
> +}
>   
>   #endif /* _SCSI_DISK_H */
> diff --git a/drivers/scsi/sd_cdl.c b/drivers/scsi/sd_cdl.c
> index 513cd989f19a..59d02dbb5ea1 100644
> --- a/drivers/scsi/sd_cdl.c
> +++ b/drivers/scsi/sd_cdl.c
> @@ -93,6 +93,63 @@ static const char *sd_cdl_policy_name(u8 policy)
>   	}
>   }
>   
> +/*
> + * Enable/disable CDL.
> + */
> +static int sd_cdl_enable(struct scsi_disk *sdkp, bool enable)
> +{
> +	struct scsi_device *sdp = sdkp->device;
> +	struct scsi_mode_data data;
> +	struct scsi_sense_hdr sshdr;
> +	struct scsi_vpd *vpd;
> +	bool is_ata = false;
> +	char buf[64];
> +	int ret;
> +
> +	rcu_read_lock();
> +	vpd = rcu_dereference(sdp->vpd_pg89);
> +	if (vpd)
> +		is_ata = true;
> +	rcu_read_unlock();
> +
> +	/*
> +	 * For ATA devices, CDL needs to be enabled with a SET FEATURES command.
> +	 */
> +	if (is_ata) {
> +		char *buf_data;
> +		int len;
> +
> +		ret = scsi_mode_sense(sdp, 0x08, 0x0a, 0xf2, buf, sizeof(buf),
> +				      SD_TIMEOUT, sdkp->max_retries, &data,
> +				      NULL);
> +		if (ret)
> +			return -EINVAL;
> +
That is a tad odd.
Is CDL always enabled for 'normal' SCSI?

Cheers,

Hannes

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/18] scsi: sd: handle read/write CDL timeout failures
  2023-01-24 19:02 ` [PATCH v3 09/18] scsi: sd: handle read/write CDL timeout failures Niklas Cassel
@ 2023-01-27 15:34   ` Hannes Reinecke
  2023-01-28  0:06     ` Damien Le Moal
  2023-02-03 16:49     ` Niklas Cassel
  0 siblings, 2 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 15:34 UTC (permalink / raw)
  To: Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, Damien Le Moal, linux-scsi, linux-ide, linux-block

On 1/24/23 20:02, Niklas Cassel wrote:
> Commands using a duration limit descriptor that has limit policies set
> to a value other than 0x0 may be failed by the device if one of the
> limits are exceeded. For such commands, since the failure is the result
> of the user duration limit configuration and workload, the commands
> should not be retried and terminated immediately. Furthermore, to allow
> the user to differentiate these "soft" failures from hard errors due to
> hardware problem, a different error code than EIO should be returned.
> 
> There are 2 cases to consider:
> (1) The failure is due to a limit policy failing the command with a
> check condition sense key, that is, any limit policy other than 0xD.
> For this case, scsi_check_sense() is modified to detect failures with
> the ABORTED COMMAND sense key and the COMMAND TIMEOUT BEFORE PROCESSING
> or COMMAND TIMEOUT DURING PROCESSING or COMMAND TIMEOUT DURING
> PROCESSING DUE TO ERROR RECOVERY additional sense code. For these
> failures, a SUCCESS disposition is returned so that
> scsi_finish_command() is called to terminate the command.
> 
> (2) The failure is due to a limit policy set to 0xD, which result in the
> command being terminated with a GOOD status, COMPLETED sense key, and
> DATA CURRENTLY UNAVAILABLE additional sense code. To handle this case,
> the scsi_check_sense() is modified to return a SUCCESS disposition so
> that scsi_finish_command() is called to terminate the command.
> In addition, scsi_decide_disposition() has to be modified to see if a
> command being terminated with GOOD status has sense data.
> This is as defined in SCSI Primary Commands - 6 (SPC-6), so all
> according to spec, even if GOOD status commands were not checked before.
> 
> If scsi_check_sense() detects sense data representing a duration limit,
> scsi_check_sense() will set the newly introduced SCSI ML byte
> SCSIML_STAT_DL_TIMEOUT. This SCSI ML byte is checked in
> scsi_noretry_cmd(), so that a command that failed because of a CDL
> timeout cannot be retried. The SCSI ML byte is also checked in
> scsi_result_to_blk_status() to complete the command request with the
> BLK_STS_DURATION_LIMIT status, which result in the user seeing ETIME
> errors for the failed commands.
> 
> Co-developed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/scsi/scsi_error.c | 46 +++++++++++++++++++++++++++++++++++++++
>   drivers/scsi/scsi_lib.c   |  4 ++++
>   drivers/scsi/scsi_priv.h  |  1 +
>   3 files changed, 51 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index cf5ec5f5f4f6..9988539bc348 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -536,6 +536,7 @@ static inline void set_scsi_ml_byte(struct scsi_cmnd *cmd, u8 status)
>    */
>   enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>   {
> +	struct request *req = scsi_cmd_to_rq(scmd);
>   	struct scsi_device *sdev = scmd->device;
>   	struct scsi_sense_hdr sshdr;
>   
> @@ -595,6 +596,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>   		if (sshdr.asc == 0x10) /* DIF */
>   			return SUCCESS;
>   
> +		/*
> +		 * Check aborts due to command duration limit policy:
> +		 * ABORTED COMMAND additional sense code with the
> +		 * COMMAND TIMEOUT BEFORE PROCESSING or
> +		 * COMMAND TIMEOUT DURING PROCESSING or
> +		 * COMMAND TIMEOUT DURING PROCESSING DUE TO ERROR RECOVERY
> +		 * additional sense code qualifiers.
> +		 */
> +		if (sshdr.asc == 0x2e &&
> +		    sshdr.ascq >= 0x01 && sshdr.ascq <= 0x03) {
> +			set_scsi_ml_byte(scmd, SCSIML_STAT_DL_TIMEOUT);
> +			req->cmd_flags |= REQ_FAILFAST_DEV;
> +			req->rq_flags |= RQF_QUIET;
> +			return SUCCESS;
> +		}
> +
>   		if (sshdr.asc == 0x44 && sdev->sdev_bflags & BLIST_RETRY_ITF)
>   			return ADD_TO_MLQUEUE;
>   		if (sshdr.asc == 0xc1 && sshdr.ascq == 0x01 &&
> @@ -691,6 +708,15 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>   		}
>   		return SUCCESS;
>   
> +	case COMPLETED:
> +		if (sshdr.asc == 0x55 && sshdr.ascq == 0x0a) {
> +			set_scsi_ml_byte(scmd, SCSIML_STAT_DL_TIMEOUT);
> +			req->cmd_flags |= REQ_FAILFAST_DEV;
> +			req->rq_flags |= RQF_QUIET;
> +			return SUCCESS;

You can kill this line, will be done anyway.

> +		}
> +		return SUCCESS;
> +
>   	default:
>   		return SUCCESS;
>   	}
> @@ -785,6 +811,14 @@ static enum scsi_disposition scsi_eh_completed_normally(struct scsi_cmnd *scmd)
>   	switch (get_status_byte(scmd)) {
>   	case SAM_STAT_GOOD:
>   		scsi_handle_queue_ramp_up(scmd->device);
> +		if (scmd->sense_buffer && SCSI_SENSE_VALID(scmd))
> +			/*
> +			 * If we have sense data, call scsi_check_sense() in
> +			 * order to set the correct SCSI ML byte (if any).
> +			 * No point in checking the return value, since the
> +			 * command has already completed successfully.
> +			 */
> +			scsi_check_sense(scmd);

I am every so slightly nervous here.
We never checked the sense code for GOOD status, so heaven knows if 
there are devices out there which return something here.
And you have checked that we've cleared the sense code before submitting 
(or retrying, even), right?

>   		fallthrough;
>   	case SAM_STAT_COMMAND_TERMINATED:
>   		return SUCCESS;
> @@ -1807,6 +1841,10 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
>   		return !!(req->cmd_flags & REQ_FAILFAST_DRIVER);
>   	}
>   
> +	/* Never retry commands aborted due to a duration limit timeout */
> +	if (scsi_ml_byte(scmd->result) == SCSIML_STAT_DL_TIMEOUT)
> +		return true;
> +
>   	if (!scsi_status_is_check_condition(scmd->result))
>   		return false;
>   
> @@ -1966,6 +2004,14 @@ enum scsi_disposition scsi_decide_disposition(struct scsi_cmnd *scmd)
>   		if (scmd->cmnd[0] == REPORT_LUNS)
>   			scmd->device->sdev_target->expecting_lun_change = 0;
>   		scsi_handle_queue_ramp_up(scmd->device);
> +		if (scmd->sense_buffer && SCSI_SENSE_VALID(scmd))
> +			/*
> +			 * If we have sense data, call scsi_check_sense() in
> +			 * order to set the correct SCSI ML byte (if any).
> +			 * No point in checking the return value, since the
> +			 * command has already completed successfully.
> +			 */
> +			scsi_check_sense(scmd);
>   		fallthrough;
>   	case SAM_STAT_COMMAND_TERMINATED:
>   		return SUCCESS;
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index e1a021dd4da2..406952e72a68 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -600,6 +600,8 @@ static blk_status_t scsi_result_to_blk_status(int result)
>   		return BLK_STS_MEDIUM;
>   	case SCSIML_STAT_TGT_FAILURE:
>   		return BLK_STS_TARGET;
> +	case SCSIML_STAT_DL_TIMEOUT:
> +		return BLK_STS_DURATION_LIMIT;
>   	}
>   
>   	switch (host_byte(result)) {
> @@ -797,6 +799,8 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
>   				blk_stat = BLK_STS_ZONE_OPEN_RESOURCE;
>   			}
>   			break;
> +		case COMPLETED:
> +			fallthrough;
>   		default:
>   			action = ACTION_FAIL;
>   			break;
> diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
> index 74324fba4281..f42388ecb024 100644
> --- a/drivers/scsi/scsi_priv.h
> +++ b/drivers/scsi/scsi_priv.h
> @@ -27,6 +27,7 @@ enum scsi_ml_status {
>   	SCSIML_STAT_NOSPC		= 0x02,	/* Space allocation on the dev failed */
>   	SCSIML_STAT_MED_ERROR		= 0x03,	/* Medium error */
>   	SCSIML_STAT_TGT_FAILURE		= 0x04,	/* Permanent target failure */
> +	SCSIML_STAT_DL_TIMEOUT		= 0x05, /* Command Duration Limit timeout */
>   };
>   
>   static inline u8 scsi_ml_byte(int result)

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 10/18] ata: libata-scsi: remove unnecessary !cmd checks
  2023-01-24 19:02 ` [PATCH v3 10/18] ata: libata-scsi: remove unnecessary !cmd checks Niklas Cassel
@ 2023-01-27 15:35   ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 15:35 UTC (permalink / raw)
  To: Niklas Cassel, Damien Le Moal
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/24/23 20:02, Niklas Cassel wrote:
> There is no need to check if !cmd as this can only happen for
> ATA internal commands which uses the ATA internal tag (32).
> 
> Most users of ata_scsi_set_sense() are from _xlat functions that
> translate a scsicmd to an ATA command. These obviously have a qc->scsicmd.
> 
> ata_scsi_qc_complete() can also call ata_scsi_set_sense() via
> ata_gen_passthru_sense() / ata_gen_ata_sense(), called via
> ata_scsi_qc_complete(). This callback is only called for translated
> commands, so it also has a qc->scsicmd.
> 
> ata_eh_analyze_ncq_error(): the NCQ error log can only contain a 0-31
> value, so it will never be able to get the ATA internal tag (32).
> 
> ata_eh_request_sense(): only called by ata_eh_analyze_tf(), which
> is only called when iteratating the QCs using ata_qc_for_each_raw(),
> which does not include the internal tag.
> 
> Since there is no existing call site where cmd can be NULL, remove the
> !cmd check from ata_scsi_set_sense() and ata_scsi_set_sense_information().
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/ata/libata-scsi.c | 6 ------
>   1 file changed, 6 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 11/18] ata: libata: change ata_eh_request_sense() to not set CHECK_CONDITION
  2023-01-24 19:02 ` [PATCH v3 11/18] ata: libata: change ata_eh_request_sense() to not set CHECK_CONDITION Niklas Cassel
@ 2023-01-27 15:36   ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 15:36 UTC (permalink / raw)
  To: Niklas Cassel, Damien Le Moal
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/24/23 20:02, Niklas Cassel wrote:
> Currently, ata_eh_request_sense() unconditionally sets the scsicmd->result
> to SAM_STAT_CHECK_CONDITION.
> 
> For Command Duration Limits policy 0xD:
> The device shall complete the command without error (SAM_STAT_GOOD)
> with the additional sense code set to DATA CURRENTLY UNAVAILABLE.
> 
> It is perfectly fine to have sense data for a command that returned
> completion without error.
> 
> In order to support for CDL policy 0xD, we have to remove this
> assumption that having sense data means that the command failed
> (SAM_STAT_CHECK_CONDITION).
> 
> Change ata_eh_request_sense() to not set SAM_STAT_CHECK_CONDITION,
> and instead move the setting of SAM_STAT_CHECK_CONDITION to the single
> caller that wants SAM_STAT_CHECK_CONDITION set, that way
> ata_eh_request_sense() can be reused in a follow-up patch that adds
> support for CDL policy 0xD.
> 
> The only caller of ata_eh_request_sense() is protected by:
> if (!(qc->flags & ATA_QCFLAG_SENSE_VALID)), so we can remove this
> duplicated check from ata_eh_request_sense() itself.
> 
> Additionally, ata_eh_request_sense() is only called from
> ata_eh_analyze_tf(), which is only called when iteratating the QCs using
> ata_qc_for_each_raw(), which does not include the internal tag,
> so cmd can never be NULL (all non-internal commands have qc->scsicmd set),
> so remove the !cmd check as well.
> 
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/ata/libata-eh.c | 25 ++++++++++++++++---------
>   1 file changed, 16 insertions(+), 9 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 13/18] ata: libata-scsi: handle CDL bits in ata_scsiop_maint_in()
  2023-01-24 19:02 ` [PATCH v3 13/18] ata: libata-scsi: handle CDL bits in ata_scsiop_maint_in() Niklas Cassel
@ 2023-01-27 15:37   ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 15:37 UTC (permalink / raw)
  To: Niklas Cassel, Damien Le Moal
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/24/23 20:02, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> For a scsi MAINTENANCE_IN/MI_REPORT_SUPPORTED_OPERATION_CODES operation,
> add the translation of the rwcdlp and cdlp bits for the READ16 and
> WRITE16 commands. If the ATA device does not support command duration
> limits, these bits are always 0. If the ATA device supports command
> duration limits, the rwcdlp bit is set to 1 for READ16 and WRITE16 and
> the cdlp bits are set to 0x1 for READ16 and 0x2 for WRITE16. These
> correspond to the T2A mode page containing the read descriptors and
> to the T2B mode page containing the write descriptors, as defined in
> SAT-5.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/ata/libata-scsi.c | 30 ++++++++++++++++++++++++++----
>   1 file changed, 26 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 14/18] ata: libata-scsi: add support for CDL pages mode sense
  2023-01-24 19:03 ` [PATCH v3 14/18] ata: libata-scsi: add support for CDL pages mode sense Niklas Cassel
@ 2023-01-27 15:38   ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 15:38 UTC (permalink / raw)
  To: Niklas Cassel, Damien Le Moal
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/24/23 20:03, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> Modify ata_scsiop_mode_sense() and ata_msense_control() to support mode
> sense access to the T2A and T2B sub-pages of the control mode page.
> ata_msense_control() is modified to support sub-pages. The T2A sub-page
> is generated using the read descriptors of the command duration limits
> log page 18h. The T2B sub-page is generated using the write descriptors
> of the same log page. With the addition of these sub-pages, getting all
> sub-pages of the control mode page is also supported by increasing the
> value of ATA_SCSI_RBUF_SIZE from 576B up to 2048B to ensure that all
> sub-pages fit in the fill buffer.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/ata/libata-scsi.c | 150 ++++++++++++++++++++++++++++++++------
>   1 file changed, 128 insertions(+), 22 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 15/18] ata: libata: add ATA feature control sub-page translation
  2023-01-24 19:03 ` [PATCH v3 15/18] ata: libata: add ATA feature control sub-page translation Niklas Cassel
@ 2023-01-27 15:40   ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 15:40 UTC (permalink / raw)
  To: Niklas Cassel, Damien Le Moal
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/24/23 20:03, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> Add support for the ATA feature control sub-page of the control mode
> page to enable/disable the command duration limits feature using the
> cdl_ctrl field of the ATA feature control sub-page.
> 
> Both mode sense and mode select translation are supported. For mode
> sense, the ata device flag ATA_DFLAG_CDL_ENABLED is used to cache the
> status of the command duration limits feature. Enabling this feature is
> done using a SET FEATURES command with a cdl action set to 1 when the
> page cdl_ctrl field value is 0x2 (T2A and T2B pages supported). If this
> field is 0, CDL is disabled using the SET FEATURES command with a cdl
> action set to 0.
> 
> Since a device CDL and NCQ priority features should not be used
> simultaneously, ata_mselect_control_ata_feature() returns an error when
> attempting to enable CDL with the device priority feature enabled.
> Conversely, the function ata_ncq_prio_enable_store() used to enable the
> use of the device NCQ priority feature through sysfs is modified to
> return an error if the device CDL feature is enabled.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/ata/libata-core.c |  40 ++++++++-
>   drivers/ata/libata-sata.c |  11 ++-
>   drivers/ata/libata-scsi.c | 167 ++++++++++++++++++++++++++++++++------
>   include/linux/ata.h       |   3 +
>   include/linux/libata.h    |   1 +
>   5 files changed, 193 insertions(+), 29 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 16/18] ata: libata: set read/write commands CDL index
  2023-01-24 19:03 ` [PATCH v3 16/18] ata: libata: set read/write commands CDL index Niklas Cassel
@ 2023-01-27 15:41   ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 15:41 UTC (permalink / raw)
  To: Niklas Cassel, Damien Le Moal
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/24/23 20:03, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> For devices supporting the command duration limits feature, when a read
> or write operation has the IOPRIO_CLASS_DL priority class and the
> command duration limits feature is enabled, set the command duration
> limit index field of the command to the priority level.
> 
> For unqueued read and write operations, the command duration limit index
> is set as the lower 2 bits of the feature field. For queued NCQ
> read/write commands, the index is set as the lower 2 bits of the
> auxiliary field.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/ata/libata-core.c | 43 ++++++++++++++++++++++++++++++++++-----
>   drivers/ata/libata-scsi.c |  3 +--
>   drivers/ata/libata.h      |  2 +-
>   include/linux/libata.h    |  1 +
>   4 files changed, 41 insertions(+), 8 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 17/18] ata: libata: handle completion of CDL commands using policy 0xD
  2023-01-24 19:03 ` [PATCH v3 17/18] ata: libata: handle completion of CDL commands using policy 0xD Niklas Cassel
@ 2023-01-27 15:43   ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 15:43 UTC (permalink / raw)
  To: Niklas Cassel, Damien Le Moal
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/24/23 20:03, Niklas Cassel wrote:
> A CDL timeout for policy 0xF is defined as a NCQ error, just with a CDL
> specific sk/asc/ascq in the sense data. Therefore, the existing code in
> libata does not need to be modified to handle a policy 0xF CDL timeout.
> 
> For Command Duration Limits policy 0xD:
> The device shall complete the command without error with the additional
> sense code set to DATA CURRENTLY UNAVAILABLE.
> 
> Since a CDL timeout for policy 0xD is not an error, we cannot use the
> NCQ Command Error log (10h).
> 
> Instead, we need to read the Sense Data for Successful NCQ Commands
> log (0Fh).
> 
> In the success case, just like in the error case, we cannot simply read
> a log page from the interrupt handler itself, since reading a log page
> involves sending a READ LOG DMA EXT or READ LOG EXT command.
> 
> Therefore, we add a new EH action ATA_EH_GET_SUCCESS_SENSE.
> When a command completes without error, and when the ATA_SENSE bit
> is set, this new action is set as pending, and EH is scheduled.
> 
> This way, similar to the NCQ error case, the log page will be read
> from EH context.
> 
> An alternative would have been to add a new kthread or workqueue to
> handle this. However, extending EH can be done with minimal changes
> and avoids the need to synchronize a new kthread/workqueue with EH.
> 
> Co-developed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   drivers/ata/libata-core.c |  88 +++++++++++++++++++++++++++++++-
>   drivers/ata/libata-eh.c   | 105 +++++++++++++++++++++++++++++++++++++-
>   drivers/ata/libata-sata.c |  92 +++++++++++++++++++++++++++++++++
>   include/linux/ata.h       |   3 ++
>   include/linux/libata.h    |  11 +++-
>   5 files changed, 295 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 18/18] Documentation: sysfs-block-device: document command duration limits
  2023-01-24 19:03 ` [PATCH v3 18/18] Documentation: sysfs-block-device: document command duration limits Niklas Cassel
@ 2023-01-27 15:43   ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-27 15:43 UTC (permalink / raw)
  To: Niklas Cassel, linux-kernel
  Cc: Christoph Hellwig, Damien Le Moal, linux-scsi, linux-ide, linux-block

On 1/24/23 20:03, Niklas Cassel wrote:
> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> 
> Document ABI/testing/sysfs-block-device the sysfs attributes present
> under /sys/block/*/device/duration_limits for ATA and SCSI devices
> supporting the command duration limits feature.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
> ---
>   Documentation/ABI/testing/sysfs-block-device | 150 +++++++++++++++++++
>   1 file changed, 150 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-27  1:40                           ` Damien Le Moal
@ 2023-01-27 17:23                             ` Bart Van Assche
  2023-01-28  0:40                               ` Damien Le Moal
  0 siblings, 1 reply; 82+ messages in thread
From: Bart Van Assche @ 2023-01-27 17:23 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel
  Cc: Paolo Valente, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block

On 1/26/23 17:40, Damien Le Moal wrote:
> On 1/27/23 09:18, Damien Le Moal wrote:
>> On 1/27/23 02:33, Bart Van Assche wrote:
>>> How about only supporting a subset of the standard such that it becomes
>>> easy to map CDLs to host side priority levels?
>>
>> I am opposed to this, for several reasons:
>>
>> 1) We are seeing different use cases from users that cover a wide range of
>> use of CDL descriptors with various definitions.
>>
>> 2) Passthrough commands can be used by a user to change a drive CDL
>> descriptors without the kernel knowing about it, unless we spend our time
>> revalidating the CDL descriptor log page(s)...
>> 3) CDL standard as is is actually very sensible and not overloaded with
>> stuff that is only useful in niche use cases. For each CDL descriptor, you
>> have:
>>   * The active time limit, which is a clean way to specify how much time
>> you allow a drive to deal with bad sectors (mostly read case). A typical
>> HDD will try very hard to recover data from a sector, always. As a result,
>> the HDD may spend up to several seconds reading a sector again and again
>> applying different signal processing techniques until it gets the sector
>> ECC checked to return valid data. That of course can hugely increase an IO
>> latency seen by the host. In applications such as erasure coded
>> distributed object stores, maximum latency for an object access can thus
>> be kept low using this limit without compromising the data since the
>> object can always be rebuilt from the erasure codes if one HDD is slow to
>> respond. This limit is also interesting for video streaming/playback to
>> avoid video buffer underflow (at the expense of may be some block noise
>> depending on the codec).
>>   * The inactive time limit can be used to tell the drive how long it is
>> allowed to let a command stand in the drive internal queue before
>> processing. This is thus a parameter that allows a host to tune the drive
>> RPO optimization (rotational positioning optimization, e.g. HDD internal
>> command scheduling based on angular sector position on tracks withe the
>> head current position). This is a neat way to control max IOPS vs tail
>> latency since drives tend to privilege maximizing IOPS over lowering max
>> tail latency.
>>   * The duration guideline limit defines an overall time limit for a
>> command without distinguishing between active and inactive time. It is the
>> easiest to use (the easiest one to understand from a beginner user point
>> of view). This is a neat way to define an intelligent IO prioritization in
>> fact, way better than RT class scheduling on the host or the use of ATA
>> NCQ high priority, as it provides more information to the drive about the
>> urgency of a particular command. That allows the drive to still perform
>> RPO to maximize IOPS without long tail latencies. Chaining such limit with
>> an active+inactive time limit descriptor using the "next limit" policy
>> (0x1 policy) can also finely define what the drive should if the guideline
>> limit is exceeded (as the next descriptor can define what to do based on
>> the reason for the limit being exceeded: long internal queueing vs bad
>> sector long access time).
> 
> Note that all 3 limits can be used in a single CDL descriptor to precisely
> define how a command should be processed by the device. That is why it is
> nearly impossible to come up with a meaningful ordering of CDL descriptors
> as an increasing set of priority levels.

A summary of my concerns is as follows:
* The current I/O priority levels (RT, BE, IDLE) apply to all block 
devices. IOPRIO_CLASS_DL is only supported by certain block devices 
(some but not all SCSI harddisks). This forces applications to check the 
capabilities of the storage device before it can be decided whether or 
not IOPRIO_CLASS_DL can be used. This is not something applications 
should do but something the kernel should do. Additionally, if multiple 
dm devices are stacked on top of the block device driver, like in 
Android, it becomes even more cumbersome to check whether or not the 
block device supports CDL.
* For the RT, BE and IDLE classes, it is well defined which priority 
number represents a high priority and which priority number represents a 
low priority. For CDL, only the drive knows the priority details. I 
think that application software should be able to select a DL priority 
without having to read the CDL configuration first.

I hope that I have it made it clear that I think that the proposed user 
space API will be very painful to use for application developers.

Bart.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 08/18] scsi: sd: set read/write commands CDL index
  2023-01-27 15:30   ` Hannes Reinecke
@ 2023-01-28  0:03     ` Damien Le Moal
  2023-01-30 18:13       ` Hannes Reinecke
  0 siblings, 1 reply; 82+ messages in thread
From: Damien Le Moal @ 2023-01-28  0:03 UTC (permalink / raw)
  To: Hannes Reinecke, Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/28/23 00:30, Hannes Reinecke wrote:
> On 1/24/23 20:02, Niklas Cassel wrote:
>> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>>
>> Introduce the command duration limits helper function
>> sd_cdl_cmd_limit() to retrieve and set the DLD bits of the
>> READ/WRITE 16 and READ/WRITE 32 commands to indicate to the device
>> the command duration limit descriptor to apply to the command.
>>
>> When command duration limits are enabled, sd_cdl_cmd_limit() obtains the
>> index of the descriptor to apply to the command for requests that have
>> the IOPRIO_CLASS_DL priority class with a priority data sepcifying a
>> valid descriptor index (1 to 7).
>>
>> The read-write sysfs attribute "enable" is introduced to control
>> setting the command duration limits indexes. If this attribute is set
>> to 0 (default), command duration limits specified by the user are
>> ignored. The user must set this attribute to 1 for command duration
>> limits to be set. Enabling and disabling the command duration limits
>> feature for ATA devices must be done using the ATA feature sub-page of
>> the control mode page. The sd_cdl_enable() function is introduced to
>> check if this mode page is supported by the device and if it is, use
>> it to enable/disable CDL.
>>
>> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
>> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
>> ---
>>   drivers/scsi/sd.c     |  16 +++--
>>   drivers/scsi/sd.h     |  10 ++++
>>   drivers/scsi/sd_cdl.c | 134 +++++++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 152 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index 7879a5470773..d2eb01337943 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -1045,13 +1045,14 @@ static blk_status_t sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
>>   
>>   static blk_status_t sd_setup_rw32_cmnd(struct scsi_cmnd *cmd, bool write,
>>   				       sector_t lba, unsigned int nr_blocks,
>> -				       unsigned char flags)
>> +				       unsigned char flags, unsigned int dld)
>>   {
>>   	cmd->cmd_len = SD_EXT_CDB_SIZE;
>>   	cmd->cmnd[0]  = VARIABLE_LENGTH_CMD;
>>   	cmd->cmnd[7]  = 0x18; /* Additional CDB len */
>>   	cmd->cmnd[9]  = write ? WRITE_32 : READ_32;
>>   	cmd->cmnd[10] = flags;
>> +	cmd->cmnd[11] = dld & 0x07;
>>   	put_unaligned_be64(lba, &cmd->cmnd[12]);
>>   	put_unaligned_be32(lba, &cmd->cmnd[20]); /* Expected Indirect LBA */
>>   	put_unaligned_be32(nr_blocks, &cmd->cmnd[28]);
>> @@ -1061,12 +1062,12 @@ static blk_status_t sd_setup_rw32_cmnd(struct scsi_cmnd *cmd, bool write,
>>   
>>   static blk_status_t sd_setup_rw16_cmnd(struct scsi_cmnd *cmd, bool write,
>>   				       sector_t lba, unsigned int nr_blocks,
>> -				       unsigned char flags)
>> +				       unsigned char flags, unsigned int dld)
>>   {
>>   	cmd->cmd_len  = 16;
>>   	cmd->cmnd[0]  = write ? WRITE_16 : READ_16;
>> -	cmd->cmnd[1]  = flags;
>> -	cmd->cmnd[14] = 0;
>> +	cmd->cmnd[1]  = flags | ((dld >> 2) & 0x01);
>> +	cmd->cmnd[14] = (dld & 0x03) << 6;
>>   	cmd->cmnd[15] = 0;
>>   	put_unaligned_be64(lba, &cmd->cmnd[2]);
>>   	put_unaligned_be32(nr_blocks, &cmd->cmnd[10]);
>> @@ -1129,6 +1130,7 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>   	unsigned int mask = logical_to_sectors(sdp, 1) - 1;
>>   	bool write = rq_data_dir(rq) == WRITE;
>>   	unsigned char protect, fua;
>> +	unsigned int dld = 0;
>>   	blk_status_t ret;
>>   	unsigned int dif;
>>   	bool dix;
>> @@ -1178,6 +1180,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>   	fua = rq->cmd_flags & REQ_FUA ? 0x8 : 0;
>>   	dix = scsi_prot_sg_count(cmd);
>>   	dif = scsi_host_dif_capable(cmd->device->host, sdkp->protection_type);
>> +	if (sd_cdl_enabled(sdkp))
>> +		dld = sd_cdl_dld(sdkp, cmd);
>>   
>>   	if (dif || dix)
>>   		protect = sd_setup_protect_cmnd(cmd, dix, dif);
>> @@ -1186,10 +1190,10 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>   
>>   	if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
>>   		ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
>> -					 protect | fua);
>> +					 protect | fua, dld);
>>   	} else if (sdp->use_16_for_rw || (nr_blocks > 0xffff)) {
>>   		ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks,
>> -					 protect | fua);
>> +					 protect | fua, dld);
>>   	} else if ((nr_blocks > 0xff) || (lba > 0x1fffff) ||
>>   		   sdp->use_10_for_rw || protect) {
>>   		ret = sd_setup_rw10_cmnd(cmd, write, lba, nr_blocks,
>> diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
>> index e60d33bd222a..5b6b6dc4b92d 100644
>> --- a/drivers/scsi/sd.h
>> +++ b/drivers/scsi/sd.h
>> @@ -130,8 +130,11 @@ struct sd_cdl_page {
>>   	struct sd_cdl_desc      descs[SD_CDL_MAX_DESC];
>>   };
>>   
>> +struct scsi_disk;
>> +
>>   struct sd_cdl {
>>   	struct kobject		kobj;
>> +	struct scsi_disk	*sdkp;
>>   	bool			sysfs_registered;
>>   	u8			perf_vs_duration_guideline;
>>   	struct sd_cdl_page	pages[SD_CDL_RW];
>> @@ -188,6 +191,7 @@ struct scsi_disk {
>>   	u8		zeroing_mode;
>>   	u8		nr_actuators;		/* Number of actuators */
>>   	struct sd_cdl	*cdl;
>> +	unsigned	cdl_enabled : 1;
>>   	unsigned	ATO : 1;	/* state of disk ATO bit */
>>   	unsigned	cache_override : 1; /* temp override of WCE,RCD */
>>   	unsigned	WCE : 1;	/* state of disk WCE bit */
>> @@ -355,5 +359,11 @@ void sd_print_result(const struct scsi_disk *sdkp, const char *msg, int result);
>>   /* Command duration limits support (in sd_cdl.c) */
>>   void sd_read_cdl(struct scsi_disk *sdkp, unsigned char *buf);
>>   void sd_cdl_release(struct scsi_disk *sdkp);
>> +int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd);
>> +
>> +static inline bool sd_cdl_enabled(struct scsi_disk *sdkp)
>> +{
>> +	return sdkp->cdl && sdkp->cdl_enabled;
>> +}
>>   
>>   #endif /* _SCSI_DISK_H */
>> diff --git a/drivers/scsi/sd_cdl.c b/drivers/scsi/sd_cdl.c
>> index 513cd989f19a..59d02dbb5ea1 100644
>> --- a/drivers/scsi/sd_cdl.c
>> +++ b/drivers/scsi/sd_cdl.c
>> @@ -93,6 +93,63 @@ static const char *sd_cdl_policy_name(u8 policy)
>>   	}
>>   }
>>   
>> +/*
>> + * Enable/disable CDL.
>> + */
>> +static int sd_cdl_enable(struct scsi_disk *sdkp, bool enable)
>> +{
>> +	struct scsi_device *sdp = sdkp->device;
>> +	struct scsi_mode_data data;
>> +	struct scsi_sense_hdr sshdr;
>> +	struct scsi_vpd *vpd;
>> +	bool is_ata = false;
>> +	char buf[64];
>> +	int ret;
>> +
>> +	rcu_read_lock();
>> +	vpd = rcu_dereference(sdp->vpd_pg89);
>> +	if (vpd)
>> +		is_ata = true;
>> +	rcu_read_unlock();
>> +
>> +	/*
>> +	 * For ATA devices, CDL needs to be enabled with a SET FEATURES command.
>> +	 */
>> +	if (is_ata) {
>> +		char *buf_data;
>> +		int len;
>> +
>> +		ret = scsi_mode_sense(sdp, 0x08, 0x0a, 0xf2, buf, sizeof(buf),
>> +				      SD_TIMEOUT, sdkp->max_retries, &data,
>> +				      NULL);
>> +		if (ret)
>> +			return -EINVAL;
>> +
> That is a tad odd.
> Is CDL always enabled for 'normal' SCSI?

Yes it is on the device side. There is no mode sense to turn it on/off. Not sure
why it was designed like that in the specs... The sysfs duration_limits/enable
attribute is a "soft" on/off switch and it is off by default, even for drives
reporting supporting CDL.
Hence the "if (is_ata)" to do the mode sense to enable the feature on the device
side only for ATA devices. We need this to avoid having 2 different enable
pathes with 2 different sysfs "enable" attributes. Doing it like this is a lot
less code.

> 
> Cheers,
> 
> Hannes

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/18] scsi: sd: handle read/write CDL timeout failures
  2023-01-27 15:34   ` Hannes Reinecke
@ 2023-01-28  0:06     ` Damien Le Moal
  2023-02-03 16:49     ` Niklas Cassel
  1 sibling, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-28  0:06 UTC (permalink / raw)
  To: Hannes Reinecke, Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/28/23 00:34, Hannes Reinecke wrote:
> On 1/24/23 20:02, Niklas Cassel wrote:
>> Commands using a duration limit descriptor that has limit policies set
>> to a value other than 0x0 may be failed by the device if one of the
>> limits are exceeded. For such commands, since the failure is the result
>> of the user duration limit configuration and workload, the commands
>> should not be retried and terminated immediately. Furthermore, to allow
>> the user to differentiate these "soft" failures from hard errors due to
>> hardware problem, a different error code than EIO should be returned.
>>
>> There are 2 cases to consider:
>> (1) The failure is due to a limit policy failing the command with a
>> check condition sense key, that is, any limit policy other than 0xD.
>> For this case, scsi_check_sense() is modified to detect failures with
>> the ABORTED COMMAND sense key and the COMMAND TIMEOUT BEFORE PROCESSING
>> or COMMAND TIMEOUT DURING PROCESSING or COMMAND TIMEOUT DURING
>> PROCESSING DUE TO ERROR RECOVERY additional sense code. For these
>> failures, a SUCCESS disposition is returned so that
>> scsi_finish_command() is called to terminate the command.
>>
>> (2) The failure is due to a limit policy set to 0xD, which result in the
>> command being terminated with a GOOD status, COMPLETED sense key, and
>> DATA CURRENTLY UNAVAILABLE additional sense code. To handle this case,
>> the scsi_check_sense() is modified to return a SUCCESS disposition so
>> that scsi_finish_command() is called to terminate the command.
>> In addition, scsi_decide_disposition() has to be modified to see if a
>> command being terminated with GOOD status has sense data.
>> This is as defined in SCSI Primary Commands - 6 (SPC-6), so all
>> according to spec, even if GOOD status commands were not checked before.
>>
>> If scsi_check_sense() detects sense data representing a duration limit,
>> scsi_check_sense() will set the newly introduced SCSI ML byte
>> SCSIML_STAT_DL_TIMEOUT. This SCSI ML byte is checked in
>> scsi_noretry_cmd(), so that a command that failed because of a CDL
>> timeout cannot be retried. The SCSI ML byte is also checked in
>> scsi_result_to_blk_status() to complete the command request with the
>> BLK_STS_DURATION_LIMIT status, which result in the user seeing ETIME
>> errors for the failed commands.
>>
>> Co-developed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
>> ---
>>   drivers/scsi/scsi_error.c | 46 +++++++++++++++++++++++++++++++++++++++
>>   drivers/scsi/scsi_lib.c   |  4 ++++
>>   drivers/scsi/scsi_priv.h  |  1 +
>>   3 files changed, 51 insertions(+)
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index cf5ec5f5f4f6..9988539bc348 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -536,6 +536,7 @@ static inline void set_scsi_ml_byte(struct scsi_cmnd *cmd, u8 status)
>>    */
>>   enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>>   {
>> +	struct request *req = scsi_cmd_to_rq(scmd);
>>   	struct scsi_device *sdev = scmd->device;
>>   	struct scsi_sense_hdr sshdr;
>>   
>> @@ -595,6 +596,22 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>>   		if (sshdr.asc == 0x10) /* DIF */
>>   			return SUCCESS;
>>   
>> +		/*
>> +		 * Check aborts due to command duration limit policy:
>> +		 * ABORTED COMMAND additional sense code with the
>> +		 * COMMAND TIMEOUT BEFORE PROCESSING or
>> +		 * COMMAND TIMEOUT DURING PROCESSING or
>> +		 * COMMAND TIMEOUT DURING PROCESSING DUE TO ERROR RECOVERY
>> +		 * additional sense code qualifiers.
>> +		 */
>> +		if (sshdr.asc == 0x2e &&
>> +		    sshdr.ascq >= 0x01 && sshdr.ascq <= 0x03) {
>> +			set_scsi_ml_byte(scmd, SCSIML_STAT_DL_TIMEOUT);
>> +			req->cmd_flags |= REQ_FAILFAST_DEV;
>> +			req->rq_flags |= RQF_QUIET;
>> +			return SUCCESS;
>> +		}
>> +
>>   		if (sshdr.asc == 0x44 && sdev->sdev_bflags & BLIST_RETRY_ITF)
>>   			return ADD_TO_MLQUEUE;
>>   		if (sshdr.asc == 0xc1 && sshdr.ascq == 0x01 &&
>> @@ -691,6 +708,15 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>>   		}
>>   		return SUCCESS;
>>   
>> +	case COMPLETED:
>> +		if (sshdr.asc == 0x55 && sshdr.ascq == 0x0a) {
>> +			set_scsi_ml_byte(scmd, SCSIML_STAT_DL_TIMEOUT);
>> +			req->cmd_flags |= REQ_FAILFAST_DEV;
>> +			req->rq_flags |= RQF_QUIET;
>> +			return SUCCESS;
> 
> You can kill this line, will be done anyway.
> 
>> +		}
>> +		return SUCCESS;
>> +
>>   	default:
>>   		return SUCCESS;
>>   	}
>> @@ -785,6 +811,14 @@ static enum scsi_disposition scsi_eh_completed_normally(struct scsi_cmnd *scmd)
>>   	switch (get_status_byte(scmd)) {
>>   	case SAM_STAT_GOOD:
>>   		scsi_handle_queue_ramp_up(scmd->device);
>> +		if (scmd->sense_buffer && SCSI_SENSE_VALID(scmd))
>> +			/*
>> +			 * If we have sense data, call scsi_check_sense() in
>> +			 * order to set the correct SCSI ML byte (if any).
>> +			 * No point in checking the return value, since the
>> +			 * command has already completed successfully.
>> +			 */
>> +			scsi_check_sense(scmd);
> 
> I am every so slightly nervous here.
> We never checked the sense code for GOOD status, so heaven knows if 
> there are devices out there which return something here.
> And you have checked that we've cleared the sense code before submitting 
> (or retrying, even), right?

We'll double check that.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-27 17:23                             ` Bart Van Assche
@ 2023-01-28  0:40                               ` Damien Le Moal
  2023-01-28  0:47                                 ` Bart Van Assche
  0 siblings, 1 reply; 82+ messages in thread
From: Damien Le Moal @ 2023-01-28  0:40 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel
  Cc: Paolo Valente, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block

On 1/28/23 02:23, Bart Van Assche wrote:
> A summary of my concerns is as follows:
> * The current I/O priority levels (RT, BE, IDLE) apply to all block 
> devices. IOPRIO_CLASS_DL is only supported by certain block devices 
> (some but not all SCSI harddisks). This forces applications to check the 
> capabilities of the storage device before it can be decided whether or 
> not IOPRIO_CLASS_DL can be used. This is not something applications 
> should do but something the kernel should do. Additionally, if multiple 
> dm devices are stacked on top of the block device driver, like in 
> Android, it becomes even more cumbersome to check whether or not the 
> block device supports CDL.

Yes, RT, BE and IDLE apply to all block devices. And so does CDL in the sense
that if a user specifies the CDL class for IOs to a device that does not support
CDL, then nothing special will happen. There will be no differentiation of the
IOs. That *exactly* what happens when using RT, BE or IDLE with the none
scheduler (e.g. default nvme setup). And the same remark applies to RT class
mapping to ATA NCQ priority feature: the user needs to check the device to know
if that will happen, *and* also needs to turn on that feature for that mapping
to be effective.

The levels of the CDL priority class are also very well defined: they map to the
CDL descriptors defined on the drive, which are consultable by the user through
sysfs (no special tools needed), so easily discoverable.

As for DM devices, these have no scheduler. So any processing of a priority
class by a DM target driver is that driver responsibility. Initially, all that
happens is the block layer passing on that information through the stack with
the BIOs. That's it. Real action may happen once the physical block device is
reached with the IO scheduler for that device, if one is set.

At that level, none scheduler is of no concern, nothing will happen. Kyber also
ignores priorities. We are left with only bfq and mq-deadline. The latter only
cares about the priority class, ignoring levels. bfq does act on both class and
level.

IOPRIO_CLASS_DL is equal to 4, so strictly speaking, is of lower priority than
the IDLE class if you want to consider it as part of that ordering. But we
defined it as a different class to allow *not* having to do that. IO schedulers
can be modified to ignore that priority class for now, mapping it to say the
default BE class for instance. Our current patch set maps the CDL class to the
RT class for the schedulers, as that made most sense given the time-sensitive
nature of CDL workloads. But we can change that to actually let the scheduler
decide if you want. There are no other changes in the block layer that have or
need special handling of the CDL class. All very clean in my opinion, no special
conditions for that feature. No additional "if" in the hot path, no overhead added.

> * For the RT, BE and IDLE classes, it is well defined which priority 
> number represents a high priority and which priority number represents a 
> low priority. For CDL, only the drive knows the priority details. I 
> think that application software should be able to select a DL priority 
> without having to read the CDL configuration first.

The levels of the CDL priority class are also very well defined: they map to the
CDL descriptors defined on the drive, which are consultable by the user through
sysfs (no special tools needed), so easily discoverable. And unless we restrict
how CDL descriptors can be defined, which I explained in my previous email is
not desirable at all, we cannot and should not try to order levels in some sort
of priority semantic. CDL semantic does not define directly a priority level,
only time limits, which may or may not be ordered, depending on the limits
definitions.

As Niklas pointed out, this is not a "generic" feature that any random
application can magically use without modifications. The application must be
aware of what CDL is and if how the descriptors are. And for 99.99% of the use
cases, the CDL descriptors will be defined in a way usefull for that
application. There is no magic generic set of descriptors defined by default.
Though a simple set of increasing time limits that can be cleanly mapped to
priority levels. A system administrator is free to do that for the system drives
if that is what the running applications expect. CDL is a very flexible feature
that can cover a lot of use cases. Trying to shoehorn in into the legacy/classic
priority semantic framework would only restrict its usefulness.

> I hope that I have it made it clear that I think that the proposed user 
> space API will be very painful to use for application developers.

I completely disagree. Reusing the prio class/level API made it easy to allow
applications to use the feature. fio support for CDL requires exactly *one line*
change, to allow for the CDL class number 4. That's it. From there, one can use
the --cmdprio_class=4 nd --cmdprio=idx options to exercise a drive. The value of
"idx" here of course depends on how the descriptors are set on the drive. But
back to the point above. This depends on the application goals and the
descriptors are set accordingly for that goal. There is no real discovery needed
by the application. The application expect a certain set of CDL limits for its
use case, and checking that this set is the one currently defined on the drive
is easy to do from an application with the sysfs interface we added.

Many users out there have deployed and using applications taking advantage of
ATA NCQ priority feature, using class RT for high priority IOs. The new CDL
class does not require many application changes to be enabled for next gen
drives that will have CDL.

> 
> Bart.
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-28  0:40                               ` Damien Le Moal
@ 2023-01-28  0:47                                 ` Bart Van Assche
  2023-01-28  0:59                                   ` Damien Le Moal
  0 siblings, 1 reply; 82+ messages in thread
From: Bart Van Assche @ 2023-01-28  0:47 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel
  Cc: Paolo Valente, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block

On 1/27/23 16:40, Damien Le Moal wrote:
> On 1/28/23 02:23, Bart Van Assche wrote:
>> I hope that I have it made it clear that I think that the proposed user
>> space API will be very painful to use for application developers.
> 
> I completely disagree. Reusing the prio class/level API made it easy to allow
> applications to use the feature. fio support for CDL requires exactly *one line*
> change, to allow for the CDL class number 4. That's it. From there, one can use
> the --cmdprio_class=4 nd --cmdprio=idx options to exercise a drive. The value of
> "idx" here of course depends on how the descriptors are set on the drive. But
> back to the point above. This depends on the application goals and the
> descriptors are set accordingly for that goal. There is no real discovery needed
> by the application. The application expect a certain set of CDL limits for its
> use case, and checking that this set is the one currently defined on the drive
> is easy to do from an application with the sysfs interface we added.
> 
> Many users out there have deployed and using applications taking advantage of
> ATA NCQ priority feature, using class RT for high priority IOs. The new CDL
> class does not require many application changes to be enabled for next gen
> drives that will have CDL.
  As I mentioned before, the new I/O priority class IOPRIO_CLASS_DL 
makes it impossible to use a single I/O priority class across devices 
that support CDL and devices that do not support CDL. I'm surprised that 
you keep denying that IOPRIO_CLASS_DL is a royal pain for users who have 
to support devices that support CDL and devices that do not support CDL.

Bart.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 07/18] scsi: sd: detect support for command duration limits
  2023-01-27 13:00   ` Hannes Reinecke
@ 2023-01-28  0:51     ` Damien Le Moal
  2023-01-28  2:52       ` Bart Van Assche
  0 siblings, 1 reply; 82+ messages in thread
From: Damien Le Moal @ 2023-01-28  0:51 UTC (permalink / raw)
  To: Hannes Reinecke, Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/27/23 22:00, Hannes Reinecke wrote:
> On 1/24/23 20:02, Niklas Cassel wrote:
>> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>>
>> Detect if a disk supports command duration limits. Support for
>> the READ 16, WRITE 16, READ 32 and WRITE 32 commands is tested using
>> the function scsi_report_opcode(). For a disk supporting command
>> duration limits, the mode page indicating the command duration limits
>> descriptors that apply to the command is indicated using the rwcdlp
>> and cdlp bits.
>>
>> Support duration limits is advertizes through sysfs using the new
>> "duration_limits" sysfs sub-directory of the generic device directory,
>> that is, /sys/block/sdX/device/duration_limits. Within this new
>> directory, the limit descriptors that apply to read and write operations
>> are exposed within the read and write directories, with descriptor
>> attributes grouped together in directories. The overall sysfs structure
>> created is:
>>
>> /sys/block/sde/device/duration_limits/
>> ├── perf_vs_duration_guideline
>> ├── read
>> │   ├── 1
>> │   │   ├── duration_guideline
>> │   │   ├── duration_guideline_policy
>> │   │   ├── max_active_time
>> │   │   ├── max_active_time_policy
>> │   │   ├── max_inactive_time
>> │   │   └── max_inactive_time_policy
>> │   ├── 2
>> │   │   ├── duration_guideline
>> ...
>> │   └── page
>> └── write
>>      ├── 1
>>      │   ├── duration_guideline
>>      │   ├── duration_guideline_policy
>> ...
>>
>> For each of the read and write descriptor directories, the page
>> attribute file indicate the command duration limit page providing the
>> descriptors. The possible values for the page attribute are "A", "B",
>> "T2A" and "T2B".
>>
>> The new "duration_limits" attributes directory is added only for disks
>> that supports command duration limits.
>>
>> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
>> ---
>>   drivers/scsi/Makefile |   2 +-
>>   drivers/scsi/sd.c     |   2 +
>>   drivers/scsi/sd.h     |  61 ++++
>>   drivers/scsi/sd_cdl.c | 764 ++++++++++++++++++++++++++++++++++++++++++
>>   4 files changed, 828 insertions(+), 1 deletion(-)
>>   create mode 100644 drivers/scsi/sd_cdl.c
>>
> I'm not particularly happy with having sysfs reflect user settings, but 
> every other place I can think of is even more convoluted.
> So there.
> 
>> diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
>> index f055bfd54a68..0e48cb6d21d6 100644
>> --- a/drivers/scsi/Makefile
>> +++ b/drivers/scsi/Makefile
>> @@ -170,7 +170,7 @@ scsi_mod-$(CONFIG_BLK_DEV_BSG)	+= scsi_bsg.o
>>   
>>   hv_storvsc-y			:= storvsc_drv.o
>>   
>> -sd_mod-objs	:= sd.o
>> +sd_mod-objs	:= sd.o sd_cdl.o
>>   sd_mod-$(CONFIG_BLK_DEV_INTEGRITY) += sd_dif.o
>>   sd_mod-$(CONFIG_BLK_DEV_ZONED) += sd_zbc.o
>>   
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index 45945bfeee92..7879a5470773 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -3326,6 +3326,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
>>   		sd_read_write_same(sdkp, buffer);
>>   		sd_read_security(sdkp, buffer);
>>   		sd_config_protection(sdkp);
>> +		sd_read_cdl(sdkp, buffer);
>>   	}
>>   
>>   	/*
>> @@ -3646,6 +3647,7 @@ static void scsi_disk_release(struct device *dev)
>>   
>>   	ida_free(&sd_index_ida, sdkp->index);
>>   	sd_zbc_free_zone_info(sdkp);
>> +	sd_cdl_release(sdkp);
>>   	put_device(&sdkp->device->sdev_gendev);
>>   	free_opal_dev(sdkp->opal_dev);
>>   
> Hmm. Calling this during revalidate() makes sense, but how can we ensure 
> that we call revalidate() when the user issues a MODE_SELECT command?

Given that CDLs can be changed with a passthrough command, I do not think we can
do anything about that, unfortunately. But I think the same is true of many
things like that. E.g. "let's turn onf/off the write cache without the kernel
noticing"... But given that on a normal system only privileged applications can
do passthrough, if that happens, then the system has been hacked or the user is
shooting himself in the foot.

cdl-tools project (cdladm utility) uses passtrhough but triggers a revalidate
after changing CDLs to make sure sysfs stays in sync.

As Christoph suggested, we could change all this to an ioctl(GET_CDL) for
applications... But sysfs is so much simpler in my opinion, not to mention that
it allows access to the information for any application written in a language
that does not have ioctl() or an equivalent.

cdl-tools has a test suite all written in bash scripts thanks to the sysfs
interface :)

> 
> Other than that:
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> 
> Cheers,
> 
> Hannes

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-28  0:47                                 ` Bart Van Assche
@ 2023-01-28  0:59                                   ` Damien Le Moal
  2023-01-28 20:25                                     ` Martin K. Petersen
  0 siblings, 1 reply; 82+ messages in thread
From: Damien Le Moal @ 2023-01-28  0:59 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel
  Cc: Paolo Valente, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block

On 1/28/23 09:47, Bart Van Assche wrote:
> On 1/27/23 16:40, Damien Le Moal wrote:
>> On 1/28/23 02:23, Bart Van Assche wrote:
>>> I hope that I have it made it clear that I think that the proposed user
>>> space API will be very painful to use for application developers.
>>
>> I completely disagree. Reusing the prio class/level API made it easy to allow
>> applications to use the feature. fio support for CDL requires exactly *one line*
>> change, to allow for the CDL class number 4. That's it. From there, one can use
>> the --cmdprio_class=4 nd --cmdprio=idx options to exercise a drive. The value of
>> "idx" here of course depends on how the descriptors are set on the drive. But
>> back to the point above. This depends on the application goals and the
>> descriptors are set accordingly for that goal. There is no real discovery needed
>> by the application. The application expect a certain set of CDL limits for its
>> use case, and checking that this set is the one currently defined on the drive
>> is easy to do from an application with the sysfs interface we added.
>>
>> Many users out there have deployed and using applications taking advantage of
>> ATA NCQ priority feature, using class RT for high priority IOs. The new CDL
>> class does not require many application changes to be enabled for next gen
>> drives that will have CDL.
>   As I mentioned before, the new I/O priority class IOPRIO_CLASS_DL 
> makes it impossible to use a single I/O priority class across devices 
> that support CDL and devices that do not support CDL. I'm surprised that 
> you keep denying that IOPRIO_CLASS_DL is a royal pain for users who have 
> to support devices that support CDL and devices that do not support CDL.

I am not denying anything. I simply keep telling you that CDL is not a generic
feature for random applications to use, including those that already use
RT/BE/IDLE. It is for applications that know and expect it, and so have a setup
suited for CDL use down to the drive CDL descriptors. That includes DM setups.

Thinking about CDL in a generic setup for any random application to use is
nonsense. And even if that happens and a user not knowing about it still tries
it, than as mentioned, nothing bad will happen. Using CDL in a setup that does
not support it is a NOP. That would be the same as not using it.

> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 07/18] scsi: sd: detect support for command duration limits
  2023-01-28  0:51     ` Damien Le Moal
@ 2023-01-28  2:52       ` Bart Van Assche
  2023-01-29  2:05         ` Damien Le Moal
  0 siblings, 1 reply; 82+ messages in thread
From: Bart Van Assche @ 2023-01-28  2:52 UTC (permalink / raw)
  To: Damien Le Moal, Hannes Reinecke, Niklas Cassel,
	James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/27/23 16:51, Damien Le Moal wrote:
> On 1/27/23 22:00, Hannes Reinecke wrote:
>> Hmm. Calling this during revalidate() makes sense, but how can we ensure
>> that we call revalidate() when the user issues a MODE_SELECT command?
> 
> Given that CDLs can be changed with a passthrough command, I do not think we can
> do anything about that, unfortunately. But I think the same is true of many
> things like that. E.g. "let's turn onf/off the write cache without the kernel
> noticing"... But given that on a normal system only privileged applications can
> do passthrough, if that happens, then the system has been hacked or the user is
> shooting himself in the foot.
> 
> cdl-tools project (cdladm utility) uses passtrhough but triggers a revalidate
> after changing CDLs to make sure sysfs stays in sync.
> 
> As Christoph suggested, we could change all this to an ioctl(GET_CDL) for
> applications... But sysfs is so much simpler in my opinion, not to mention that
> it allows access to the information for any application written in a language
> that does not have ioctl() or an equivalent.
> 
> cdl-tools has a test suite all written in bash scripts thanks to the sysfs
> interface :)

My understanding is that combining the sd driver with SCSI pass-through 
is not supported and also that there are no plans to support this 
combination.

Martin, please correct me if I got this wrong.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-28  0:59                                   ` Damien Le Moal
@ 2023-01-28 20:25                                     ` Martin K. Petersen
  2023-01-29  3:52                                       ` Damien Le Moal
  2023-01-30 19:13                                       ` Bart Van Assche
  0 siblings, 2 replies; 82+ messages in thread
From: Martin K. Petersen @ 2023-01-28 20:25 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe,
	Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block


Damien,

Finally had a window where I could sit down a read this extremely long
thread.

> I am not denying anything. I simply keep telling you that CDL is not a
> generic feature for random applications to use, including those that
> already use RT/BE/IDLE. It is for applications that know and expect
> it, and so have a setup suited for CDL use down to the drive CDL
> descriptors. That includes DM setups.
>
> Thinking about CDL in a generic setup for any random application to
> use is nonsense. And even if that happens and a user not knowing about
> it still tries it, than as mentioned, nothing bad will happen. Using
> CDL in a setup that does not support it is a NOP. That would be the
> same as not using it.

My observations:

 - Wrt. ioprio as conduit, I personally really dislike the idea of
   conflating priority (relative performance wrt. other I/O) with CDL
   (which is a QoS concept). I would really prefer those things to be
   separate. However, I do think that the ioprio *interface* is a good
   fit. A tool like ionice seems like a reasonable approach to letting
   generic applications set their CDL.

   If bio space wasn't a premium, I'd say just keep things separate.
   But given the inherent incompatibility between kernel I/O scheduling
   and CDL, it's probably not worth the hassle to separate them. As much
   as it pains me to mix two concepts which should be completely
   orthogonal.

   I wish we could let applications specify both a priority and a CDL at
   the same time, though. Even if the kernel plumbing in the short term
   ends up using bi_ioprio as conduit. It's much harder to make changes
   in the application interface at a later date.

 - Wrt. "CDL is not a generic feature", I think you are underestimating
   how many applications actually want something like this. We have
   many.

   I don't think we should design for "special interest only, needs
   custom device tweaking to be usable". We have been down that path
   before (streams, etc.). And with poor results.

   I/O hints also tanked but at least we tried to pre-define performance
   classes that made sense in an abstract fashion. And programmed the
   mode page on discovered devices so that the classes were identical
   across all disks, regardless of whether they were SSDs or million
   dollar arrays. This allowed filesystems to communicate "this is
   metadata" regardless of the device the I/O was going to. Instead of
   "index 5 on this device" but "index 42 on the mirror".

   As such, I don't like the "just customize your settings with
   cdltools" approach. I'd much rather see us try to define a few QoS
   classes that make sense that would apply to every app and use those
   to define the application interface. And then have the kernel program
   those CDL classes into SCSI/ATA devices by default.

   Having the kernel provide an abstract interface for bio QoS and
   configuring a new disk with a sane handful of classes does not
   prevent $CLOUD_VENDOR from overriding what Linux configured. But at
   least we'd have a generic approach to block QoS in Linux. Similar to
   the existing I/O priority infrastructure which is also not tied to
   any particular hardware feature.

   A generic implementation also allows us to do fancy things in the
   hypervisor where we would like to be able to do QoS across multiple
   devices as well. Without having ATA or SCSI with CDL involved. Or
   whatever things might look like in NVMe.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 07/18] scsi: sd: detect support for command duration limits
  2023-01-28  2:52       ` Bart Van Assche
@ 2023-01-29  2:05         ` Damien Le Moal
  0 siblings, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-29  2:05 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, Niklas Cassel,
	James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/28/23 11:52, Bart Van Assche wrote:
> On 1/27/23 16:51, Damien Le Moal wrote:
>> On 1/27/23 22:00, Hannes Reinecke wrote:
>>> Hmm. Calling this during revalidate() makes sense, but how can we ensure
>>> that we call revalidate() when the user issues a MODE_SELECT command?
>>
>> Given that CDLs can be changed with a passthrough command, I do not think we can
>> do anything about that, unfortunately. But I think the same is true of many
>> things like that. E.g. "let's turn onf/off the write cache without the kernel
>> noticing"... But given that on a normal system only privileged applications can
>> do passthrough, if that happens, then the system has been hacked or the user is
>> shooting himself in the foot.
>>
>> cdl-tools project (cdladm utility) uses passtrhough but triggers a revalidate
>> after changing CDLs to make sure sysfs stays in sync.
>>
>> As Christoph suggested, we could change all this to an ioctl(GET_CDL) for
>> applications... But sysfs is so much simpler in my opinion, not to mention that
>> it allows access to the information for any application written in a language
>> that does not have ioctl() or an equivalent.
>>
>> cdl-tools has a test suite all written in bash scripts thanks to the sysfs
>> interface :)
> 
> My understanding is that combining the sd driver with SCSI pass-through 
> is not supported and also that there are no plans to support this 
> combination.

Yes. Correct. Passthrough commands do not use sd. That is why cdl-tools triggers
a revalidate once it is done with changing the CDL descriptors using passthrough
commands.

> 
> Martin, please correct me if I got this wrong.
> 
> Thanks,
> 
> Bart.
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-28 20:25                                     ` Martin K. Petersen
@ 2023-01-29  3:52                                       ` Damien Le Moal
  2023-01-30 13:44                                         ` Hannes Reinecke
                                                           ` (3 more replies)
  2023-01-30 19:13                                       ` Bart Van Assche
  1 sibling, 4 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-29  3:52 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe,
	Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block, Bart Van Assche

On 1/29/23 05:25, Martin K. Petersen wrote:
> 
> Damien,
> 
> Finally had a window where I could sit down a read this extremely long
> thread.
> 
>> I am not denying anything. I simply keep telling you that CDL is not a
>> generic feature for random applications to use, including those that
>> already use RT/BE/IDLE. It is for applications that know and expect
>> it, and so have a setup suited for CDL use down to the drive CDL
>> descriptors. That includes DM setups.
>>
>> Thinking about CDL in a generic setup for any random application to
>> use is nonsense. And even if that happens and a user not knowing about
>> it still tries it, than as mentioned, nothing bad will happen. Using
>> CDL in a setup that does not support it is a NOP. That would be the
>> same as not using it.
> 
> My observations:
> 
>  - Wrt. ioprio as conduit, I personally really dislike the idea of
>    conflating priority (relative performance wrt. other I/O) with CDL
>    (which is a QoS concept). I would really prefer those things to be
>    separate. However, I do think that the ioprio *interface* is a good
>    fit. A tool like ionice seems like a reasonable approach to letting
>    generic applications set their CDL.

The definition of IOPRIO_CLASS_CDL was more about reusing the ioprio *interface*
rather than having CDL support defined as a fully functional IO priority class.
As I argued in this thread, and I think you agreee, CDL semantic is more than
the simple priority class/level ordering.

>    If bio space wasn't a premium, I'd say just keep things separate.
>    But given the inherent incompatibility between kernel I/O scheduling
>    and CDL, it's probably not worth the hassle to separate them. As much
>    as it pains me to mix two concepts which should be completely
>    orthogonal.
> 
>    I wish we could let applications specify both a priority and a CDL at
>    the same time, though. Even if the kernel plumbing in the short term
>    ends up using bi_ioprio as conduit. It's much harder to make changes
>    in the application interface at a later date.

See below. There may be a solution about that.

>  - Wrt. "CDL is not a generic feature", I think you are underestimating
>    how many applications actually want something like this. We have
>    many.
> 
>    I don't think we should design for "special interest only, needs
>    custom device tweaking to be usable". We have been down that path
>    before (streams, etc.). And with poor results.

OK.

>    I/O hints also tanked but at least we tried to pre-define performance
>    classes that made sense in an abstract fashion. And programmed the
>    mode page on discovered devices so that the classes were identical
>    across all disks, regardless of whether they were SSDs or million
>    dollar arrays. This allowed filesystems to communicate "this is
>    metadata" regardless of the device the I/O was going to. Instead of
>    "index 5 on this device" but "index 42 on the mirror".
> 
>    As such, I don't like the "just customize your settings with
>    cdltools" approach. I'd much rather see us try to define a few QoS
>    classes that make sense that would apply to every app and use those
>    to define the application interface. And then have the kernel program
>    those CDL classes into SCSI/ATA devices by default.

Makes sense. Though I think it will be hard to define a set of QoS hints that
are useful for a wide range of applications, and even harder to convert the
defined hint classes to CDL descriptors. I fear that we may end up with the same
issues as IO hints/streams.

>    Having the kernel provide an abstract interface for bio QoS and
>    configuring a new disk with a sane handful of classes does not
>    prevent $CLOUD_VENDOR from overriding what Linux configured. But at
>    least we'd have a generic approach to block QoS in Linux. Similar to
>    the existing I/O priority infrastructure which is also not tied to
>    any particular hardware feature.

OK. See below about this.

>    A generic implementation also allows us to do fancy things in the
>    hypervisor where we would like to be able to do QoS across multiple
>    devices as well. Without having ATA or SCSI with CDL involved. Or
>    whatever things might look like in NVMe.

Fair point, especially given that virtio actually already forwards a guest
ioprio to the host through the virtio block command. Thinking of that particular
point together with what you said, I came up with the change show below as a
replacement for this patch 1/18.

This changes the 13-bits ioprio data into a 3-bits QOS hint + 3-bits of IO prio
level. This is consistent with the IO prio interface since IO priority levels
have to be between 0 and 7 (otherwise, errors are returned). So in fact, the
upper 10-bits of the ioprio data are ignored and we can safely use 3 of these
bits for an IO hint.

This hint applies to all priority classes and levels, that is, for the CDL case,
we can enrich any priority with a hint that specifies the CDL index to use for
an IO.

This falls short of actually defining generic IO hints, but this has the
advantage to not break anything for current applications using IO priorities,
not require any change to existing IO schedulers, while still allowing to pass
CDL indexes for IOs down to the scsi & ATA layers (which for now would be the
only layers in the kernel acting on the ioprio qos hints).

I think that this approach still allows us to enable CDL support, and on top of
it, go further and define generic QOS hints that IO scheduler can use and that
also potentially map to CDL for scsi & ata (similarly to the RT class IOs
mapping to the NCQ priority feature if the user enabled that feature).

As mentioned above, I think that defining generic IO hint classes will be
difficult. But the change below is I think a good a starting point that should
not prevent working on that.

Thoughts ?

Bart,

Given that you did not like the IOPRIO_CLASS_CDL, what do you think of this
approach ?


diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index ccf2204477a5..9b3c8fb806f1 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -5378,11 +5378,11 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct
bfq_io_cq *bic)
 		bfqq->new_ioprio_class = task_nice_ioclass(tsk);
 		break;
 	case IOPRIO_CLASS_RT:
-		bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);
 		bfqq->new_ioprio_class = IOPRIO_CLASS_RT;
 		break;
 	case IOPRIO_CLASS_BE:
-		bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+		bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);
 		bfqq->new_ioprio_class = IOPRIO_CLASS_BE;
 		break;
 	case IOPRIO_CLASS_IDLE:
@@ -5671,7 +5671,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd,
 				       struct bfq_io_cq *bic,
 				       bool respawn)
 {
-	const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio);
+	const int ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio);
 	const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio);
 	struct bfq_queue **async_bfqq = NULL;
 	struct bfq_queue *bfqq;
diff --git a/block/ioprio.c b/block/ioprio.c
index 32a456b45804..33f327a10811 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -33,7 +33,7 @@
 int ioprio_check_cap(int ioprio)
 {
 	int class = IOPRIO_PRIO_CLASS(ioprio);
-	int data = IOPRIO_PRIO_DATA(ioprio);
+	int level = IOPRIO_PRIO_LEVEL(ioprio);

 	switch (class) {
 		case IOPRIO_CLASS_RT:
@@ -49,13 +49,13 @@ int ioprio_check_cap(int ioprio)
 			fallthrough;
 			/* rt has prio field too */
 		case IOPRIO_CLASS_BE:
-			if (data >= IOPRIO_NR_LEVELS || data < 0)
+			if (level >= IOPRIO_NR_LEVELS || level < 0)
 				return -EINVAL;
 			break;
 		case IOPRIO_CLASS_IDLE:
 			break;
 		case IOPRIO_CLASS_NONE:
-			if (data)
+			if (level)
 				return -EINVAL;
 			break;
 		default:
diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
index 83a366f3ee80..eba23e6a7bf6 100644
--- a/fs/f2fs/sysfs.c
+++ b/fs/f2fs/sysfs.c
@@ -314,7 +314,7 @@ static ssize_t f2fs_sbi_show(struct f2fs_attr *a,
 		struct ckpt_req_control *cprc = &sbi->cprc_info;
 		int len = 0;
 		int class = IOPRIO_PRIO_CLASS(cprc->ckpt_thread_ioprio);
-		int data = IOPRIO_PRIO_DATA(cprc->ckpt_thread_ioprio);
+		int level = IOPRIO_PRIO_LEVEL(cprc->ckpt_thread_ioprio);

 		if (class == IOPRIO_CLASS_RT)
 			len += scnprintf(buf + len, PAGE_SIZE - len, "rt,");
@@ -323,7 +323,7 @@ static ssize_t f2fs_sbi_show(struct f2fs_attr *a,
 		else
 			return -EINVAL;

-		len += scnprintf(buf + len, PAGE_SIZE - len, "%d\n", data);
+		len += scnprintf(buf + len, PAGE_SIZE - len, "%d\n", level);
 		return len;
 	}

diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index f70f2596a6bf..1d90349a19c9 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -37,6 +37,18 @@ enum {
 #define IOPRIO_NR_LEVELS	8
 #define IOPRIO_BE_NR		IOPRIO_NR_LEVELS

+/*
+ * The 13-bits of ioprio data for each class provide up to 8 QOS hints and
+ * up to 8 priority levels.
+ */
+#define IOPRIO_PRIO_LEVEL_MASK	(IOPRIO_NR_LEVELS - 1)
+#define IOPRIO_QOS_HINT_SHIFT	10
+#define IOPRIO_NR_QOS_HINTS	8
+#define IOPRIO_QOS_HINT_MASK	(IOPRIO_NR_QOS_HINTS - 1)
+#define IOPRIO_PRIO_LEVEL(ioprio)	((ioprio) & IOPRIO_PRIO_LEVEL_MASK)
+#define IOPRIO_QOS_HINT(ioprio)	\
+	(((ioprio) >> IOPRIO_QOS_HINT_SHIFT) & IOPRIO_QOS_HINT_MASK)
+
 enum {
 	IOPRIO_WHO_PROCESS = 1,
 	IOPRIO_WHO_PGRP,


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-29  3:52                                       ` Damien Le Moal
@ 2023-01-30 13:44                                         ` Hannes Reinecke
  2023-01-30 14:55                                           ` Damien Le Moal
  2023-01-30 19:24                                         ` Bart Van Assche
                                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-30 13:44 UTC (permalink / raw)
  To: Damien Le Moal, Martin K. Petersen
  Cc: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe,
	Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/29/23 04:52, Damien Le Moal wrote:
> On 1/29/23 05:25, Martin K. Petersen wrote:
[ .. ]
>>
>>     As such, I don't like the "just customize your settings with
>>     cdltools" approach. I'd much rather see us try to define a few QoS
>>     classes that make sense that would apply to every app and use those
>>     to define the application interface. And then have the kernel program
>>     those CDL classes into SCSI/ATA devices by default.
> 
> Makes sense. Though I think it will be hard to define a set of QoS hints that
> are useful for a wide range of applications, and even harder to convert the
> defined hint classes to CDL descriptors. I fear that we may end up with the same
> issues as IO hints/streams.
> 
>>     Having the kernel provide an abstract interface for bio QoS and
>>     configuring a new disk with a sane handful of classes does not
>>     prevent $CLOUD_VENDOR from overriding what Linux configured. But at
>>     least we'd have a generic approach to block QoS in Linux. Similar to
>>     the existing I/O priority infrastructure which is also not tied to
>>     any particular hardware feature.
> 
> OK. See below about this.
> 
>>     A generic implementation also allows us to do fancy things in the
>>     hypervisor where we would like to be able to do QoS across multiple
>>     devices as well. Without having ATA or SCSI with CDL involved. Or
>>     whatever things might look like in NVMe.
> 
> Fair point, especially given that virtio actually already forwards a guest
> ioprio to the host through the virtio block command. Thinking of that particular
> point together with what you said, I came up with the change show below as a
> replacement for this patch 1/18.
> 
> This changes the 13-bits ioprio data into a 3-bits QOS hint + 3-bits of IO prio
> level. This is consistent with the IO prio interface since IO priority levels
> have to be between 0 and 7 (otherwise, errors are returned). So in fact, the
> upper 10-bits of the ioprio data are ignored and we can safely use 3 of these
> bits for an IO hint.
> 
> This hint applies to all priority classes and levels, that is, for the CDL case,
> we can enrich any priority with a hint that specifies the CDL index to use for
> an IO.
> 
> This falls short of actually defining generic IO hints, but this has the
> advantage to not break anything for current applications using IO priorities,
> not require any change to existing IO schedulers, while still allowing to pass
> CDL indexes for IOs down to the scsi & ATA layers (which for now would be the
> only layers in the kernel acting on the ioprio qos hints).
> 
> I think that this approach still allows us to enable CDL support, and on top of
> it, go further and define generic QOS hints that IO scheduler can use and that
> also potentially map to CDL for scsi & ata (similarly to the RT class IOs
> mapping to the NCQ priority feature if the user enabled that feature).
> 
> As mentioned above, I think that defining generic IO hint classes will be
> difficult. But the change below is I think a good a starting point that should
> not prevent working on that.
> 
> Thoughts ?
> 
I like the idea.
QoS is one of the recurring topic always coming up sooner or later when 
talking of storage networks, so having _some_ concept of QoS in the 
linux kernel (for storage) would be beneficial.

Maybe time for a topic at LSF?

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-30 13:44                                         ` Hannes Reinecke
@ 2023-01-30 14:55                                           ` Damien Le Moal
  0 siblings, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-30 14:55 UTC (permalink / raw)
  To: Hannes Reinecke, Martin K. Petersen
  Cc: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe,
	Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/30/23 22:44, Hannes Reinecke wrote:
> On 1/29/23 04:52, Damien Le Moal wrote:
>> On 1/29/23 05:25, Martin K. Petersen wrote:
> [ .. ]
>>>
>>>     As such, I don't like the "just customize your settings with
>>>     cdltools" approach. I'd much rather see us try to define a few QoS
>>>     classes that make sense that would apply to every app and use those
>>>     to define the application interface. And then have the kernel program
>>>     those CDL classes into SCSI/ATA devices by default.
>>
>> Makes sense. Though I think it will be hard to define a set of QoS hints that
>> are useful for a wide range of applications, and even harder to convert the
>> defined hint classes to CDL descriptors. I fear that we may end up with the same
>> issues as IO hints/streams.
>>
>>>     Having the kernel provide an abstract interface for bio QoS and
>>>     configuring a new disk with a sane handful of classes does not
>>>     prevent $CLOUD_VENDOR from overriding what Linux configured. But at
>>>     least we'd have a generic approach to block QoS in Linux. Similar to
>>>     the existing I/O priority infrastructure which is also not tied to
>>>     any particular hardware feature.
>>
>> OK. See below about this.
>>
>>>     A generic implementation also allows us to do fancy things in the
>>>     hypervisor where we would like to be able to do QoS across multiple
>>>     devices as well. Without having ATA or SCSI with CDL involved. Or
>>>     whatever things might look like in NVMe.
>>
>> Fair point, especially given that virtio actually already forwards a guest
>> ioprio to the host through the virtio block command. Thinking of that particular
>> point together with what you said, I came up with the change show below as a
>> replacement for this patch 1/18.
>>
>> This changes the 13-bits ioprio data into a 3-bits QOS hint + 3-bits of IO prio
>> level. This is consistent with the IO prio interface since IO priority levels
>> have to be between 0 and 7 (otherwise, errors are returned). So in fact, the
>> upper 10-bits of the ioprio data are ignored and we can safely use 3 of these
>> bits for an IO hint.
>>
>> This hint applies to all priority classes and levels, that is, for the CDL case,
>> we can enrich any priority with a hint that specifies the CDL index to use for
>> an IO.
>>
>> This falls short of actually defining generic IO hints, but this has the
>> advantage to not break anything for current applications using IO priorities,
>> not require any change to existing IO schedulers, while still allowing to pass
>> CDL indexes for IOs down to the scsi & ATA layers (which for now would be the
>> only layers in the kernel acting on the ioprio qos hints).
>>
>> I think that this approach still allows us to enable CDL support, and on top of
>> it, go further and define generic QOS hints that IO scheduler can use and that
>> also potentially map to CDL for scsi & ata (similarly to the RT class IOs
>> mapping to the NCQ priority feature if the user enabled that feature).
>>
>> As mentioned above, I think that defining generic IO hint classes will be
>> difficult. But the change below is I think a good a starting point that should
>> not prevent working on that.
>>
>> Thoughts ?
>>
> I like the idea.
> QoS is one of the recurring topic always coming up sooner or later when 
> talking of storage networks, so having _some_ concept of QoS in the 
> linux kernel (for storage) would be beneficial.
> 
> Maybe time for a topic at LSF?

Yes. I was hoping for a quicker resolution so that we can get the CDL
"mechanical" bits in, but without a nice API for it, we cannot :)
Trying to compile something with Niklas. So far, we are thinking of having
QOS flags + QOS data, the flags determining how (and if) the QOS data is used
and what it means.

Ex of things We could have:
* IOPRIO_QOS_FAILFAST: do not retry the IO if it fails the first time
* IOPRIO_QOS_DURATION_LIMIT: then the QOS data indicates the limit to use
(number). That can be implemented in schedulers and also map to CDL on drives
that support that feature.

That is the difficult part: what else ? For now, considering only our target of
adding scsi & ata CDL support, the above is enough. But is that enough in
general for most users/apps ?

> 
> Cheers,
> 
> Hannes
> 

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 08/18] scsi: sd: set read/write commands CDL index
  2023-01-28  0:03     ` Damien Le Moal
@ 2023-01-30 18:13       ` Hannes Reinecke
  0 siblings, 0 replies; 82+ messages in thread
From: Hannes Reinecke @ 2023-01-30 18:13 UTC (permalink / raw)
  To: Damien Le Moal, Niklas Cassel, James E.J. Bottomley, Martin K. Petersen
  Cc: Christoph Hellwig, linux-scsi, linux-ide, linux-block

On 1/28/23 01:03, Damien Le Moal wrote:
> On 1/28/23 00:30, Hannes Reinecke wrote:
>> On 1/24/23 20:02, Niklas Cassel wrote:
>>> From: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>>>
>>> Introduce the command duration limits helper function
>>> sd_cdl_cmd_limit() to retrieve and set the DLD bits of the
>>> READ/WRITE 16 and READ/WRITE 32 commands to indicate to the device
>>> the command duration limit descriptor to apply to the command.
>>>
>>> When command duration limits are enabled, sd_cdl_cmd_limit() obtains the
>>> index of the descriptor to apply to the command for requests that have
>>> the IOPRIO_CLASS_DL priority class with a priority data sepcifying a
>>> valid descriptor index (1 to 7).
>>>
>>> The read-write sysfs attribute "enable" is introduced to control
>>> setting the command duration limits indexes. If this attribute is set
>>> to 0 (default), command duration limits specified by the user are
>>> ignored. The user must set this attribute to 1 for command duration
>>> limits to be set. Enabling and disabling the command duration limits
>>> feature for ATA devices must be done using the ATA feature sub-page of
>>> the control mode page. The sd_cdl_enable() function is introduced to
>>> check if this mode page is supported by the device and if it is, use
>>> it to enable/disable CDL.
>>>
>>> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
>>> Co-developed-by: Niklas Cassel <niklas.cassel@wdc.com>
>>> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
>>> ---
>>>    drivers/scsi/sd.c     |  16 +++--
>>>    drivers/scsi/sd.h     |  10 ++++
>>>    drivers/scsi/sd_cdl.c | 134 +++++++++++++++++++++++++++++++++++++++++-
>>>    3 files changed, 152 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>>> index 7879a5470773..d2eb01337943 100644
>>> --- a/drivers/scsi/sd.c
>>> +++ b/drivers/scsi/sd.c
>>> @@ -1045,13 +1045,14 @@ static blk_status_t sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
>>>    
>>>    static blk_status_t sd_setup_rw32_cmnd(struct scsi_cmnd *cmd, bool write,
>>>    				       sector_t lba, unsigned int nr_blocks,
>>> -				       unsigned char flags)
>>> +				       unsigned char flags, unsigned int dld)
>>>    {
>>>    	cmd->cmd_len = SD_EXT_CDB_SIZE;
>>>    	cmd->cmnd[0]  = VARIABLE_LENGTH_CMD;
>>>    	cmd->cmnd[7]  = 0x18; /* Additional CDB len */
>>>    	cmd->cmnd[9]  = write ? WRITE_32 : READ_32;
>>>    	cmd->cmnd[10] = flags;
>>> +	cmd->cmnd[11] = dld & 0x07;
>>>    	put_unaligned_be64(lba, &cmd->cmnd[12]);
>>>    	put_unaligned_be32(lba, &cmd->cmnd[20]); /* Expected Indirect LBA */
>>>    	put_unaligned_be32(nr_blocks, &cmd->cmnd[28]);
>>> @@ -1061,12 +1062,12 @@ static blk_status_t sd_setup_rw32_cmnd(struct scsi_cmnd *cmd, bool write,
>>>    
>>>    static blk_status_t sd_setup_rw16_cmnd(struct scsi_cmnd *cmd, bool write,
>>>    				       sector_t lba, unsigned int nr_blocks,
>>> -				       unsigned char flags)
>>> +				       unsigned char flags, unsigned int dld)
>>>    {
>>>    	cmd->cmd_len  = 16;
>>>    	cmd->cmnd[0]  = write ? WRITE_16 : READ_16;
>>> -	cmd->cmnd[1]  = flags;
>>> -	cmd->cmnd[14] = 0;
>>> +	cmd->cmnd[1]  = flags | ((dld >> 2) & 0x01);
>>> +	cmd->cmnd[14] = (dld & 0x03) << 6;
>>>    	cmd->cmnd[15] = 0;
>>>    	put_unaligned_be64(lba, &cmd->cmnd[2]);
>>>    	put_unaligned_be32(nr_blocks, &cmd->cmnd[10]);
>>> @@ -1129,6 +1130,7 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>>    	unsigned int mask = logical_to_sectors(sdp, 1) - 1;
>>>    	bool write = rq_data_dir(rq) == WRITE;
>>>    	unsigned char protect, fua;
>>> +	unsigned int dld = 0;
>>>    	blk_status_t ret;
>>>    	unsigned int dif;
>>>    	bool dix;
>>> @@ -1178,6 +1180,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>>    	fua = rq->cmd_flags & REQ_FUA ? 0x8 : 0;
>>>    	dix = scsi_prot_sg_count(cmd);
>>>    	dif = scsi_host_dif_capable(cmd->device->host, sdkp->protection_type);
>>> +	if (sd_cdl_enabled(sdkp))
>>> +		dld = sd_cdl_dld(sdkp, cmd);
>>>    
>>>    	if (dif || dix)
>>>    		protect = sd_setup_protect_cmnd(cmd, dix, dif);
>>> @@ -1186,10 +1190,10 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>>    
>>>    	if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
>>>    		ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
>>> -					 protect | fua);
>>> +					 protect | fua, dld);
>>>    	} else if (sdp->use_16_for_rw || (nr_blocks > 0xffff)) {
>>>    		ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks,
>>> -					 protect | fua);
>>> +					 protect | fua, dld);
>>>    	} else if ((nr_blocks > 0xff) || (lba > 0x1fffff) ||
>>>    		   sdp->use_10_for_rw || protect) {
>>>    		ret = sd_setup_rw10_cmnd(cmd, write, lba, nr_blocks,
>>> diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
>>> index e60d33bd222a..5b6b6dc4b92d 100644
>>> --- a/drivers/scsi/sd.h
>>> +++ b/drivers/scsi/sd.h
>>> @@ -130,8 +130,11 @@ struct sd_cdl_page {
>>>    	struct sd_cdl_desc      descs[SD_CDL_MAX_DESC];
>>>    };
>>>    
>>> +struct scsi_disk;
>>> +
>>>    struct sd_cdl {
>>>    	struct kobject		kobj;
>>> +	struct scsi_disk	*sdkp;
>>>    	bool			sysfs_registered;
>>>    	u8			perf_vs_duration_guideline;
>>>    	struct sd_cdl_page	pages[SD_CDL_RW];
>>> @@ -188,6 +191,7 @@ struct scsi_disk {
>>>    	u8		zeroing_mode;
>>>    	u8		nr_actuators;		/* Number of actuators */
>>>    	struct sd_cdl	*cdl;
>>> +	unsigned	cdl_enabled : 1;
>>>    	unsigned	ATO : 1;	/* state of disk ATO bit */
>>>    	unsigned	cache_override : 1; /* temp override of WCE,RCD */
>>>    	unsigned	WCE : 1;	/* state of disk WCE bit */
>>> @@ -355,5 +359,11 @@ void sd_print_result(const struct scsi_disk *sdkp, const char *msg, int result);
>>>    /* Command duration limits support (in sd_cdl.c) */
>>>    void sd_read_cdl(struct scsi_disk *sdkp, unsigned char *buf);
>>>    void sd_cdl_release(struct scsi_disk *sdkp);
>>> +int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd);
>>> +
>>> +static inline bool sd_cdl_enabled(struct scsi_disk *sdkp)
>>> +{
>>> +	return sdkp->cdl && sdkp->cdl_enabled;
>>> +}
>>>    
>>>    #endif /* _SCSI_DISK_H */
>>> diff --git a/drivers/scsi/sd_cdl.c b/drivers/scsi/sd_cdl.c
>>> index 513cd989f19a..59d02dbb5ea1 100644
>>> --- a/drivers/scsi/sd_cdl.c
>>> +++ b/drivers/scsi/sd_cdl.c
>>> @@ -93,6 +93,63 @@ static const char *sd_cdl_policy_name(u8 policy)
>>>    	}
>>>    }
>>>    
>>> +/*
>>> + * Enable/disable CDL.
>>> + */
>>> +static int sd_cdl_enable(struct scsi_disk *sdkp, bool enable)
>>> +{
>>> +	struct scsi_device *sdp = sdkp->device;
>>> +	struct scsi_mode_data data;
>>> +	struct scsi_sense_hdr sshdr;
>>> +	struct scsi_vpd *vpd;
>>> +	bool is_ata = false;
>>> +	char buf[64];
>>> +	int ret;
>>> +
>>> +	rcu_read_lock();
>>> +	vpd = rcu_dereference(sdp->vpd_pg89);
>>> +	if (vpd)
>>> +		is_ata = true;
>>> +	rcu_read_unlock();
>>> +
>>> +	/*
>>> +	 * For ATA devices, CDL needs to be enabled with a SET FEATURES command.
>>> +	 */
>>> +	if (is_ata) {
>>> +		char *buf_data;
>>> +		int len;
>>> +
>>> +		ret = scsi_mode_sense(sdp, 0x08, 0x0a, 0xf2, buf, sizeof(buf),
>>> +				      SD_TIMEOUT, sdkp->max_retries, &data,
>>> +				      NULL);
>>> +		if (ret)
>>> +			return -EINVAL;
>>> +
>> That is a tad odd.
>> Is CDL always enabled for 'normal' SCSI?
> 
> Yes it is on the device side. There is no mode sense to turn it on/off. Not sure
> why it was designed like that in the specs... The sysfs duration_limits/enable
> attribute is a "soft" on/off switch and it is off by default, even for drives
> reporting supporting CDL.
> Hence the "if (is_ata)" to do the mode sense to enable the feature on the device
> side only for ATA devices. We need this to avoid having 2 different enable
> pathes with 2 different sysfs "enable" attributes. Doing it like this is a lot
> less code.
> 
Thought as much.
No-one cares about 'real' SCSI devices anymore ;-)

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-28 20:25                                     ` Martin K. Petersen
  2023-01-29  3:52                                       ` Damien Le Moal
@ 2023-01-30 19:13                                       ` Bart Van Assche
  2023-01-31  2:58                                         ` Martin K. Petersen
  1 sibling, 1 reply; 82+ messages in thread
From: Bart Van Assche @ 2023-01-30 19:13 UTC (permalink / raw)
  To: Martin K. Petersen, Damien Le Moal
  Cc: Niklas Cassel, Paolo Valente, Jens Axboe, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/28/23 12:25, Martin K. Petersen wrote:
>   - Wrt. ioprio as conduit, I personally really dislike the idea of
>     conflating priority (relative performance wrt. other I/O) with CDL
>     (which is a QoS concept). I would really prefer those things to be
>     separate. However, I do think that the ioprio *interface* is a good
>     fit. A tool like ionice seems like a reasonable approach to letting
>     generic applications set their CDL.

Hi Martin,

My understanding is that ionice uses the ioprio_set() system call and 
hence only affects foreground I/O but not page cache writeback. This is 
why I introduced the ioprio rq-qos policy (block/blk-ioprio.c). How 
about not adding CDL support in ioprio_set() and only supporting 
configuration of CDL via the v2 cgroup mechanism?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-29  3:52                                       ` Damien Le Moal
  2023-01-30 13:44                                         ` Hannes Reinecke
@ 2023-01-30 19:24                                         ` Bart Van Assche
  2023-01-30 20:40                                         ` Bart Van Assche
  2023-01-31  2:49                                         ` Martin K. Petersen
  3 siblings, 0 replies; 82+ messages in thread
From: Bart Van Assche @ 2023-01-30 19:24 UTC (permalink / raw)
  To: Damien Le Moal, Martin K. Petersen
  Cc: Niklas Cassel, Paolo Valente, Jens Axboe, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/28/23 19:52, Damien Le Moal wrote:
> diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
> index f70f2596a6bf..1d90349a19c9 100644
> +/*
> + * The 13-bits of ioprio data for each class provide up to 8 QOS hints and
> + * up to 8 priority levels.
> + */
> +#define IOPRIO_PRIO_LEVEL_MASK	(IOPRIO_NR_LEVELS - 1)
> +#define IOPRIO_QOS_HINT_SHIFT	10
> +#define IOPRIO_NR_QOS_HINTS	8
> +#define IOPRIO_QOS_HINT_MASK	(IOPRIO_NR_QOS_HINTS - 1)
> +#define IOPRIO_PRIO_LEVEL(ioprio)	((ioprio) & IOPRIO_PRIO_LEVEL_MASK)
> +#define IOPRIO_QOS_HINT(ioprio)	\
> +	(((ioprio) >> IOPRIO_QOS_HINT_SHIFT) & IOPRIO_QOS_HINT_MASK)
> +

Hi Damien,

How about the following approach?
* Do not add QoS support to the ioprio_set() system cal since that
   system call only affects foreground I/O.
* Configure QoS via the v2 cgroup mechanism such that QoS policies are
   applied to both foreground and background I/O.

This approach allows to use another binary representation for I/O 
priorities and QoS in the bio.bi_ioprio field than what is supported by 
the ioprio_set() system call. This approach also allows to use names 
(strings) for QoS settings instead of numbers in the interface between 
user space and the kernel if that would be considered desirable.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-29  3:52                                       ` Damien Le Moal
  2023-01-30 13:44                                         ` Hannes Reinecke
  2023-01-30 19:24                                         ` Bart Van Assche
@ 2023-01-30 20:40                                         ` Bart Van Assche
  2023-01-31  2:49                                         ` Martin K. Petersen
  3 siblings, 0 replies; 82+ messages in thread
From: Bart Van Assche @ 2023-01-30 20:40 UTC (permalink / raw)
  To: Damien Le Moal, Martin K. Petersen
  Cc: Niklas Cassel, Paolo Valente, Jens Axboe, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/28/23 19:52, Damien Le Moal wrote:
> +/*
> + * The 13-bits of ioprio data for each class provide up to 8 QOS hints and
> + * up to 8 priority levels.
> + */
> +#define IOPRIO_PRIO_LEVEL_MASK	(IOPRIO_NR_LEVELS - 1)
> +#define IOPRIO_QOS_HINT_SHIFT	10
> +#define IOPRIO_NR_QOS_HINTS	8
> +#define IOPRIO_QOS_HINT_MASK	(IOPRIO_NR_QOS_HINTS - 1)
> +#define IOPRIO_PRIO_LEVEL(ioprio)	((ioprio) & IOPRIO_PRIO_LEVEL_MASK)
> +#define IOPRIO_QOS_HINT(ioprio)	\
> +	(((ioprio) >> IOPRIO_QOS_HINT_SHIFT) & IOPRIO_QOS_HINT_MASK)

Does the QoS level really have to be encoded in bio.bi_ioprio? How about 
introducing a new field in the existing hole in struct bio? From the 
pahole output:

struct bio {
         struct bio *               bi_next;            /*     0     4 */
         struct block_device *      bi_bdev;            /*     4     4 */
         blk_opf_t                  bi_opf;             /*     8     4 */
         short unsigned int         bi_flags;           /*    12     2 */
         short unsigned int         bi_ioprio;          /*    14     2 */
         blk_status_t               bi_status;          /*    16     1 */

         /* XXX 3 bytes hole, try to pack */

         atomic_t                   __bi_remaining;     /*    20     4 */
         struct bvec_iter           bi_iter;            /*    24    20 */
         blk_qc_t                   bi_cookie;          /*    44     4 */
         bio_end_io_t *             bi_end_io;          /*    48     4 */
         void *                     bi_private;         /*    52     4 */
         struct bio_crypt_ctx *     bi_crypt_context;   /*    56     4 */
[ ... ]

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-29  3:52                                       ` Damien Le Moal
                                                           ` (2 preceding siblings ...)
  2023-01-30 20:40                                         ` Bart Van Assche
@ 2023-01-31  2:49                                         ` Martin K. Petersen
  2023-01-31  3:10                                           ` Damien Le Moal
  3 siblings, 1 reply; 82+ messages in thread
From: Martin K. Petersen @ 2023-01-31  2:49 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Martin K. Petersen, Bart Van Assche, Niklas Cassel,
	Paolo Valente, Jens Axboe, Christoph Hellwig, Hannes Reinecke,
	linux-scsi, linux-ide, linux-block


Damien,

> Makes sense. Though I think it will be hard to define a set of QoS
> hints that are useful for a wide range of applications, and even
> harder to convert the defined hint classes to CDL descriptors. I fear
> that we may end up with the same issues as IO hints/streams.

Hints mainly failed because non-Linux OSes had very different
expectations about how this was going to work. So that left device
vendors in a situation where they had to essentially support 3 different
approaches all implemented using the same protocol.

The challenge of being a general purpose OS is to come up with concepts
that are applicable in a variety of situations. Twiddling protocol
fields is the easy part.

I have a couple of experienced CDL users that I'd like to talk to and
try to get a better idea of what a suitable set of defaults might look
like.

> This hint applies to all priority classes and levels, that is, for the
> CDL case, we can enrich any priority with a hint that specifies the
> CDL index to use for an IO.

Yeah, I like that approach better.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-30 19:13                                       ` Bart Van Assche
@ 2023-01-31  2:58                                         ` Martin K. Petersen
  2023-01-31  3:03                                           ` Damien Le Moal
  0 siblings, 1 reply; 82+ messages in thread
From: Martin K. Petersen @ 2023-01-31  2:58 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Martin K. Petersen, Damien Le Moal, Niklas Cassel, Paolo Valente,
	Jens Axboe, Christoph Hellwig, Hannes Reinecke, linux-scsi,
	linux-ide, linux-block


Hi Bart!

> My understanding is that ionice uses the ioprio_set() system call and
> hence only affects foreground I/O but not page cache writeback. This
> is why I introduced the ioprio rq-qos policy (block/blk-ioprio.c). How
> about not adding CDL support in ioprio_set() and only supporting
> configuration of CDL via the v2 cgroup mechanism?

I suspect applications that care about CDL would probably not go through
the page cache. But I don't have a problem supporting cgroups at all.

Longer term, for the applications that I'm aware of that care about
this, we'd probably want to be able to specify things on a per-I/O basis
via io_uring, though.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-31  2:58                                         ` Martin K. Petersen
@ 2023-01-31  3:03                                           ` Damien Le Moal
  0 siblings, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-31  3:03 UTC (permalink / raw)
  To: Martin K. Petersen, Bart Van Assche
  Cc: Niklas Cassel, Paolo Valente, Jens Axboe, Christoph Hellwig,
	Hannes Reinecke, linux-scsi, linux-ide, linux-block

On 1/31/23 11:58, Martin K. Petersen wrote:
> 
> Hi Bart!
> 
>> My understanding is that ionice uses the ioprio_set() system call and
>> hence only affects foreground I/O but not page cache writeback. This
>> is why I introduced the ioprio rq-qos policy (block/blk-ioprio.c). How
>> about not adding CDL support in ioprio_set() and only supporting
>> configuration of CDL via the v2 cgroup mechanism?
> 
> I suspect applications that care about CDL would probably not go through
> the page cache. But I don't have a problem supporting cgroups at all.
> 
> Longer term, for the applications that I'm aware of that care about
> this, we'd probably want to be able to specify things on a per-I/O basis
> via io_uring, though.

Absolutely agree here. Similarly to the legacy ioprio, we need to be able to
specify per-io (iouring or libaio) and per context (ioprio_set() or cgroups). I
see the per-io and cgroups APIs as complementary.

Many of the use cases we are seeing for CDL are transitions from ATA NCQ prio,
which relies on the RT class for per IO aio_iorpio field with libaio/iouring. I
would like to have continuity with this to facilitate application development.
Having only cgroups as the API would disallow per-io async io application engines.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 01/18] block: introduce duration-limits priority class
  2023-01-31  2:49                                         ` Martin K. Petersen
@ 2023-01-31  3:10                                           ` Damien Le Moal
  0 siblings, 0 replies; 82+ messages in thread
From: Damien Le Moal @ 2023-01-31  3:10 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bart Van Assche, Niklas Cassel, Paolo Valente, Jens Axboe,
	Christoph Hellwig, Hannes Reinecke, linux-scsi, linux-ide,
	linux-block

On 1/31/23 11:49, Martin K. Petersen wrote:
> 
> Damien,
> 
>> Makes sense. Though I think it will be hard to define a set of QoS
>> hints that are useful for a wide range of applications, and even
>> harder to convert the defined hint classes to CDL descriptors. I fear
>> that we may end up with the same issues as IO hints/streams.
> 
> Hints mainly failed because non-Linux OSes had very different
> expectations about how this was going to work. So that left device
> vendors in a situation where they had to essentially support 3 different
> approaches all implemented using the same protocol.
> 
> The challenge of being a general purpose OS is to come up with concepts
> that are applicable in a variety of situations. Twiddling protocol
> fields is the easy part.
> 
> I have a couple of experienced CDL users that I'd like to talk to and
> try to get a better idea of what a suitable set of defaults might look
> like.
> 
>> This hint applies to all priority classes and levels, that is, for the
>> CDL case, we can enrich any priority with a hint that specifies the
>> CDL index to use for an IO.
> 
> Yeah, I like that approach better.

Of note is that even though the IOPRIO_XXX macros in include/uapi/linux/ioprio.h
assume a 16bits value for the priority class + data, of which only 6 bits are
usable (3 for the class, 3 for the level), all syscall and kernel internal
interface has ioprio defined as an int. So we have in fact 32 bits to play with.
We could keep the lower 16 bits for ioprio as it was, and have the upper 16bits
used for QOS hints. More room that the 10 bits between the prio class and level.

The only place that will need changing is struct bio since bi_ioprio is defined
as an unsigned short. To solve this, as Bart suggested, we could add another
unsigned short in the bio struct hole for the qos hints (bi_iohint or bi_ioqoshint).

But if we can define a sensible set of hints that covers at least CDL with the
10 free bits we have in the current ioprio, that would be even better I think
(less changes needed in the block layer).

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v3 09/18] scsi: sd: handle read/write CDL timeout failures
  2023-01-27 15:34   ` Hannes Reinecke
  2023-01-28  0:06     ` Damien Le Moal
@ 2023-02-03 16:49     ` Niklas Cassel
  1 sibling, 0 replies; 82+ messages in thread
From: Niklas Cassel @ 2023-02-03 16:49 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: James E.J. Bottomley, Martin K. Petersen, Christoph Hellwig,
	Damien Le Moal, linux-scsi, linux-ide, linux-block

Hello Hannes,

On Fri, Jan 27, 2023 at 04:34:59PM +0100, Hannes Reinecke wrote:
> > @@ -691,6 +708,15 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
> >   		}
> >   		return SUCCESS;
> > +	case COMPLETED:
> > +		if (sshdr.asc == 0x55 && sshdr.ascq == 0x0a) {
> > +			set_scsi_ml_byte(scmd, SCSIML_STAT_DL_TIMEOUT);
> > +			req->cmd_flags |= REQ_FAILFAST_DEV;
> > +			req->rq_flags |= RQF_QUIET;
> > +			return SUCCESS;
> 
> You can kill this line, will be done anyway.

Thanks, we will fix in next revision.

> 
> > +		}
> > +		return SUCCESS;
> > +
> >   	default:
> >   		return SUCCESS;
> >   	}
> > @@ -785,6 +811,14 @@ static enum scsi_disposition scsi_eh_completed_normally(struct scsi_cmnd *scmd)
> >   	switch (get_status_byte(scmd)) {
> >   	case SAM_STAT_GOOD:
> >   		scsi_handle_queue_ramp_up(scmd->device);
> > +		if (scmd->sense_buffer && SCSI_SENSE_VALID(scmd))
> > +			/*
> > +			 * If we have sense data, call scsi_check_sense() in
> > +			 * order to set the correct SCSI ML byte (if any).
> > +			 * No point in checking the return value, since the
> > +			 * command has already completed successfully.
> > +			 */
> > +			scsi_check_sense(scmd);
> 
> I am every so slightly nervous here.
> We never checked the sense code for GOOD status, so heaven knows if there
> are devices out there which return something here.

Well, right now we have had quite the opposite problem, to even allow
sense data while not being a CHECK_COMMAND, since they have been very
intertwined historically.

That alone makes me seriously doubt that this is going to be a problem.

But if there is a device that returns sense data when it shouldn't,
that is obviously a spec violation, and I assume that we would deal
with it in the same way we usually do, i.e. a device that does not
follow the spec has to be quirked. However, right now, I think that
it is sheer speculation that such a device even exists.

> And you have checked that we've cleared the sense code before submitting (or
> retrying, even), right?

Yes, scsi_queue_rq() does an unconditionall memset(cmd->sense_buffer, ...).

A retried command will call scsi_queue_insert() -> blk_mq_requeue_request(),
which will eventually reach blk_mq_dispatch_rq_list(), which will again lead
to scsi_queue_insert() getting called (which will memset the sense_buffer).

> 
> >   		fallthrough;
> >   	case SAM_STAT_COMMAND_TERMINATED:
> >   		return SUCCESS;
> > @@ -1807,6 +1841,10 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
> >   		return !!(req->cmd_flags & REQ_FAILFAST_DRIVER);
> >   	}
> > +	/* Never retry commands aborted due to a duration limit timeout */
> > +	if (scsi_ml_byte(scmd->result) == SCSIML_STAT_DL_TIMEOUT)
> > +		return true;
> > +
> >   	if (!scsi_status_is_check_condition(scmd->result))
> >   		return false;
> > @@ -1966,6 +2004,14 @@ enum scsi_disposition scsi_decide_disposition(struct scsi_cmnd *scmd)
> >   		if (scmd->cmnd[0] == REPORT_LUNS)
> >   			scmd->device->sdev_target->expecting_lun_change = 0;
> >   		scsi_handle_queue_ramp_up(scmd->device);
> > +		if (scmd->sense_buffer && SCSI_SENSE_VALID(scmd))
> > +			/*
> > +			 * If we have sense data, call scsi_check_sense() in
> > +			 * order to set the correct SCSI ML byte (if any).
> > +			 * No point in checking the return value, since the
> > +			 * command has already completed successfully.
> > +			 */
> > +			scsi_check_sense(scmd);
> >   		fallthrough;
> >   	case SAM_STAT_COMMAND_TERMINATED:
> >   		return SUCCESS;
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index e1a021dd4da2..406952e72a68 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -600,6 +600,8 @@ static blk_status_t scsi_result_to_blk_status(int result)
> >   		return BLK_STS_MEDIUM;
> >   	case SCSIML_STAT_TGT_FAILURE:
> >   		return BLK_STS_TARGET;
> > +	case SCSIML_STAT_DL_TIMEOUT:
> > +		return BLK_STS_DURATION_LIMIT;
> >   	}
> >   	switch (host_byte(result)) {
> > @@ -797,6 +799,8 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
> >   				blk_stat = BLK_STS_ZONE_OPEN_RESOURCE;
> >   			}
> >   			break;
> > +		case COMPLETED:
> > +			fallthrough;
> >   		default:
> >   			action = ACTION_FAIL;
> >   			break;
> > diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
> > index 74324fba4281..f42388ecb024 100644
> > --- a/drivers/scsi/scsi_priv.h
> > +++ b/drivers/scsi/scsi_priv.h
> > @@ -27,6 +27,7 @@ enum scsi_ml_status {
> >   	SCSIML_STAT_NOSPC		= 0x02,	/* Space allocation on the dev failed */
> >   	SCSIML_STAT_MED_ERROR		= 0x03,	/* Medium error */
> >   	SCSIML_STAT_TGT_FAILURE		= 0x04,	/* Permanent target failure */
> > +	SCSIML_STAT_DL_TIMEOUT		= 0x05, /* Command Duration Limit timeout */
> >   };
> >   static inline u8 scsi_ml_byte(int result)
> 
> Cheers,
> 
> Hannes
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2023-02-03 16:50 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-24 19:02 [PATCH v3 00/18] Add Command Duration Limits support Niklas Cassel
2023-01-24 19:02 ` [PATCH v3 01/18] block: introduce duration-limits priority class Niklas Cassel
2023-01-24 19:27   ` Bart Van Assche
2023-01-24 20:36     ` Bart Van Assche
2023-01-24 21:48       ` Damien Le Moal
2023-01-24 21:29     ` Damien Le Moal
2023-01-24 22:43       ` Bart Van Assche
2023-01-24 22:59         ` Damien Le Moal
2023-01-25  0:05           ` Bart Van Assche
2023-01-25  1:19             ` Damien Le Moal
2023-01-25 18:37               ` Bart Van Assche
2023-01-25 21:23                 ` Niklas Cassel
2023-01-26  0:24                   ` Damien Le Moal
2023-01-26 13:53                     ` Niklas Cassel
2023-01-26 17:33                       ` Bart Van Assche
2023-01-27  0:18                         ` Damien Le Moal
2023-01-27  1:40                           ` Damien Le Moal
2023-01-27 17:23                             ` Bart Van Assche
2023-01-28  0:40                               ` Damien Le Moal
2023-01-28  0:47                                 ` Bart Van Assche
2023-01-28  0:59                                   ` Damien Le Moal
2023-01-28 20:25                                     ` Martin K. Petersen
2023-01-29  3:52                                       ` Damien Le Moal
2023-01-30 13:44                                         ` Hannes Reinecke
2023-01-30 14:55                                           ` Damien Le Moal
2023-01-30 19:24                                         ` Bart Van Assche
2023-01-30 20:40                                         ` Bart Van Assche
2023-01-31  2:49                                         ` Martin K. Petersen
2023-01-31  3:10                                           ` Damien Le Moal
2023-01-30 19:13                                       ` Bart Van Assche
2023-01-31  2:58                                         ` Martin K. Petersen
2023-01-31  3:03                                           ` Damien Le Moal
2023-01-25 23:11               ` Keith Busch
2023-01-26  0:08                 ` Damien Le Moal
2023-01-26  5:26                 ` Christoph Hellwig
2023-01-25  6:33         ` Christoph Hellwig
2023-01-27 12:43   ` Hannes Reinecke
2023-01-24 19:02 ` [PATCH v3 02/18] block: introduce BLK_STS_DURATION_LIMIT Niklas Cassel
2023-01-24 19:29   ` Bart Van Assche
2023-01-24 19:59     ` Keith Busch
2023-01-24 20:32       ` Bart Van Assche
2023-01-24 21:39         ` Damien Le Moal
2023-01-24 21:36       ` Damien Le Moal
2023-01-24 21:34     ` Damien Le Moal
2023-01-24 19:02 ` [PATCH v3 03/18] scsi: core: allow libata to complete successful commands via EH Niklas Cassel
2023-01-24 19:02 ` [PATCH v3 04/18] scsi: rename and move get_scsi_ml_byte() Niklas Cassel
2023-01-24 19:32   ` Bart Van Assche
2023-01-24 19:02 ` [PATCH v3 05/18] scsi: support retrieving sub-pages of mode pages Niklas Cassel
2023-01-24 19:34   ` Bart Van Assche
2023-01-24 19:02 ` [PATCH v3 06/18] scsi: support service action in scsi_report_opcode() Niklas Cassel
2023-01-24 19:36   ` Bart Van Assche
2023-01-24 19:02 ` [PATCH v3 07/18] scsi: sd: detect support for command duration limits Niklas Cassel
2023-01-24 19:39   ` Bart Van Assche
2023-01-27 13:00   ` Hannes Reinecke
2023-01-28  0:51     ` Damien Le Moal
2023-01-28  2:52       ` Bart Van Assche
2023-01-29  2:05         ` Damien Le Moal
2023-01-24 19:02 ` [PATCH v3 08/18] scsi: sd: set read/write commands CDL index Niklas Cassel
2023-01-27 15:30   ` Hannes Reinecke
2023-01-28  0:03     ` Damien Le Moal
2023-01-30 18:13       ` Hannes Reinecke
2023-01-24 19:02 ` [PATCH v3 09/18] scsi: sd: handle read/write CDL timeout failures Niklas Cassel
2023-01-27 15:34   ` Hannes Reinecke
2023-01-28  0:06     ` Damien Le Moal
2023-02-03 16:49     ` Niklas Cassel
2023-01-24 19:02 ` [PATCH v3 10/18] ata: libata-scsi: remove unnecessary !cmd checks Niklas Cassel
2023-01-27 15:35   ` Hannes Reinecke
2023-01-24 19:02 ` [PATCH v3 11/18] ata: libata: change ata_eh_request_sense() to not set CHECK_CONDITION Niklas Cassel
2023-01-27 15:36   ` Hannes Reinecke
2023-01-24 19:02 ` [PATCH v3 12/18] ata: libata: detect support for command duration limits Niklas Cassel
2023-01-24 19:02 ` [PATCH v3 13/18] ata: libata-scsi: handle CDL bits in ata_scsiop_maint_in() Niklas Cassel
2023-01-27 15:37   ` Hannes Reinecke
2023-01-24 19:03 ` [PATCH v3 14/18] ata: libata-scsi: add support for CDL pages mode sense Niklas Cassel
2023-01-27 15:38   ` Hannes Reinecke
2023-01-24 19:03 ` [PATCH v3 15/18] ata: libata: add ATA feature control sub-page translation Niklas Cassel
2023-01-27 15:40   ` Hannes Reinecke
2023-01-24 19:03 ` [PATCH v3 16/18] ata: libata: set read/write commands CDL index Niklas Cassel
2023-01-27 15:41   ` Hannes Reinecke
2023-01-24 19:03 ` [PATCH v3 17/18] ata: libata: handle completion of CDL commands using policy 0xD Niklas Cassel
2023-01-27 15:43   ` Hannes Reinecke
2023-01-24 19:03 ` [PATCH v3 18/18] Documentation: sysfs-block-device: document command duration limits Niklas Cassel
2023-01-27 15:43   ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.