All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/15] Pass data temperature information to UFS devices
@ 2023-10-05 19:40 Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits Bart Van Assche
                   ` (14 more replies)
  0 siblings, 15 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche

Hi Jens,

UFS vendors need the data lifetime information to achieve good performance.
Without this information there is significantly higher write amplification due
to garbage collection. Hence this patch series that add support in F2FS and
also in the block layer for data lifetime information. The SCSI disk (sd)
driver is modified such that it passes write hint information to SCSI devices
via the GROUP NUMBER field.

Please consider this patch series for the next merge window.

Thanks,

Bart.

Changes compared to v1:
- Use six bits from the ioprio field for data lifetime information. The
  bio->bi_write_hint / req->write_hint / iocb->ki_hint members that were
  introduced in v1 have been removed again.
- The F_GET_FILE_RW_HINT and F_SET_FILE_RW_HINT fcntls have been removed.
- In the SCSI disk (sd) driver, query the stream status and check the PERM bit.
- The GET STREAM STATUS command has been implemented in the scsi_debug driver.

Bart Van Assche (15):
  block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits
  blk-ioprio: Modify fewer bio->bi_ioprio bits
  block: Support data lifetime in the I/O priority bitfield
  fs: Restore write hint support
  fs/f2fs: Restore the whint_mode mount option
  scsi: core: Query the Block Limits Extension VPD page
  scsi_proto: Add structures and constants related to I/O groups and
    streams
  sd: Translate data lifetime information
  scsi_debug: Reduce code duplication
  scsi_debug: Support the block limits extension VPD page
  scsi_debug: Rework page code error handling
  scsi_debug: Rework subpage code error handling
  scsi_debug: Implement the IO Advice Hints Grouping mode page
  scsi_debug: Implement GET STREAM STATUS
  scsi_debug: Maintain write statistics per group number

 Documentation/filesystems/f2fs.rst |  70 ++++++++
 block/blk-ioprio.c                 |   9 +-
 block/blk-mq.c                     |   3 +-
 drivers/scsi/scsi.c                |   2 +
 drivers/scsi/scsi_debug.c          | 247 +++++++++++++++++++++--------
 drivers/scsi/scsi_sysfs.c          |  10 ++
 drivers/scsi/sd.c                  | 112 ++++++++++++-
 drivers/scsi/sd.h                  |   3 +
 fs/f2fs/data.c                     |   3 +
 fs/f2fs/f2fs.h                     |   9 ++
 fs/f2fs/segment.c                  |  95 +++++++++++
 fs/f2fs/super.c                    |  32 +++-
 fs/iomap/buffered-io.c             |   3 +
 fs/mpage.c                         |   2 +
 include/linux/fs-lifetime.h        |  20 +++
 include/linux/ioprio.h             |  10 ++
 include/scsi/scsi_device.h         |   1 +
 include/scsi/scsi_proto.h          |  75 +++++++++
 include/uapi/linux/ioprio.h        |   8 +-
 19 files changed, 639 insertions(+), 75 deletions(-)
 create mode 100644 include/linux/fs-lifetime.h


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-06  6:28   ` Kanchan Joshi
  2023-10-10  5:22   ` Kanchan Joshi
  2023-10-05 19:40 ` [PATCH v2 02/15] blk-ioprio: Modify " Bart Van Assche
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Damien Le Moal, Niklas Cassel

A later patch will store the data lifetime in the bio->bi_ioprio member
before bio_set_ioprio() is called. Make sure that bio_set_ioprio()
doesn't clear more bits than necessary.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         |  3 ++-
 include/linux/ioprio.h | 10 ++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e2d11183f62e..bc086660ffd3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2926,7 +2926,8 @@ static void bio_set_ioprio(struct bio *bio)
 {
 	/* Nobody set ioprio so far? Initialize it based on task's nice value */
 	if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
-		bio->bi_ioprio = get_current_ioprio();
+		ioprio_set_class_and_level(&bio->bi_ioprio,
+					   get_current_ioprio());
 	blkcg_set_ioprio(bio);
 }
 
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 7578d4f6a969..f2e768ab4b35 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -71,4 +71,14 @@ static inline int ioprio_check_cap(int ioprio)
 }
 #endif /* CONFIG_BLOCK */
 
+#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << IOPRIO_CLASS_SHIFT) | \
+				 (IOPRIO_LEVEL_MASK << 0))
+
+static inline void ioprio_set_class_and_level(u16 *prio, u16 class_level)
+{
+	WARN_ON_ONCE(class_level & ~IOPRIO_CLASS_LEVEL_MASK);
+	*prio &= ~IOPRIO_CLASS_LEVEL_MASK;
+	*prio |= class_level;
+}
+
 #endif

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 02/15] blk-ioprio: Modify fewer bio->bi_ioprio bits
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-06  6:36   ` Kanchan Joshi
  2023-10-05 19:40 ` [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield Bart Van Assche
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Damien Le Moal, Niklas Cassel

A later patch will store the data lifetime in the bio->bi_ioprio member
before the blk-ioprio policy is applied. Make sure that this policy doesn't
clear more bits than necessary.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-ioprio.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
index 4051fada01f1..2db86f153b6d 100644
--- a/block/blk-ioprio.c
+++ b/block/blk-ioprio.c
@@ -202,7 +202,8 @@ void blkcg_set_ioprio(struct bio *bio)
 		 * to achieve this.
 		 */
 		if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) != IOPRIO_CLASS_RT)
-			bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4);
+			ioprio_set_class_and_level(&bio->bi_ioprio,
+					IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4));
 		return;
 	}
 
@@ -213,10 +214,10 @@ void blkcg_set_ioprio(struct bio *bio)
 	 * If the bio I/O priority equals IOPRIO_CLASS_NONE, the cgroup I/O
 	 * priority is assigned to the bio.
 	 */
-	prio = max_t(u16, bio->bi_ioprio,
+	prio = max_t(u16, bio->bi_ioprio & IOPRIO_CLASS_LEVEL_MASK,
 			IOPRIO_PRIO_VALUE(blkcg->prio_policy, 0));
-	if (prio > bio->bi_ioprio)
-		bio->bi_ioprio = prio;
+	if (prio > (bio->bi_ioprio & IOPRIO_CLASS_LEVEL_MASK))
+		ioprio_set_class_and_level(&bio->bi_ioprio, prio);
 }
 
 void blk_ioprio_exit(struct gendisk *disk)

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 02/15] blk-ioprio: Modify " Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-06  6:42   ` Kanchan Joshi
  2023-10-06  8:19   ` Damien Le Moal
  2023-10-05 19:40 ` [PATCH v2 04/15] fs: Restore write hint support Bart Van Assche
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Damien Le Moal, Niklas Cassel,
	Hannes Reinecke

The NVMe and SCSI standards define 64 different data lifetimes. Support
storing this information in the I/O priority bitfield.

The current allocation of the 16 bits in the I/O priority bitfield is as
follows:
* 15..13: I/O priority class
* 12..6: unused
* 5..3: I/O hint (CDL)
* 2..0: I/O priority level

This patch changes this into the following:
* 15..13: I/O priority class
* 12: unused
* 11..6: data lifetime
* 5..3: I/O hint (CDL)
* 2..0: I/O priority level

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 include/uapi/linux/ioprio.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
index bee2bdb0eedb..efe9bc450872 100644
--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -71,7 +71,7 @@ enum {
  * class and level.
  */
 #define IOPRIO_HINT_SHIFT		IOPRIO_LEVEL_NR_BITS
-#define IOPRIO_HINT_NR_BITS		10
+#define IOPRIO_HINT_NR_BITS		3
 #define IOPRIO_NR_HINTS			(1 << IOPRIO_HINT_NR_BITS)
 #define IOPRIO_HINT_MASK		(IOPRIO_NR_HINTS - 1)
 #define IOPRIO_PRIO_HINT(ioprio)	\
@@ -102,6 +102,12 @@ enum {
 	IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7,
 };
 
+#define IOPRIO_LIFETIME_SHIFT		(IOPRIO_HINT_SHIFT + IOPRIO_HINT_NR_BITS)
+#define IOPRIO_LIFETIME_NR_BITS		6
+#define IOPRIO_LIFETIME_MASK		((1u << IOPRIO_LIFETIME_NR_BITS) - 1)
+#define IOPRIO_PRIO_LIFETIME(ioprio)					\
+	((ioprio >> IOPRIO_LIFETIME_SHIFT) & IOPRIO_LIFETIME_MASK)
+
 #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
 
 /*

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 04/15] fs: Restore write hint support
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (2 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-10  5:42   ` Kanchan Joshi
  2023-10-16  6:20   ` Christoph Hellwig
  2023-10-05 19:40 ` [PATCH v2 05/15] fs/f2fs: Restore the whint_mode mount option Bart Van Assche
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Jaegeuk Kim, Chao Yu,
	Darrick J. Wong, Alexander Viro, Christian Brauner

This patch reverts a small subset of commit c75e707fe1aa ("block: remove
the per-bio/request write hint"). The following functionality has been
restored:
- In F2FS, store data lifetime information in struct bio.
- In fs/iomap and fs/mpage.c, restore the code that sets the data
  lifetime.

A new header file is introduced for the new bio_[sg]et_data_lifetime()
functions because there is no other header file yet that includes both
<linux/fs.h> and <linux/ioprio.h>.

The value WRITE_LIFE_NONE is mapped onto the data lifetime 0. This is
consistent with NVMe TPAR4093a. From that TPAR: "A value of 1h specifies
the shortest Data Lifetime. A value of 3Fh specifies the longest Data
Lifetime." This is also consistent with the SCSI specifications. From
T10 document 23-024r3: "0h: no relative lifetime is applicable; 1h:
shortest relative lifetime; ...; 3fh: longest relative lifetime".

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 fs/f2fs/data.c              |  3 +++
 fs/iomap/buffered-io.c      |  3 +++
 fs/mpage.c                  |  2 ++
 include/linux/fs-lifetime.h | 20 ++++++++++++++++++++
 4 files changed, 28 insertions(+)
 create mode 100644 include/linux/fs-lifetime.h

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 916e317ac925..2962cb335897 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -6,6 +6,7 @@
  *             http://www.samsung.com/
  */
 #include <linux/fs.h>
+#include <linux/fs-lifetime.h>
 #include <linux/f2fs_fs.h>
 #include <linux/buffer_head.h>
 #include <linux/sched/mm.h>
@@ -478,6 +479,8 @@ static struct bio *__bio_alloc(struct f2fs_io_info *fio, int npages)
 	} else {
 		bio->bi_end_io = f2fs_write_end_io;
 		bio->bi_private = sbi;
+		bio_set_data_lifetime(bio,
+			f2fs_io_type_to_rw_hint(sbi, fio->type, fio->temp));
 	}
 	iostat_alloc_and_bind_ctx(sbi, bio, NULL);
 
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 644479ccefbd..9bf05342ca65 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -6,6 +6,7 @@
 #include <linux/module.h>
 #include <linux/compiler.h>
 #include <linux/fs.h>
+#include <linux/fs-lifetime.h>
 #include <linux/iomap.h>
 #include <linux/pagemap.h>
 #include <linux/uio.h>
@@ -1660,6 +1661,7 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
 			       REQ_OP_WRITE | wbc_to_write_flags(wbc),
 			       GFP_NOFS, &iomap_ioend_bioset);
 	bio->bi_iter.bi_sector = sector;
+	bio_set_data_lifetime(bio, inode->i_write_hint);
 	wbc_init_bio(wbc, bio);
 
 	ioend = container_of(bio, struct iomap_ioend, io_inline_bio);
@@ -1690,6 +1692,7 @@ iomap_chain_bio(struct bio *prev)
 	new = bio_alloc(prev->bi_bdev, BIO_MAX_VECS, prev->bi_opf, GFP_NOFS);
 	bio_clone_blkg_association(new, prev);
 	new->bi_iter.bi_sector = bio_end_sector(prev);
+	bio_set_data_lifetime(new, bio_get_data_lifetime(prev));
 
 	bio_chain(prev, new);
 	bio_get(prev);		/* for iomap_finish_ioend */
diff --git a/fs/mpage.c b/fs/mpage.c
index 242e213ee064..888ca71c9ea7 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -20,6 +20,7 @@
 #include <linux/gfp.h>
 #include <linux/bio.h>
 #include <linux/fs.h>
+#include <linux/fs-lifetime.h>
 #include <linux/buffer_head.h>
 #include <linux/blkdev.h>
 #include <linux/highmem.h>
@@ -612,6 +613,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 				GFP_NOFS);
 		bio->bi_iter.bi_sector = blocks[0] << (blkbits - 9);
 		wbc_init_bio(wbc, bio);
+		bio_set_data_lifetime(bio, inode->i_write_hint);
 	}
 
 	/*
diff --git a/include/linux/fs-lifetime.h b/include/linux/fs-lifetime.h
new file mode 100644
index 000000000000..0e652e00cfab
--- /dev/null
+++ b/include/linux/fs-lifetime.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/bio.h>
+#include <linux/fs.h>
+#include <linux/ioprio.h>
+
+static inline enum rw_hint bio_get_data_lifetime(struct bio *bio)
+{
+	/* +1 to map 0 onto WRITE_LIFE_NONE. */
+	return IOPRIO_PRIO_LIFETIME(bio->bi_ioprio) + 1;
+}
+
+static inline void bio_set_data_lifetime(struct bio *bio, enum rw_hint lifetime)
+{
+	/* -1 to map WRITE_LIFE_NONE onto 0. */
+	if (lifetime != 0)
+		lifetime--;
+	WARN_ON_ONCE(lifetime & ~IOPRIO_LIFETIME_MASK);
+	bio->bi_ioprio &= ~(IOPRIO_LIFETIME_MASK << IOPRIO_LIFETIME_SHIFT);
+	bio->bi_ioprio |= lifetime << IOPRIO_LIFETIME_SHIFT;
+}

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 05/15] fs/f2fs: Restore the whint_mode mount option
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (3 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 04/15] fs: Restore write hint support Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 06/15] scsi: core: Query the Block Limits Extension VPD page Bart Van Assche
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Avri Altman, Bean Huo, Jaegeuk Kim,
	Chao Yu, Jonathan Corbet

Restore support for the whint_mode mount option by reverting commit
930e2607638d ("f2fs: remove obsolete whint_mode").

Reviewed-by: Avri Altman <avri.altman@wdc.com>
Reviewed-by: Bean Huo <beanhuo@micron.com>
Reviewed-by: Daejun Park <daejun7.park@samsung.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 Documentation/filesystems/f2fs.rst | 70 ++++++++++++++++++++++
 fs/f2fs/f2fs.h                     |  9 +++
 fs/f2fs/segment.c                  | 95 ++++++++++++++++++++++++++++++
 fs/f2fs/super.c                    | 32 +++++++++-
 4 files changed, 205 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index d32c6209685d..de412ddebcc8 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -242,6 +242,12 @@ offgrpjquota		 Turn off group journalled quota.
 offprjjquota		 Turn off project journalled quota.
 quota			 Enable plain user disk quota accounting.
 noquota			 Disable all plain disk quota option.
+whint_mode=%s		 Control which write hints are passed down to block
+			 layer. This supports "off", "user-based", and
+			 "fs-based".  In "off" mode (default), f2fs does not pass
+			 down hints. In "user-based" mode, f2fs tries to pass
+			 down hints given by users. And in "fs-based" mode, f2fs
+			 passes down hints with its policy.
 alloc_mode=%s		 Adjust block allocation policy, which supports "reuse"
 			 and "default".
 fsync_mode=%s		 Control the policy of fsync. Currently supports "posix",
@@ -776,6 +782,70 @@ In order to identify whether the data in the victim segment are valid or not,
 F2FS manages a bitmap. Each bit represents the validity of a block, and the
 bitmap is composed of a bit stream covering whole blocks in main area.
 
+Write-hint Policy
+-----------------
+
+1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.
+
+2) whint_mode=user-based. F2FS tries to pass down hints given by
+users.
+
+===================== ======================== ===================
+User                  F2FS                     Block
+===================== ======================== ===================
+N/A                   META                     WRITE_LIFE_NOT_SET
+N/A                   HOT_NODE                 "
+N/A                   WARM_NODE                "
+N/A                   COLD_NODE                "
+ioctl(COLD)           COLD_DATA                WRITE_LIFE_EXTREME
+extension list        "                        "
+
+-- buffered io
+WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
+WRITE_LIFE_NONE       "                        "
+WRITE_LIFE_MEDIUM     "                        "
+WRITE_LIFE_LONG       "                        "
+
+-- direct io
+WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
+WRITE_LIFE_NONE       "                        WRITE_LIFE_NONE
+WRITE_LIFE_MEDIUM     "                        WRITE_LIFE_MEDIUM
+WRITE_LIFE_LONG       "                        WRITE_LIFE_LONG
+===================== ======================== ===================
+
+3) whint_mode=fs-based. F2FS passes down hints with its policy.
+
+===================== ======================== ===================
+User                  F2FS                     Block
+===================== ======================== ===================
+N/A                   META                     WRITE_LIFE_MEDIUM;
+N/A                   HOT_NODE                 WRITE_LIFE_NOT_SET
+N/A                   WARM_NODE                "
+N/A                   COLD_NODE                WRITE_LIFE_NONE
+ioctl(COLD)           COLD_DATA                WRITE_LIFE_EXTREME
+extension list        "                        "
+
+-- buffered io
+WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_LONG
+WRITE_LIFE_NONE       "                        "
+WRITE_LIFE_MEDIUM     "                        "
+WRITE_LIFE_LONG       "                        "
+
+-- direct io
+WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
+WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
+WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
+WRITE_LIFE_NONE       "                        WRITE_LIFE_NONE
+WRITE_LIFE_MEDIUM     "                        WRITE_LIFE_MEDIUM
+WRITE_LIFE_LONG       "                        WRITE_LIFE_LONG
+===================== ======================== ===================
+
 Fallocate(2) Policy
 -------------------
 
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 6d688e42d89c..39ffad5c4087 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -157,6 +157,7 @@ struct f2fs_mount_info {
 	int s_jquota_fmt;			/* Format of quota to use */
 #endif
 	/* For which write hints are passed down to block layer */
+	int whint_mode;
 	int alloc_mode;			/* segment allocation policy */
 	int fsync_mode;			/* fsync policy */
 	int fs_mode;			/* fs mode: LFS or ADAPTIVE */
@@ -1343,6 +1344,12 @@ enum {
 	FS_MODE_FRAGMENT_BLK,		/* block fragmentation mode */
 };
 
+enum {
+	WHINT_MODE_OFF,		/* not pass down write hints */
+	WHINT_MODE_USER,	/* try to pass down hints given by users */
+	WHINT_MODE_FS,		/* pass down hints with F2FS policy */
+};
+
 enum {
 	ALLOC_MODE_DEFAULT,	/* stay default */
 	ALLOC_MODE_REUSE,	/* reuse segments as much as possible */
@@ -3727,6 +3734,8 @@ void f2fs_destroy_segment_manager(struct f2fs_sb_info *sbi);
 int __init f2fs_create_segment_manager_caches(void);
 void f2fs_destroy_segment_manager_caches(void);
 int f2fs_rw_hint_to_seg_type(enum rw_hint hint);
+enum rw_hint f2fs_io_type_to_rw_hint(struct f2fs_sb_info *sbi,
+			enum page_type type, enum temp_type temp);
 unsigned int f2fs_usable_segs_in_sec(struct f2fs_sb_info *sbi,
 			unsigned int segno);
 unsigned int f2fs_usable_blks_in_seg(struct f2fs_sb_info *sbi,
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index d05b41608fc0..38c0cb8d9571 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -3290,6 +3290,101 @@ int f2fs_rw_hint_to_seg_type(enum rw_hint hint)
 	}
 }
 
+/* This returns write hints for each segment type. This hints will be
+ * passed down to block layer. There are mapping tables which depend on
+ * the mount option 'whint_mode'.
+ *
+ * 1) whint_mode=off. F2FS only passes down WRITE_LIFE_NOT_SET.
+ *
+ * 2) whint_mode=user-based. F2FS tries to pass down hints given by users.
+ *
+ * User                  F2FS                     Block
+ * ----                  ----                     -----
+ *                       META                     WRITE_LIFE_NOT_SET
+ *                       HOT_NODE                 "
+ *                       WARM_NODE                "
+ *                       COLD_NODE                "
+ * ioctl(COLD)           COLD_DATA                WRITE_LIFE_EXTREME
+ * extension list        "                        "
+ *
+ * -- buffered io
+ * WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
+ * WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
+ * WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
+ * WRITE_LIFE_NONE       "                        "
+ * WRITE_LIFE_MEDIUM     "                        "
+ * WRITE_LIFE_LONG       "                        "
+ *
+ * -- direct io
+ * WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
+ * WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
+ * WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
+ * WRITE_LIFE_NONE       "                        WRITE_LIFE_NONE
+ * WRITE_LIFE_MEDIUM     "                        WRITE_LIFE_MEDIUM
+ * WRITE_LIFE_LONG       "                        WRITE_LIFE_LONG
+ *
+ * 3) whint_mode=fs-based. F2FS passes down hints with its policy.
+ *
+ * User                  F2FS                     Block
+ * ----                  ----                     -----
+ *                       META                     WRITE_LIFE_MEDIUM;
+ *                       HOT_NODE                 WRITE_LIFE_NOT_SET
+ *                       WARM_NODE                "
+ *                       COLD_NODE                WRITE_LIFE_NONE
+ * ioctl(COLD)           COLD_DATA                WRITE_LIFE_EXTREME
+ * extension list        "                        "
+ *
+ * -- buffered io
+ * WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
+ * WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
+ * WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_LONG
+ * WRITE_LIFE_NONE       "                        "
+ * WRITE_LIFE_MEDIUM     "                        "
+ * WRITE_LIFE_LONG       "                        "
+ *
+ * -- direct io
+ * WRITE_LIFE_EXTREME    COLD_DATA                WRITE_LIFE_EXTREME
+ * WRITE_LIFE_SHORT      HOT_DATA                 WRITE_LIFE_SHORT
+ * WRITE_LIFE_NOT_SET    WARM_DATA                WRITE_LIFE_NOT_SET
+ * WRITE_LIFE_NONE       "                        WRITE_LIFE_NONE
+ * WRITE_LIFE_MEDIUM     "                        WRITE_LIFE_MEDIUM
+ * WRITE_LIFE_LONG       "                        WRITE_LIFE_LONG
+ */
+
+enum rw_hint f2fs_io_type_to_rw_hint(struct f2fs_sb_info *sbi,
+				enum page_type type, enum temp_type temp)
+{
+	if (F2FS_OPTION(sbi).whint_mode == WHINT_MODE_USER) {
+		if (type == DATA) {
+			if (temp == WARM)
+				return WRITE_LIFE_NOT_SET;
+			else if (temp == HOT)
+				return WRITE_LIFE_SHORT;
+			else if (temp == COLD)
+				return WRITE_LIFE_EXTREME;
+		} else {
+			return WRITE_LIFE_NOT_SET;
+		}
+	} else if (F2FS_OPTION(sbi).whint_mode == WHINT_MODE_FS) {
+		if (type == DATA) {
+			if (temp == WARM)
+				return WRITE_LIFE_LONG;
+			else if (temp == HOT)
+				return WRITE_LIFE_SHORT;
+			else if (temp == COLD)
+				return WRITE_LIFE_EXTREME;
+		} else if (type == NODE) {
+			if (temp == WARM || temp == HOT)
+				return WRITE_LIFE_NOT_SET;
+			else if (temp == COLD)
+				return WRITE_LIFE_NONE;
+		} else if (type == META) {
+			return WRITE_LIFE_MEDIUM;
+		}
+	}
+	return WRITE_LIFE_NOT_SET;
+}
+
 static int __get_segment_type_2(struct f2fs_io_info *fio)
 {
 	if (fio->type == DATA)
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index a8c8232852bb..5bb062075acf 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -141,6 +141,7 @@ enum {
 	Opt_jqfmt_vfsold,
 	Opt_jqfmt_vfsv0,
 	Opt_jqfmt_vfsv1,
+	Opt_whint,
 	Opt_alloc,
 	Opt_fsync,
 	Opt_test_dummy_encryption,
@@ -220,6 +221,7 @@ static match_table_t f2fs_tokens = {
 	{Opt_jqfmt_vfsold, "jqfmt=vfsold"},
 	{Opt_jqfmt_vfsv0, "jqfmt=vfsv0"},
 	{Opt_jqfmt_vfsv1, "jqfmt=vfsv1"},
+	{Opt_whint, "whint_mode=%s"},
 	{Opt_alloc, "alloc_mode=%s"},
 	{Opt_fsync, "fsync_mode=%s"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption=%s"},
@@ -988,6 +990,22 @@ static int parse_options(struct super_block *sb, char *options, bool is_remount)
 			f2fs_info(sbi, "quota operations not supported");
 			break;
 #endif
+		case Opt_whint:
+			name = match_strdup(&args[0]);
+			if (!name)
+				return -ENOMEM;
+			if (!strcmp(name, "user-based")) {
+				F2FS_OPTION(sbi).whint_mode = WHINT_MODE_USER;
+			} else if (!strcmp(name, "off")) {
+				F2FS_OPTION(sbi).whint_mode = WHINT_MODE_OFF;
+			} else if (!strcmp(name, "fs-based")) {
+				F2FS_OPTION(sbi).whint_mode = WHINT_MODE_FS;
+			} else {
+				kfree(name);
+				return -EINVAL;
+			}
+			kfree(name);
+			break;
 		case Opt_alloc:
 			name = match_strdup(&args[0]);
 			if (!name)
@@ -1389,6 +1407,12 @@ static int parse_options(struct super_block *sb, char *options, bool is_remount)
 		return -EINVAL;
 	}
 
+	/* Not pass down write hints if the number of active logs is lesser
+	 * than NR_CURSEG_PERSIST_TYPE.
+	 */
+	if (F2FS_OPTION(sbi).active_logs != NR_CURSEG_PERSIST_TYPE)
+		F2FS_OPTION(sbi).whint_mode = WHINT_MODE_OFF;
+
 	if (f2fs_sb_has_readonly(sbi) && !f2fs_readonly(sbi->sb)) {
 		f2fs_err(sbi, "Allow to mount readonly mode only");
 		return -EROFS;
@@ -2060,6 +2084,10 @@ static int f2fs_show_options(struct seq_file *seq, struct dentry *root)
 		seq_puts(seq, ",prjquota");
 #endif
 	f2fs_show_quota_options(seq, sbi->sb);
+	if (F2FS_OPTION(sbi).whint_mode == WHINT_MODE_USER)
+		seq_printf(seq, ",whint_mode=%s", "user-based");
+	else if (F2FS_OPTION(sbi).whint_mode == WHINT_MODE_FS)
+		seq_printf(seq, ",whint_mode=%s", "fs-based");
 
 	fscrypt_show_test_dummy_encryption(seq, ',', sbi->sb);
 
@@ -2129,6 +2157,7 @@ static void default_options(struct f2fs_sb_info *sbi, bool remount)
 		F2FS_OPTION(sbi).active_logs = NR_CURSEG_PERSIST_TYPE;
 
 	F2FS_OPTION(sbi).inline_xattr_size = DEFAULT_INLINE_XATTR_ADDRS;
+	F2FS_OPTION(sbi).whint_mode = WHINT_MODE_OFF;
 	if (le32_to_cpu(F2FS_RAW_SUPER(sbi)->segment_count_main) <=
 							SMALL_VOLUME_SEGMENTS)
 		F2FS_OPTION(sbi).alloc_mode = ALLOC_MODE_REUSE;
@@ -2443,7 +2472,8 @@ static int f2fs_remount(struct super_block *sb, int *flags, char *data)
 		need_stop_gc = true;
 	}
 
-	if (*flags & SB_RDONLY) {
+	if (*flags & SB_RDONLY ||
+	    F2FS_OPTION(sbi).whint_mode != org_mount_opt.whint_mode) {
 		sync_inodes_sb(sb);
 
 		set_sbi_flag(sbi, SBI_IS_DIRTY);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 06/15] scsi: core: Query the Block Limits Extension VPD page
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (4 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 05/15] fs/f2fs: Restore the whint_mode mount option Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 07/15] scsi_proto: Add structures and constants related to I/O groups and streams Bart Van Assche
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Avri Altman, James E.J. Bottomley

Parse the Reduced Stream Control Supported (RSCS) bit from the block
limits extension VPD page. The RSCS bit is defined in T10 document
"SBC-5 Constrained Streams with Data Lifetimes"
(https://www.t10.org/cgi-bin/ac.pl?t=d&f=23-024r3.pdf).

Reviewed-by: Avri Altman <avri.altman@wdc.com>
Reviewed-by: Daejun Park <daejun7.park@samsung.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi.c        |  2 ++
 drivers/scsi/scsi_sysfs.c  | 10 ++++++++++
 drivers/scsi/sd.c          | 13 +++++++++++++
 drivers/scsi/sd.h          |  1 +
 include/scsi/scsi_device.h |  1 +
 5 files changed, 27 insertions(+)

diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
index d0911bc28663..5ad291770806 100644
--- a/drivers/scsi/scsi.c
+++ b/drivers/scsi/scsi.c
@@ -499,6 +499,8 @@ void scsi_attach_vpd(struct scsi_device *sdev)
 			scsi_update_vpd_page(sdev, 0xb1, &sdev->vpd_pgb1);
 		if (vpd_buf->data[i] == 0xb2)
 			scsi_update_vpd_page(sdev, 0xb2, &sdev->vpd_pgb2);
+		if (vpd_buf->data[i] == 0xb7)
+			scsi_update_vpd_page(sdev, 0xb7, &sdev->vpd_pgb7);
 	}
 	kfree(vpd_buf);
 }
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index 24f6eefb6803..93652a786a46 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -449,6 +449,7 @@ static void scsi_device_dev_release(struct device *dev)
 	struct scsi_vpd *vpd_pg80 = NULL, *vpd_pg83 = NULL;
 	struct scsi_vpd *vpd_pg0 = NULL, *vpd_pg89 = NULL;
 	struct scsi_vpd *vpd_pgb0 = NULL, *vpd_pgb1 = NULL, *vpd_pgb2 = NULL;
+	struct scsi_vpd *vpd_pgb7 = NULL;
 	unsigned long flags;
 
 	might_sleep();
@@ -494,6 +495,8 @@ static void scsi_device_dev_release(struct device *dev)
 				       lockdep_is_held(&sdev->inquiry_mutex));
 	vpd_pgb2 = rcu_replace_pointer(sdev->vpd_pgb2, vpd_pgb2,
 				       lockdep_is_held(&sdev->inquiry_mutex));
+	vpd_pgb7 = rcu_replace_pointer(sdev->vpd_pgb7, vpd_pgb7,
+				       lockdep_is_held(&sdev->inquiry_mutex));
 	mutex_unlock(&sdev->inquiry_mutex);
 
 	if (vpd_pg0)
@@ -510,6 +513,8 @@ static void scsi_device_dev_release(struct device *dev)
 		kfree_rcu(vpd_pgb1, rcu);
 	if (vpd_pgb2)
 		kfree_rcu(vpd_pgb2, rcu);
+	if (vpd_pgb7)
+		kfree_rcu(vpd_pgb7, rcu);
 	kfree(sdev->inquiry);
 	kfree(sdev);
 
@@ -921,6 +926,7 @@ sdev_vpd_pg_attr(pg89);
 sdev_vpd_pg_attr(pgb0);
 sdev_vpd_pg_attr(pgb1);
 sdev_vpd_pg_attr(pgb2);
+sdev_vpd_pg_attr(pgb7);
 sdev_vpd_pg_attr(pg0);
 
 static ssize_t show_inquiry(struct file *filep, struct kobject *kobj,
@@ -1295,6 +1301,9 @@ static umode_t scsi_sdev_bin_attr_is_visible(struct kobject *kobj,
 	if (attr == &dev_attr_vpd_pgb2 && !sdev->vpd_pgb2)
 		return 0;
 
+	if (attr == &dev_attr_vpd_pgb7 && !sdev->vpd_pgb7)
+		return 0;
+
 	return S_IRUGO;
 }
 
@@ -1347,6 +1356,7 @@ static struct bin_attribute *scsi_sdev_bin_attrs[] = {
 	&dev_attr_vpd_pgb0,
 	&dev_attr_vpd_pgb1,
 	&dev_attr_vpd_pgb2,
+	&dev_attr_vpd_pgb7,
 	&dev_attr_inquiry,
 	NULL
 };
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index c92a317ba547..879edbc1a065 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3019,6 +3019,18 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 	rcu_read_unlock();
 }
 
+/* Parse the Block Limits Extension VPD page (0xb7) */
+static void sd_read_block_limits_ext(struct scsi_disk *sdkp)
+{
+	struct scsi_vpd *vpd;
+
+	rcu_read_lock();
+	vpd = rcu_dereference(sdkp->device->vpd_pgb7);
+	if (vpd && vpd->len >= 2)
+		sdkp->rscs = vpd->data[5] & 1;
+	rcu_read_unlock();
+}
+
 /**
  * sd_read_block_characteristics - Query block dev. characteristics
  * @sdkp: disk to query
@@ -3373,6 +3385,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
 		if (scsi_device_supports_vpd(sdp)) {
 			sd_read_block_provisioning(sdkp);
 			sd_read_block_limits(sdkp);
+			sd_read_block_limits_ext(sdkp);
 			sd_read_block_characteristics(sdkp);
 			sd_zbc_read_zones(sdkp, buffer);
 			sd_read_cpr(sdkp);
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 5eea762f84d1..84685168b6e0 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -150,6 +150,7 @@ struct scsi_disk {
 	unsigned	urswrz : 1;
 	unsigned	security : 1;
 	unsigned	ignore_medium_access_errors : 1;
+	bool		rscs : 1; /* reduced stream control support */
 };
 #define to_scsi_disk(obj) container_of(obj, struct scsi_disk, disk_dev)
 
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index b9230b6add04..2dd96ae101e1 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -153,6 +153,7 @@ struct scsi_device {
 	struct scsi_vpd __rcu *vpd_pgb0;
 	struct scsi_vpd __rcu *vpd_pgb1;
 	struct scsi_vpd __rcu *vpd_pgb2;
+	struct scsi_vpd __rcu *vpd_pgb7;
 
 	struct scsi_target      *sdev_target;
 

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 07/15] scsi_proto: Add structures and constants related to I/O groups and streams
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (5 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 06/15] scsi: core: Query the Block Limits Extension VPD page Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 08/15] sd: Translate data lifetime information Bart Van Assche
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, James E.J. Bottomley

Prepare for adding code that will query the I/O advice hints group
descriptors and for adding code that will retrieve the stream status.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 include/scsi/scsi_proto.h | 75 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 07d65c1f59db..9ee4983c23b4 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -10,6 +10,7 @@
 #ifndef _SCSI_PROTO_H_
 #define _SCSI_PROTO_H_
 
+#include <linux/build_bug.h>
 #include <linux/types.h>
 
 /*
@@ -126,6 +127,7 @@
 #define	SAI_READ_CAPACITY_16  0x10
 #define SAI_GET_LBA_STATUS    0x12
 #define SAI_REPORT_REFERRALS  0x13
+#define SAI_GET_STREAM_STATUS 0x16
 /* values for maintenance in */
 #define MI_REPORT_IDENTIFYING_INFORMATION 0x05
 #define MI_REPORT_TARGET_PGS  0x0a
@@ -275,6 +277,79 @@ struct scsi_lun {
 	__u8 scsi_lun[8];
 };
 
+/* SBC-5 IO advice hints group descriptor */
+struct scsi_io_group_descriptor {
+#if defined(__BIG_ENDIAN)
+	u8 io_advice_hints_mode: 2;
+	u8 reserved1: 3;
+	u8 st_enble: 1;
+	u8 cs_enble: 1;
+	u8 ic_enable: 1;
+#elif defined(__LITTLE_ENDIAN)
+	u8 ic_enable: 1;
+	u8 cs_enble: 1;
+	u8 st_enble: 1;
+	u8 reserved1: 3;
+	u8 io_advice_hints_mode: 2;
+#else
+#error
+#endif
+	u8 reserved2[3];
+	/* Logical block markup descriptor */
+#if defined(__BIG_ENDIAN)
+	u8 acdlu: 1;
+	u8 reserved3: 1;
+	u8 rlbsr: 2;
+	u8 lbm_descriptor_type: 4;
+#elif defined(__LITTLE_ENDIAN)
+	u8 lbm_descriptor_type: 4;
+	u8 rlbsr: 2;
+	u8 reserved3: 1;
+	u8 acdlu: 1;
+#else
+#error
+#endif
+	u8 params[2];
+	u8 reserved4;
+	u8 reserved5[8];
+};
+
+static_assert(sizeof(struct scsi_io_group_descriptor) == 16);
+
+struct scsi_stream_status {
+#if defined(__BIG_ENDIAN)
+	u16 perm: 1;
+	u16 reserved1: 15;
+#elif defined(__LITTLE_ENDIAN)
+	u16 reserved1: 15;
+	u16 perm: 1;
+#else
+#error
+#endif
+	__be16 stream_identifier;
+#if defined(__BIG_ENDIAN)
+	u8 reserved2: 2;
+	u8 rel_lifetime: 6;
+#elif defined(__LITTLE_ENDIAN)
+	u8 rel_lifetime: 6;
+	u8 reserved2: 2;
+#else
+#error
+#endif
+	u8 reserved3[3];
+};
+
+static_assert(sizeof(struct scsi_stream_status) == 8);
+
+struct scsi_stream_status_header {
+	__be32 len;	/* length in bytes of stream_status[] array. */
+	u16 reserved;
+	u16 number_of_open_streams;
+	DECLARE_FLEX_ARRAY(struct scsi_stream_status, stream_status);
+};
+
+static_assert(sizeof(struct scsi_stream_status_header) == 8);
+
 /* SPC asymmetric access states */
 #define SCSI_ACCESS_STATE_OPTIMAL     0x00
 #define SCSI_ACCESS_STATE_ACTIVE      0x01

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 08/15] sd: Translate data lifetime information
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (6 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 07/15] scsi_proto: Add structures and constants related to I/O groups and streams Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 09/15] scsi_debug: Reduce code duplication Bart Van Assche
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Damien Le Moal,
	James E.J. Bottomley

Recently T10 standardized SBC constrained streams. This mechanism allows
to pass data lifetime information to SCSI devices in the group number
field. Add support for translating write hint information into a
permanent stream number in the sd driver. Use WRITE(10) instead of
WRITE(6) if data lifetime information is present because the WRITE(6)
command does not have a GROUP NUMBER field.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/sd.c | 99 +++++++++++++++++++++++++++++++++++++++++++++--
 drivers/scsi/sd.h |  2 +
 2 files changed, 98 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 879edbc1a065..5d4b6c8cfd50 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -47,6 +47,7 @@
 #include <linux/blkpg.h>
 #include <linux/blk-pm.h>
 #include <linux/delay.h>
+#include <linux/fs-lifetime.h>
 #include <linux/major.h>
 #include <linux/mutex.h>
 #include <linux/string_helpers.h>
@@ -1001,12 +1002,38 @@ static blk_status_t sd_setup_flush_cmnd(struct scsi_cmnd *cmd)
 	return BLK_STS_OK;
 }
 
+/**
+ * sd_group_number() - Compute the GROUP NUMBER field
+ * @cmd: SCSI command for which to compute the value of the six-bit GROUP NUMBER
+ *	field.
+ *
+ * From "SBC-5 Constrained Streams with Data Lifetimes"
+ * (https://www.t10.org/cgi-bin/ac.pl?t=d&f=23-024r3.pdf):
+ * 0: no relative lifetime.
+ * 1: shortest relative lifetime.
+ * 2: second shortest relative lifetime.
+ * 3 - 0x3d: intermediate relative lifetimes.
+ * 0x3e: second longest relative lifetime.
+ * 0x3f: longest relative lifetime.
+ */
+static u8 sd_group_number(struct scsi_cmnd *cmd)
+{
+	const struct request *rq = scsi_cmd_to_rq(cmd);
+	struct scsi_disk *sdkp = scsi_disk(rq->q->disk);
+	const u32 max_gn = min_t(u32, sdkp->permanent_stream_count, 0x3f);
+
+	if (!sdkp->rscs)
+		return 0;
+	return min(IOPRIO_PRIO_LIFETIME(rq->ioprio), max_gn);
+}
+
 static blk_status_t sd_setup_rw32_cmnd(struct scsi_cmnd *cmd, bool write,
 				       sector_t lba, unsigned int nr_blocks,
 				       unsigned char flags, unsigned int dld)
 {
 	cmd->cmd_len = SD_EXT_CDB_SIZE;
 	cmd->cmnd[0]  = VARIABLE_LENGTH_CMD;
+	cmd->cmnd[6]  = sd_group_number(cmd);
 	cmd->cmnd[7]  = 0x18; /* Additional CDB len */
 	cmd->cmnd[9]  = write ? WRITE_32 : READ_32;
 	cmd->cmnd[10] = flags;
@@ -1025,7 +1052,7 @@ static blk_status_t sd_setup_rw16_cmnd(struct scsi_cmnd *cmd, bool write,
 	cmd->cmd_len  = 16;
 	cmd->cmnd[0]  = write ? WRITE_16 : READ_16;
 	cmd->cmnd[1]  = flags | ((dld >> 2) & 0x01);
-	cmd->cmnd[14] = (dld & 0x03) << 6;
+	cmd->cmnd[14] = ((dld & 0x03) << 6) | sd_group_number(cmd);
 	cmd->cmnd[15] = 0;
 	put_unaligned_be64(lba, &cmd->cmnd[2]);
 	put_unaligned_be32(nr_blocks, &cmd->cmnd[10]);
@@ -1040,7 +1067,7 @@ static blk_status_t sd_setup_rw10_cmnd(struct scsi_cmnd *cmd, bool write,
 	cmd->cmd_len = 10;
 	cmd->cmnd[0] = write ? WRITE_10 : READ_10;
 	cmd->cmnd[1] = flags;
-	cmd->cmnd[6] = 0;
+	cmd->cmnd[6] = sd_group_number(cmd);
 	cmd->cmnd[9] = 0;
 	put_unaligned_be32(lba, &cmd->cmnd[2]);
 	put_unaligned_be16(nr_blocks, &cmd->cmnd[7]);
@@ -1177,7 +1204,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 		ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks,
 					 protect | fua, dld);
 	} else if ((nr_blocks > 0xff) || (lba > 0x1fffff) ||
-		   sdp->use_10_for_rw || protect) {
+		   sdp->use_10_for_rw || protect ||
+		   IOPRIO_PRIO_LIFETIME(rq->ioprio)) {
 		ret = sd_setup_rw10_cmnd(cmd, write, lba, nr_blocks,
 					 protect | fua);
 	} else {
@@ -2912,6 +2940,70 @@ sd_read_cache_type(struct scsi_disk *sdkp, unsigned char *buffer)
 	sdkp->DPOFUA = 0;
 }
 
+static bool sd_is_perm_stream(struct scsi_disk *sdkp, unsigned stream_id)
+{
+	u8 cdb[16] = { SERVICE_ACTION_IN_16, SAI_GET_STREAM_STATUS };
+	struct {
+		struct scsi_stream_status_header h;
+		struct scsi_stream_status s;
+	} buf;
+	struct scsi_device *sdev = sdkp->device;
+	struct scsi_sense_hdr sshdr;
+	const struct scsi_exec_args exec_args = {
+		.sshdr = &sshdr,
+	};
+	int res;
+
+	put_unaligned_be16(stream_id, &cdb[4]);
+	put_unaligned_be32(sizeof(buf), &cdb[10]);
+
+	res = scsi_execute_cmd(sdev, cdb, REQ_OP_DRV_IN, &buf, sizeof(buf),
+			       SD_TIMEOUT, sdkp->max_retries, &exec_args);
+	if (res < 0)
+		return false;
+	if (scsi_status_is_check_condition(res) && scsi_sense_valid(&sshdr))
+		sd_print_sense_hdr(sdkp, &sshdr);
+	if (res)
+		return false;
+	if (get_unaligned_be32(&buf.h.len) < sizeof(struct scsi_stream_status))
+		return false;
+	return buf.h.stream_status[0].perm;
+}
+
+static void sd_read_io_hints(struct scsi_disk *sdkp, unsigned char *buffer)
+{
+	struct scsi_device *sdp = sdkp->device;
+	const struct scsi_io_group_descriptor *desc, *start, *end;
+	struct scsi_sense_hdr sshdr;
+	struct scsi_mode_data data;
+	int res;
+
+	res = scsi_mode_sense(sdp, /*dbd=*/0x8, /*modepage=*/0x0a,
+			      /*subpage=*/0x05, buffer, SD_BUF_SIZE,
+			      SD_TIMEOUT, sdkp->max_retries, &data, &sshdr);
+	if (res < 0)
+		return;
+	start = (void *)buffer + data.header_length + 16;
+	end = (void *)buffer + ALIGN_DOWN(data.header_length + data.length,
+					  sizeof(*end));
+	/*
+	 * From "SBC-5 Constrained Streams with Data Lifetimes": Device severs
+	 * should assign the lowest numbered stream identifiers to permanent
+	 * streams.
+	 */
+	for (desc = start; desc < end; desc++)
+		if (!desc->st_enble || !sd_is_perm_stream(sdkp, desc - start))
+			break;
+	sdkp->permanent_stream_count = desc - start;
+	if (sdkp->rscs && sdkp->permanent_stream_count < 2)
+		sd_printk(KERN_INFO, sdkp,
+			  "Unexpected: RSCS has been set and the permanent stream count is %u\n",
+			  sdkp->permanent_stream_count);
+	else if (sdkp->permanent_stream_count)
+		sd_printk(KERN_INFO, sdkp, "permanent stream count = %d\n",
+			  sdkp->permanent_stream_count);
+}
+
 /*
  * The ATO bit indicates whether the DIF application tag is available
  * for use by the operating system.
@@ -3395,6 +3487,7 @@ static int sd_revalidate_disk(struct gendisk *disk)
 
 		sd_read_write_protect_flag(sdkp, buffer);
 		sd_read_cache_type(sdkp, buffer);
+		sd_read_io_hints(sdkp, buffer);
 		sd_read_app_tag_own(sdkp, buffer);
 		sd_read_write_same(sdkp, buffer);
 		sd_read_security(sdkp, buffer);
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 84685168b6e0..570d5a72749a 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -125,6 +125,8 @@ struct scsi_disk {
 	unsigned int	physical_block_size;
 	unsigned int	max_medium_access_timeouts;
 	unsigned int	medium_access_timed_out;
+			/* number of permanent streams */
+	u16		permanent_stream_count;
 	u8		media_present;
 	u8		write_prot;
 	u8		protection_type;/* Data Integrity Field */

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 09/15] scsi_debug: Reduce code duplication
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (7 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 08/15] sd: Translate data lifetime information Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 10/15] scsi_debug: Support the block limits extension VPD page Bart Van Assche
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Avri Altman, Douglas Gilbert,
	James E.J. Bottomley

All VPD pages have the page code in byte one. Reduce code duplication by
storing the VPD page code once.

Reviewed-by: Avri Altman <avri.altman@wdc.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 9c0af50501f9..46eaa2f9e63b 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1598,7 +1598,8 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		u32 len;
 		char lu_id_str[6];
 		int host_no = devip->sdbg_host->shost->host_no;
-		
+
+		arr[1] = cmd[2];
 		port_group_id = (((host_no + 1) & 0x7f) << 8) +
 		    (devip->channel & 0x7f);
 		if (sdebug_vpd_use_hostno == 0)
@@ -1609,7 +1610,6 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 				 (devip->target * 1000) - 3;
 		len = scnprintf(lu_id_str, 6, "%d", lu_id_num);
 		if (0 == cmd[2]) { /* supported vital product data pages */
-			arr[1] = cmd[2];	/*sanity */
 			n = 4;
 			arr[n++] = 0x0;   /* this page */
 			arr[n++] = 0x80;  /* unit serial number */
@@ -1630,23 +1630,18 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 			}
 			arr[3] = n - 4;	  /* number of supported VPD pages */
 		} else if (0x80 == cmd[2]) { /* unit serial number */
-			arr[1] = cmd[2];	/*sanity */
 			arr[3] = len;
 			memcpy(&arr[4], lu_id_str, len);
 		} else if (0x83 == cmd[2]) { /* device identification */
-			arr[1] = cmd[2];	/*sanity */
 			arr[3] = inquiry_vpd_83(&arr[4], port_group_id,
 						target_dev_id, lu_id_num,
 						lu_id_str, len,
 						&devip->lu_name);
 		} else if (0x84 == cmd[2]) { /* Software interface ident. */
-			arr[1] = cmd[2];	/*sanity */
 			arr[3] = inquiry_vpd_84(&arr[4]);
 		} else if (0x85 == cmd[2]) { /* Management network addresses */
-			arr[1] = cmd[2];	/*sanity */
 			arr[3] = inquiry_vpd_85(&arr[4]);
 		} else if (0x86 == cmd[2]) { /* extended inquiry */
-			arr[1] = cmd[2];	/*sanity */
 			arr[3] = 0x3c;	/* number of following entries */
 			if (sdebug_dif == T10_PI_TYPE3_PROTECTION)
 				arr[4] = 0x4;	/* SPT: GRD_CHK:1 */
@@ -1656,30 +1651,23 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 				arr[4] = 0x0;   /* no protection stuff */
 			arr[5] = 0x7;   /* head of q, ordered + simple q's */
 		} else if (0x87 == cmd[2]) { /* mode page policy */
-			arr[1] = cmd[2];	/*sanity */
 			arr[3] = 0x8;	/* number of following entries */
 			arr[4] = 0x2;	/* disconnect-reconnect mp */
 			arr[6] = 0x80;	/* mlus, shared */
 			arr[8] = 0x18;	 /* protocol specific lu */
 			arr[10] = 0x82;	 /* mlus, per initiator port */
 		} else if (0x88 == cmd[2]) { /* SCSI Ports */
-			arr[1] = cmd[2];	/*sanity */
 			arr[3] = inquiry_vpd_88(&arr[4], target_dev_id);
 		} else if (is_disk_zbc && 0x89 == cmd[2]) { /* ATA info */
-			arr[1] = cmd[2];        /*sanity */
 			n = inquiry_vpd_89(&arr[4]);
 			put_unaligned_be16(n, arr + 2);
 		} else if (is_disk_zbc && 0xb0 == cmd[2]) { /* Block limits */
-			arr[1] = cmd[2];        /*sanity */
 			arr[3] = inquiry_vpd_b0(&arr[4]);
 		} else if (is_disk_zbc && 0xb1 == cmd[2]) { /* Block char. */
-			arr[1] = cmd[2];        /*sanity */
 			arr[3] = inquiry_vpd_b1(devip, &arr[4]);
 		} else if (is_disk && 0xb2 == cmd[2]) { /* LB Prov. */
-			arr[1] = cmd[2];        /*sanity */
 			arr[3] = inquiry_vpd_b2(&arr[4]);
 		} else if (is_zbc && cmd[2] == 0xb6) { /* ZB dev. charact. */
-			arr[1] = cmd[2];        /*sanity */
 			arr[3] = inquiry_vpd_b6(devip, &arr[4]);
 		} else {
 			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 2, -1);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 10/15] scsi_debug: Support the block limits extension VPD page
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (8 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 09/15] scsi_debug: Reduce code duplication Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 11/15] scsi_debug: Rework page code error handling Bart Van Assche
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Douglas Gilbert,
	James E.J. Bottomley

From T10 document 23-024r3.pdf:

"Reduced stream control:
a) reduces the maximum number of streams that the device server supports;
   and
b) increases the number of write commands that are able to specify a stream
   to be written in any write command that contains the GROUP NUMBER field
   in its CDB.

If the RSCS bit (see 6.6.5) is set to one, then the device server shall:
a) support per group stream identifier usage as described in 4.32.2;
b) support the IO Advice Hints Grouping mode page (see 6.5.7); and
c) set the MAXIMUM NUMBER OF STREAMS field (see 6.6.5) to a value that is
   less than 64.

Device servers that set the RSCS bit to one may support other features
(e.g., permanent streams (see 4.32.4)).

4.32.4 Permanent streams

A permanent stream is a stream for which the device server does not allow
closing or otherwise modifying the configuration of that stream. The PERM
bit (see 5.9.2.3) indicates whether a stream is a permanent stream. If a
STREAM CONTROL command (see 5.32) specifies the closing of a permanent
stream, the device server terminates that command with CHECK CONDITION
status instead of closing the specified stream. A permanent stream is always
an open stream. Device severs should assign the lowest numbered stream
identifiers to permanent streams."

Report that reduced stream control is supported.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 46eaa2f9e63b..88cba9374166 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -1627,6 +1627,7 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 					arr[n++] = 0xb2;  /* LB Provisioning */
 				if (is_zbc)
 					arr[n++] = 0xb6;  /* ZB dev. char. */
+				arr[n++] = 0xb7;  /* Block limits extension */
 			}
 			arr[3] = n - 4;	  /* number of supported VPD pages */
 		} else if (0x80 == cmd[2]) { /* unit serial number */
@@ -1669,6 +1670,9 @@ static int resp_inquiry(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 			arr[3] = inquiry_vpd_b2(&arr[4]);
 		} else if (is_zbc && cmd[2] == 0xb6) { /* ZB dev. charact. */
 			arr[3] = inquiry_vpd_b6(devip, &arr[4]);
+		} else if (cmd[2] == 0xb7) { /* block limits extension page */
+			arr[3] = 2; /* page length */
+			arr[5] = 1; /* Reduced stream control support (RSCS) */
 		} else {
 			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 2, -1);
 			kfree(arr);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 11/15] scsi_debug: Rework page code error handling
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (9 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 10/15] scsi_debug: Support the block limits extension VPD page Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 12/15] scsi_debug: Rework subpage " Bart Van Assche
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Douglas Gilbert,
	James E.J. Bottomley

Instead of tracking whether or not the page code is valid in a boolean
variable, jump to error handling code if an unsupported page code is
encountered.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 88cba9374166..6b87d267c9c5 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -2327,7 +2327,7 @@ static int resp_mode_sense(struct scsi_cmnd *scp,
 	unsigned char *ap;
 	unsigned char arr[SDEBUG_MAX_MSENSE_SZ];
 	unsigned char *cmd = scp->cmnd;
-	bool dbd, llbaa, msense_6, is_disk, is_zbc, bad_pcode;
+	bool dbd, llbaa, msense_6, is_disk, is_zbc;
 
 	dbd = !!(cmd[1] & 0x8);		/* disable block descriptors */
 	pcontrol = (cmd[2] & 0xc0) >> 6;
@@ -2391,7 +2391,6 @@ static int resp_mode_sense(struct scsi_cmnd *scp,
 		mk_sense_invalid_fld(scp, SDEB_IN_CDB, 3, -1);
 		return check_condition_result;
 	}
-	bad_pcode = false;
 
 	switch (pcode) {
 	case 0x1:	/* Read-Write error recovery page, direct access */
@@ -2406,15 +2405,17 @@ static int resp_mode_sense(struct scsi_cmnd *scp,
 		if (is_disk) {
 			len = resp_format_pg(ap, pcontrol, target);
 			offset += len;
-		} else
-			bad_pcode = true;
+		} else {
+			goto bad_pcode;
+		}
 		break;
 	case 0x8:	/* Caching page, direct access */
 		if (is_disk || is_zbc) {
 			len = resp_caching_pg(ap, pcontrol, target);
 			offset += len;
-		} else
-			bad_pcode = true;
+		} else {
+			goto bad_pcode;
+		}
 		break;
 	case 0xa:	/* Control Mode page, all devices */
 		len = resp_ctrl_m_pg(ap, pcontrol, target);
@@ -2467,18 +2468,17 @@ static int resp_mode_sense(struct scsi_cmnd *scp,
 		}
 		break;
 	default:
-		bad_pcode = true;
-		break;
-	}
-	if (bad_pcode) {
-		mk_sense_invalid_fld(scp, SDEB_IN_CDB, 2, 5);
-		return check_condition_result;
+		goto bad_pcode;
 	}
 	if (msense_6)
 		arr[0] = offset - 1;
 	else
 		put_unaligned_be16((offset - 2), arr + 0);
 	return fill_from_dev_buffer(scp, arr, min_t(u32, alloc_len, offset));
+
+bad_pcode:
+	mk_sense_invalid_fld(scp, SDEB_IN_CDB, 2, 5);
+	return check_condition_result;
 }
 
 #define SDEBUG_MAX_MSELECT_SZ 512

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 12/15] scsi_debug: Rework subpage code error handling
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (10 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 11/15] scsi_debug: Rework page code error handling Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-05 19:40 ` [PATCH v2 13/15] scsi_debug: Implement the IO Advice Hints Grouping mode page Bart Van Assche
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Douglas Gilbert,
	James E.J. Bottomley

Move the subpage code checks into the switch statement to make it easier
to add support for new page code / subpage code combinations.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 70 ++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 34 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 6b87d267c9c5..a96eb0d10346 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -2386,22 +2386,22 @@ static int resp_mode_sense(struct scsi_cmnd *scp,
 		ap = arr + offset;
 	}
 
-	if ((subpcode > 0x0) && (subpcode < 0xff) && (0x19 != pcode)) {
-		/* TODO: Control Extension page */
-		mk_sense_invalid_fld(scp, SDEB_IN_CDB, 3, -1);
-		return check_condition_result;
-	}
-
 	switch (pcode) {
 	case 0x1:	/* Read-Write error recovery page, direct access */
+		if (subpcode > 0x0 && subpcode < 0xff)
+			goto bad_subpcode;
 		len = resp_err_recov_pg(ap, pcontrol, target);
 		offset += len;
 		break;
 	case 0x2:	/* Disconnect-Reconnect page, all devices */
+		if (subpcode > 0x0 && subpcode < 0xff)
+			goto bad_subpcode;
 		len = resp_disconnect_pg(ap, pcontrol, target);
 		offset += len;
 		break;
 	case 0x3:       /* Format device page, direct access */
+		if (subpcode > 0x0 && subpcode < 0xff)
+			goto bad_subpcode;
 		if (is_disk) {
 			len = resp_format_pg(ap, pcontrol, target);
 			offset += len;
@@ -2410,6 +2410,8 @@ static int resp_mode_sense(struct scsi_cmnd *scp,
 		}
 		break;
 	case 0x8:	/* Caching page, direct access */
+		if (subpcode > 0x0 && subpcode < 0xff)
+			goto bad_subpcode;
 		if (is_disk || is_zbc) {
 			len = resp_caching_pg(ap, pcontrol, target);
 			offset += len;
@@ -2418,14 +2420,14 @@ static int resp_mode_sense(struct scsi_cmnd *scp,
 		}
 		break;
 	case 0xa:	/* Control Mode page, all devices */
+		if (subpcode > 0x0 && subpcode < 0xff)
+			goto bad_subpcode;
 		len = resp_ctrl_m_pg(ap, pcontrol, target);
 		offset += len;
 		break;
 	case 0x19:	/* if spc==1 then sas phy, control+discover */
-		if ((subpcode > 0x2) && (subpcode < 0xff)) {
-			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 3, -1);
-			return check_condition_result;
-		}
+		if (subpcode > 0x2 && subpcode < 0xff)
+			goto bad_subpcode;
 		len = 0;
 		if ((0x0 == subpcode) || (0xff == subpcode))
 			len += resp_sas_sf_m_pg(ap + len, pcontrol, target);
@@ -2437,35 +2439,31 @@ static int resp_mode_sense(struct scsi_cmnd *scp,
 		offset += len;
 		break;
 	case 0x1c:	/* Informational Exceptions Mode page, all devices */
+		if (subpcode > 0x0 && subpcode < 0xff)
+			goto bad_subpcode;
 		len = resp_iec_m_pg(ap, pcontrol, target);
 		offset += len;
 		break;
 	case 0x3f:	/* Read all Mode pages */
-		if ((0 == subpcode) || (0xff == subpcode)) {
-			len = resp_err_recov_pg(ap, pcontrol, target);
-			len += resp_disconnect_pg(ap + len, pcontrol, target);
-			if (is_disk) {
-				len += resp_format_pg(ap + len, pcontrol,
-						      target);
-				len += resp_caching_pg(ap + len, pcontrol,
-						       target);
-			} else if (is_zbc) {
-				len += resp_caching_pg(ap + len, pcontrol,
-						       target);
-			}
-			len += resp_ctrl_m_pg(ap + len, pcontrol, target);
-			len += resp_sas_sf_m_pg(ap + len, pcontrol, target);
-			if (0xff == subpcode) {
-				len += resp_sas_pcd_m_spg(ap + len, pcontrol,
-						  target, target_dev_id);
-				len += resp_sas_sha_m_spg(ap + len, pcontrol);
-			}
-			len += resp_iec_m_pg(ap + len, pcontrol, target);
-			offset += len;
-		} else {
-			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 3, -1);
-			return check_condition_result;
+		if (subpcode > 0x0 && subpcode < 0xff)
+			goto bad_subpcode;
+		len = resp_err_recov_pg(ap, pcontrol, target);
+		len += resp_disconnect_pg(ap + len, pcontrol, target);
+		if (is_disk) {
+			len += resp_format_pg(ap + len, pcontrol, target);
+			len += resp_caching_pg(ap + len, pcontrol, target);
+		} else if (is_zbc) {
+			len += resp_caching_pg(ap + len, pcontrol, target);
+		}
+		len += resp_ctrl_m_pg(ap + len, pcontrol, target);
+		len += resp_sas_sf_m_pg(ap + len, pcontrol, target);
+		if (0xff == subpcode) {
+			len += resp_sas_pcd_m_spg(ap + len, pcontrol, target,
+						  target_dev_id);
+			len += resp_sas_sha_m_spg(ap + len, pcontrol);
 		}
+		len += resp_iec_m_pg(ap + len, pcontrol, target);
+		offset += len;
 		break;
 	default:
 		goto bad_pcode;
@@ -2479,6 +2477,10 @@ static int resp_mode_sense(struct scsi_cmnd *scp,
 bad_pcode:
 	mk_sense_invalid_fld(scp, SDEB_IN_CDB, 2, 5);
 	return check_condition_result;
+
+bad_subpcode:
+	mk_sense_invalid_fld(scp, SDEB_IN_CDB, 3, -1);
+	return check_condition_result;
 }
 
 #define SDEBUG_MAX_MSELECT_SZ 512

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 13/15] scsi_debug: Implement the IO Advice Hints Grouping mode page
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (11 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 12/15] scsi_debug: Rework subpage " Bart Van Assche
@ 2023-10-05 19:40 ` Bart Van Assche
  2023-10-05 19:41 ` [PATCH v2 14/15] scsi_debug: Implement GET STREAM STATUS Bart Van Assche
  2023-10-05 19:41 ` [PATCH v2 15/15] scsi_debug: Maintain write statistics per group number Bart Van Assche
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Douglas Gilbert,
	James E.J. Bottomley

Implement an IO Advice Hints Grouping mode page with three permanent
streams. A permanent stream is a stream for which the device server does
not allow closing or otherwise modifying the configuration of that
stream. The stream identifier enable (ST_ENBLE) bit specifies whether
the stream identifier may be used in the GROUP NUMBER field of SCSI
WRITE commands.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 42 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index a96eb0d10346..d56989e94c4a 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -2241,6 +2241,36 @@ static int resp_ctrl_m_pg(unsigned char *p, int pcontrol, int target)
 	return sizeof(ctrl_m_pg);
 }
 
+enum { MAXIMUM_NUMBER_OF_STREAMS = 4 };
+
+/* IO Advice Hints Grouping mode page */
+static int resp_grouping_m_pg(unsigned char *p, int pcontrol, int target)
+{
+	/* IO Advice Hints Grouping mode page */
+	struct grouping_m_pg {
+		u8 page_code;
+		u8 subpage_code;
+		__be16 page_length;
+		u8 reserved[12];
+		struct scsi_io_group_descriptor
+			descr[MAXIMUM_NUMBER_OF_STREAMS];
+	};
+	static const struct grouping_m_pg gr_m_pg = {
+		.page_code = 0xa,
+		.subpage_code = 5,
+		.page_length = cpu_to_be16(sizeof(gr_m_pg) - 4),
+		.descr = {
+			{ .st_enble = 1 },
+			{ .st_enble = 1 },
+			{ .st_enble = 1 },
+			{ .st_enble = 0 },
+		}
+	};
+
+	BUILD_BUG_ON(sizeof(struct grouping_m_pg) != 16 + 4 * 16);
+	memcpy(p, &gr_m_pg, sizeof(gr_m_pg));
+	return sizeof(gr_m_pg);
+}
 
 static int resp_iec_m_pg(unsigned char *p, int pcontrol, int target)
 {	/* Informational Exceptions control mode page for mode_sense */
@@ -2420,9 +2450,17 @@ static int resp_mode_sense(struct scsi_cmnd *scp,
 		}
 		break;
 	case 0xa:	/* Control Mode page, all devices */
-		if (subpcode > 0x0 && subpcode < 0xff)
+		switch (subpcode) {
+		case 0:
+		case 0xff:
+			len = resp_ctrl_m_pg(ap, pcontrol, target);
+			break;
+		case 0x05:
+			len = resp_grouping_m_pg(ap, pcontrol, target);
+			break;
+		default:
 			goto bad_subpcode;
-		len = resp_ctrl_m_pg(ap, pcontrol, target);
+		}
 		offset += len;
 		break;
 	case 0x19:	/* if spc==1 then sas phy, control+discover */

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 14/15] scsi_debug: Implement GET STREAM STATUS
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (12 preceding siblings ...)
  2023-10-05 19:40 ` [PATCH v2 13/15] scsi_debug: Implement the IO Advice Hints Grouping mode page Bart Van Assche
@ 2023-10-05 19:41 ` Bart Van Assche
  2023-10-05 19:41 ` [PATCH v2 15/15] scsi_debug: Maintain write statistics per group number Bart Van Assche
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, James E.J. Bottomley

Implement the GET STREAM STATUS SCSI command. Report that the first
three stream indexes correspond to permanent streams.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 44 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index d56989e94c4a..801448570960 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -481,6 +481,8 @@ static int resp_write_scat(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_start_stop(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_readcap16(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_get_lba_status(struct scsi_cmnd *, struct sdebug_dev_info *);
+static int resp_get_stream_status(struct scsi_cmnd *scp,
+				  struct sdebug_dev_info *devip);
 static int resp_report_tgtpgs(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_unmap(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_rsup_opcodes(struct scsi_cmnd *, struct sdebug_dev_info *);
@@ -555,6 +557,9 @@ static const struct opcode_info_t sa_in_16_iarr[] = {
 	{0, 0x9e, 0x12, F_SA_LOW | F_D_IN, resp_get_lba_status, NULL,
 	    {16,  0x12, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 	     0xff, 0xff, 0xff, 0, 0xc7} },	/* GET LBA STATUS(16) */
+	{0, 0x9e, 0x16, F_SA_LOW | F_D_IN, resp_get_stream_status, NULL,
+	    {16, 0x16, 0, 0, 0xff, 0xff, 0, 0, 0, 0, 0xff, 0xff, 0xff, 0xff,
+	     0, 0} },	/* GET STREAM STATUS */
 };
 
 static const struct opcode_info_t vl_iarr[] = {	/* VARIABLE LENGTH */
@@ -2241,7 +2246,7 @@ static int resp_ctrl_m_pg(unsigned char *p, int pcontrol, int target)
 	return sizeof(ctrl_m_pg);
 }
 
-enum { MAXIMUM_NUMBER_OF_STREAMS = 4 };
+enum { MAXIMUM_NUMBER_OF_STREAMS = 4, PERMANENT_STREAM_COUNT = 3 };
 
 /* IO Advice Hints Grouping mode page */
 static int resp_grouping_m_pg(unsigned char *p, int pcontrol, int target)
@@ -4236,6 +4241,43 @@ static int resp_get_lba_status(struct scsi_cmnd *scp,
 	return fill_from_dev_buffer(scp, arr, SDEBUG_GET_LBA_STATUS_LEN);
 }
 
+static int resp_get_stream_status(struct scsi_cmnd *scp,
+				  struct sdebug_dev_info *devip)
+{
+	u16 starting_stream_id, stream_id;
+	const u8 *cmd = scp->cmnd;
+	u32 alloc_len, offset;
+	u8 arr[256];
+
+	starting_stream_id = get_unaligned_be16(cmd + 4);
+	alloc_len = get_unaligned_be32(cmd + 10);
+
+	if (alloc_len < 8) {
+		mk_sense_invalid_fld(scp, SDEB_IN_CDB, 10, -1);
+		return check_condition_result;
+	}
+
+	if (starting_stream_id >= MAXIMUM_NUMBER_OF_STREAMS) {
+		mk_sense_invalid_fld(scp, SDEB_IN_CDB, 4, -1);
+		return check_condition_result;
+	}
+
+	for (offset = 8, stream_id = starting_stream_id;
+	     offset + 8 <= min_t(u32, alloc_len, sizeof(arr)) &&
+		     stream_id < MAXIMUM_NUMBER_OF_STREAMS;
+	     offset += 8, stream_id++) {
+		struct scsi_stream_status *stream_status = (void *)arr + offset;
+
+		stream_status->perm = stream_id < PERMANENT_STREAM_COUNT;
+		put_unaligned_be16(stream_id,
+				   &stream_status->stream_identifier);
+		stream_status->rel_lifetime = stream_id + 1;
+	}
+	put_unaligned_be32(offset - 8, arr + 0); /* PARAMETER DATA LENGTH */
+
+	return fill_from_dev_buffer(scp, arr, min(offset, alloc_len));
+}
+
 static int resp_sync_cache(struct scsi_cmnd *scp,
 			   struct sdebug_dev_info *devip)
 {

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 15/15] scsi_debug: Maintain write statistics per group number
  2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
                   ` (13 preceding siblings ...)
  2023-10-05 19:41 ` [PATCH v2 14/15] scsi_debug: Implement GET STREAM STATUS Bart Van Assche
@ 2023-10-05 19:41 ` Bart Van Assche
  14 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-05 19:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Bart Van Assche, Douglas Gilbert,
	James E.J. Bottomley

Track per GROUP NUMBER how many write commands have been processed. Make
this information available in sysfs. Reset these statistics if any data
is written into the sysfs attribute.

Note: SCSI devices should only interpret the information in the GROUP
NUMBER field as a stream identifier if the ST_ENBLE bit has been set to
one. This patch follows a simpler approach: count the number of writes
per GROUP NUMBER whether or not the group number represents a stream
identifier.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Douglas Gilbert <dgilbert@interlog.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_debug.c | 51 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 47 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 801448570960..c2102c0046ad 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -846,6 +846,8 @@ static int sdeb_zbc_nr_conv = DEF_ZBC_NR_CONV_ZONES;
 static int submit_queues = DEF_SUBMIT_QUEUES;  /* > 1 for multi-queue (mq) */
 static int poll_queues; /* iouring iopoll interface.*/
 
+static atomic_long_t writes_by_group_number[64];
+
 static char sdebug_proc_name[] = MY_NAME;
 static const char *my_name = MY_NAME;
 
@@ -3040,7 +3042,8 @@ static inline struct sdeb_store_info *devip2sip(struct sdebug_dev_info *devip,
 
 /* Returns number of bytes copied or -1 if error. */
 static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp,
-			    u32 sg_skip, u64 lba, u32 num, bool do_write)
+			    u32 sg_skip, u64 lba, u32 num, bool do_write,
+			    u8 group_number)
 {
 	int ret;
 	u64 block, rest = 0;
@@ -3059,6 +3062,10 @@ static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp,
 		return 0;
 	if (scp->sc_data_direction != dir)
 		return -1;
+
+	if (do_write && group_number < ARRAY_SIZE(writes_by_group_number))
+		atomic_long_inc(&writes_by_group_number[group_number]);
+
 	fsp = sip->storep;
 
 	block = do_div(lba, sdebug_store_sectors);
@@ -3432,7 +3439,7 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		}
 	}
 
-	ret = do_device_access(sip, scp, 0, lba, num, false);
+	ret = do_device_access(sip, scp, 0, lba, num, false, 0);
 	sdeb_read_unlock(sip);
 	if (unlikely(ret == -1))
 		return DID_ERROR << 16;
@@ -3617,6 +3624,7 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 {
 	bool check_prot;
 	u32 num;
+	u8 group = 0;
 	u32 ei_lba;
 	int ret;
 	u64 lba;
@@ -3628,11 +3636,13 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		ei_lba = 0;
 		lba = get_unaligned_be64(cmd + 2);
 		num = get_unaligned_be32(cmd + 10);
+		group = cmd[14] & 0x3f;
 		check_prot = true;
 		break;
 	case WRITE_10:
 		ei_lba = 0;
 		lba = get_unaligned_be32(cmd + 2);
+		group = cmd[6] & 0x3f;
 		num = get_unaligned_be16(cmd + 7);
 		check_prot = true;
 		break;
@@ -3647,15 +3657,18 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		ei_lba = 0;
 		lba = get_unaligned_be32(cmd + 2);
 		num = get_unaligned_be32(cmd + 6);
+		group = cmd[6] & 0x3f;
 		check_prot = true;
 		break;
 	case 0x53:	/* XDWRITEREAD(10) */
 		ei_lba = 0;
 		lba = get_unaligned_be32(cmd + 2);
+		group = cmd[6] & 0x1f;
 		num = get_unaligned_be16(cmd + 7);
 		check_prot = false;
 		break;
 	default:	/* assume WRITE(32) */
+		group = cmd[6] & 0x3f;
 		lba = get_unaligned_be64(cmd + 12);
 		ei_lba = get_unaligned_be32(cmd + 20);
 		num = get_unaligned_be32(cmd + 28);
@@ -3710,7 +3723,7 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		}
 	}
 
-	ret = do_device_access(sip, scp, 0, lba, num, true);
+	ret = do_device_access(sip, scp, 0, lba, num, true, group);
 	if (unlikely(scsi_debug_lbp()))
 		map_region(sip, lba, num);
 	/* If ZBC zone then bump its write pointer */
@@ -3762,12 +3775,14 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 	u32 lb_size = sdebug_sector_size;
 	u32 ei_lba;
 	u64 lba;
+	u8 group;
 	int ret, res;
 	bool is_16;
 	static const u32 lrd_size = 32; /* + parameter list header size */
 
 	if (cmd[0] == VARIABLE_LENGTH_CMD) {
 		is_16 = false;
+		group = cmd[6] & 0x3f;
 		wrprotect = (cmd[10] >> 5) & 0x7;
 		lbdof = get_unaligned_be16(cmd + 12);
 		num_lrd = get_unaligned_be16(cmd + 16);
@@ -3778,6 +3793,7 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 		lbdof = get_unaligned_be16(cmd + 4);
 		num_lrd = get_unaligned_be16(cmd + 8);
 		bt_len = get_unaligned_be32(cmd + 10);
+		group = cmd[14] & 0x3f;
 		if (unlikely(have_dif_prot)) {
 			if (sdebug_dif == T10_PI_TYPE2_PROTECTION &&
 			    wrprotect) {
@@ -3866,7 +3882,8 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 			}
 		}
 
-		ret = do_device_access(sip, scp, sg_off, lba, num, true);
+		ret = do_device_access(sip, scp, sg_off, lba, num, true,
+				       group);
 		/* If ZBC zone then bump its write pointer */
 		if (sdebug_dev_is_zoned(devip))
 			zbc_inc_wp(devip, lba, num);
@@ -6828,6 +6845,31 @@ static ssize_t tur_ms_to_ready_show(struct device_driver *ddp, char *buf)
 }
 static DRIVER_ATTR_RO(tur_ms_to_ready);
 
+static ssize_t group_number_stats_show(struct device_driver *ddp, char *buf)
+{
+	char *p = buf, *end = buf + PAGE_SIZE;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(writes_by_group_number); i++)
+		p += scnprintf(p, end - p, "%d %ld\n", i,
+			       atomic_long_read(&writes_by_group_number[i]));
+
+	return p - buf;
+}
+
+static ssize_t group_number_stats_store(struct device_driver *ddp,
+					const char *buf,
+				  size_t count)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(writes_by_group_number); i++)
+		atomic_long_set(&writes_by_group_number[i], 0);
+
+	return 0;
+}
+static DRIVER_ATTR_RW(group_number_stats);
+
 /* Note: The following array creates attribute files in the
    /sys/bus/pseudo/drivers/scsi_debug directory. The advantage of these
    files (over those found in the /sys/module/scsi_debug/parameters
@@ -6874,6 +6916,7 @@ static struct attribute *sdebug_drv_attrs[] = {
 	&driver_attr_cdb_len.attr,
 	&driver_attr_tur_ms_to_ready.attr,
 	&driver_attr_zbc.attr,
+	&driver_attr_group_number_stats.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(sdebug_drv);

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits
  2023-10-05 19:40 ` [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits Bart Van Assche
@ 2023-10-06  6:28   ` Kanchan Joshi
  2023-10-06 18:20     ` Bart Van Assche
  2023-10-10  5:22   ` Kanchan Joshi
  1 sibling, 1 reply; 45+ messages in thread
From: Kanchan Joshi @ 2023-10-06  6:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, linux-fsdevel,
	Martin K . Petersen, Christoph Hellwig, Niklas Cassel,
	Avri Altman, Bean Huo, Daejun Park, Damien Le Moal

[-- Attachment #1: Type: text/plain, Size: 349 bytes --]

On Thu, Oct 05, 2023 at 12:40:47PM -0700, Bart Van Assche wrote:
>A later patch will store the data lifetime in the bio->bi_ioprio member
>before bio_set_ioprio() is called. Make sure that bio_set_ioprio()
>doesn't clear more bits than necessary.

Only lifetime bits need to be retained, but the patch retains the CDL
bits too. Is that intentional?

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/15] blk-ioprio: Modify fewer bio->bi_ioprio bits
  2023-10-05 19:40 ` [PATCH v2 02/15] blk-ioprio: Modify " Bart Van Assche
@ 2023-10-06  6:36   ` Kanchan Joshi
  2023-10-06 18:25     ` Bart Van Assche
  0 siblings, 1 reply; 45+ messages in thread
From: Kanchan Joshi @ 2023-10-06  6:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, linux-fsdevel,
	Martin K . Petersen, Christoph Hellwig, Niklas Cassel,
	Avri Altman, Bean Huo, Daejun Park, Damien Le Moal

[-- Attachment #1: Type: text/plain, Size: 1594 bytes --]

On Thu, Oct 05, 2023 at 12:40:48PM -0700, Bart Van Assche wrote:
>A later patch will store the data lifetime in the bio->bi_ioprio member
>before the blk-ioprio policy is applied. Make sure that this policy doesn't
>clear more bits than necessary.
>
>Cc: Damien Le Moal <dlemoal@kernel.org>
>Cc: Niklas Cassel <niklas.cassel@wdc.com>
>Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>---
> block/blk-ioprio.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
>diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
>index 4051fada01f1..2db86f153b6d 100644
>--- a/block/blk-ioprio.c
>+++ b/block/blk-ioprio.c
>@@ -202,7 +202,8 @@ void blkcg_set_ioprio(struct bio *bio)
> 		 * to achieve this.
> 		 */
> 		if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) != IOPRIO_CLASS_RT)
>-			bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4);
>+			ioprio_set_class_and_level(&bio->bi_ioprio,
>+					IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4));
> 		return;
> 	}
>
>@@ -213,10 +214,10 @@ void blkcg_set_ioprio(struct bio *bio)
> 	 * If the bio I/O priority equals IOPRIO_CLASS_NONE, the cgroup I/O
> 	 * priority is assigned to the bio.
> 	 */
>-	prio = max_t(u16, bio->bi_ioprio,
>+	prio = max_t(u16, bio->bi_ioprio & IOPRIO_CLASS_LEVEL_MASK,
> 			IOPRIO_PRIO_VALUE(blkcg->prio_policy, 0));

All 9 bits (including CDL) are not taking part in this decision making
now. Maybe you want to exclude only lifetime bits.

>-	if (prio > bio->bi_ioprio)
>-		bio->bi_ioprio = prio;
>+	if (prio > (bio->bi_ioprio & IOPRIO_CLASS_LEVEL_MASK))
>+		ioprio_set_class_and_level(&bio->bi_ioprio, prio);

Same as above.


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-05 19:40 ` [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield Bart Van Assche
@ 2023-10-06  6:42   ` Kanchan Joshi
  2023-10-06  8:19   ` Damien Le Moal
  1 sibling, 0 replies; 45+ messages in thread
From: Kanchan Joshi @ 2023-10-06  6:42 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, linux-fsdevel,
	Martin K . Petersen, Christoph Hellwig, Niklas Cassel,
	Avri Altman, Bean Huo, Daejun Park, Damien Le Moal,
	Hannes Reinecke

[-- Attachment #1: Type: text/plain, Size: 1365 bytes --]

On Thu, Oct 05, 2023 at 12:40:49PM -0700, Bart Van Assche wrote:
>The NVMe and SCSI standards define 64 different data lifetimes. Support
>storing this information in the I/O priority bitfield.
>
>The current allocation of the 16 bits in the I/O priority bitfield is as
>follows:
>* 15..13: I/O priority class
>* 12..6: unused
>* 5..3: I/O hint (CDL)
>* 2..0: I/O priority level
>
>This patch changes this into the following:
>* 15..13: I/O priority class
>* 12: unused
>* 11..6: data lifetime
>* 5..3: I/O hint (CDL)
>* 2..0: I/O priority level
>
>Cc: Damien Le Moal <dlemoal@kernel.org>
>Cc: Niklas Cassel <niklas.cassel@wdc.com>
>Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>---
> include/uapi/linux/ioprio.h | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
>diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
>index bee2bdb0eedb..efe9bc450872 100644
>--- a/include/uapi/linux/ioprio.h
>+++ b/include/uapi/linux/ioprio.h
>@@ -71,7 +71,7 @@ enum {
>  * class and level.
>  */
> #define IOPRIO_HINT_SHIFT		IOPRIO_LEVEL_NR_BITS
>-#define IOPRIO_HINT_NR_BITS		10
>+#define IOPRIO_HINT_NR_BITS		3

Should the comment[*] also be modified to reflect this change?

[*]
/*
 * The 10 bits between the priority class and the priority level are used to
 * optionally define I/O hints for any combination of I/O priority class and


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-05 19:40 ` [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield Bart Van Assche
  2023-10-06  6:42   ` Kanchan Joshi
@ 2023-10-06  8:19   ` Damien Le Moal
  2023-10-06  9:53     ` Niklas Cassel
                       ` (2 more replies)
  1 sibling, 3 replies; 45+ messages in thread
From: Damien Le Moal @ 2023-10-06  8:19 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke

On 10/6/23 04:40, Bart Van Assche wrote:
> The NVMe and SCSI standards define 64 different data lifetimes. Support
> storing this information in the I/O priority bitfield.
> 
> The current allocation of the 16 bits in the I/O priority bitfield is as
> follows:
> * 15..13: I/O priority class
> * 12..6: unused
> * 5..3: I/O hint (CDL)
> * 2..0: I/O priority level
> 
> This patch changes this into the following:
> * 15..13: I/O priority class
> * 12: unused
> * 11..6: data lifetime
> * 5..3: I/O hint (CDL)
> * 2..0: I/O priority level
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  include/uapi/linux/ioprio.h | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
> index bee2bdb0eedb..efe9bc450872 100644
> --- a/include/uapi/linux/ioprio.h
> +++ b/include/uapi/linux/ioprio.h
> @@ -71,7 +71,7 @@ enum {
>   * class and level.
>   */
>  #define IOPRIO_HINT_SHIFT		IOPRIO_LEVEL_NR_BITS
> -#define IOPRIO_HINT_NR_BITS		10
> +#define IOPRIO_HINT_NR_BITS		3
>  #define IOPRIO_NR_HINTS			(1 << IOPRIO_HINT_NR_BITS)
>  #define IOPRIO_HINT_MASK		(IOPRIO_NR_HINTS - 1)
>  #define IOPRIO_PRIO_HINT(ioprio)	\
> @@ -102,6 +102,12 @@ enum {
>  	IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7,
>  };
>  
> +#define IOPRIO_LIFETIME_SHIFT		(IOPRIO_HINT_SHIFT + IOPRIO_HINT_NR_BITS)
> +#define IOPRIO_LIFETIME_NR_BITS		6
> +#define IOPRIO_LIFETIME_MASK		((1u << IOPRIO_LIFETIME_NR_BITS) - 1)
> +#define IOPRIO_PRIO_LIFETIME(ioprio)					\
> +	((ioprio >> IOPRIO_LIFETIME_SHIFT) & IOPRIO_LIFETIME_MASK)
> +
>  #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))

I am really not a fan of this. This essentially limits prio hints to CDL, while
the initial intent was to define the hints as something generic that depend on
the device features. With your change, we will not be able to support new
features in the future.

Your change seem to assume that it makes sense to be able to combine CDL with
lifetime hints. But does it really ? CDL is of dubious value for solid state
media and as far as I know, UFS world has not expressed interest. Conversely,
data lifetime hints do not make much sense for spin rust media where CDL is
important. So I would say that the combination of CDL and lifetime hints is of
dubious value.

Given this, why not simply define the 64 possible lifetime values as plain hint
values (8 to 71, following 1 to 7 for CDL) ?

The other question here if you really want to keep the bit separation approach
is: do we really need up to 64 different lifetime hints ? While the scsi
standard allows that much, does this many different lifetime make sense in
practice ? Can we ever think of a usecase that needs more than say 8 different
liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
then we can keep 4 bits unused in the hint field for future features.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-06  8:19   ` Damien Le Moal
@ 2023-10-06  9:53     ` Niklas Cassel
  2023-10-06 18:07     ` Bart Van Assche
  2023-10-16  6:17     ` Christoph Hellwig
  2 siblings, 0 replies; 45+ messages in thread
From: Niklas Cassel @ 2023-10-06  9:53 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Jens Axboe, linux-block, linux-scsi,
	linux-fsdevel, Martin K . Petersen, Christoph Hellwig,
	Avri Altman, Bean Huo, Daejun Park, Hannes Reinecke

On Fri, Oct 06, 2023 at 05:19:52PM +0900, Damien Le Moal wrote:
> On 10/6/23 04:40, Bart Van Assche wrote:
> > The NVMe and SCSI standards define 64 different data lifetimes. Support
> > storing this information in the I/O priority bitfield.
> > 
> > The current allocation of the 16 bits in the I/O priority bitfield is as
> > follows:
> > * 15..13: I/O priority class
> > * 12..6: unused
> > * 5..3: I/O hint (CDL)
> > * 2..0: I/O priority level
> > 
> > This patch changes this into the following:
> > * 15..13: I/O priority class
> > * 12: unused
> > * 11..6: data lifetime
> > * 5..3: I/O hint (CDL)
> > * 2..0: I/O priority level
> > 
> > Cc: Damien Le Moal <dlemoal@kernel.org>
> > Cc: Niklas Cassel <niklas.cassel@wdc.com>
> > Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> > ---
> >  include/uapi/linux/ioprio.h | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h
> > index bee2bdb0eedb..efe9bc450872 100644
> > --- a/include/uapi/linux/ioprio.h
> > +++ b/include/uapi/linux/ioprio.h
> > @@ -71,7 +71,7 @@ enum {
> >   * class and level.
> >   */
> >  #define IOPRIO_HINT_SHIFT		IOPRIO_LEVEL_NR_BITS
> > -#define IOPRIO_HINT_NR_BITS		10
> > +#define IOPRIO_HINT_NR_BITS		3

Can we really redefine this?
This is defined to 10 in released kernels, e.g. kernel v6.5.

Perhaps a better option would be to keep this defined to 10,
and then add new macros that define "specific" IO prio hint "classes".

something like
IOPRIO_HINT_NR_BITS                10

IOPRIO_HINT_CDL
IOPRIO_HINT_CDL_NR_BITS
IOPRIO_HINT_LIFETIME
IOPRIO_HINT_LIFETIME_NR_BITS

lifetime is really a hint, so I think it makes sense for it to be part of
the "IOPRIO_HINT bits".


> >  #define IOPRIO_NR_HINTS			(1 << IOPRIO_HINT_NR_BITS)
> >  #define IOPRIO_HINT_MASK		(IOPRIO_NR_HINTS - 1)
> >  #define IOPRIO_PRIO_HINT(ioprio)	\
> > @@ -102,6 +102,12 @@ enum {
> >  	IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7,
> >  };
> >  
> > +#define IOPRIO_LIFETIME_SHIFT		(IOPRIO_HINT_SHIFT + IOPRIO_HINT_NR_BITS)
> > +#define IOPRIO_LIFETIME_NR_BITS		6
> > +#define IOPRIO_LIFETIME_MASK		((1u << IOPRIO_LIFETIME_NR_BITS) - 1)
> > +#define IOPRIO_PRIO_LIFETIME(ioprio)					\
> > +	((ioprio >> IOPRIO_LIFETIME_SHIFT) & IOPRIO_LIFETIME_MASK)
> > +
> >  #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max))
> 
> I am really not a fan of this. This essentially limits prio hints to CDL, while
> the initial intent was to define the hints as something generic that depend on
> the device features. With your change, we will not be able to support new
> features in the future.
> 
> Your change seem to assume that it makes sense to be able to combine CDL with
> lifetime hints. But does it really ? CDL is of dubious value for solid state
> media and as far as I know, UFS world has not expressed interest. Conversely,
> data lifetime hints do not make much sense for spin rust media where CDL is
> important. So I would say that the combination of CDL and lifetime hints is of
> dubious value.
> 
> Given this, why not simply define the 64 possible lifetime values as plain hint
> values (8 to 71, following 1 to 7 for CDL) ?
> 
> The other question here if you really want to keep the bit separation approach
> is: do we really need up to 64 different lifetime hints ? While the scsi
> standard allows that much, does this many different lifetime make sense in
> practice ? Can we ever think of a usecase that needs more than say 8 different
> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
> then we can keep 4 bits unused in the hint field for future features.

I think the question is: do we ever want to allow ioprio hints to be combined
or not?

If the answer is no, then go ahead and add the life time hints as values
8 to 71.

If the answer is yes, then life time hints need to use unique bits, even if it
might not make sense for CDL and life times to be combined.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-06  8:19   ` Damien Le Moal
  2023-10-06  9:53     ` Niklas Cassel
@ 2023-10-06 18:07     ` Bart Van Assche
  2023-10-11 20:51       ` Bart Van Assche
  2023-10-16  6:17     ` Christoph Hellwig
  2 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2023-10-06 18:07 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke, Niklas Cassel

On 10/6/23 01:19, Damien Le Moal wrote:
> Your change seem to assume that it makes sense to be able to combine CDL with
> lifetime hints. But does it really ? CDL is of dubious value for solid state
> media and as far as I know, UFS world has not expressed interest. Conversely,
> data lifetime hints do not make much sense for spin rust media where CDL is
> important. So I would say that the combination of CDL and lifetime hints is of
> dubious value.
> 
> Given this, why not simply define the 64 possible lifetime values as plain hint
> values (8 to 71, following 1 to 7 for CDL) ?
> 
> The other question here if you really want to keep the bit separation approach
> is: do we really need up to 64 different lifetime hints ? While the scsi
> standard allows that much, does this many different lifetime make sense in
> practice ? Can we ever think of a usecase that needs more than say 8 different
> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
> then we can keep 4 bits unused in the hint field for future features.

Hi Damien,

Not supporting CDL for solid state media and supporting eight different
lifetime values sounds good to me. Is this perhaps what you had in mind?

Thanks,

Bart.

--- a/include/uapi/linux/ioprio.h
+++ b/include/uapi/linux/ioprio.h
@@ -100,6 +100,14 @@ enum {
         IOPRIO_HINT_DEV_DURATION_LIMIT_5 = 5,
         IOPRIO_HINT_DEV_DURATION_LIMIT_6 = 6,
         IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7,
+       IOPRIO_HINT_DATA_LIFE_TIME_0 = 8,
+       IOPRIO_HINT_DATA_LIFE_TIME_1 = 9,
+       IOPRIO_HINT_DATA_LIFE_TIME_2 = 10,
+       IOPRIO_HINT_DATA_LIFE_TIME_3 = 11,
+       IOPRIO_HINT_DATA_LIFE_TIME_4 = 12,
+       IOPRIO_HINT_DATA_LIFE_TIME_5 = 13,
+       IOPRIO_HINT_DATA_LIFE_TIME_6 = 14,
+       IOPRIO_HINT_DATA_LIFE_TIME_7 = 15,
  };



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits
  2023-10-06  6:28   ` Kanchan Joshi
@ 2023-10-06 18:20     ` Bart Van Assche
  0 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-06 18:20 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Jens Axboe, linux-block, linux-scsi, linux-fsdevel,
	Martin K . Petersen, Christoph Hellwig, Niklas Cassel,
	Avri Altman, Bean Huo, Daejun Park, Damien Le Moal

On 10/5/23 23:28, Kanchan Joshi wrote:
> On Thu, Oct 05, 2023 at 12:40:47PM -0700, Bart Van Assche wrote:
>> A later patch will store the data lifetime in the bio->bi_ioprio member
>> before bio_set_ioprio() is called. Make sure that bio_set_ioprio()
>> doesn't clear more bits than necessary.
> 
> Only lifetime bits need to be retained, but the patch retains the CDL
> bits too. Is that intentional?

Hi Kanchan,

Yes, this is intentional.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/15] blk-ioprio: Modify fewer bio->bi_ioprio bits
  2023-10-06  6:36   ` Kanchan Joshi
@ 2023-10-06 18:25     ` Bart Van Assche
  0 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-06 18:25 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Jens Axboe, linux-block, linux-scsi, linux-fsdevel,
	Martin K . Petersen, Christoph Hellwig, Niklas Cassel,
	Avri Altman, Bean Huo, Daejun Park, Damien Le Moal

On 10/5/23 23:36, Kanchan Joshi wrote:
> On Thu, Oct 05, 2023 at 12:40:48PM -0700, Bart Van Assche wrote:
>> A later patch will store the data lifetime in the bio->bi_ioprio member
>> before the blk-ioprio policy is applied. Make sure that this policy 
>> doesn't
>> clear more bits than necessary.
>>
>> Cc: Damien Le Moal <dlemoal@kernel.org>
>> Cc: Niklas Cassel <niklas.cassel@wdc.com>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>> block/blk-ioprio.c | 9 +++++----
>> 1 file changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
>> index 4051fada01f1..2db86f153b6d 100644
>> --- a/block/blk-ioprio.c
>> +++ b/block/blk-ioprio.c
>> @@ -202,7 +202,8 @@ void blkcg_set_ioprio(struct bio *bio)
>>          * to achieve this.
>>          */
>>         if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) != IOPRIO_CLASS_RT)
>> -            bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4);
>> +            ioprio_set_class_and_level(&bio->bi_ioprio,
>> +                    IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4));
>>         return;
>>     }
>>
>> @@ -213,10 +214,10 @@ void blkcg_set_ioprio(struct bio *bio)
>>      * If the bio I/O priority equals IOPRIO_CLASS_NONE, the cgroup I/O
>>      * priority is assigned to the bio.
>>      */
>> -    prio = max_t(u16, bio->bi_ioprio,
>> +    prio = max_t(u16, bio->bi_ioprio & IOPRIO_CLASS_LEVEL_MASK,
>>             IOPRIO_PRIO_VALUE(blkcg->prio_policy, 0));
> 
> All 9 bits (including CDL) are not taking part in this decision making
> now. Maybe you want to exclude only lifetime bits.
> 
>> -    if (prio > bio->bi_ioprio)
>> -        bio->bi_ioprio = prio;
>> +    if (prio > (bio->bi_ioprio & IOPRIO_CLASS_LEVEL_MASK))
>> +        ioprio_set_class_and_level(&bio->bi_ioprio, prio);
> 
> Same as above.

Hi Kanchan,

It is intentional that the CDL bits are left out from these decisions.
I think the decisions made in this policy should be based on the I/O
priority class and level only and not on the CDL value.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits
  2023-10-05 19:40 ` [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits Bart Van Assche
  2023-10-06  6:28   ` Kanchan Joshi
@ 2023-10-10  5:22   ` Kanchan Joshi
  2023-10-11 16:52     ` Bart Van Assche
  1 sibling, 1 reply; 45+ messages in thread
From: Kanchan Joshi @ 2023-10-10  5:22 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Damien Le Moal

On 10/6/2023 1:10 AM, Bart Van Assche wrote:
> A later patch will store the data lifetime in the bio->bi_ioprio member
> before bio_set_ioprio() is called. Make sure that bio_set_ioprio()
> doesn't clear more bits than necessary.
> 
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Niklas Cassel <niklas.cassel@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/blk-mq.c         |  3 ++-
>   include/linux/ioprio.h | 10 ++++++++++
>   2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e2d11183f62e..bc086660ffd3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2926,7 +2926,8 @@ static void bio_set_ioprio(struct bio *bio)
>   {
>   	/* Nobody set ioprio so far? Initialize it based on task's nice value */
>   	if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
> -		bio->bi_ioprio = get_current_ioprio();
> +		ioprio_set_class_and_level(&bio->bi_ioprio,
> +					   get_current_ioprio());
>   	blkcg_set_ioprio(bio);
>   }
>   
> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> index 7578d4f6a969..f2e768ab4b35 100644
> --- a/include/linux/ioprio.h
> +++ b/include/linux/ioprio.h
> @@ -71,4 +71,14 @@ static inline int ioprio_check_cap(int ioprio)
>   }
>   #endif /* CONFIG_BLOCK */
>   
> +#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << IOPRIO_CLASS_SHIFT) | \
> +				 (IOPRIO_LEVEL_MASK << 0))
> +
> +static inline void ioprio_set_class_and_level(u16 *prio, u16 class_level)
> +{
> +	WARN_ON_ONCE(class_level & ~IOPRIO_CLASS_LEVEL_MASK);
> +	*prio &= ~IOPRIO_CLASS_LEVEL_MASK;
> +	*prio |= class_level;

Return of get_current_ioprio() will touch all 16 bits here. So 
user-defined value can alter whatever was set in bio by F2FS (patch 4 in 
this series). Is that not an issue?

And what is the user interface you have in mind. Is it ioprio based, or 
write-hint based or mix of both?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 04/15] fs: Restore write hint support
  2023-10-05 19:40 ` [PATCH v2 04/15] fs: Restore write hint support Bart Van Assche
@ 2023-10-10  5:42   ` Kanchan Joshi
  2023-10-11 16:56     ` Bart Van Assche
  2023-10-16  6:20   ` Christoph Hellwig
  1 sibling, 1 reply; 45+ messages in thread
From: Kanchan Joshi @ 2023-10-10  5:42 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Jaegeuk Kim, Chao Yu, Darrick J. Wong,
	Alexander Viro, Christian Brauner

On 10/6/2023 1:10 AM, Bart Van Assche wrote:
> This patch reverts a small subset of commit c75e707fe1aa ("block: remove
> the per-bio/request write hint"). The following functionality has been
> restored:
> - In F2FS, store data lifetime information in struct bio.
> - In fs/iomap and fs/mpage.c, restore the code that sets the data
>    lifetime.
> 
> A new header file is introduced for the new bio_[sg]et_data_lifetime()
> functions because there is no other header file yet that includes both
> <linux/fs.h> and <linux/ioprio.h>.
> 
> The value WRITE_LIFE_NONE is mapped onto the data lifetime 0. This is
> consistent with NVMe TPAR4093a. From that TPAR: "A value of 1h specifies
> the shortest Data Lifetime. A value of 3Fh specifies the longest Data
> Lifetime." This is also consistent with the SCSI specifications. From
> T10 document 23-024r3: "0h: no relative lifetime is applicable; 1h:
> shortest relative lifetime; ...; 3fh: longest relative lifetime".
> 
> Cc: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   fs/f2fs/data.c              |  3 +++
>   fs/iomap/buffered-io.c      |  3 +++
>   fs/mpage.c                  |  2 ++
>   include/linux/fs-lifetime.h | 20 ++++++++++++++++++++
>   4 files changed, 28 insertions(+)
>   create mode 100644 include/linux/fs-lifetime.h
> 
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 916e317ac925..2962cb335897 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -6,6 +6,7 @@
>    *             http://www.samsung.com/
>    */
>   #include <linux/fs.h>
> +#include <linux/fs-lifetime.h>
>   #include <linux/f2fs_fs.h>
>   #include <linux/buffer_head.h>
>   #include <linux/sched/mm.h>
> @@ -478,6 +479,8 @@ static struct bio *__bio_alloc(struct f2fs_io_info *fio, int npages)
>   	} else {
>   		bio->bi_end_io = f2fs_write_end_io;
>   		bio->bi_private = sbi;
> +		bio_set_data_lifetime(bio,
> +			f2fs_io_type_to_rw_hint(sbi, fio->type, fio->temp));
>   	}
>   	iostat_alloc_and_bind_ctx(sbi, bio, NULL);
>   
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 644479ccefbd..9bf05342ca65 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -6,6 +6,7 @@
>   #include <linux/module.h>
>   #include <linux/compiler.h>
>   #include <linux/fs.h>
> +#include <linux/fs-lifetime.h>
>   #include <linux/iomap.h>
>   #include <linux/pagemap.h>
>   #include <linux/uio.h>
> @@ -1660,6 +1661,7 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
>   			       REQ_OP_WRITE | wbc_to_write_flags(wbc),
>   			       GFP_NOFS, &iomap_ioend_bioset);
>   	bio->bi_iter.bi_sector = sector;
> +	bio_set_data_lifetime(bio, inode->i_write_hint);
>   	wbc_init_bio(wbc, bio);
>   
>   	ioend = container_of(bio, struct iomap_ioend, io_inline_bio);
> @@ -1690,6 +1692,7 @@ iomap_chain_bio(struct bio *prev)
>   	new = bio_alloc(prev->bi_bdev, BIO_MAX_VECS, prev->bi_opf, GFP_NOFS);
>   	bio_clone_blkg_association(new, prev);
>   	new->bi_iter.bi_sector = bio_end_sector(prev);
> +	bio_set_data_lifetime(new, bio_get_data_lifetime(prev));
>   
>   	bio_chain(prev, new);
>   	bio_get(prev);		/* for iomap_finish_ioend */
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 242e213ee064..888ca71c9ea7 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -20,6 +20,7 @@
>   #include <linux/gfp.h>
>   #include <linux/bio.h>
>   #include <linux/fs.h>
> +#include <linux/fs-lifetime.h>
>   #include <linux/buffer_head.h>
>   #include <linux/blkdev.h>
>   #include <linux/highmem.h>
> @@ -612,6 +613,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
>   				GFP_NOFS);
>   		bio->bi_iter.bi_sector = blocks[0] << (blkbits - 9);
>   		wbc_init_bio(wbc, bio);
> +		bio_set_data_lifetime(bio, inode->i_write_hint);
>   	}
>   
>   	/*
> diff --git a/include/linux/fs-lifetime.h b/include/linux/fs-lifetime.h
> new file mode 100644
> index 000000000000..0e652e00cfab
> --- /dev/null
> +++ b/include/linux/fs-lifetime.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/bio.h>
> +#include <linux/fs.h>
> +#include <linux/ioprio.h>
> +
> +static inline enum rw_hint bio_get_data_lifetime(struct bio *bio)
> +{
> +	/* +1 to map 0 onto WRITE_LIFE_NONE. */
> +	return IOPRIO_PRIO_LIFETIME(bio->bi_ioprio) + 1;
> +}
> +
> +static inline void bio_set_data_lifetime(struct bio *bio, enum rw_hint lifetime)
> +{
> +	/* -1 to map WRITE_LIFE_NONE onto 0. */
> +	if (lifetime != 0)
> +		lifetime--;

How the driver can figure when lifetime is not set, and when it is set 
to WRITE_LIFE_NONE? If it uses IOPRIO_PRIO_LIFETIME (as patch 8 does), 
it will see 0 in both cases.
F2FS fs-based whint_mode seems to expect distinct streams for 
WRITE_LIFE_NOT_SET and WRITE_LIFE_NONE.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits
  2023-10-10  5:22   ` Kanchan Joshi
@ 2023-10-11 16:52     ` Bart Van Assche
  2023-10-12  8:49       ` Kanchan Joshi
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2023-10-11 16:52 UTC (permalink / raw)
  To: Kanchan Joshi, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Damien Le Moal

On 10/9/23 22:22, Kanchan Joshi wrote:
> On 10/6/2023 1:10 AM, Bart Van Assche wrote:
>> A later patch will store the data lifetime in the bio->bi_ioprio member
>> before bio_set_ioprio() is called. Make sure that bio_set_ioprio()
>> doesn't clear more bits than necessary.
>>
>> Cc: Damien Le Moal <dlemoal@kernel.org>
>> Cc: Niklas Cassel <niklas.cassel@wdc.com>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>    block/blk-mq.c         |  3 ++-
>>    include/linux/ioprio.h | 10 ++++++++++
>>    2 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index e2d11183f62e..bc086660ffd3 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2926,7 +2926,8 @@ static void bio_set_ioprio(struct bio *bio)
>>    {
>>    	/* Nobody set ioprio so far? Initialize it based on task's nice value */
>>    	if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
>> -		bio->bi_ioprio = get_current_ioprio();
>> +		ioprio_set_class_and_level(&bio->bi_ioprio,
>> +					   get_current_ioprio());
>>    	blkcg_set_ioprio(bio);
>>    }
>>    
>> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
>> index 7578d4f6a969..f2e768ab4b35 100644
>> --- a/include/linux/ioprio.h
>> +++ b/include/linux/ioprio.h
>> @@ -71,4 +71,14 @@ static inline int ioprio_check_cap(int ioprio)
>>    }
>>    #endif /* CONFIG_BLOCK */
>>    
>> +#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << IOPRIO_CLASS_SHIFT) | \
>> +				 (IOPRIO_LEVEL_MASK << 0))
>> +
>> +static inline void ioprio_set_class_and_level(u16 *prio, u16 class_level)
>> +{
>> +	WARN_ON_ONCE(class_level & ~IOPRIO_CLASS_LEVEL_MASK);
>> +	*prio &= ~IOPRIO_CLASS_LEVEL_MASK;
>> +	*prio |= class_level;
> 
> Return of get_current_ioprio() will touch all 16 bits here. So
> user-defined value can alter whatever was set in bio by F2FS (patch 4 in
> this series). Is that not an issue?

The above is incomprehensible to me. Anyway, I will try to answer.

It is not clear to me why it is claimed that "get_current_ioprio() will
touch all 16 bits here"? The return value of get_current_ioprio() is
passed to ioprio_set_class_and_level() and that function clears the hint
bits from the get_current_ioprio() return value.

ioprio_set_class_and_level() preserves the hint bits set by F2FS.

> And what is the user interface you have in mind. Is it ioprio based, or
> write-hint based or mix of both?

Since the data lifetime is encoded in the hint bits, the hint bits need
to be set by user space to set a data lifetime. In case you would help,
the blktest test that I wrote to test this functionality is available
below.

Thanks,

Bart.


diff --git a/tests/scsi/097 b/tests/scsi/097
new file mode 100755
index 000000000000..01d280021653
--- /dev/null
+++ b/tests/scsi/097
@@ -0,0 +1,62 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2022 Google LLC
+
+. tests/zbd/rc
+. common/null_blk
+. common/scsi_debug
+
+DESCRIPTION="test block data lifetime support"
+QUICK=1
+
+requires() {
+	_have_fio
+	_have_module scsi_debug
+}
+
+test() {
+	echo "Running ${TEST_NAME}"
+
+	local scsi_debug_params=(
+		delay=0
+		dev_size_mb=1024
+		sector_size=4096
+	)
+	_init_scsi_debug "${scsi_debug_params[@]}" &&
+	local dev="/dev/${SCSI_DEBUG_DEVICES[0]}" fail &&
+	ls -ld "${dev}" >>"${FULL}" &&
+	local i fio_args=(
+		--group_reporting=1
+		--gtod_reduce=1
+		--ioscheduler=none
+		--norandommap
+		--direct=1
+		--rw=randwrite
+		--ioengine=io_uring
+		--time_based
+	) &&
+	for ((i=1; i<=3; i++)); do
+		fio_args+=(
+			--name=whint"$i"
+			--filename="${dev}"
+			--prio=$((i<<6))
+			--time_based
+			--runtime=10
+		)
+	done &&
+	_run_fio "${fio_args[@]}" >>"${FULL}" 2>&1 ||
+	fail=true
+
+	head -n 999 /sys/bus/pseudo/drivers/scsi_debug/group_number_stats >> 
"${FULL}"
+	while read -r group count; do
+		if [ "$count" -gt 0 ]; then echo "$group"; fi
+	done < /sys/bus/pseudo/drivers/scsi_debug/group_number_stats
+	_exit_scsi_debug
+
+	if [ -z "$fail" ]; then
+		echo "Test complete"
+	else
+		echo "Test failed"
+		return 1
+	fi
+}
diff --git a/tests/scsi/097.out b/tests/scsi/097.out
new file mode 100644
index 000000000000..f6ed3d6de861
--- /dev/null
+++ b/tests/scsi/097.out
@@ -0,0 +1,5 @@
+Running scsi/097
+1
+2
+3
+Test complete


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 04/15] fs: Restore write hint support
  2023-10-10  5:42   ` Kanchan Joshi
@ 2023-10-11 16:56     ` Bart Van Assche
  0 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-11 16:56 UTC (permalink / raw)
  To: Kanchan Joshi, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Jaegeuk Kim, Chao Yu, Darrick J. Wong,
	Alexander Viro, Christian Brauner

On 10/9/23 22:42, Kanchan Joshi wrote:
> On 10/6/2023 1:10 AM, Bart Van Assche wrote:
>> +static inline enum rw_hint bio_get_data_lifetime(struct bio *bio)
>> +{
>> +	/* +1 to map 0 onto WRITE_LIFE_NONE. */
>> +	return IOPRIO_PRIO_LIFETIME(bio->bi_ioprio) + 1;
>> +}
>> +
>> +static inline void bio_set_data_lifetime(struct bio *bio, enum rw_hint lifetime)
>> +{
>> +	/* -1 to map WRITE_LIFE_NONE onto 0. */
>> +	if (lifetime != 0)
>> +		lifetime--;
> 
> How the driver can figure when lifetime is not set, and when it is set
> to WRITE_LIFE_NONE? If it uses IOPRIO_PRIO_LIFETIME (as patch 8 does),
> it will see 0 in both cases.
> F2FS fs-based whint_mode seems to expect distinct streams for
> WRITE_LIFE_NOT_SET and WRITE_LIFE_NONE.

I will remove the -1 / +1 from the above code.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-06 18:07     ` Bart Van Assche
@ 2023-10-11 20:51       ` Bart Van Assche
  2023-10-12  1:02         ` Damien Le Moal
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2023-10-11 20:51 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke

On 10/6/23 11:07, Bart Van Assche wrote:
> On 10/6/23 01:19, Damien Le Moal wrote:
>> Your change seem to assume that it makes sense to be able to combine 
>> CDL with
>> lifetime hints. But does it really ? CDL is of dubious value for solid 
>> state
>> media and as far as I know, UFS world has not expressed interest. 
>> Conversely,
>> data lifetime hints do not make much sense for spin rust media where 
>> CDL is
>> important. So I would say that the combination of CDL and lifetime 
>> hints is of
>> dubious value.
>>
>> Given this, why not simply define the 64 possible lifetime values as 
>> plain hint
>> values (8 to 71, following 1 to 7 for CDL) ?
>>
>> The other question here if you really want to keep the bit separation 
>> approach
>> is: do we really need up to 64 different lifetime hints ? While the scsi
>> standard allows that much, does this many different lifetime make 
>> sense in
>> practice ? Can we ever think of a usecase that needs more than say 8 
>> different
>> liftimes (3 bits) ? If you limit the number of possible lifetime hints 
>> to 8,
>> then we can keep 4 bits unused in the hint field for future features.
> 
> Hi Damien,
> 
> Not supporting CDL for solid state media and supporting eight different
> lifetime values sounds good to me. Is this perhaps what you had in mind?
> 
> Thanks,
> 
> Bart.
> 
> --- a/include/uapi/linux/ioprio.h
> +++ b/include/uapi/linux/ioprio.h
> @@ -100,6 +100,14 @@ enum {
>          IOPRIO_HINT_DEV_DURATION_LIMIT_5 = 5,
>          IOPRIO_HINT_DEV_DURATION_LIMIT_6 = 6,
>          IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7,
> +       IOPRIO_HINT_DATA_LIFE_TIME_0 = 8,
> +       IOPRIO_HINT_DATA_LIFE_TIME_1 = 9,
> +       IOPRIO_HINT_DATA_LIFE_TIME_2 = 10,
> +       IOPRIO_HINT_DATA_LIFE_TIME_3 = 11,
> +       IOPRIO_HINT_DATA_LIFE_TIME_4 = 12,
> +       IOPRIO_HINT_DATA_LIFE_TIME_5 = 13,
> +       IOPRIO_HINT_DATA_LIFE_TIME_6 = 14,
> +       IOPRIO_HINT_DATA_LIFE_TIME_7 = 15,
>   };

(replying to my own e-mail)

Hi Damien,

Does the above look good to you?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-11 20:51       ` Bart Van Assche
@ 2023-10-12  1:02         ` Damien Le Moal
  2023-10-12 18:00           ` Bart Van Assche
  0 siblings, 1 reply; 45+ messages in thread
From: Damien Le Moal @ 2023-10-12  1:02 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke

On 10/12/23 05:51, Bart Van Assche wrote:
> On 10/6/23 11:07, Bart Van Assche wrote:
>> On 10/6/23 01:19, Damien Le Moal wrote:
>>> Your change seem to assume that it makes sense to be able to combine CDL with
>>> lifetime hints. But does it really ? CDL is of dubious value for solid state
>>> media and as far as I know, UFS world has not expressed interest. Conversely,
>>> data lifetime hints do not make much sense for spin rust media where CDL is
>>> important. So I would say that the combination of CDL and lifetime hints is of
>>> dubious value.
>>>
>>> Given this, why not simply define the 64 possible lifetime values as plain hint
>>> values (8 to 71, following 1 to 7 for CDL) ?
>>>
>>> The other question here if you really want to keep the bit separation approach
>>> is: do we really need up to 64 different lifetime hints ? While the scsi
>>> standard allows that much, does this many different lifetime make sense in
>>> practice ? Can we ever think of a usecase that needs more than say 8 different
>>> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
>>> then we can keep 4 bits unused in the hint field for future features.
>>
>> Hi Damien,
>>
>> Not supporting CDL for solid state media and supporting eight different
>> lifetime values sounds good to me. Is this perhaps what you had in mind?
>>
>> Thanks,
>>
>> Bart.
>>
>> --- a/include/uapi/linux/ioprio.h
>> +++ b/include/uapi/linux/ioprio.h
>> @@ -100,6 +100,14 @@ enum {
>>          IOPRIO_HINT_DEV_DURATION_LIMIT_5 = 5,
>>          IOPRIO_HINT_DEV_DURATION_LIMIT_6 = 6,
>>          IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7,
>> +       IOPRIO_HINT_DATA_LIFE_TIME_0 = 8,
>> +       IOPRIO_HINT_DATA_LIFE_TIME_1 = 9,
>> +       IOPRIO_HINT_DATA_LIFE_TIME_2 = 10,
>> +       IOPRIO_HINT_DATA_LIFE_TIME_3 = 11,
>> +       IOPRIO_HINT_DATA_LIFE_TIME_4 = 12,
>> +       IOPRIO_HINT_DATA_LIFE_TIME_5 = 13,
>> +       IOPRIO_HINT_DATA_LIFE_TIME_6 = 14,
>> +       IOPRIO_HINT_DATA_LIFE_TIME_7 = 15,
>>   };
> 
> (replying to my own e-mail)
> 
> Hi Damien,
> 
> Does the above look good to you?

Yes, it is what I was thinking of and I think it looks much better (and
simpler) than coding with different bits. However, I think this is acceptable
only if everyone agrees that combining CDL and lifetime (and potentially other
hints) is not a sensible thing to do. I stated that my thinking is that CDL is
more geared toward rotating media while lifetime is more suitable for solid
state media. But does everyone agree ?

Some have stated interest in CDL in NVMe-oF context, which could imply that
combining CDL and lifetime may be something useful to do in that space...

Getting more opinions about this would be nice. If we do not get any, I would
say that we should go with this.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits
  2023-10-11 16:52     ` Bart Van Assche
@ 2023-10-12  8:49       ` Kanchan Joshi
  2023-10-12 14:03         ` Niklas Cassel
  2023-10-12 17:42         ` Bart Van Assche
  0 siblings, 2 replies; 45+ messages in thread
From: Kanchan Joshi @ 2023-10-12  8:49 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Damien Le Moal

On 10/11/2023 10:22 PM, Bart Van Assche wrote:
>>> @@ -2926,7 +2926,8 @@ static void bio_set_ioprio(struct bio *bio)
>>>    {
>>>        /* Nobody set ioprio so far? Initialize it based on task's 
>>> nice value */
>>>        if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
>>> -        bio->bi_ioprio = get_current_ioprio();
>>> +        ioprio_set_class_and_level(&bio->bi_ioprio,
>>> +                       get_current_ioprio());
>>>        blkcg_set_ioprio(bio);
>>>    }
>>> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
>>> index 7578d4f6a969..f2e768ab4b35 100644
>>> --- a/include/linux/ioprio.h
>>> +++ b/include/linux/ioprio.h
>>> @@ -71,4 +71,14 @@ static inline int ioprio_check_cap(int ioprio)
>>>    }
>>>    #endif /* CONFIG_BLOCK */
>>> +#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << 
>>> IOPRIO_CLASS_SHIFT) | \
>>> +                 (IOPRIO_LEVEL_MASK << 0))
>>> +
>>> +static inline void ioprio_set_class_and_level(u16 *prio, u16 
>>> class_level)
>>> +{
>>> +    WARN_ON_ONCE(class_level & ~IOPRIO_CLASS_LEVEL_MASK);
>>> +    *prio &= ~IOPRIO_CLASS_LEVEL_MASK;
>>> +    *prio |= class_level;
>>
>> Return of get_current_ioprio() will touch all 16 bits here. So
>> user-defined value can alter whatever was set in bio by F2FS (patch 4 in
>> this series). Is that not an issue?
> 
> The above is incomprehensible to me. Anyway, I will try to answer.
> 
> It is not clear to me why it is claimed that "get_current_ioprio() will
> touch all 16 bits here"? The return value of get_current_ioprio() is
> passed to ioprio_set_class_and_level() and that function clears the hint
> bits from the get_current_ioprio() return value.

Function does OR bio->bi_ioprio with whatever is the return of 
get_current_ioprio(). So if lifetime bits were set in 
get_current_ioprio(), you will end up setting that in bio->bi_ioprio too.


> ioprio_set_class_and_level() preserves the hint bits set by F2FS.
> 
>> And what is the user interface you have in mind. Is it ioprio based, or
>> write-hint based or mix of both?
> 
> Since the data lifetime is encoded in the hint bits, the hint bits need
> to be set by user space to set a data lifetime.

I asked because more than one way seems to emerge here. Parts of this 
series (Patch 4) are taking inode->i_write_hint (and not ioprio value) 
and putting that into bio.
I wonder what to expect if application get to send one lifetime with 
fcntl (write-hints) and different one with ioprio. Is that not racy?


> In case you would help,
> the blktest test that I wrote to test this functionality is available
> below.
> 
> Thanks,
> 
> Bart.
> 
> 
> diff --git a/tests/scsi/097 b/tests/scsi/097
> new file mode 100755
> index 000000000000..01d280021653
> --- /dev/null
> +++ b/tests/scsi/097
> @@ -0,0 +1,62 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-3.0+
> +# Copyright (C) 2022 Google LLC
> +
> +. tests/zbd/rc
> +. common/null_blk
> +. common/scsi_debug
> +
> +DESCRIPTION="test block data lifetime support"
> +QUICK=1
> +
> +requires() {
> +    _have_fio
> +    _have_module scsi_debug
> +}
> +
> +test() {
> +    echo "Running ${TEST_NAME}"
> +
> +    local scsi_debug_params=(
> +        delay=0
> +        dev_size_mb=1024
> +        sector_size=4096
> +    )
> +    _init_scsi_debug "${scsi_debug_params[@]}" &&
> +    local dev="/dev/${SCSI_DEBUG_DEVICES[0]}" fail &&
> +    ls -ld "${dev}" >>"${FULL}" &&
> +    local i fio_args=(
> +        --group_reporting=1
> +        --gtod_reduce=1
> +        --ioscheduler=none
> +        --norandommap
> +        --direct=1
> +        --rw=randwrite
> +        --ioengine=io_uring
> +        --time_based
> +    ) &&
> +    for ((i=1; i<=3; i++)); do
> +        fio_args+=(
> +            --name=whint"$i"
> +            --filename="${dev}"
> +            --prio=$((i<<6))

This will not work as prio can only take values between 0-7.
Perhaps you want to use "priohint" to send lifetime.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits
  2023-10-12  8:49       ` Kanchan Joshi
@ 2023-10-12 14:03         ` Niklas Cassel
  2023-10-12 17:42         ` Bart Van Assche
  1 sibling, 0 replies; 45+ messages in thread
From: Niklas Cassel @ 2023-10-12 14:03 UTC (permalink / raw)
  To: Kanchan Joshi
  Cc: Bart Van Assche, Jens Axboe, linux-block, linux-scsi,
	linux-fsdevel, Martin K . Petersen, Christoph Hellwig,
	Avri Altman, Bean Huo, Daejun Park, Damien Le Moal

On Thu, Oct 12, 2023 at 02:19:02PM +0530, Kanchan Joshi wrote:
> On 10/11/2023 10:22 PM, Bart Van Assche wrote:
> >>> @@ -2926,7 +2926,8 @@ static void bio_set_ioprio(struct bio *bio)
> >>>    {
> >>>        /* Nobody set ioprio so far? Initialize it based on task's 
> >>> nice value */
> >>>        if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
> >>> -        bio->bi_ioprio = get_current_ioprio();
> >>> +        ioprio_set_class_and_level(&bio->bi_ioprio,
> >>> +                       get_current_ioprio());
> >>>        blkcg_set_ioprio(bio);
> >>>    }
> >>> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> >>> index 7578d4f6a969..f2e768ab4b35 100644
> >>> --- a/include/linux/ioprio.h
> >>> +++ b/include/linux/ioprio.h
> >>> @@ -71,4 +71,14 @@ static inline int ioprio_check_cap(int ioprio)
> >>>    }
> >>>    #endif /* CONFIG_BLOCK */
> >>> +#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << 
> >>> IOPRIO_CLASS_SHIFT) | \
> >>> +                 (IOPRIO_LEVEL_MASK << 0))
> >>> +
> >>> +static inline void ioprio_set_class_and_level(u16 *prio, u16 
> >>> class_level)
> >>> +{
> >>> +    WARN_ON_ONCE(class_level & ~IOPRIO_CLASS_LEVEL_MASK);
> >>> +    *prio &= ~IOPRIO_CLASS_LEVEL_MASK;
> >>> +    *prio |= class_level;
> >>
> >> Return of get_current_ioprio() will touch all 16 bits here. So
> >> user-defined value can alter whatever was set in bio by F2FS (patch 4 in
> >> this series). Is that not an issue?
> > 
> > The above is incomprehensible to me. Anyway, I will try to answer.
> > 
> > It is not clear to me why it is claimed that "get_current_ioprio() will
> > touch all 16 bits here"? The return value of get_current_ioprio() is
> > passed to ioprio_set_class_and_level() and that function clears the hint
> > bits from the get_current_ioprio() return value.
> 
> Function does OR bio->bi_ioprio with whatever is the return of 
> get_current_ioprio(). So if lifetime bits were set in 
> get_current_ioprio(), you will end up setting that in bio->bi_ioprio too.
> 
> 
> > ioprio_set_class_and_level() preserves the hint bits set by F2FS.
> > 
> >> And what is the user interface you have in mind. Is it ioprio based, or
> >> write-hint based or mix of both?
> > 
> > Since the data lifetime is encoded in the hint bits, the hint bits need
> > to be set by user space to set a data lifetime.
> 
> I asked because more than one way seems to emerge here. Parts of this 
> series (Patch 4) are taking inode->i_write_hint (and not ioprio value) 
> and putting that into bio.
> I wonder what to expect if application get to send one lifetime with 
> fcntl (write-hints) and different one with ioprio. Is that not racy?

Hello Kanchan,

The fcntl F_SET_RW_HINT still exists, which sets inode->i_write_hint.
This is currently only used by f2fs.

The usage of inode->i_write_hint was removed from all filesystems
(except f2fs) in:
c75e707fe1aa ("block: remove the per-bio/request write hint").
This commit also removed bi_write_hint from struct bio.

The fcntl F_SET_FILE_RW_HINT, which used to set f->f_write_hint was removed
in:
7b12e49669c9 ("fs: remove fs.f_write_hint")
This commit also removed f_write_hint from struct file.

My thinking when suggesting to reuse ioprio hints, was that we don't need
to readd write_hint struct members to struct bio and struct request.

SCSI can just reuse the hints bits in ioprio.



Note that my filesystem knowledge is not the best...

But for f2fs, I guess it just needs to set the bio->ioprio hint bits
correctly.

I guess the confusion is if an application does both:
ioprio_set() and fcntl(.., F_SET_RW_HINT, ..), what should the filesystem
use?

Or.. if you use e.g. io_uring to write to a file stored on f2fs...
io_uring has sqe->ioprio, which can contain a write hint, does this get
propagated to the filesystem?
And if so, what if you also did fcntl(.., F_SET_RW_HINT, ..) to set
i_write_hint? Which should the filesystem use to set bio->ioprio?


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits
  2023-10-12  8:49       ` Kanchan Joshi
  2023-10-12 14:03         ` Niklas Cassel
@ 2023-10-12 17:42         ` Bart Van Assche
  1 sibling, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-12 17:42 UTC (permalink / raw)
  To: Kanchan Joshi, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Damien Le Moal

On 10/12/23 01:49, Kanchan Joshi wrote:
> Function does OR bio->bi_ioprio with whatever is the return of
> get_current_ioprio().

No, that's not what ioprio_set_class_and_level() does. It clears the 
hint bits before it performs a logical OR.

> So if lifetime bits were set in get_current_ioprio(), you will end up
> setting that in bio->bi_ioprio too.
I'm not sure there are any use cases where it is useful to set the data
lifetime for an entire process.

Anyway, how about replacing this patch with the patch below? This will
allow to set hint information for an entire process.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e2d11183f62e..3419ca4c1bf4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2924,9 +2924,14 @@ static inline struct request 
*blk_mq_get_cached_request(struct request_queue *q,

  static void bio_set_ioprio(struct bio *bio)
  {
+	u16 cur_ioprio = get_current_ioprio();
+
  	/* Nobody set ioprio so far? Initialize it based on task's nice value */
  	if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE)
-		bio->bi_ioprio = get_current_ioprio();
+		bio->bi_ioprio |= cur_ioprio & IOPRIO_CLASS_LEVEL_MASK;
+	if (IOPRIO_PRIO_HINT(bio->bi_ioprio) == 0)
+		bio->bi_ioprio |= cur_ioprio &
+			(IOPRIO_HINT_MASK << IOPRIO_HINT_SHIFT);
  	blkcg_set_ioprio(bio);
  }

diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 7578d4f6a969..5697832f35a3 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -71,4 +71,7 @@ static inline int ioprio_check_cap(int ioprio)
  }
  #endif /* CONFIG_BLOCK */

+#define IOPRIO_CLASS_LEVEL_MASK ((IOPRIO_CLASS_MASK << 
IOPRIO_CLASS_SHIFT) | \
+				 (IOPRIO_LEVEL_MASK << 0))
+
  #endif


>> ioprio_set_class_and_level() preserves the hint bits set by F2FS.
>>
>>> And what is the user interface you have in mind. Is it ioprio based, or
>>> write-hint based or mix of both?
>>
>> Since the data lifetime is encoded in the hint bits, the hint bits need
>> to be set by user space to set a data lifetime.
> 
> I asked because more than one way seems to emerge here. Parts of this
> series (Patch 4) are taking inode->i_write_hint (and not ioprio value)
> and putting that into bio.
> I wonder what to expect if application get to send one lifetime with
> fcntl (write-hints) and different one with ioprio. Is that not racy?

There is no race condition. F_SET_RW_HINT can be used to set 
inode->i_write_hint. The filesystem may use the inode->i_write_hint 
information. I think F2FS uses this information in its block allocator.

>> +            --prio=$((i<<6))
> 
> This will not work as prio can only take values between 0-7.
> Perhaps you want to use "priohint" to send lifetime.

Thanks for having mentioned the priohint option. The above works on my
test setup since I modified fio locally to accept larger prio values. I
will switch to the priohint option.

Bart.

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-12  1:02         ` Damien Le Moal
@ 2023-10-12 18:00           ` Bart Van Assche
  2023-10-13  1:08             ` Damien Le Moal
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2023-10-12 18:00 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke

On 10/11/23 18:02, Damien Le Moal wrote:
> Some have stated interest in CDL in NVMe-oF context, which could
> imply that combining CDL and lifetime may be something useful to do
> in that space...

We are having this discussion because bi_ioprio is sixteen bits wide and
because we don't want to make struct bio larger. How about expanding the
bi_ioprio field from 16 to 32 bits and to use separate bits for CDL
information and data lifetimes?

This patch does not make struct bio bigger because it changes a three
byte hole with a one byte hole:

--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -268,8 +268,8 @@ struct bio {
                                                  * req_flags.
                                                  */
         unsigned short          bi_flags;       /* BIO_* below */
-       unsigned short          bi_ioprio;
         blk_status_t            bi_status;
+       u32                     bi_ioprio;
         atomic_t                __bi_remaining;

         struct bvec_iter        bi_iter;


 From the pahole output:

struct bio {
         struct bio *               bi_next;              /*     0     8 */
         struct block_device *      bi_bdev;              /*     8     8 */
         blk_opf_t                  bi_opf;               /*    16     4 */
         short unsigned int         bi_flags;             /*    20     2 */
         blk_status_t               bi_status;            /*    22     1 */

         /* XXX 1 byte hole, try to pack */

         u32                        bi_ioprio;            /*    24     4 */
         atomic_t                   __bi_remaining;       /*    28     4 */
         struct bvec_iter           bi_iter;              /*    32    20 */
         blk_qc_t                   bi_cookie;            /*    52     4 */
         bio_end_io_t *             bi_end_io;            /*    56     8 */
         /* --- cacheline 1 boundary (64 bytes) --- */
         void *                     bi_private;           /*    64     8 */
         struct bio_crypt_ctx *     bi_crypt_context;     /*    72     8 */
         union {
                 struct bio_integrity_payload * bi_integrity; /*    80 
   8 */
         };                                               /*    80     8 */
         short unsigned int         bi_vcnt;              /*    88     2 */
         short unsigned int         bi_max_vecs;          /*    90     2 */
         atomic_t                   __bi_cnt;             /*    92     4 */
         struct bio_vec *           bi_io_vec;            /*    96     8 */
         struct bio_set *           bi_pool;              /*   104     8 */
         struct bio_vec             bi_inline_vecs[];     /*   112     0 */

         /* size: 112, cachelines: 2, members: 19 */
         /* sum members: 111, holes: 1, sum holes: 1 */
         /* last cacheline: 48 bytes */
};

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-12 18:00           ` Bart Van Assche
@ 2023-10-13  1:08             ` Damien Le Moal
  2023-10-13  9:33               ` Niklas Cassel
  2023-10-13 20:18               ` Bart Van Assche
  0 siblings, 2 replies; 45+ messages in thread
From: Damien Le Moal @ 2023-10-13  1:08 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke

On 10/13/23 03:00, Bart Van Assche wrote:
> On 10/11/23 18:02, Damien Le Moal wrote:
>> Some have stated interest in CDL in NVMe-oF context, which could
>> imply that combining CDL and lifetime may be something useful to do
>> in that space...
> 
> We are having this discussion because bi_ioprio is sixteen bits wide and
> because we don't want to make struct bio larger. How about expanding the
> bi_ioprio field from 16 to 32 bits and to use separate bits for CDL
> information and data lifetimes?

I guess we could do that as well. User side aio_reqprio field of struct aiocb,
which is used by io_uring and libaio, is an int, so 32-bits also. Changing
bi_ioprio to match that should not cause regressions or break user space I
think. Kernel uapi ioprio.h will need some massaging though.

> This patch does not make struct bio bigger because it changes a three
> byte hole with a one byte hole:

Yeah, but if the kernel is compiled with struct randomization, that does not
really apply, doesn't it ?

Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for
the lifetime may have one drawback: that information will be propagated to the
device only for direct IOs, no ? For buffered IOs, the information will be
lost. The other potential disadvantage of the ioprio interface is that we
cannot define ioprio+hint per file (or per inode really), unlike the old
write_hint that you initially reintroduced. Are these points blockers for the
user API you were thinking of ? How do you envision the user specifying
lifetime ? Per file ? Or are you thinking of not relying on the user to specify
that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I
think ioprio+hint is fine (it is simple). But if it is the former, the ioprio
API may not be the best suited for the job at hand.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-13  1:08             ` Damien Le Moal
@ 2023-10-13  9:33               ` Niklas Cassel
  2023-10-13 21:20                 ` Bart Van Assche
  2023-10-13 20:18               ` Bart Van Assche
  1 sibling, 1 reply; 45+ messages in thread
From: Niklas Cassel @ 2023-10-13  9:33 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Jens Axboe, linux-block, linux-scsi,
	linux-fsdevel, Martin K . Petersen, Christoph Hellwig,
	Avri Altman, Bean Huo, Daejun Park, Hannes Reinecke

On Fri, Oct 13, 2023 at 10:08:29AM +0900, Damien Le Moal wrote:
> On 10/13/23 03:00, Bart Van Assche wrote:
> > On 10/11/23 18:02, Damien Le Moal wrote:
> >> Some have stated interest in CDL in NVMe-oF context, which could
> >> imply that combining CDL and lifetime may be something useful to do
> >> in that space...
> > 
> > We are having this discussion because bi_ioprio is sixteen bits wide and
> > because we don't want to make struct bio larger. How about expanding the
> > bi_ioprio field from 16 to 32 bits and to use separate bits for CDL
> > information and data lifetimes?
> 
> I guess we could do that as well. User side aio_reqprio field of struct aiocb,
> which is used by io_uring and libaio, is an int, so 32-bits also. Changing
> bi_ioprio to match that should not cause regressions or break user space I
> think. Kernel uapi ioprio.h will need some massaging though.
> 
> > This patch does not make struct bio bigger because it changes a three
> > byte hole with a one byte hole:
> 
> Yeah, but if the kernel is compiled with struct randomization, that does not
> really apply, doesn't it ?
> 
> Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for
> the lifetime may have one drawback: that information will be propagated to the
> device only for direct IOs, no ? For buffered IOs, the information will be
> lost. The other potential disadvantage of the ioprio interface is that we
> cannot define ioprio+hint per file (or per inode really), unlike the old
> write_hint that you initially reintroduced. Are these points blockers for the
> user API you were thinking of ? How do you envision the user specifying
> lifetime ? Per file ? Or are you thinking of not relying on the user to specify
> that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I
> think ioprio+hint is fine (it is simple). But if it is the former, the ioprio
> API may not be the best suited for the job at hand.

Hello Damien,

If you look closer at this series, you will see that even V2 of this series
still uses fcntl F_SET_RW_HINT as the user facing API.

This series simply take the value from fcntl F_SET_RW_HINT
(inode->i_write_hint) and stores it in bio->ioprio.

So it is not really using the ioprio user API.

See the patch to e.g. buffered-io.c:

--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -6,6 +6,7 @@
 #include <linux/module.h>
 #include <linux/compiler.h>
 #include <linux/fs.h>
+#include <linux/fs-lifetime.h>
 #include <linux/iomap.h>
 #include <linux/pagemap.h>
 #include <linux/uio.h>
@@ -1660,6 +1661,7 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc,
 			       REQ_OP_WRITE | wbc_to_write_flags(wbc),
 			       GFP_NOFS, &iomap_ioend_bioset);
 	bio->bi_iter.bi_sector = sector;
+	bio_set_data_lifetime(bio, inode->i_write_hint);
 	wbc_init_bio(wbc, bio);
 
 	ioend = container_of(bio, struct iomap_ioend, io_inline_bio);




In commit c75e707fe1aa ("block: remove the per-bio/request write hint")
this line from fs/direct-io.c was removed:
-       bio->bi_write_hint = dio->iocb->ki_hint;

I'm not sure why this series does not readd a similar line to set the
lifetime (using bio_set_data_lifetime()) also for fs/direct-io.c.


I still don't understand what happens if one uses io_uring to write
to a file on a f2fs filesystem using buffered-io, with both
inode->i_write_hint set using fcntl F_SET_RW_HINT, and bits belonging
to life time hints set in the io_uring SQE (sqe->ioprio).

I'm guessing that the:
bio_set_data_lifetime(bio, inode->i_write_hint);
call above in buffered-io.c, will simply overwrite whatever value
that was already stored in bio->ioprio. (Because if I understand
correctly, bio->ioprio will initially be set to the value in the
io_uring SQE (sqe->ioprio).)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-13  1:08             ` Damien Le Moal
  2023-10-13  9:33               ` Niklas Cassel
@ 2023-10-13 20:18               ` Bart Van Assche
  2023-10-15 22:22                 ` Damien Le Moal
  1 sibling, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2023-10-13 20:18 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke

On 10/12/23 18:08, Damien Le Moal wrote:
> On 10/13/23 03:00, Bart Van Assche wrote:
>> We are having this discussion because bi_ioprio is sixteen bits wide and
>> because we don't want to make struct bio larger. How about expanding the
>> bi_ioprio field from 16 to 32 bits and to use separate bits for CDL
>> information and data lifetimes?
> 
> I guess we could do that as well. User side aio_reqprio field of struct aiocb,
> which is used by io_uring and libaio, is an int, so 32-bits also. Changing
> bi_ioprio to match that should not cause regressions or break user space I
> think. Kernel uapi ioprio.h will need some massaging though.

Hmm ... are we perhaps looking at different kernel versions? This is
what I found:

$ git grep -nHE 'ioprio;|reqprio;' include/uapi/linux/{io_uring,aio_abi}.h
include/uapi/linux/aio_abi.h:89:	__s16	aio_reqprio;
include/uapi/linux/io_uring.h:33:	__u16	ioprio;		/* ioprio for the 
request */

The struct iocb used for asynchronous I/O has a size of 64 bytes and
does not have any holes. struct io_uring_sqe also has a size of 64 bytes
and does not have any holes either. The ioprio_set() and ioprio_get()
system calls use the data type int so these wouldn't need any changes to
increase the number of ioprio bits.

> Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for
> the lifetime may have one drawback: that information will be propagated to the
> device only for direct IOs, no ? For buffered IOs, the information will be
> lost. The other potential disadvantage of the ioprio interface is that we
> cannot define ioprio+hint per file (or per inode really), unlike the old
> write_hint that you initially reintroduced. Are these points blockers for the
> user API you were thinking of ? How do you envision the user specifying
> lifetime ? Per file ? Or are you thinking of not relying on the user to specify
> that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I
> think ioprio+hint is fine (it is simple). But if it is the former, the ioprio
> API may not be the best suited for the job at hand.

The way I see it is that the primary purpose of the bits in the
bi_ioprio member that are used for the data lifetime is to allow
filesystems to provide data lifetime information to block drivers.

Specifying data lifetime information for direct I/O is convenient when
writing test scripts that verify whether data lifetime supports works
correctly. There may be other use cases but this is not my primary
focus.

I think that applications that want to specify data lifetime information
should use fcntl(fd, F_SET_RW_HINT, ...). It is up to the filesystem to
make sure that this information ends up in the bi_ioprio field. The
block layer is responsible for passing the information in the bi_ioprio
member to block drivers. Filesystems can support multiple policies for
combining the i_write_hint and other information into a data lifetime.
See also the whint_mode restored by patch 05/15 in this series.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-13  9:33               ` Niklas Cassel
@ 2023-10-13 21:20                 ` Bart Van Assche
  2023-10-16  9:20                   ` Niklas Cassel
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2023-10-13 21:20 UTC (permalink / raw)
  To: Niklas Cassel, Damien Le Moal
  Cc: Jens Axboe, linux-block, linux-scsi, linux-fsdevel,
	Martin K . Petersen, Christoph Hellwig, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke

On 10/13/23 02:33, Niklas Cassel wrote:
> In commit c75e707fe1aa ("block: remove the per-bio/request write hint")
> this line from fs/direct-io.c was removed:
> -       bio->bi_write_hint = dio->iocb->ki_hint;
> 
> I'm not sure why this series does not readd a similar line to set the
> lifetime (using bio_set_data_lifetime()) also for fs/direct-io.c.

It depends on how we want the user to specify the data lifetime for
direct I/O. This assignment is not modified by this patch series and
copies the data lifetime information from the ioprio bitfield from user
space into the bio:

		bio->bi_ioprio = dio->iocb->ki_ioprio;

> I still don't understand what happens if one uses io_uring to write
> to a file on a f2fs filesystem using buffered-io, with both
> inode->i_write_hint set using fcntl F_SET_RW_HINT, and bits belonging
> to life time hints set in the io_uring SQE (sqe->ioprio).

Is the documentation of the whint_mode mount option in patch 5/15 of this
series sufficient to answer the above question?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-13 20:18               ` Bart Van Assche
@ 2023-10-15 22:22                 ` Damien Le Moal
  2023-10-16 16:31                   ` Bart Van Assche
  0 siblings, 1 reply; 45+ messages in thread
From: Damien Le Moal @ 2023-10-15 22:22 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke

On 10/14/23 05:18, Bart Van Assche wrote:
> On 10/12/23 18:08, Damien Le Moal wrote:
>> On 10/13/23 03:00, Bart Van Assche wrote:
>>> We are having this discussion because bi_ioprio is sixteen bits wide and
>>> because we don't want to make struct bio larger. How about expanding the
>>> bi_ioprio field from 16 to 32 bits and to use separate bits for CDL
>>> information and data lifetimes?
>>
>> I guess we could do that as well. User side aio_reqprio field of struct aiocb,
>> which is used by io_uring and libaio, is an int, so 32-bits also. Changing
>> bi_ioprio to match that should not cause regressions or break user space I
>> think. Kernel uapi ioprio.h will need some massaging though.
> 
> Hmm ... are we perhaps looking at different kernel versions? This is
> what I found:
> 
> $ git grep -nHE 'ioprio;|reqprio;' include/uapi/linux/{io_uring,aio_abi}.h
> include/uapi/linux/aio_abi.h:89:	__s16	aio_reqprio;
> include/uapi/linux/io_uring.h:33:	__u16	ioprio;		/* ioprio for the 
> request */

My bad. I looked at "man aio" but that is the posix AIO API, not Linux native.

> The struct iocb used for asynchronous I/O has a size of 64 bytes and
> does not have any holes. struct io_uring_sqe also has a size of 64 bytes
> and does not have any holes either. The ioprio_set() and ioprio_get()
> system calls use the data type int so these wouldn't need any changes to
> increase the number of ioprio bits.

Yes, but I think it would be better to keep the bio bi_ioprio field size synced
with the per AIO aio_reqprio/ioprio for libaio and io_uring, that is, 16-bits.

>> Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for
>> the lifetime may have one drawback: that information will be propagated to the
>> device only for direct IOs, no ? For buffered IOs, the information will be
>> lost. The other potential disadvantage of the ioprio interface is that we
>> cannot define ioprio+hint per file (or per inode really), unlike the old
>> write_hint that you initially reintroduced. Are these points blockers for the
>> user API you were thinking of ? How do you envision the user specifying
>> lifetime ? Per file ? Or are you thinking of not relying on the user to specify
>> that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I
>> think ioprio+hint is fine (it is simple). But if it is the former, the ioprio
>> API may not be the best suited for the job at hand.
> 
> The way I see it is that the primary purpose of the bits in the
> bi_ioprio member that are used for the data lifetime is to allow
> filesystems to provide data lifetime information to block drivers.
> 
> Specifying data lifetime information for direct I/O is convenient when
> writing test scripts that verify whether data lifetime supports works
> correctly. There may be other use cases but this is not my primary
> focus.
> 
> I think that applications that want to specify data lifetime information
> should use fcntl(fd, F_SET_RW_HINT, ...). It is up to the filesystem to
> make sure that this information ends up in the bi_ioprio field. The
> block layer is responsible for passing the information in the bi_ioprio
> member to block drivers. Filesystems can support multiple policies for
> combining the i_write_hint and other information into a data lifetime.
> See also the whint_mode restored by patch 05/15 in this series.

Explaining this in the cover letter of the series would be helpful for one to
understand your view of how the information is propagated from user to device.

I am not a fan of having a fcntl() call ending up modifying the ioprio of IOs
using hints, given that hints in themselves are already a user facing
information/API. This is confusing... What if we have a user issue direct IOs
with a lifetime value hint on a file that has a different lifetime set with
fcntl() ? And I am sure there are other corner cases like this.

Given that lifetime is per file (inode) and IO prio is per process or per I/O,
having different user APIs makes sense. The issue of not growing (if possible)
the bio and request structures remains. For bio, you identified a hole already,
so what about using another 16-bits field for lifetime ? Not sure for requests.
I thought also of a union with bi_ioprio, but that would prevent using lifetime
and IO priority together, which is not ideal.

> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-06  8:19   ` Damien Le Moal
  2023-10-06  9:53     ` Niklas Cassel
  2023-10-06 18:07     ` Bart Van Assche
@ 2023-10-16  6:17     ` Christoph Hellwig
  2023-10-16 16:32       ` Bart Van Assche
  2 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2023-10-16  6:17 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Bart Van Assche, Jens Axboe, linux-block, linux-scsi,
	linux-fsdevel, Martin K . Petersen, Christoph Hellwig,
	Niklas Cassel, Avri Altman, Bean Huo, Daejun Park,
	Hannes Reinecke

On Fri, Oct 06, 2023 at 05:19:52PM +0900, Damien Le Moal wrote:
> Your change seem to assume that it makes sense to be able to combine CDL with
> lifetime hints. But does it really ? 

Yes, it does.


> CDL is of dubious value for solid state
> media and as far as I know,

No, it's pretty useful and I'd bet my 2 cents that it will eventually
show up in relevant standards and devices.

Even if it wasn't making our user interfaces exclusive would be a
massive pain.

> The other question here if you really want to keep the bit separation approach
> is: do we really need up to 64 different lifetime hints ? While the scsi
> standard allows that much, does this many different lifetime make sense in
> practice ? Can we ever think of a usecase that needs more than say 8 different
> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
> then we can keep 4 bits unused in the hint field for future features.

Yes, I think this is the smoking gun.  We should be fine with a much
more limited number of lifetime hints, i.e. the user interface only
exposes 5 hints, and supporting more in the in-kernel interfaces seems
of rather doubtfuŀ use.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 04/15] fs: Restore write hint support
  2023-10-05 19:40 ` [PATCH v2 04/15] fs: Restore write hint support Bart Van Assche
  2023-10-10  5:42   ` Kanchan Joshi
@ 2023-10-16  6:20   ` Christoph Hellwig
  1 sibling, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2023-10-16  6:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-scsi, linux-fsdevel,
	Martin K . Petersen, Christoph Hellwig, Niklas Cassel,
	Avri Altman, Bean Huo, Daejun Park, Jaegeuk Kim, Chao Yu,
	Darrick J. Wong, Alexander Viro, Christian Brauner

On Thu, Oct 05, 2023 at 12:40:50PM -0700, Bart Van Assche wrote:
> This patch reverts a small subset of commit c75e707fe1aa ("block: remove
> the per-bio/request write hint"). The following functionality has been
> restored:

Please explain this in terms of what you add.  The fact that it restores
something isn't more than a little footnote added at the end.

> --- /dev/null
> +++ b/include/linux/fs-lifetime.h

The name seems a bit odd for something that primarily deals with bios.
bio-lifetime.h would seem like a better fit.

> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/bio.h>
> +#include <linux/fs.h>
> +#include <linux/ioprio.h>
> +
> +static inline enum rw_hint bio_get_data_lifetime(struct bio *bio)
> +{
> +	/* +1 to map 0 onto WRITE_LIFE_NONE. */
> +	return IOPRIO_PRIO_LIFETIME(bio->bi_ioprio) + 1;

This seems a little to magic.  Why not a lookup table?

> +}
> +
> +static inline void bio_set_data_lifetime(struct bio *bio, enum rw_hint lifetime)

Please avoid the overly long line.

> +	/* -1 to map WRITE_LIFE_NONE onto 0. */
> +	if (lifetime != 0)
> +		lifetime--;
> +	WARN_ON_ONCE(lifetime & ~IOPRIO_LIFETIME_MASK);

I'd return here instead of propagating the bogus value.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-13 21:20                 ` Bart Van Assche
@ 2023-10-16  9:20                   ` Niklas Cassel
  2023-10-16 16:36                     ` Bart Van Assche
  0 siblings, 1 reply; 45+ messages in thread
From: Niklas Cassel @ 2023-10-16  9:20 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Damien Le Moal, Jens Axboe, linux-block, linux-scsi,
	linux-fsdevel, Martin K . Petersen, Christoph Hellwig,
	Avri Altman, Bean Huo, Daejun Park, Hannes Reinecke

On Fri, Oct 13, 2023 at 02:20:23PM -0700, Bart Van Assche wrote:
> On 10/13/23 02:33, Niklas Cassel wrote:
> > In commit c75e707fe1aa ("block: remove the per-bio/request write hint")
> > this line from fs/direct-io.c was removed:
> > -       bio->bi_write_hint = dio->iocb->ki_hint;
> > 
> > I'm not sure why this series does not readd a similar line to set the
> > lifetime (using bio_set_data_lifetime()) also for fs/direct-io.c.
> 
> It depends on how we want the user to specify the data lifetime for
> direct I/O. This assignment is not modified by this patch series and
> copies the data lifetime information from the ioprio bitfield from user
> space into the bio:
> 
> 		bio->bi_ioprio = dio->iocb->ki_ioprio;

Before per-bio/request write hints were removed, things looked like this:

io_uring.c:
req->rw.kiocb.ki_hint = ki_hint_validate(file_write_hint(req->file));

fs/fcntl.c:
static inline enum rw_hint file_write_hint(struct file *file)
{
	if (file->f_write_hint != WRITE_LIFE_NOT_SET)
		return file->f_write_hint;

	return file_inode(file)->i_write_hint;
}

direct-io.c:
bio->bi_write_hint = dio->iocb->ki_hint;

buffered-io.c:
bio->bi_write_hint = inode->i_write_hint;



After this series, things instead look like this:

direct-io.c:
bio->bi_ioprio = dio->iocb->ki_ioprio;

buffered-io.c:
bio_set_data_lifetime(bio, inode->i_write_hint);


So when you say:
"It depends on how we want the user to specify the data lifetime for
direct I/O.", do you mean that buffered I/O should use fcntl() to specify
data lifetime, but direct I/O should used Linux IO priority API to specify
the same?

Because, to me that seems to be how the series is currently working.
(I am sorry if I am missing something.)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-15 22:22                 ` Damien Le Moal
@ 2023-10-16 16:31                   ` Bart Van Assche
  0 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-16 16:31 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, linux-scsi, linux-fsdevel, Martin K . Petersen,
	Christoph Hellwig, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke

On 10/15/23 15:22, Damien Le Moal wrote:
> Given that lifetime is per file (inode) and IO prio is per process or per I/O,
> having different user APIs makes sense. The issue of not growing (if possible)
> the bio and request structures remains. For bio, you identified a hole already,
> so what about using another 16-bits field for lifetime ? Not sure for requests.
> I thought also of a union with bi_ioprio, but that would prevent using lifetime
> and IO priority together, which is not ideal.

There is a challenge: F_SET_RW_HINT / F_GET_RW_HINT are per inode and
hence submitting direct I/O with different lifetimes to a single file
by opening that file multiple times and by using F_SET_FILE_RW_HINT
won't be possible. fio supports this use case. See also fio commit
bd553af6c849 ("Update API for file write hints"). If nobody objects I
won't restore F_SET_FILE_RW_HINT support.

Furthermore, I plan to drop the bi_ioprio changes and introduce a new u8
struct bio member instead: bi_lifetime. I think 8 bits is enough since
the NVMe and SCSI data lifetime fields are six bits wide.

There is a two byte hole in struct request past the ioprio field. I'd
like to use that space for a new data lifetime member.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-16  6:17     ` Christoph Hellwig
@ 2023-10-16 16:32       ` Bart Van Assche
  0 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-16 16:32 UTC (permalink / raw)
  To: Christoph Hellwig, Damien Le Moal
  Cc: Jens Axboe, linux-block, linux-scsi, linux-fsdevel,
	Martin K . Petersen, Niklas Cassel, Avri Altman, Bean Huo,
	Daejun Park, Hannes Reinecke


On 10/15/23 23:17, Christoph Hellwig wrote:
> On Fri, Oct 06, 2023 at 05:19:52PM +0900, Damien Le Moal wrote:
>> Your change seem to assume that it makes sense to be able to combine CDL with
>> lifetime hints. But does it really ?
> 
> Yes, it does.
> 
> 
>> CDL is of dubious value for solid state
>> media and as far as I know,
> 
> No, it's pretty useful and I'd bet my 2 cents that it will eventually
> show up in relevant standards and devices.
> 
> Even if it wasn't making our user interfaces exclusive would be a
> massive pain.
> 
>> The other question here if you really want to keep the bit separation approach
>> is: do we really need up to 64 different lifetime hints ? While the scsi
>> standard allows that much, does this many different lifetime make sense in
>> practice ? Can we ever think of a usecase that needs more than say 8 different
>> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8,
>> then we can keep 4 bits unused in the hint field for future features.
> 
> Yes, I think this is the smoking gun.  We should be fine with a much
> more limited number of lifetime hints, i.e. the user interface only
> exposes 5 hints, and supporting more in the in-kernel interfaces seems
> of rather doubtfuŀ use.

Thanks for the feedback Christoph. I will address this feedback in the
next version of this patch series.

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield
  2023-10-16  9:20                   ` Niklas Cassel
@ 2023-10-16 16:36                     ` Bart Van Assche
  0 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2023-10-16 16:36 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Damien Le Moal, Jens Axboe, linux-block, linux-scsi,
	linux-fsdevel, Martin K . Petersen, Christoph Hellwig,
	Avri Altman, Bean Huo, Daejun Park, Hannes Reinecke

On 10/16/23 02:20, Niklas Cassel wrote:
> So when you say:
> "It depends on how we want the user to specify the data lifetime for
> direct I/O.", do you mean that buffered I/O should use fcntl() to specify
> data lifetime, but direct I/O should used Linux IO priority API to specify
> the same?
> 
> Because, to me that seems to be how the series is currently working.
> (I am sorry if I am missing something.)

Let's drop support for passing the data lifetime in the I/O priority
field from user space as Damien proposed. The remaining question is
whether only to restore support for F_SET_RW_HINT (per inode) or also
for F_SET_FILE_RW_HINT (per file)?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2023-10-16 16:59 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-05 19:40 [PATCH v2 00/15] Pass data temperature information to UFS devices Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 01/15] block: Make bio_set_ioprio() modify fewer bio->bi_ioprio bits Bart Van Assche
2023-10-06  6:28   ` Kanchan Joshi
2023-10-06 18:20     ` Bart Van Assche
2023-10-10  5:22   ` Kanchan Joshi
2023-10-11 16:52     ` Bart Van Assche
2023-10-12  8:49       ` Kanchan Joshi
2023-10-12 14:03         ` Niklas Cassel
2023-10-12 17:42         ` Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 02/15] blk-ioprio: Modify " Bart Van Assche
2023-10-06  6:36   ` Kanchan Joshi
2023-10-06 18:25     ` Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 03/15] block: Support data lifetime in the I/O priority bitfield Bart Van Assche
2023-10-06  6:42   ` Kanchan Joshi
2023-10-06  8:19   ` Damien Le Moal
2023-10-06  9:53     ` Niklas Cassel
2023-10-06 18:07     ` Bart Van Assche
2023-10-11 20:51       ` Bart Van Assche
2023-10-12  1:02         ` Damien Le Moal
2023-10-12 18:00           ` Bart Van Assche
2023-10-13  1:08             ` Damien Le Moal
2023-10-13  9:33               ` Niklas Cassel
2023-10-13 21:20                 ` Bart Van Assche
2023-10-16  9:20                   ` Niklas Cassel
2023-10-16 16:36                     ` Bart Van Assche
2023-10-13 20:18               ` Bart Van Assche
2023-10-15 22:22                 ` Damien Le Moal
2023-10-16 16:31                   ` Bart Van Assche
2023-10-16  6:17     ` Christoph Hellwig
2023-10-16 16:32       ` Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 04/15] fs: Restore write hint support Bart Van Assche
2023-10-10  5:42   ` Kanchan Joshi
2023-10-11 16:56     ` Bart Van Assche
2023-10-16  6:20   ` Christoph Hellwig
2023-10-05 19:40 ` [PATCH v2 05/15] fs/f2fs: Restore the whint_mode mount option Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 06/15] scsi: core: Query the Block Limits Extension VPD page Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 07/15] scsi_proto: Add structures and constants related to I/O groups and streams Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 08/15] sd: Translate data lifetime information Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 09/15] scsi_debug: Reduce code duplication Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 10/15] scsi_debug: Support the block limits extension VPD page Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 11/15] scsi_debug: Rework page code error handling Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 12/15] scsi_debug: Rework subpage " Bart Van Assche
2023-10-05 19:40 ` [PATCH v2 13/15] scsi_debug: Implement the IO Advice Hints Grouping mode page Bart Van Assche
2023-10-05 19:41 ` [PATCH v2 14/15] scsi_debug: Implement GET STREAM STATUS Bart Van Assche
2023-10-05 19:41 ` [PATCH v2 15/15] scsi_debug: Maintain write statistics per group number Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.