All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/16] block atomic writes
@ 2023-12-12 11:08 John Garry
  2023-12-12 11:08 ` [PATCH v2 01/16] block: Add atomic write operations to request_queue limits John Garry
                   ` (16 more replies)
  0 siblings, 17 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

This series introduces a proposal to implementing atomic writes in the
kernel for torn-write protection.

This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically".

Only direct IO is supported and for block devices here. For this, atomic
write HW is required, like SCSI ATOMIC WRITE (16). For now, XFS support
from earlier versions is sidelined. That is until the interface is agreed
which does not fully rely on HW offload support.

man pages update has been posted at:
https://lore.kernel.org/linux-api/20230929093717.2972367-1-john.g.garry@oracle.com/T/#t
(not updated since I posted v1 kernel series)

The goal here is to provide an interface that allows applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.

Two new fields are added to struct statx - atomic_write_unit_min and
atomic_write_unit_max. For each atomic individual write, the total length
of a write must be a between atomic_write_unit_min and
atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
at a natural offset in the file wrt the write length.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

Some open questions:
- How to make API extensible for when we have no HW support? In that case,
  we would prob not have to follow rule of power-of-2 length et al.
  As a possible solution, maybe we can say that atomic writes are
  supported for the file via statx, but not set unit_min and max values,
  and this means that writes need to be just FS block aligned there.
- For block layer, should atomic_write_unit_max be limited by
  max_sectors_kb? Currently it is not.
- How to improve requirement that iovecs are PAGE-aligned.
  There are 2x issues:
  a. We impose this rule to not split BIOs due to virt boundary for
     NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
     16K/64K pages). Easy solution is to impose requirement that iovecs
     are 4K-aligned.
  b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
- Since debugging torn-writes due to unwanted kernel BIO splitting/merging
  would be horrible, should we add some kernel storage stack software
  integrity checks?

This series is based on v6.7-rc5.

Changes since v1:
- Drop XFS support for now
- Tidy NVMe changes and also add checks for atomic write violating max
  AW PF length and boundary (if any)
- Reject - instead of ignoring - RWF_ATOMIC for files which do not
  support atomic writes
- Update block sysfs documentation
- Various tidy-ups

Alan Adamson (2):
  nvme: Support atomic writes
  nvme: Ensure atomic writes will be executed atomically

Himanshu Madhani (2):
  block: Add atomic write operations to request_queue limits
  block: Add REQ_ATOMIC flag

John Garry (10):
  block: Limit atomic writes according to bio and queue limits
  fs: Increase fmode_t size
  block: Pass blk_queue_get_max_sectors() a request pointer
  block: Limit atomic write IO size according to
    atomic_write_max_sectors
  block: Error an attempt to split an atomic write bio
  block: Add checks to merging of atomic writes
  block: Add fops atomic write support
  scsi: sd: Support reading atomic write properties from block limits
    VPD
  scsi: sd: Add WRITE_ATOMIC_16 support
  scsi: scsi_debug: Atomic write support

Prasad Singamsetty (2):
  fs/bdev: Add atomic write support info to statx
  fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support

 Documentation/ABI/stable/sysfs-block |  47 +++
 block/bdev.c                         |  31 +-
 block/blk-merge.c                    |  95 ++++-
 block/blk-mq.c                       |   2 +-
 block/blk-settings.c                 |  84 ++++
 block/blk-sysfs.c                    |  33 ++
 block/blk.h                          |   9 +-
 block/fops.c                         |  40 +-
 drivers/dma-buf/dma-buf.c            |   2 +-
 drivers/nvme/host/core.c             | 108 ++++-
 drivers/nvme/host/nvme.h             |   2 +
 drivers/scsi/scsi_debug.c            | 590 +++++++++++++++++++++------
 drivers/scsi/scsi_trace.c            |  22 +
 drivers/scsi/sd.c                    |  93 ++++-
 drivers/scsi/sd.h                    |   8 +
 fs/stat.c                            |  44 +-
 include/linux/blk_types.h            |   2 +
 include/linux/blkdev.h               |  41 +-
 include/linux/fs.h                   |  11 +
 include/linux/stat.h                 |   2 +
 include/linux/types.h                |   2 +-
 include/scsi/scsi_proto.h            |   1 +
 include/trace/events/scsi.h          |   1 +
 include/uapi/linux/fs.h              |   5 +-
 include/uapi/linux/stat.h            |   7 +-
 25 files changed, 1098 insertions(+), 184 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v2 01/16] block: Add atomic write operations to request_queue limits
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-13  1:25   ` Ming Lei
  2023-12-12 11:08 ` [PATCH v2 02/16] block: Limit atomic writes according to bio and queue limits John Garry
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	Himanshu Madhani, John Garry

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add the following limits:
- atomic_write_boundary_bytes
- atomic_write_max_bytes
- atomic_write_unit_max_bytes
- atomic_write_unit_min_bytes

All atomic writes limits are initialised to 0 to indicate no atomic write
support. Stacked devices are just not supported either for now.

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
#jpg: Heavy rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 Documentation/ABI/stable/sysfs-block | 47 ++++++++++++++++++++++
 block/blk-settings.c                 | 60 ++++++++++++++++++++++++++++
 block/blk-sysfs.c                    | 33 +++++++++++++++
 include/linux/blkdev.h               | 37 +++++++++++++++++
 4 files changed, 177 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 1fe9a553c37b..ba81a081522f 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -21,6 +21,53 @@ Description:
 		device is offset from the internal allocation unit's
 		natural alignment.
 
+What:		/sys/block/<disk>/atomic_write_max_bytes
+Date:		May 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] This parameter specifies the maximum atomic write
+		size reported by the device. This parameter is relevant
+		for merging of writes, where a merged atomic write
+		operation must not exceed this number of bytes.
+		The atomic_write_max_bytes may exceed the value in
+		atomic_write_unit_max_bytes if atomic_write_max_bytes
+		is not a power-of-two or atomic_write_unit_max_bytes is
+		limited by some queue limits, such as max_segments.
+
+
+What:		/sys/block/<disk>/atomic_write_unit_min_bytes
+Date:		May 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] This parameter specifies the smallest block which can
+		be written atomically with an atomic write operation. All
+		atomic write operations must begin at a
+		atomic_write_unit_min boundary and must be multiples of
+		atomic_write_unit_min. This value must be a power-of-two.
+
+
+What:		/sys/block/<disk>/atomic_write_unit_max_bytes
+Date:		January 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] This parameter defines the largest block which can be
+		written atomically with an atomic write operation. This
+		value must be a multiple of atomic_write_unit_min and must
+		be a power-of-two.
+
+
+What:		/sys/block/<disk>/atomic_write_boundary_bytes
+Date:		May 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] A device may need to internally split I/Os which
+		straddle a given logical block address boundary. In that
+		case a single atomic write operation will be processed as
+		one of more sub-operations which each complete atomically.
+		This parameter specifies the size in bytes of the atomic
+		boundary if one is reported by the device. This value must
+		be a power-of-two.
+
 
 What:		/sys/block/<disk>/diskseq
 Date:		February 2021
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0046b447268f..d151be394c98 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,10 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
+	lim->atomic_write_unit_min_sectors = 0;
+	lim->atomic_write_unit_max_sectors = 0;
+	lim->atomic_write_max_sectors = 0;
+	lim->atomic_write_boundary_sectors = 0;
 }
 
 /**
@@ -183,6 +187,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_atomic_write_max_bytes - set max bytes supported by
+ * the device for atomic write operations.
+ * @q:  the request queue for the device
+ * @size: maximum bytes supported
+ */
+void blk_queue_atomic_write_max_bytes(struct request_queue *q,
+				      unsigned int bytes)
+{
+	q->limits.atomic_write_max_sectors = bytes >> SECTOR_SHIFT;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
+
+/**
+ * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
+ * which an atomic write should not cross.
+ * @q:  the request queue for the device
+ * @bytes: must be a power-of-two.
+ */
+void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
+					   unsigned int bytes)
+{
+	q->limits.atomic_write_boundary_sectors = bytes >> SECTOR_SHIFT;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
+
+/**
+ * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
+ * atomically to the device.
+ * @q:  the request queue for the device
+ * @sectors: must be a power-of-two.
+ */
+void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
+					     unsigned int sectors)
+{
+	struct queue_limits *limits = &q->limits;
+
+	limits->atomic_write_unit_min_sectors = sectors;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
+
+/*
+ * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
+ * atomically to the device.
+ * @q: the request queue for the device
+ * @sectors: must be a power-of-two.
+ */
+void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
+					     unsigned int sectors)
+{
+	struct queue_limits *limits = &q->limits;
+
+	limits->atomic_write_unit_max_sectors = sectors;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 0b2d04766324..4ebf148cf356 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -118,6 +118,30 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q,
 	return queue_var_show(queue_max_discard_segments(q), page);
 }
 
+static ssize_t queue_atomic_write_max_bytes_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(queue_atomic_write_max_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_boundary_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(queue_atomic_write_boundary_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_unit_min_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(queue_atomic_write_unit_min_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_unit_max_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(queue_atomic_write_unit_max_bytes(q), page);
+}
+
 static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(q->limits.max_integrity_segments, page);
@@ -507,6 +531,11 @@ QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes");
 QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes");
 QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
 
+QUEUE_RO_ENTRY(queue_atomic_write_max_bytes, "atomic_write_max_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_boundary, "atomic_write_boundary_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
+
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
 QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
@@ -634,6 +663,10 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_max_entry.attr,
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
+	&queue_atomic_write_max_bytes_entry.attr,
+	&queue_atomic_write_boundary_entry.attr,
+	&queue_atomic_write_unit_min_entry.attr,
+	&queue_atomic_write_unit_max_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 51fa7ffdee83..ab53163dd187 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -309,6 +309,11 @@ struct queue_limits {
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
 
+	unsigned int		atomic_write_boundary_sectors;
+	unsigned int		atomic_write_max_sectors;
+	unsigned int		atomic_write_unit_min_sectors;
+	unsigned int		atomic_write_unit_max_sectors;
+
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
@@ -908,6 +913,14 @@ void blk_queue_zone_write_granularity(struct request_queue *q,
 				      unsigned int size);
 extern void blk_queue_alignment_offset(struct request_queue *q,
 				       unsigned int alignment);
+void blk_queue_atomic_write_max_bytes(struct request_queue *q,
+				unsigned int bytes);
+void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
+				unsigned int sectors);
+void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
+				unsigned int sectors);
+void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
+				unsigned int bytes);
 void disk_update_readahead(struct gendisk *disk);
 extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
 extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
@@ -1312,6 +1325,30 @@ static inline int queue_dma_alignment(const struct request_queue *q)
 	return q ? q->limits.dma_alignment : 511;
 }
 
+static inline unsigned int
+queue_atomic_write_unit_max_bytes(const struct request_queue *q)
+{
+	return q->limits.atomic_write_unit_max_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int
+queue_atomic_write_unit_min_bytes(const struct request_queue *q)
+{
+	return q->limits.atomic_write_unit_min_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int
+queue_atomic_write_boundary_bytes(const struct request_queue *q)
+{
+	return q->limits.atomic_write_boundary_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int
+queue_atomic_write_max_bytes(const struct request_queue *q)
+{
+	return q->limits.atomic_write_max_sectors << SECTOR_SHIFT;
+}
+
 static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
 {
 	return queue_dma_alignment(bdev_get_queue(bdev));
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 02/16] block: Limit atomic writes according to bio and queue limits
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
  2023-12-12 11:08 ` [PATCH v2 01/16] block: Add atomic write operations to request_queue limits John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 11:08 ` [PATCH v2 03/16] fs/bdev: Add atomic write support info to statx John Garry
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

We rely the block layer always being able to send a bio of size
atomic_write_unit_max without being required to split it due to request
queue or other bio limits.

A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors on the
relevant submission paths for atomic writes and each vector contains at
least a PAGE_SIZE of data, apart from the first vector.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-settings.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index d151be394c98..dd699580fccd 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -213,6 +213,26 @@ void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
 
+
+/*
+ * Returns max guaranteed sectors which we can fit in a bio. For convenience of
+ * users, rounddown_pow_of_two() the return value.
+ *
+ * We always assume that we can fit in at least PAGE_SIZE in a segment.
+ */
+static unsigned int blk_queue_max_guaranteed_bio_sectors(
+					struct queue_limits *limits)
+{
+	unsigned int max_segments = min_t(unsigned int, BIO_MAX_VECS,
+					  limits->max_segments);
+
+	if (max_segments < 2)
+		return 0;
+
+	/* subtract 2 to assume PAGE-misaligned IOV start address */
+	return rounddown_pow_of_two((max_segments - 1) * PAGE_SECTORS);
+}
+
 /**
  * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
  * atomically to the device.
@@ -223,8 +243,10 @@ void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
 					     unsigned int sectors)
 {
 	struct queue_limits *limits = &q->limits;
+	unsigned int guaranteed_sectors =
+		blk_queue_max_guaranteed_bio_sectors(limits);
 
-	limits->atomic_write_unit_min_sectors = sectors;
+	limits->atomic_write_unit_min_sectors = min(guaranteed_sectors, sectors);
 }
 EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
 
@@ -238,8 +260,10 @@ void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
 					     unsigned int sectors)
 {
 	struct queue_limits *limits = &q->limits;
+	unsigned int guaranteed_sectors =
+		blk_queue_max_guaranteed_bio_sectors(limits);
 
-	limits->atomic_write_unit_max_sectors = sectors;
+	limits->atomic_write_unit_max_sectors = min(guaranteed_sectors, sectors);
 }
 EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 03/16] fs/bdev: Add atomic write support info to statx
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
  2023-12-12 11:08 ` [PATCH v2 01/16] block: Add atomic write operations to request_queue limits John Garry
  2023-12-12 11:08 ` [PATCH v2 02/16] block: Limit atomic writes according to bio and queue limits John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-13 10:24   ` Jan Kara
  2023-12-12 11:08 ` [PATCH v2 04/16] fs: Increase fmode_t size John Garry
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	Prasad Singamsetty, John Garry

From: Prasad Singamsetty <prasad.singamsetty@oracle.com>

Extend statx system call to return additional info for atomic write support
support if the specified file is a block device.

Add initial support for a block device.

Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/bdev.c              | 31 ++++++++++++++++++---------
 fs/stat.c                 | 44 ++++++++++++++++++++++++++++++++-------
 include/linux/blkdev.h    |  4 ++--
 include/linux/fs.h        |  3 +++
 include/linux/stat.h      |  2 ++
 include/uapi/linux/stat.h |  7 ++++++-
 6 files changed, 71 insertions(+), 20 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 750aec178b6a..1b514df48dac 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1081,24 +1081,35 @@ void sync_bdevs(bool wait)
 	iput(old_inode);
 }
 
+#define BDEV_STATX_SUPPORTED_MSK (STATX_DIOALIGN | STATX_WRITE_ATOMIC)
+
 /*
- * Handle STATX_DIOALIGN for block devices.
- *
- * Note that the inode passed to this is the inode of a block device node file,
- * not the block device's internal inode.  Therefore it is *not* valid to use
- * I_BDEV() here; the block device has to be looked up by i_rdev instead.
+ * Handle STATX_{DIOALIGN, WRITE_ATOMIC} for block devices.
  */
-void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
+void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask)
 {
 	struct block_device *bdev;
 
-	bdev = blkdev_get_no_open(inode->i_rdev);
+	if (!(request_mask & BDEV_STATX_SUPPORTED_MSK))
+		return;
+
+	bdev = blkdev_get_no_open(d_backing_inode(dentry)->i_rdev);
 	if (!bdev)
 		return;
 
-	stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
-	stat->dio_offset_align = bdev_logical_block_size(bdev);
-	stat->result_mask |= STATX_DIOALIGN;
+	if (request_mask & STATX_DIOALIGN) {
+		stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
+		stat->dio_offset_align = bdev_logical_block_size(bdev);
+		stat->result_mask |= STATX_DIOALIGN;
+	}
+
+	if (request_mask & STATX_WRITE_ATOMIC) {
+		struct request_queue *bd_queue = bdev->bd_queue;
+
+		generic_fill_statx_atomic_writes(stat,
+			queue_atomic_write_unit_min_bytes(bd_queue),
+			queue_atomic_write_unit_max_bytes(bd_queue));
+	}
 
 	blkdev_put_no_open(bdev);
 }
diff --git a/fs/stat.c b/fs/stat.c
index f721d26ec3f7..ad8f9235f1c9 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -89,6 +89,35 @@ void generic_fill_statx_attr(struct inode *inode, struct kstat *stat)
 }
 EXPORT_SYMBOL(generic_fill_statx_attr);
 
+/**
+ * generic_fill_statx_atomic_writes - Fill in the atomic writes statx attributes
+ * @stat:	Where to fill in the attribute flags
+ * @unit_min:	Minimum supported atomic write length
+ * @unit_max:	Maximum supported atomic write length
+ *
+ * Fill in the STATX{_ATTR}_WRITE_ATOMIC flags in the kstat structure from
+ * atomic write unit_min and unit_max values.
+ */
+void generic_fill_statx_atomic_writes(struct kstat *stat,
+				      unsigned int unit_min,
+				      unsigned int unit_max)
+{
+	/* Confirm that the request type is known */
+	stat->result_mask |= STATX_WRITE_ATOMIC;
+
+	/* Confirm that the file attribute type is known */
+	stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
+
+	if (unit_min) {
+		stat->atomic_write_unit_min = unit_min;
+		stat->atomic_write_unit_max = unit_max;
+
+		/* Confirm atomic writes are actually supported */
+		stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
+	}
+}
+EXPORT_SYMBOL(generic_fill_statx_atomic_writes);
+
 /**
  * vfs_getattr_nosec - getattr without security checks
  * @path: file to get attributes from
@@ -254,13 +283,12 @@ static int vfs_statx(int dfd, struct filename *filename, int flags,
 		stat->attributes |= STATX_ATTR_MOUNT_ROOT;
 	stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT;
 
-	/* Handle STATX_DIOALIGN for block devices. */
-	if (request_mask & STATX_DIOALIGN) {
-		struct inode *inode = d_backing_inode(path.dentry);
-
-		if (S_ISBLK(inode->i_mode))
-			bdev_statx_dioalign(inode, stat);
-	}
+	/* If this is a block device inode, override the filesystem
+	 * attributes with the block device specific parameters
+	 * that need to be obtained from the bdev backing inode
+	 */
+	if (S_ISBLK(d_backing_inode(path.dentry)->i_mode))
+		bdev_statx(path.dentry, stat, request_mask);
 
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
@@ -653,6 +681,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_mnt_id = stat->mnt_id;
 	tmp.stx_dio_mem_align = stat->dio_mem_align;
 	tmp.stx_dio_offset_align = stat->dio_offset_align;
+	tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min;
+	tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ab53163dd187..ab7289995615 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1548,7 +1548,7 @@ int sync_blockdev(struct block_device *bdev);
 int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend);
 int sync_blockdev_nowait(struct block_device *bdev);
 void sync_bdevs(bool wait);
-void bdev_statx_dioalign(struct inode *inode, struct kstat *stat);
+void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask);
 void printk_all_partitions(void);
 int __init early_lookup_bdev(const char *pathname, dev_t *dev);
 #else
@@ -1566,7 +1566,7 @@ static inline int sync_blockdev_nowait(struct block_device *bdev)
 static inline void sync_bdevs(bool wait)
 {
 }
-static inline void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
+static inline void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask)
 {
 }
 static inline void printk_all_partitions(void)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98b7a7a8c42e..70329c81be31 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3084,6 +3084,9 @@ extern const struct inode_operations page_symlink_inode_operations;
 extern void kfree_link(void *);
 void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
 void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
+void generic_fill_statx_atomic_writes(struct kstat *stat,
+				      unsigned int unit_min,
+				      unsigned int unit_max);
 extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
 extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
 void __inode_add_bytes(struct inode *inode, loff_t bytes);
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 52150570d37a..ba690ff89efd 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -53,6 +53,8 @@ struct kstat {
 	u32		dio_mem_align;
 	u32		dio_offset_align;
 	u64		change_cookie;
+	u32		atomic_write_unit_min;
+	u32		atomic_write_unit_max;
 };
 
 /* These definitions are internal to the kernel for now. Mainly used by nfsd. */
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7cab2c65d3d7..64dfc1baa640 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -127,7 +127,10 @@ struct statx {
 	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
 	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
 	/* 0xa0 */
-	__u64	__spare3[12];	/* Spare space for future expansion */
+	__u32	stx_atomic_write_unit_min;
+	__u32	stx_atomic_write_unit_max;
+	/* 0xb0 */
+	__u64	__spare3[11];	/* Spare space for future expansion */
 	/* 0x100 */
 };
 
@@ -154,6 +157,7 @@ struct statx {
 #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
 #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
 #define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
+#define STATX_WRITE_ATOMIC	0x00004000U	/* Want/got atomic_write_* fields */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
@@ -189,6 +193,7 @@ struct statx {
 #define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
 #define STATX_ATTR_DAX			0x00200000 /* File is currently in DAX state */
+#define STATX_ATTR_WRITE_ATOMIC		0x00400000 /* File supports atomic write operations */
 
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 04/16] fs: Increase fmode_t size
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (2 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 03/16] fs/bdev: Add atomic write support info to statx John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-13 11:20   ` Jan Kara
  2023-12-13 13:02   ` Christian Brauner
  2023-12-12 11:08 ` [PATCH v2 05/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
                   ` (12 subsequent siblings)
  16 siblings, 2 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

Currently all bits are being used in fmode_t.

To allow for further expansion, increase from unsigned int to unsigned
long.

Since the dma-buf driver prints the file->f_mode member, change the print
as necessary to deal with the larger size.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/dma-buf/dma-buf.c | 2 +-
 include/linux/types.h     | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 21916bba77d5..a5227ae3d637 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -1628,7 +1628,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
 
 
 		spin_lock(&buf_obj->name_lock);
-		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\t%s\n",
+		seq_printf(s, "%08zu\t%08x\t%08lx\t%08ld\t%s\t%08lu\t%s\n",
 				buf_obj->size,
 				buf_obj->file->f_flags, buf_obj->file->f_mode,
 				file_count(buf_obj->file),
diff --git a/include/linux/types.h b/include/linux/types.h
index 253168bb3fe1..49c754fde1d6 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -153,7 +153,7 @@ typedef u32 dma_addr_t;
 
 typedef unsigned int __bitwise gfp_t;
 typedef unsigned int __bitwise slab_flags_t;
-typedef unsigned int __bitwise fmode_t;
+typedef unsigned long __bitwise fmode_t;
 
 #ifdef CONFIG_PHYS_ADDR_T_64BIT
 typedef u64 phys_addr_t;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 05/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (3 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 04/16] fs: Increase fmode_t size John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-13 13:31   ` Al Viro
  2024-01-22  8:29   ` John Garry
  2023-12-12 11:08 ` [PATCH v2 06/16] block: Add REQ_ATOMIC flag John Garry
                   ` (11 subsequent siblings)
  16 siblings, 2 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	Prasad Singamsetty, John Garry

From: Prasad Singamsetty <prasad.singamsetty@oracle.com>

Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn write prevention, according to special
alignment and length rules.

Torn write prevention means that for a power or any other HW failure, all
or none of the data will be committed to storage, but never a mix of old
and new.

For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.

A call to statx will give the relevant atomic write info:
- atomic_write_unit_min
- atomic_write_unit_max

Both values are a power-of-2.

Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length.

Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.

Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/linux/fs.h      | 8 ++++++++
 include/uapi/linux/fs.h | 5 ++++-
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 70329c81be31..d725c194243c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -185,6 +185,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File supports async nowait buffered writes */
 #define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)
 
+/* File supports atomic writes */
+#define FMODE_CAN_ATOMIC_WRITE	((__force fmode_t)0x100000000)
+
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
@@ -328,6 +331,7 @@ enum rw_hint {
 #define IOCB_SYNC		(__force int) RWF_SYNC
 #define IOCB_NOWAIT		(__force int) RWF_NOWAIT
 #define IOCB_APPEND		(__force int) RWF_APPEND
+#define IOCB_ATOMIC		(__force int) RWF_ATOMIC
 
 /* non-RWF related bits - start at 16 */
 #define IOCB_EVENTFD		(1 << 16)
@@ -3265,6 +3269,10 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
 			return -EOPNOTSUPP;
 		kiocb_flags |= IOCB_NOIO;
 	}
+	if (flags & RWF_ATOMIC) {
+		if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
+			return -EOPNOTSUPP;
+	}
 	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
 	if (flags & RWF_SYNC)
 		kiocb_flags |= IOCB_DSYNC;
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index da43810b7485..1a5c68d31ef5 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -301,9 +301,12 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* Atomic Write */
+#define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000020)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_ATOMIC)
 
 /* Pagemap ioctl */
 #define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 06/16] block: Add REQ_ATOMIC flag
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (4 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 05/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 11:08 ` [PATCH v2 07/16] block: Pass blk_queue_get_max_sectors() a request pointer John Garry
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	Himanshu Madhani, John Garry

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add flag REQ_ATOMIC, meaning an atomic operation. This should only be
used in conjunction with REQ_OP_WRITE.

We will not add a special "request atomic write" operation, as to try to
avoid maintenance effort for an operation which is almost the same as
REQ_OP_WRITE.

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/linux/blk_types.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index b29ebd53417d..b10d4db753b9 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -424,6 +424,7 @@ enum req_flag_bits {
 	__REQ_DRV,		/* for driver use */
 	__REQ_FS_PRIVATE,	/* for file system (submitter) use */
 
+	__REQ_ATOMIC,		/* for atomic write operations */
 	/*
 	 * Command specific flags, keep last:
 	 */
@@ -450,6 +451,7 @@ enum req_flag_bits {
 #define REQ_RAHEAD	(__force blk_opf_t)(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND	(__force blk_opf_t)(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT	(__force blk_opf_t)(1ULL << __REQ_NOWAIT)
+#define REQ_ATOMIC	(__force blk_opf_t)(1ULL << __REQ_ATOMIC)
 #define REQ_POLLED	(__force blk_opf_t)(1ULL << __REQ_POLLED)
 #define REQ_ALLOC_CACHE	(__force blk_opf_t)(1ULL << __REQ_ALLOC_CACHE)
 #define REQ_SWAP	(__force blk_opf_t)(1ULL << __REQ_SWAP)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 07/16] block: Pass blk_queue_get_max_sectors() a request pointer
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (5 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 06/16] block: Add REQ_ATOMIC flag John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 11:08 ` [PATCH v2 08/16] block: Limit atomic write IO size according to atomic_write_max_sectors John Garry
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

Currently blk_queue_get_max_sectors() is passed a enum req_op, which does
not work for atomic writes. This is because an atomic write has a different
max sectors values to a regular write, and we need the rq->cmd_flags
to know that we have an atomic write, so pass the request pointer, which
has all information available.

Also use rq->cmd_flags instead of rq->bio->bi_opf when possible.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-merge.c | 3 ++-
 block/blk-mq.c    | 2 +-
 block/blk.h       | 6 ++++--
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 65e75efa9bd3..0ccc251e22ff 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -596,7 +596,8 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
 	if (blk_rq_is_passthrough(rq))
 		return q->limits.max_hw_sectors;
 
-	max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
+	max_sectors = blk_queue_get_max_sectors(rq);
+
 	if (!q->limits.chunk_sectors ||
 	    req_op(rq) == REQ_OP_DISCARD ||
 	    req_op(rq) == REQ_OP_SECURE_ERASE)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ac18f802c027..58bacf9029b3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3027,7 +3027,7 @@ void blk_mq_submit_bio(struct bio *bio)
 blk_status_t blk_insert_cloned_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
+	unsigned int max_sectors = blk_queue_get_max_sectors(rq);
 	unsigned int max_segments = blk_rq_get_max_segments(rq);
 	blk_status_t ret;
 
diff --git a/block/blk.h b/block/blk.h
index 08a358bc0919..94e330e9c853 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -166,9 +166,11 @@ static inline unsigned int blk_rq_get_max_segments(struct request *rq)
 	return queue_max_segments(rq->q);
 }
 
-static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
-						     enum req_op op)
+static inline unsigned int blk_queue_get_max_sectors(struct request *rq)
 {
+	struct request_queue *q = rq->q;
+	enum req_op op = req_op(rq);
+
 	if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 08/16] block: Limit atomic write IO size according to atomic_write_max_sectors
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (6 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 07/16] block: Pass blk_queue_get_max_sectors() a request pointer John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-15  2:27   ` Ming Lei
  2023-12-12 11:08 ` [PATCH v2 09/16] block: Error an attempt to split an atomic write bio John Garry
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

Currently an IO size is limited to the request_queue limits max_sectors.
Limit the size for an atomic write to queue limit atomic_write_max_sectors
value.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-merge.c | 12 +++++++++++-
 block/blk.h       |  3 +++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 0ccc251e22ff..8d4de9253fe9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -171,7 +171,17 @@ static inline unsigned get_max_io_size(struct bio *bio,
 {
 	unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
 	unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
-	unsigned max_sectors = lim->max_sectors, start, end;
+	unsigned max_sectors, start, end;
+
+	/*
+	 * We ignore lim->max_sectors for atomic writes simply because
+	 * it may less than bio->write_atomic_unit, which we cannot
+	 * tolerate.
+	 */
+	if (bio->bi_opf & REQ_ATOMIC)
+		max_sectors = lim->atomic_write_max_sectors;
+	else
+		max_sectors = lim->max_sectors;
 
 	if (lim->chunk_sectors) {
 		max_sectors = min(max_sectors,
diff --git a/block/blk.h b/block/blk.h
index 94e330e9c853..6f6cd5b1830a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -178,6 +178,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request *rq)
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
 		return q->limits.max_write_zeroes_sectors;
 
+	if (rq->cmd_flags & REQ_ATOMIC)
+		return q->limits.atomic_write_max_sectors;
+
 	return q->limits.max_sectors;
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 09/16] block: Error an attempt to split an atomic write bio
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (7 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 08/16] block: Limit atomic write IO size according to atomic_write_max_sectors John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 11:08 ` [PATCH v2 10/16] block: Add checks to merging of atomic writes John Garry
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

As the name suggests, we should not be splitting these.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-merge.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8d4de9253fe9..bc21f8ff4842 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -319,6 +319,11 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
 	*segs = nsegs;
 	return NULL;
 split:
+	if (bio->bi_opf & REQ_ATOMIC) {
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return ERR_PTR(-EIO);
+	}
 	/*
 	 * We can't sanely support splitting for a REQ_NOWAIT bio. End it
 	 * with EAGAIN if splitting is required and return an error pointer.
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 10/16] block: Add checks to merging of atomic writes
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (8 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 09/16] block: Error an attempt to split an atomic write bio John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 11:08 ` [PATCH v2 11/16] block: Add fops atomic write support John Garry
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

For atomic writes we allow merging, but we must adhere to some additional
rules:
- Only allow merging of atomic writes with other atomic writes
- Ensure that the merged IO would not cross an atomic write boundary, if
  any

We already ensure that we don't exceed the atomic writes size limit in
get_max_io_size().

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-merge.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index bc21f8ff4842..05eb227a5644 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -18,6 +18,42 @@
 #include "blk-rq-qos.h"
 #include "blk-throttle.h"
 
+static bool blk_rq_straddles_atomic_write_boundary(struct request *rq,
+						unsigned int front,
+						unsigned int back)
+{
+	unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
+	unsigned int mask, imask;
+	loff_t start, end;
+
+	if (!boundary)
+		return false;
+
+	start = rq->__sector << SECTOR_SHIFT;
+	end = start + rq->__data_len;
+
+	start -= front;
+	end += back;
+
+	/* We're longer than the boundary, so must be crossing it */
+	if (end - start > boundary)
+		return true;
+
+	mask = boundary - 1;
+
+	/* start/end are boundary-aligned, so cannot be crossing */
+	if (!(start & mask) || !(end & mask))
+		return false;
+
+	imask = ~mask;
+
+	/* Top bits are different, so crossed a boundary */
+	if ((start & imask) != (end & imask))
+		return true;
+
+	return false;
+}
+
 static inline void bio_get_first_bvec(struct bio *bio, struct bio_vec *bv)
 {
 	*bv = mp_bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
@@ -664,6 +700,13 @@ int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
 		return 0;
 	}
 
+	if (req->cmd_flags & REQ_ATOMIC) {
+		if (blk_rq_straddles_atomic_write_boundary(req,
+				0, bio->bi_iter.bi_size)) {
+			return 0;
+		}
+	}
+
 	return ll_new_hw_segment(req, bio, nr_segs);
 }
 
@@ -683,6 +726,13 @@ static int ll_front_merge_fn(struct request *req, struct bio *bio,
 		return 0;
 	}
 
+	if (req->cmd_flags & REQ_ATOMIC) {
+		if (blk_rq_straddles_atomic_write_boundary(req,
+				bio->bi_iter.bi_size, 0)) {
+			return 0;
+		}
+	}
+
 	return ll_new_hw_segment(req, bio, nr_segs);
 }
 
@@ -719,6 +769,13 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
 	    blk_rq_get_max_sectors(req, blk_rq_pos(req)))
 		return 0;
 
+	if (req->cmd_flags & REQ_ATOMIC) {
+		if (blk_rq_straddles_atomic_write_boundary(req,
+				0, blk_rq_bytes(next))) {
+			return 0;
+		}
+	}
+
 	total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
 	if (total_phys_segments > blk_rq_get_max_segments(req))
 		return 0;
@@ -814,6 +871,18 @@ static enum elv_merge blk_try_req_merge(struct request *req,
 	return ELEVATOR_NO_MERGE;
 }
 
+static bool blk_atomic_write_mergeable_rq_bio(struct request *rq,
+					      struct bio *bio)
+{
+	return (rq->cmd_flags & REQ_ATOMIC) == (bio->bi_opf & REQ_ATOMIC);
+}
+
+static bool blk_atomic_write_mergeable_rqs(struct request *rq,
+					   struct request *next)
+{
+	return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
+}
+
 /*
  * For non-mq, this has to be called with the request spinlock acquired.
  * For mq with scheduling, the appropriate queue wide lock should be held.
@@ -833,6 +902,9 @@ static struct request *attempt_merge(struct request_queue *q,
 	if (req->ioprio != next->ioprio)
 		return NULL;
 
+	if (!blk_atomic_write_mergeable_rqs(req, next))
+		return NULL;
+
 	/*
 	 * If we are allowed to merge, then append bio list
 	 * from next to rq and release next. merge_requests_fn
@@ -960,6 +1032,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (rq->ioprio != bio_prio(bio))
 		return false;
 
+	if (blk_atomic_write_mergeable_rq_bio(rq, bio) == false)
+		return false;
+
 	return true;
 }
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 11/16] block: Add fops atomic write support
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (9 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 10/16] block: Add checks to merging of atomic writes John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 11:08 ` [PATCH v2 12/16] scsi: sd: Support reading atomic write properties from block limits VPD John Garry
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

Add support for atomic writes, as follows:
- Ensure that the IO follows all the atomic writes rules, like must be
  naturally aligned
- Set REQ_ATOMIC

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/fops.c | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/block/fops.c b/block/fops.c
index 0abaac705daf..ba6a2c5a74b1 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -41,6 +41,24 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
 		!bdev_iter_is_aligned(bdev, iter);
 }
 
+static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
+			      struct iov_iter *iter)
+{
+	struct request_queue *q = bdev_get_queue(bdev);
+	unsigned int min_bytes = queue_atomic_write_unit_min_bytes(q);
+	unsigned int max_bytes = queue_atomic_write_unit_max_bytes(q);
+
+	if (iov_iter_count(iter) & (min_bytes - 1))
+		return false;
+	if (!is_power_of_2(iov_iter_count(iter)))
+		return false;
+	if (pos & (iov_iter_count(iter) - 1))
+		return false;
+	if (iov_iter_count(iter) > max_bytes)
+		return false;
+	return true;
+}
+
 #define DIO_INLINE_BIO_VECS 4
 
 static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
@@ -48,6 +66,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 {
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
 	struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs;
+	bool is_read = iov_iter_rw(iter) == READ;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 	loff_t pos = iocb->ki_pos;
 	bool should_dirty = false;
 	struct bio bio;
@@ -56,6 +76,9 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
+	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+		return -EINVAL;
+
 	if (nr_pages <= DIO_INLINE_BIO_VECS)
 		vecs = inline_vecs;
 	else {
@@ -65,7 +88,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 			return -ENOMEM;
 	}
 
-	if (iov_iter_rw(iter) == READ) {
+	if (is_read) {
 		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
 		if (user_backed_iter(iter))
 			should_dirty = true;
@@ -74,6 +97,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
 	bio.bi_ioprio = iocb->ki_ioprio;
+	if (atomic_write)
+		bio.bi_opf |= REQ_ATOMIC;
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
@@ -167,10 +192,14 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	struct blkdev_dio *dio;
 	struct bio *bio;
 	bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	loff_t pos = iocb->ki_pos;
 	int ret = 0;
 
+	if (atomic_write)
+		return -EINVAL;
+
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
@@ -305,6 +334,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
 	bool is_read = iov_iter_rw(iter) == READ;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 	struct blkdev_dio *dio;
 	struct bio *bio;
 	loff_t pos = iocb->ki_pos;
@@ -313,6 +343,9 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
+	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+		return -EINVAL;
+
 	if (iocb->ki_flags & IOCB_ALLOC_CACHE)
 		opf |= REQ_ALLOC_CACHE;
 	bio = bio_alloc_bioset(bdev, nr_pages, opf, GFP_KERNEL,
@@ -347,6 +380,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 			bio_set_pages_dirty(bio);
 		}
 	} else {
+		if (atomic_write)
+			bio->bi_opf |= REQ_ATOMIC;
 		task_io_account_write(bio->bi_iter.bi_size);
 	}
 
@@ -605,6 +640,9 @@ static int blkdev_open(struct inode *inode, struct file *filp)
 	if (bdev_nowait(handle->bdev))
 		filp->f_mode |= FMODE_NOWAIT;
 
+	if (queue_atomic_write_unit_min_bytes(bdev_get_queue(handle->bdev)))
+		filp->f_mode |= FMODE_CAN_ATOMIC_WRITE;
+
 	filp->f_mapping = handle->bdev->bd_inode->i_mapping;
 	filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
 	filp->private_data = handle;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 12/16] scsi: sd: Support reading atomic write properties from block limits VPD
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (10 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 11/16] block: Add fops atomic write support John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 11:08 ` [PATCH v2 13/16] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

Also update block layer request queue sysfs properties.

See sbc4r22 section 6.6.4 - Block limits VPD page.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/sd.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/scsi/sd.h |  8 ++++++
 2 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 542a4bbb21bc..fa857a08341f 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -916,6 +916,65 @@ static blk_status_t sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 	return scsi_alloc_sgtables(cmd);
 }
 
+static void sd_config_atomic(struct scsi_disk *sdkp)
+{
+	unsigned int logical_block_size = sdkp->device->sector_size,
+		physical_block_size_sectors, max_atomic, unit_min, unit_max;
+	struct request_queue *q = sdkp->disk->queue;
+
+	if ((!sdkp->max_atomic && !sdkp->max_atomic_with_boundary) ||
+	    sdkp->protection_type == T10_PI_TYPE2_PROTECTION)
+		return;
+
+	physical_block_size_sectors = sdkp->physical_block_size /
+					sdkp->device->sector_size;
+
+	unit_min = rounddown_pow_of_two(sdkp->atomic_granularity ?
+					sdkp->atomic_granularity :
+					physical_block_size_sectors);
+
+	/*
+	 * Only use atomic boundary when we have the odd scenario of
+	 * sdkp->max_atomic == 0, which the spec does permit.
+	 */
+	if (sdkp->max_atomic) {
+		max_atomic = sdkp->max_atomic;
+		unit_max = rounddown_pow_of_two(sdkp->max_atomic);
+		sdkp->use_atomic_write_boundary = 0;
+	} else {
+		max_atomic = sdkp->max_atomic_with_boundary;
+		unit_max = rounddown_pow_of_two(sdkp->max_atomic_boundary);
+		sdkp->use_atomic_write_boundary = 1;
+	}
+
+	/*
+	 * Ensure compliance with granularity and alignment. For now, keep it
+	 * simple and just don't support atomic writes for values mismatched
+	 * with max_{boundary}atomic, physical block size, and
+	 * atomic_granularity itself.
+	 *
+	 * We're really being distrustful by checking unit_max also...
+	 */
+	if (sdkp->atomic_granularity > 1) {
+		if (unit_min > 1 && unit_min % sdkp->atomic_granularity)
+			return;
+		if (unit_max > 1 && unit_max % sdkp->atomic_granularity)
+			return;
+	}
+
+	if (sdkp->atomic_alignment > 1) {
+		if (unit_min > 1 && unit_min % sdkp->atomic_alignment)
+			return;
+		if (unit_max > 1 && unit_max % sdkp->atomic_alignment)
+			return;
+	}
+
+	blk_queue_atomic_write_max_bytes(q, max_atomic * logical_block_size);
+	blk_queue_atomic_write_unit_min_sectors(q, unit_min);
+	blk_queue_atomic_write_unit_max_sectors(q, unit_max);
+	blk_queue_atomic_write_boundary_bytes(q, 0);
+}
+
 static blk_status_t sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd,
 		bool unmap)
 {
@@ -3071,7 +3130,7 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 		sdkp->max_ws_blocks = (u32)get_unaligned_be64(&vpd->data[36]);
 
 		if (!sdkp->lbpme)
-			goto out;
+			goto read_atomics;
 
 		lba_count = get_unaligned_be32(&vpd->data[20]);
 		desc_count = get_unaligned_be32(&vpd->data[24]);
@@ -3102,6 +3161,14 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 			else
 				sd_config_discard(sdkp, SD_LBP_DISABLE);
 		}
+read_atomics:
+		sdkp->max_atomic = get_unaligned_be32(&vpd->data[44]);
+		sdkp->atomic_alignment = get_unaligned_be32(&vpd->data[48]);
+		sdkp->atomic_granularity = get_unaligned_be32(&vpd->data[52]);
+		sdkp->max_atomic_with_boundary = get_unaligned_be32(&vpd->data[56]);
+		sdkp->max_atomic_boundary = get_unaligned_be32(&vpd->data[60]);
+
+		sd_config_atomic(sdkp);
 	}
 
  out:
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 409dda5350d1..990188a56b51 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -121,6 +121,13 @@ struct scsi_disk {
 	u32		max_unmap_blocks;
 	u32		unmap_granularity;
 	u32		unmap_alignment;
+
+	u32		max_atomic;
+	u32		atomic_alignment;
+	u32		atomic_granularity;
+	u32		max_atomic_with_boundary;
+	u32		max_atomic_boundary;
+
 	u32		index;
 	unsigned int	physical_block_size;
 	unsigned int	max_medium_access_timeouts;
@@ -151,6 +158,7 @@ struct scsi_disk {
 	unsigned	urswrz : 1;
 	unsigned	security : 1;
 	unsigned	ignore_medium_access_errors : 1;
+	unsigned	use_atomic_write_boundary : 1;
 };
 #define to_scsi_disk(obj) container_of(obj, struct scsi_disk, disk_dev)
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 13/16] scsi: sd: Add WRITE_ATOMIC_16 support
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (11 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 12/16] scsi: sd: Support reading atomic write properties from block limits VPD John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 11:08 ` [PATCH v2 14/16] scsi: scsi_debug: Atomic write support John Garry
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

Add function sd_setup_atomic_cmnd() to setup an WRITE_ATOMIC_16
CDB for when REQ_ATOMIC flag is set for the request.

Also add trace info.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_trace.c   | 22 ++++++++++++++++++++++
 drivers/scsi/sd.c           | 24 ++++++++++++++++++++++++
 include/scsi/scsi_proto.h   |  1 +
 include/trace/events/scsi.h |  1 +
 4 files changed, 48 insertions(+)

diff --git a/drivers/scsi/scsi_trace.c b/drivers/scsi/scsi_trace.c
index 41a950075913..3e47c4472a80 100644
--- a/drivers/scsi/scsi_trace.c
+++ b/drivers/scsi/scsi_trace.c
@@ -325,6 +325,26 @@ scsi_trace_zbc_out(struct trace_seq *p, unsigned char *cdb, int len)
 	return ret;
 }
 
+static const char *
+scsi_trace_atomic_write16_out(struct trace_seq *p, unsigned char *cdb, int len)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+	unsigned int boundary_size;
+	unsigned int nr_blocks;
+	sector_t lba;
+
+	lba = get_unaligned_be64(&cdb[2]);
+	boundary_size = get_unaligned_be16(&cdb[10]);
+	nr_blocks = get_unaligned_be16(&cdb[12]);
+
+	trace_seq_printf(p, "lba=%llu txlen=%u boundary_size=%u",
+			  lba, nr_blocks, boundary_size);
+
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+
 static const char *
 scsi_trace_varlen(struct trace_seq *p, unsigned char *cdb, int len)
 {
@@ -385,6 +405,8 @@ scsi_trace_parse_cdb(struct trace_seq *p, unsigned char *cdb, int len)
 		return scsi_trace_zbc_in(p, cdb, len);
 	case ZBC_OUT:
 		return scsi_trace_zbc_out(p, cdb, len);
+	case WRITE_ATOMIC_16:
+		return scsi_trace_atomic_write16_out(p, cdb, len);
 	default:
 		return scsi_trace_misc(p, cdb, len);
 	}
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fa857a08341f..10942e322253 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1240,6 +1240,26 @@ static int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd)
 	return (hint - IOPRIO_HINT_DEV_DURATION_LIMIT_1) + 1;
 }
 
+static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd,
+					sector_t lba, unsigned int nr_blocks,
+					bool boundary, unsigned char flags)
+{
+	cmd->cmd_len  = 16;
+	cmd->cmnd[0]  = WRITE_ATOMIC_16;
+	cmd->cmnd[1]  = flags;
+	put_unaligned_be64(lba, &cmd->cmnd[2]);
+	put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
+	if (boundary)
+		put_unaligned_be16(nr_blocks, &cmd->cmnd[10]);
+	else
+		put_unaligned_be16(0, &cmd->cmnd[10]);
+	put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
+	cmd->cmnd[14] = 0;
+	cmd->cmnd[15] = 0;
+
+	return BLK_STS_OK;
+}
+
 static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 {
 	struct request *rq = scsi_cmd_to_rq(cmd);
@@ -1311,6 +1331,10 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
 		ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
 					 protect | fua, dld);
+	} else if (rq->cmd_flags & REQ_ATOMIC && write) {
+		ret = sd_setup_atomic_cmnd(cmd, lba, nr_blocks,
+				sdkp->use_atomic_write_boundary,
+				protect | fua);
 	} else if (sdp->use_16_for_rw || (nr_blocks > 0xffff)) {
 		ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks,
 					 protect | fua, dld);
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 07d65c1f59db..833de67305b5 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -119,6 +119,7 @@
 #define WRITE_SAME_16	      0x93
 #define ZBC_OUT		      0x94
 #define ZBC_IN		      0x95
+#define WRITE_ATOMIC_16	0x9c
 #define SERVICE_ACTION_BIDIRECTIONAL 0x9d
 #define SERVICE_ACTION_IN_16  0x9e
 #define SERVICE_ACTION_OUT_16 0x9f
diff --git a/include/trace/events/scsi.h b/include/trace/events/scsi.h
index 8e2d9b1b0e77..05f1945ed204 100644
--- a/include/trace/events/scsi.h
+++ b/include/trace/events/scsi.h
@@ -102,6 +102,7 @@
 		scsi_opcode_name(WRITE_32),			\
 		scsi_opcode_name(WRITE_SAME_32),		\
 		scsi_opcode_name(ATA_16),			\
+		scsi_opcode_name(WRITE_ATOMIC_16),		\
 		scsi_opcode_name(ATA_12))
 
 #define scsi_hostbyte_name(result)	{ result, #result }
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 14/16] scsi: scsi_debug: Atomic write support
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (12 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 13/16] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 11:08 ` [PATCH v2 15/16] nvme: Support atomic writes John Garry
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	John Garry

Add initial support for atomic writes.

As is standard method, feed device properties via modules param, those
being:
- atomic_max_size_blks
- atomic_alignment_blks
- atomic_granularity_blks
- atomic_max_size_with_boundary_blks
- atomic_max_boundary_blks

These just match sbc4r22 section 6.6.4 - Block limits VPD page.

We just support ATOMIC_WRITE_16.

The major change in the driver is how we lock the device for RW accesses.

Currently the driver uses a per-device lock for accessing device metadata
and "media" data (calls to do_device_access()) atomically for the duration
of the whole read/write command.

This should not suit verifying atomic writes. Reason being that currently
all reads/writes are atomic, so using atomic writes does not prove
anything.

Change device access model to basis that regular writes only atomic on a
per-sector basis, while reads and atomic writes are fully atomic.

As mentioned, since accessing metadata and device media is atomic,
continue to have regular writes involving metadata - like discard or PI -
as atomic. We can improve this later.

Currently we only support model where overlapping going reads or writes
wait for current access to complete before commencing an atomic write.
This is described in 4.29.3.2 section of the SBC. However, we simplify,
things and wait for all accesses to complete (when issuing an atomic
write).

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_debug.c | 590 +++++++++++++++++++++++++++++---------
 1 file changed, 457 insertions(+), 133 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 512fae20c56c..645571b0fd2c 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -68,6 +68,8 @@ static const char *sdebug_version_date = "20210520";
 
 /* Additional Sense Code (ASC) */
 #define NO_ADDITIONAL_SENSE 0x0
+#define OVERLAP_ATOMIC_COMMAND_ASC 0x0
+#define OVERLAP_ATOMIC_COMMAND_ASCQ 0x23
 #define LOGICAL_UNIT_NOT_READY 0x4
 #define LOGICAL_UNIT_COMMUNICATION_FAILURE 0x8
 #define UNRECOVERED_READ_ERR 0x11
@@ -102,6 +104,7 @@ static const char *sdebug_version_date = "20210520";
 #define READ_BOUNDARY_ASCQ 0x7
 #define ATTEMPT_ACCESS_GAP 0x9
 #define INSUFF_ZONE_ASCQ 0xe
+/* see drivers/scsi/sense_codes.h */
 
 /* Additional Sense Code Qualifier (ASCQ) */
 #define ACK_NAK_TO 0x3
@@ -151,6 +154,12 @@ static const char *sdebug_version_date = "20210520";
 #define DEF_VIRTUAL_GB   0
 #define DEF_VPD_USE_HOSTNO 1
 #define DEF_WRITESAME_LENGTH 0xFFFF
+#define DEF_ATOMIC_WR 0
+#define DEF_ATOMIC_WR_MAX_LENGTH 8192
+#define DEF_ATOMIC_WR_ALIGN 2
+#define DEF_ATOMIC_WR_GRAN 2
+#define DEF_ATOMIC_WR_MAX_LENGTH_BNDRY (DEF_ATOMIC_WR_MAX_LENGTH)
+#define DEF_ATOMIC_WR_MAX_BNDRY 128
 #define DEF_STRICT 0
 #define DEF_STATISTICS false
 #define DEF_SUBMIT_QUEUES 1
@@ -373,7 +382,9 @@ struct sdebug_host_info {
 
 /* There is an xarray of pointers to this struct's objects, one per host */
 struct sdeb_store_info {
-	rwlock_t macc_lck;	/* for atomic media access on this store */
+	rwlock_t macc_data_lck;	/* for media data access on this store */
+	rwlock_t macc_meta_lck;	/* for atomic media meta access on this store */
+	rwlock_t macc_sector_lck;	/* per-sector media data access on this store */
 	u8 *storep;		/* user data storage (ram) */
 	struct t10_pi_tuple *dif_storep; /* protection info */
 	void *map_storep;	/* provisioning map */
@@ -397,12 +408,20 @@ struct sdebug_defer {
 	enum sdeb_defer_type defer_t;
 };
 
+struct sdebug_device_access_info {
+	bool atomic_write;
+	u64 lba;
+	u32 num;
+	struct scsi_cmnd *self;
+};
+
 struct sdebug_queued_cmd {
 	/* corresponding bit set in in_use_bm[] in owning struct sdebug_queue
 	 * instance indicates this slot is in use.
 	 */
 	struct sdebug_defer sd_dp;
 	struct scsi_cmnd *scmd;
+	struct sdebug_device_access_info *i;
 };
 
 struct sdebug_scsi_cmd {
@@ -462,7 +481,8 @@ enum sdeb_opcode_index {
 	SDEB_I_PRE_FETCH = 29,		/* 10, 16 */
 	SDEB_I_ZONE_OUT = 30,		/* 0x94+SA; includes no data xfer */
 	SDEB_I_ZONE_IN = 31,		/* 0x95+SA; all have data-in */
-	SDEB_I_LAST_ELEM_P1 = 32,	/* keep this last (previous + 1) */
+	SDEB_I_ATOMIC_WRITE_16 = 32,
+	SDEB_I_LAST_ELEM_P1 = 33,	/* keep this last (previous + 1) */
 };
 
 
@@ -496,7 +516,8 @@ static const unsigned char opcode_ind_arr[256] = {
 	0, 0, 0, SDEB_I_VERIFY,
 	SDEB_I_PRE_FETCH, SDEB_I_SYNC_CACHE, 0, SDEB_I_WRITE_SAME,
 	SDEB_I_ZONE_OUT, SDEB_I_ZONE_IN, 0, 0,
-	0, 0, 0, 0, 0, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16,
+	0, 0, 0, 0,
+	SDEB_I_ATOMIC_WRITE_16, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16,
 /* 0xa0; 0xa0->0xbf: 12 byte cdbs */
 	SDEB_I_REPORT_LUNS, SDEB_I_ATA_PT, 0, SDEB_I_MAINT_IN,
 	     SDEB_I_MAINT_OUT, 0, 0, 0,
@@ -544,6 +565,7 @@ static int resp_write_buffer(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_sync_cache(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_pre_fetch(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_report_zones(struct scsi_cmnd *, struct sdebug_dev_info *);
+static int resp_atomic_write(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_open_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_close_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_finish_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
@@ -782,6 +804,11 @@ static const struct opcode_info_t opcode_info_arr[SDEB_I_LAST_ELEM_P1 + 1] = {
 	    resp_report_zones, zone_in_iarr, /* ZONE_IN(16), REPORT ZONES) */
 		{16,  0x0 /* SA */, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 		 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xbf, 0xc7} },
+/* 31 */
+	{0, 0x0, 0x0, F_D_OUT | FF_MEDIA_IO,
+	    resp_atomic_write, NULL, /* ATOMIC WRITE 16 */
+		{16,  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff} },
 /* sentinel */
 	{0xff, 0, 0, 0, NULL, NULL,		/* terminating element */
 	    {0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} },
@@ -829,6 +856,13 @@ static unsigned int sdebug_unmap_granularity = DEF_UNMAP_GRANULARITY;
 static unsigned int sdebug_unmap_max_blocks = DEF_UNMAP_MAX_BLOCKS;
 static unsigned int sdebug_unmap_max_desc = DEF_UNMAP_MAX_DESC;
 static unsigned int sdebug_write_same_length = DEF_WRITESAME_LENGTH;
+static unsigned int sdebug_atomic_wr = DEF_ATOMIC_WR;
+static unsigned int sdebug_atomic_wr_max_length = DEF_ATOMIC_WR_MAX_LENGTH;
+static unsigned int sdebug_atomic_wr_align = DEF_ATOMIC_WR_ALIGN;
+static unsigned int sdebug_atomic_wr_gran = DEF_ATOMIC_WR_GRAN;
+static unsigned int sdebug_atomic_wr_max_length_bndry =
+			DEF_ATOMIC_WR_MAX_LENGTH_BNDRY;
+static unsigned int sdebug_atomic_wr_max_bndry = DEF_ATOMIC_WR_MAX_BNDRY;
 static int sdebug_uuid_ctl = DEF_UUID_CTL;
 static bool sdebug_random = DEF_RANDOM;
 static bool sdebug_per_host_store = DEF_PER_HOST_STORE;
@@ -1177,6 +1211,11 @@ static inline bool scsi_debug_lbp(void)
 		(sdebug_lbpu || sdebug_lbpws || sdebug_lbpws10);
 }
 
+static inline bool scsi_debug_atomic_write(void)
+{
+	return 0 == sdebug_fake_rw && sdebug_atomic_wr;
+}
+
 static void *lba2fake_store(struct sdeb_store_info *sip,
 			    unsigned long long lba)
 {
@@ -1804,6 +1843,14 @@ static int inquiry_vpd_b0(unsigned char *arr)
 	/* Maximum WRITE SAME Length */
 	put_unaligned_be64(sdebug_write_same_length, &arr[32]);
 
+	if (sdebug_atomic_wr) {
+		put_unaligned_be32(sdebug_atomic_wr_max_length, &arr[40]);
+		put_unaligned_be32(sdebug_atomic_wr_align, &arr[44]);
+		put_unaligned_be32(sdebug_atomic_wr_gran, &arr[48]);
+		put_unaligned_be32(sdebug_atomic_wr_max_length_bndry, &arr[52]);
+		put_unaligned_be32(sdebug_atomic_wr_max_bndry, &arr[56]);
+	}
+
 	return 0x3c; /* Mandatory page length for Logical Block Provisioning */
 }
 
@@ -3305,15 +3352,240 @@ static inline struct sdeb_store_info *devip2sip(struct sdebug_dev_info *devip,
 	return xa_load(per_store_ap, devip->sdbg_host->si_idx);
 }
 
+
+static inline void
+sdeb_read_lock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__acquire(lock);
+	else
+		read_lock(lock);
+}
+
+static inline void
+sdeb_read_unlock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__release(lock);
+	else
+		read_unlock(lock);
+}
+
+static inline void
+sdeb_write_lock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__acquire(lock);
+	else
+		write_lock(lock);
+}
+
+static inline void
+sdeb_write_unlock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__release(lock);
+	else
+		write_unlock(lock);
+}
+
+static inline void
+sdeb_data_read_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_lock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_read_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_unlock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_write_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_lock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_write_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_unlock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_sector_read_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_lock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_read_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_unlock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_write_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_lock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_write_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_unlock(&sip->macc_sector_lck);
+}
+
+/*
+Atomic locking:
+We simplify the atomic model to allow only 1x atomic
+write and many non-atomic reads or writes for all
+LBAs.
+
+A RW lock has a similar bahaviour:
+Only 1x writer and many readers.
+
+So use a RW lock for per-device read and write locking:
+An atomic access grabs the lock as a writer and
+non-atomic grabs the lock as a reader.
+*/
+
+static inline void
+sdeb_data_lock(struct sdeb_store_info *sip, bool atomic_write)
+{
+	if (atomic_write)
+		sdeb_data_write_lock(sip);
+	else
+		sdeb_data_read_lock(sip);
+}
+
+static inline void
+sdeb_data_unlock(struct sdeb_store_info *sip, bool atomic_write)
+{
+	if (atomic_write)
+		sdeb_data_write_unlock(sip);
+	else
+		sdeb_data_read_unlock(sip);
+}
+
+/* Allow many reads but only 1x write per sector */
+static inline void
+sdeb_data_sector_lock(struct sdeb_store_info *sip, bool do_write)
+{
+	if (do_write)
+		sdeb_data_sector_write_lock(sip);
+	else
+		sdeb_data_sector_read_lock(sip);
+}
+
+static inline void
+sdeb_data_sector_unlock(struct sdeb_store_info *sip, bool do_write)
+{
+	if (do_write)
+		sdeb_data_sector_write_unlock(sip);
+	else
+		sdeb_data_sector_read_unlock(sip);
+}
+
+static inline void
+sdeb_meta_read_lock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__acquire(&sip->macc_meta_lck);
+		else
+			__acquire(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			read_lock(&sip->macc_meta_lck);
+		else
+			read_lock(&sdeb_fake_rw_lck);
+	}
+}
+
+static inline void
+sdeb_meta_read_unlock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__release(&sip->macc_meta_lck);
+		else
+			__release(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			read_unlock(&sip->macc_meta_lck);
+		else
+			read_unlock(&sdeb_fake_rw_lck);
+	}
+}
+
+static inline void
+sdeb_meta_write_lock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__acquire(&sip->macc_meta_lck);
+		else
+			__acquire(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			write_lock(&sip->macc_meta_lck);
+		else
+			write_lock(&sdeb_fake_rw_lck);
+	}
+}
+
+static inline void
+sdeb_meta_write_unlock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__release(&sip->macc_meta_lck);
+		else
+			__release(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			write_unlock(&sip->macc_meta_lck);
+		else
+			write_unlock(&sdeb_fake_rw_lck);
+	}
+}
+
 /* Returns number of bytes copied or -1 if error. */
 static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp,
-			    u32 sg_skip, u64 lba, u32 num, bool do_write)
+			    u32 sg_skip, u64 lba, u32 num, bool do_write,
+			    bool atomic_write)
 {
 	int ret;
-	u64 block, rest = 0;
+	u64 block;
 	enum dma_data_direction dir;
 	struct scsi_data_buffer *sdb = &scp->sdb;
 	u8 *fsp;
+	int i;
+
+	/*
+	 * Even though reads are inherently atomic (in this driver), we expect
+	 * the atomic flag only for writes.
+	 */
+	if (!do_write && atomic_write)
+		return -1;
 
 	if (do_write) {
 		dir = DMA_TO_DEVICE;
@@ -3329,21 +3601,26 @@ static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp,
 	fsp = sip->storep;
 
 	block = do_div(lba, sdebug_store_sectors);
-	if (block + num > sdebug_store_sectors)
-		rest = block + num - sdebug_store_sectors;
 
-	ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
+	/* Only allow 1x atomic write or multiple non-atomic writes at any given time */
+	sdeb_data_lock(sip, atomic_write);
+	for (i = 0; i < num; i++) {
+		/* We shouldn't need to lock for atomic writes, but do it anyway */
+		sdeb_data_sector_lock(sip, do_write);
+		ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 		   fsp + (block * sdebug_sector_size),
-		   (num - rest) * sdebug_sector_size, sg_skip, do_write);
-	if (ret != (num - rest) * sdebug_sector_size)
-		return ret;
-
-	if (rest) {
-		ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
-			    fsp, rest * sdebug_sector_size,
-			    sg_skip + ((num - rest) * sdebug_sector_size),
-			    do_write);
+		   sdebug_sector_size, sg_skip, do_write);
+		sdeb_data_sector_unlock(sip, do_write);
+		if (ret != sdebug_sector_size) {
+			ret += (i * sdebug_sector_size);
+			break;
+		}
+		sg_skip += sdebug_sector_size;
+		if (++block >= sdebug_store_sectors)
+			block = 0;
 	}
+	ret = num * sdebug_sector_size;
+	sdeb_data_unlock(sip, atomic_write);
 
 	return ret;
 }
@@ -3519,70 +3796,6 @@ static int prot_verify_read(struct scsi_cmnd *scp, sector_t start_sec,
 	return ret;
 }
 
-static inline void
-sdeb_read_lock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__acquire(&sip->macc_lck);
-		else
-			__acquire(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			read_lock(&sip->macc_lck);
-		else
-			read_lock(&sdeb_fake_rw_lck);
-	}
-}
-
-static inline void
-sdeb_read_unlock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__release(&sip->macc_lck);
-		else
-			__release(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			read_unlock(&sip->macc_lck);
-		else
-			read_unlock(&sdeb_fake_rw_lck);
-	}
-}
-
-static inline void
-sdeb_write_lock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__acquire(&sip->macc_lck);
-		else
-			__acquire(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			write_lock(&sip->macc_lck);
-		else
-			write_lock(&sdeb_fake_rw_lck);
-	}
-}
-
-static inline void
-sdeb_write_unlock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__release(&sip->macc_lck);
-		else
-			__release(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			write_unlock(&sip->macc_lck);
-		else
-			write_unlock(&sdeb_fake_rw_lck);
-	}
-}
-
 static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 {
 	bool check_prot;
@@ -3592,6 +3805,7 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	u64 lba;
 	struct sdeb_store_info *sip = devip2sip(devip, true);
 	u8 *cmd = scp->cmnd;
+	bool meta_data_locked = false;
 
 	switch (cmd[0]) {
 	case READ_16:
@@ -3650,6 +3864,10 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		atomic_set(&sdeb_inject_pending, 0);
 	}
 
+	/*
+	 * When checking device access params, for reads we only check data
+	 * versus what is set at init time, so no need to lock.
+	 */
 	ret = check_device_access_params(scp, lba, num, false);
 	if (ret)
 		return ret;
@@ -3669,29 +3887,33 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		return check_condition_result;
 	}
 
-	sdeb_read_lock(sip);
+	if (sdebug_dev_is_zoned(devip) ||
+	    (sdebug_dix && scsi_prot_sg_count(scp)))  {
+		sdeb_meta_read_lock(sip);
+		meta_data_locked = true;
+	}
 
 	/* DIX + T10 DIF */
 	if (unlikely(sdebug_dix && scsi_prot_sg_count(scp))) {
 		switch (prot_verify_read(scp, lba, num, ei_lba)) {
 		case 1: /* Guard tag error */
 			if (cmd[1] >> 5 != 3) { /* RDPROTECT != 3 */
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 1);
 				return check_condition_result;
 			} else if (scp->prot_flags & SCSI_PROT_GUARD_CHECK) {
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 1);
 				return illegal_condition_result;
 			}
 			break;
 		case 3: /* Reference tag error */
 			if (cmd[1] >> 5 != 3) { /* RDPROTECT != 3 */
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 3);
 				return check_condition_result;
 			} else if (scp->prot_flags & SCSI_PROT_REF_CHECK) {
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 3);
 				return illegal_condition_result;
 			}
@@ -3699,8 +3921,9 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		}
 	}
 
-	ret = do_device_access(sip, scp, 0, lba, num, false);
-	sdeb_read_unlock(sip);
+	ret = do_device_access(sip, scp, 0, lba, num, false, false);
+	if (meta_data_locked)
+		sdeb_meta_read_unlock(sip);
 	if (unlikely(ret == -1))
 		return DID_ERROR << 16;
 
@@ -3889,6 +4112,7 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	u64 lba;
 	struct sdeb_store_info *sip = devip2sip(devip, true);
 	u8 *cmd = scp->cmnd;
+	bool meta_data_locked = false;
 
 	switch (cmd[0]) {
 	case WRITE_16:
@@ -3942,10 +4166,17 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 				    "to DIF device\n");
 	}
 
-	sdeb_write_lock(sip);
+	if (sdebug_dev_is_zoned(devip) ||
+	    (sdebug_dix && scsi_prot_sg_count(scp)) ||
+	    scsi_debug_lbp())  {
+		sdeb_meta_write_lock(sip);
+		meta_data_locked = true;
+	}
+
 	ret = check_device_access_params(scp, lba, num, true);
 	if (ret) {
-		sdeb_write_unlock(sip);
+		if (meta_data_locked)
+			sdeb_meta_write_unlock(sip);
 		return ret;
 	}
 
@@ -3954,22 +4185,22 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		switch (prot_verify_write(scp, lba, num, ei_lba)) {
 		case 1: /* Guard tag error */
 			if (scp->prot_flags & SCSI_PROT_GUARD_CHECK) {
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 1);
 				return illegal_condition_result;
 			} else if (scp->cmnd[1] >> 5 != 3) { /* WRPROTECT != 3 */
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 1);
 				return check_condition_result;
 			}
 			break;
 		case 3: /* Reference tag error */
 			if (scp->prot_flags & SCSI_PROT_REF_CHECK) {
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 3);
 				return illegal_condition_result;
 			} else if (scp->cmnd[1] >> 5 != 3) { /* WRPROTECT != 3 */
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 3);
 				return check_condition_result;
 			}
@@ -3977,13 +4208,16 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		}
 	}
 
-	ret = do_device_access(sip, scp, 0, lba, num, true);
+	ret = do_device_access(sip, scp, 0, lba, num, true, false);
 	if (unlikely(scsi_debug_lbp()))
 		map_region(sip, lba, num);
+
 	/* If ZBC zone then bump its write pointer */
 	if (sdebug_dev_is_zoned(devip))
 		zbc_inc_wp(devip, lba, num);
-	sdeb_write_unlock(sip);
+	if (meta_data_locked)
+		sdeb_meta_write_unlock(sip);
+
 	if (unlikely(-1 == ret))
 		return DID_ERROR << 16;
 	else if (unlikely(sdebug_verbose &&
@@ -4090,7 +4324,8 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 		goto err_out;
 	}
 
-	sdeb_write_lock(sip);
+	/* Just keep it simple and always lock for now */
+	sdeb_meta_write_lock(sip);
 	sg_off = lbdof_blen;
 	/* Spec says Buffer xfer Length field in number of LBs in dout */
 	cum_lb = 0;
@@ -4133,7 +4368,11 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 			}
 		}
 
-		ret = do_device_access(sip, scp, sg_off, lba, num, true);
+		/*
+		 * Write ranges atomically to keep as close to pre-atomic
+		 * writes behaviour as possible.
+		 */
+		ret = do_device_access(sip, scp, sg_off, lba, num, true, true);
 		/* If ZBC zone then bump its write pointer */
 		if (sdebug_dev_is_zoned(devip))
 			zbc_inc_wp(devip, lba, num);
@@ -4172,7 +4411,7 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 	}
 	ret = 0;
 err_out_unlock:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 err_out:
 	kfree(lrdp);
 	return ret;
@@ -4191,14 +4430,16 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 						scp->device->hostdata, true);
 	u8 *fs1p;
 	u8 *fsp;
+	bool meta_data_locked = false;
 
-	sdeb_write_lock(sip);
+	if (sdebug_dev_is_zoned(devip) || scsi_debug_lbp()) {
+		sdeb_meta_write_lock(sip);
+		meta_data_locked = true;
+	}
 
 	ret = check_device_access_params(scp, lba, num, true);
-	if (ret) {
-		sdeb_write_unlock(sip);
-		return ret;
-	}
+	if (ret)
+		goto out;
 
 	if (unmap && scsi_debug_lbp()) {
 		unmap_region(sip, lba, num);
@@ -4209,6 +4450,7 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 	/* if ndob then zero 1 logical block, else fetch 1 logical block */
 	fsp = sip->storep;
 	fs1p = fsp + (block * lb_size);
+	sdeb_data_write_lock(sip);
 	if (ndob) {
 		memset(fs1p, 0, lb_size);
 		ret = 0;
@@ -4216,8 +4458,8 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 		ret = fetch_to_dev_buffer(scp, fs1p, lb_size);
 
 	if (-1 == ret) {
-		sdeb_write_unlock(sip);
-		return DID_ERROR << 16;
+		ret = DID_ERROR << 16;
+		goto out;
 	} else if (sdebug_verbose && !ndob && (ret < lb_size))
 		sdev_printk(KERN_INFO, scp->device,
 			    "%s: %s: lb size=%u, IO sent=%d bytes\n",
@@ -4234,10 +4476,12 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 	/* If ZBC zone then bump its write pointer */
 	if (sdebug_dev_is_zoned(devip))
 		zbc_inc_wp(devip, lba, num);
+	sdeb_data_write_unlock(sip);
+	ret = 0;
 out:
-	sdeb_write_unlock(sip);
-
-	return 0;
+	if (meta_data_locked)
+		sdeb_meta_write_unlock(sip);
+	return ret;
 }
 
 static int resp_write_same_10(struct scsi_cmnd *scp,
@@ -4380,25 +4624,30 @@ static int resp_comp_write(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
-
 	ret = do_dout_fetch(scp, dnum, arr);
 	if (ret == -1) {
 		retval = DID_ERROR << 16;
-		goto cleanup;
+		goto cleanup_free;
 	} else if (sdebug_verbose && (ret < (dnum * lb_size)))
 		sdev_printk(KERN_INFO, scp->device, "%s: compare_write: cdb "
 			    "indicated=%u, IO sent=%d bytes\n", my_name,
 			    dnum * lb_size, ret);
+
+	sdeb_data_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 	if (!comp_write_worker(sip, lba, num, arr, false)) {
 		mk_sense_buffer(scp, MISCOMPARE, MISCOMPARE_VERIFY_ASC, 0);
 		retval = check_condition_result;
-		goto cleanup;
+		goto cleanup_unlock;
 	}
+
+	/* Cover sip->map_storep (which map_region()) sets with data lock */
 	if (scsi_debug_lbp())
 		map_region(sip, lba, num);
-cleanup:
-	sdeb_write_unlock(sip);
+cleanup_unlock:
+	sdeb_meta_write_unlock(sip);
+	sdeb_data_write_unlock(sip);
+cleanup_free:
 	kfree(arr);
 	return retval;
 }
@@ -4442,7 +4691,7 @@ static int resp_unmap(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 
 	desc = (void *)&buf[8];
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	for (i = 0 ; i < descriptors ; i++) {
 		unsigned long long lba = get_unaligned_be64(&desc[i].lba);
@@ -4458,7 +4707,7 @@ static int resp_unmap(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	ret = 0;
 
 out:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	kfree(buf);
 
 	return ret;
@@ -4571,12 +4820,13 @@ static int resp_pre_fetch(struct scsi_cmnd *scp,
 		rest = block + nblks - sdebug_store_sectors;
 
 	/* Try to bring the PRE-FETCH range into CPU's cache */
-	sdeb_read_lock(sip);
+	sdeb_data_read_lock(sip);
 	prefetch_range(fsp + (sdebug_sector_size * block),
 		       (nblks - rest) * sdebug_sector_size);
 	if (rest)
 		prefetch_range(fsp, rest * sdebug_sector_size);
-	sdeb_read_unlock(sip);
+
+	sdeb_data_read_unlock(sip);
 fini:
 	if (cmd[1] & 0x2)
 		res = SDEG_RES_IMMED_MASK;
@@ -4735,7 +4985,7 @@ static int resp_verify(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		return check_condition_result;
 	}
 	/* Not changing store, so only need read access */
-	sdeb_read_lock(sip);
+	sdeb_data_read_lock(sip);
 
 	ret = do_dout_fetch(scp, a_num, arr);
 	if (ret == -1) {
@@ -4757,7 +5007,7 @@ static int resp_verify(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		goto cleanup;
 	}
 cleanup:
-	sdeb_read_unlock(sip);
+	sdeb_data_read_unlock(sip);
 	kfree(arr);
 	return ret;
 }
@@ -4803,7 +5053,7 @@ static int resp_report_zones(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_read_lock(sip);
+	sdeb_meta_read_lock(sip);
 
 	desc = arr + 64;
 	for (lba = zs_lba; lba < sdebug_capacity;
@@ -4901,11 +5151,70 @@ static int resp_report_zones(struct scsi_cmnd *scp,
 	ret = fill_from_dev_buffer(scp, arr, min_t(u32, alloc_len, rep_len));
 
 fini:
-	sdeb_read_unlock(sip);
+	sdeb_meta_read_unlock(sip);
 	kfree(arr);
 	return ret;
 }
 
+static int resp_atomic_write(struct scsi_cmnd *scp,
+			     struct sdebug_dev_info *devip)
+{
+	struct sdeb_store_info *sip;
+	u8 *cmd = scp->cmnd;
+	u16 boundary, len;
+	u64 lba, lba_tmp;
+	int ret;
+
+	if (!scsi_debug_atomic_write()) {
+		mk_sense_invalid_opcode(scp);
+		return check_condition_result;
+	}
+
+	sip = devip2sip(devip, true);
+
+	lba = get_unaligned_be64(cmd + 2);
+	boundary = get_unaligned_be16(cmd + 10);
+	len = get_unaligned_be16(cmd + 12);
+
+	lba_tmp = lba;
+	if (sdebug_atomic_wr_align &&
+	    do_div(lba_tmp, sdebug_atomic_wr_align)) {
+		/* Does not meet alignment requirement */
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, INVALID_FIELD_IN_CDB, 0);
+		return check_condition_result;
+	}
+
+	if (sdebug_atomic_wr_gran && len % sdebug_atomic_wr_gran) {
+		/* Does not meet alignment requirement */
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, INVALID_FIELD_IN_CDB, 0);
+		return check_condition_result;
+	}
+
+	if (boundary > 0) {
+		if (boundary > sdebug_atomic_wr_max_bndry) {
+			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+			return check_condition_result;
+		}
+
+		if (len > sdebug_atomic_wr_max_length_bndry) {
+			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+			return check_condition_result;
+		}
+	} else {
+		if (len > sdebug_atomic_wr_max_length) {
+			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+			return check_condition_result;
+		}
+	}
+
+	ret = do_device_access(sip, scp, 0, lba, len, true, true);
+	if (unlikely(ret == -1))
+		return DID_ERROR << 16;
+	if (unlikely(ret != len * sdebug_sector_size))
+		return DID_ERROR << 16;
+	return 0;
+}
+
 /* Logic transplanted from tcmu-runner, file_zbc.c */
 static void zbc_open_all(struct sdebug_dev_info *devip)
 {
@@ -4932,8 +5241,7 @@ static int resp_open_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		mk_sense_invalid_opcode(scp);
 		return check_condition_result;
 	}
-
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		/* Check if all closed zones can be open */
@@ -4982,7 +5290,7 @@ static int resp_open_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 
 	zbc_open_zone(devip, zsp, true);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -5009,7 +5317,7 @@ static int resp_close_zone(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		zbc_close_all(devip);
@@ -5038,7 +5346,7 @@ static int resp_close_zone(struct scsi_cmnd *scp,
 
 	zbc_close_zone(devip, zsp);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -5081,7 +5389,7 @@ static int resp_finish_zone(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		zbc_finish_all(devip);
@@ -5110,7 +5418,7 @@ static int resp_finish_zone(struct scsi_cmnd *scp,
 
 	zbc_finish_zone(devip, zsp, true);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -5161,7 +5469,7 @@ static int resp_rwp_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		zbc_rwp_all(devip);
@@ -5189,7 +5497,7 @@ static int resp_rwp_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 
 	zbc_rwp_zone(devip, zsp);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -5216,6 +5524,7 @@ static void sdebug_q_cmd_complete(struct sdebug_defer *sd_dp)
 	if (!scp) {
 		pr_err("scmd=NULL\n");
 		goto out;
+
 	}
 
 	sdsc = scsi_cmd_priv(scp);
@@ -6153,6 +6462,7 @@ module_param_named(lbprz, sdebug_lbprz, int, S_IRUGO);
 module_param_named(lbpu, sdebug_lbpu, int, S_IRUGO);
 module_param_named(lbpws, sdebug_lbpws, int, S_IRUGO);
 module_param_named(lbpws10, sdebug_lbpws10, int, S_IRUGO);
+module_param_named(atomic_wr, sdebug_atomic_wr, int, S_IRUGO);
 module_param_named(lowest_aligned, sdebug_lowest_aligned, int, S_IRUGO);
 module_param_named(lun_format, sdebug_lun_am_i, int, S_IRUGO | S_IWUSR);
 module_param_named(max_luns, sdebug_max_luns, int, S_IRUGO | S_IWUSR);
@@ -6187,6 +6497,11 @@ module_param_named(unmap_alignment, sdebug_unmap_alignment, int, S_IRUGO);
 module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO);
 module_param_named(unmap_max_blocks, sdebug_unmap_max_blocks, int, S_IRUGO);
 module_param_named(unmap_max_desc, sdebug_unmap_max_desc, int, S_IRUGO);
+module_param_named(atomic_wr_max_length, sdebug_atomic_wr_max_length, int, S_IRUGO);
+module_param_named(atomic_wr_align, sdebug_atomic_wr_align, int, S_IRUGO);
+module_param_named(atomic_wr_gran, sdebug_atomic_wr_gran, int, S_IRUGO);
+module_param_named(atomic_wr_max_length_bndry, sdebug_atomic_wr_max_length_bndry, int, S_IRUGO);
+module_param_named(atomic_wr_max_bndry, sdebug_atomic_wr_max_bndry, int, S_IRUGO);
 module_param_named(uuid_ctl, sdebug_uuid_ctl, int, S_IRUGO);
 module_param_named(virtual_gb, sdebug_virtual_gb, int, S_IRUGO | S_IWUSR);
 module_param_named(vpd_use_hostno, sdebug_vpd_use_hostno, int,
@@ -6230,6 +6545,7 @@ MODULE_PARM_DESC(lbprz,
 MODULE_PARM_DESC(lbpu, "enable LBP, support UNMAP command (def=0)");
 MODULE_PARM_DESC(lbpws, "enable LBP, support WRITE SAME(16) with UNMAP bit (def=0)");
 MODULE_PARM_DESC(lbpws10, "enable LBP, support WRITE SAME(10) with UNMAP bit (def=0)");
+MODULE_PARM_DESC(atomic_write, "enable ATOMIC WRITE support, support WRITE ATOMIC(16) (def=1)");
 MODULE_PARM_DESC(lowest_aligned, "lowest aligned lba (def=0)");
 MODULE_PARM_DESC(lun_format, "LUN format: 0->peripheral (def); 1 --> flat address method");
 MODULE_PARM_DESC(max_luns, "number of LUNs per target to simulate(def=1)");
@@ -6261,6 +6577,11 @@ MODULE_PARM_DESC(unmap_alignment, "lowest aligned thin provisioning lba (def=0)"
 MODULE_PARM_DESC(unmap_granularity, "thin provisioning granularity in blocks (def=1)");
 MODULE_PARM_DESC(unmap_max_blocks, "max # of blocks can be unmapped in one cmd (def=0xffffffff)");
 MODULE_PARM_DESC(unmap_max_desc, "max # of ranges that can be unmapped in one cmd (def=256)");
+MODULE_PARM_DESC(atomic_wr_max_length, "max # of blocks can be atomically written in one cmd (def=8192)");
+MODULE_PARM_DESC(atomic_wr_align, "minimum alignment of atomic write in blocks (def=2)");
+MODULE_PARM_DESC(atomic_wr_gran, "minimum granularity of atomic write in blocks (def=2)");
+MODULE_PARM_DESC(atomic_wr_max_length_bndry, "max # of blocks can be atomically written in one cmd with boundary set (def=8192)");
+MODULE_PARM_DESC(atomic_wr_max_bndry, "max # boundaries per atomic write (def=128)");
 MODULE_PARM_DESC(uuid_ctl,
 		 "1->use uuid for lu name, 0->don't, 2->all use same (def=0)");
 MODULE_PARM_DESC(virtual_gb, "virtual gigabyte (GiB) size (def=0 -> use dev_size_mb)");
@@ -7407,6 +7728,7 @@ static int __init scsi_debug_init(void)
 			return -EINVAL;
 		}
 	}
+
 	xa_init_flags(per_store_ap, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
 	if (want_store) {
 		idx = sdebug_add_store();
@@ -7614,7 +7936,9 @@ static int sdebug_add_store(void)
 			map_region(sip, 0, 2);
 	}
 
-	rwlock_init(&sip->macc_lck);
+	rwlock_init(&sip->macc_data_lck);
+	rwlock_init(&sip->macc_meta_lck);
+	rwlock_init(&sip->macc_sector_lck);
 	return (int)n_idx;
 err:
 	sdebug_erase_store((int)n_idx, sip);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 15/16] nvme: Support atomic writes
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (13 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 14/16] scsi: scsi_debug: Atomic write support John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 11:08 ` [PATCH v2 16/16] nvme: Ensure atomic writes will be executed atomically John Garry
  2023-12-12 16:32 ` [PATCH v2 00/16] block atomic writes Christoph Hellwig
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	Alan Adamson, John Garry

From: Alan Adamson <alan.adamson@oracle.com>

Support reading atomic write registers to fill in request_queue
properties.

Use following method to calculate limits:
atomic_write_max_bytes = flp2(NAWUPF ?: AWUPF)
atomic_write_unit_min = logical_block_size
atomic_write_unit_max = flp2(NAWUPF ?: AWUPF)
atomic_write_boundary = NABSPF

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
#jpg: some rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/nvme/host/core.c | 85 ++++++++++++++++++++++++++++++----------
 1 file changed, 64 insertions(+), 21 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8ebdfd623e0f..fd4f09f9dbe8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1894,12 +1894,68 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 	blk_queue_write_cache(q, vwc, vwc);
 }
 
+static void nvme_update_atomic_write_disk_info(struct gendisk *disk,
+		struct nvme_ns *ns, struct nvme_id_ns *id, u32 bs, u32 phys_bs)
+{
+	unsigned int max_bytes = 0, unit_min = 0, unit_max = 0, boundary = 0;
+	u32 atomic_bs = bs;
+
+	if (id->nabo == 0) {
+		unsigned int ns_boundary;
+
+		/*
+		 * Bit 1 indicates whether NAWUPF is defined for this namespace
+		 * and whether it should be used instead of AWUPF. If NAWUPF ==
+		 * 0 then AWUPF must be used instead.
+		 */
+		if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf)
+			atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
+		else
+			atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs;
+
+		if (le16_to_cpu(id->nabspf))
+			ns_boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
+		else
+			ns_boundary = 0;
+
+		/*
+		 * The boundary size just needs to be a multiple
+		 * of unit_max (and not necessarily a power-of-2), so
+		 * this could be relaxed in the block layer in future.
+		 */
+		if (!ns_boundary || is_power_of_2(ns_boundary)) {
+			max_bytes = atomic_bs;
+			unit_min = bs >> SECTOR_SHIFT;
+			unit_max = rounddown_pow_of_two(atomic_bs) >> SECTOR_SHIFT;
+			boundary = ns_boundary;
+		} else {
+			dev_notice(ns->ctrl->device, "Unsupported atomic write boundary (%d bytes)\n",
+				ns_boundary);
+		}
+	} else {
+		dev_info(ns->ctrl->device, "Atomic writes not supported for NABO set (%d blocks)\n",
+				id->nabo);
+	}
+
+	blk_queue_atomic_write_max_bytes(disk->queue, atomic_bs);
+	blk_queue_atomic_write_unit_min_sectors(disk->queue, bs >> SECTOR_SHIFT);
+	blk_queue_atomic_write_unit_max_sectors(disk->queue,
+		rounddown_pow_of_two(atomic_bs) >> SECTOR_SHIFT);
+	blk_queue_atomic_write_boundary_bytes(disk->queue, boundary);
+
+	/*
+	 * Linux filesystems assume writing a single physical block is
+	 * an atomic operation. Hence limit the physical block size to the
+	 * value of the Atomic Write Unit Power Fail parameter.
+	 */
+	blk_queue_physical_block_size(disk->queue, min(phys_bs, atomic_bs));
+}
+
 static void nvme_update_disk_info(struct gendisk *disk,
 		struct nvme_ns *ns, struct nvme_id_ns *id)
 {
 	sector_t capacity = nvme_lba_to_sect(ns, le64_to_cpu(id->nsze));
-	u32 bs = 1U << ns->lba_shift;
-	u32 atomic_bs, phys_bs, io_opt = 0;
+	u32 bs, phys_bs, io_opt = 0;
 
 	/*
 	 * The block layer can't support LBA sizes larger than the page size
@@ -1909,37 +1965,24 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	if (ns->lba_shift > PAGE_SHIFT || ns->lba_shift < SECTOR_SHIFT) {
 		capacity = 0;
 		bs = (1 << 9);
+	} else {
+		bs = 1U << ns->lba_shift;
 	}
 
 	blk_integrity_unregister(disk);
 
-	atomic_bs = phys_bs = bs;
-	if (id->nabo == 0) {
-		/*
-		 * Bit 1 indicates whether NAWUPF is defined for this namespace
-		 * and whether it should be used instead of AWUPF. If NAWUPF ==
-		 * 0 then AWUPF must be used instead.
-		 */
-		if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf)
-			atomic_bs = (1 + le16_to_cpu(id->nawupf)) * bs;
-		else
-			atomic_bs = (1 + ns->ctrl->subsys->awupf) * bs;
-	}
-
 	if (id->nsfeat & NVME_NS_FEAT_IO_OPT) {
 		/* NPWG = Namespace Preferred Write Granularity */
 		phys_bs = bs * (1 + le16_to_cpu(id->npwg));
 		/* NOWS = Namespace Optimal Write Size */
 		io_opt = bs * (1 + le16_to_cpu(id->nows));
+	} else {
+		phys_bs = bs;
 	}
 
+	nvme_update_atomic_write_disk_info(disk, ns, id, bs, phys_bs);
+
 	blk_queue_logical_block_size(disk->queue, bs);
-	/*
-	 * Linux filesystems assume writing a single physical block is
-	 * an atomic operation. Hence limit the physical block size to the
-	 * value of the Atomic Write Unit Power Fail parameter.
-	 */
-	blk_queue_physical_block_size(disk->queue, min(phys_bs, atomic_bs));
 	blk_queue_io_min(disk->queue, phys_bs);
 	blk_queue_io_opt(disk->queue, io_opt);
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v2 16/16] nvme: Ensure atomic writes will be executed atomically
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (14 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 15/16] nvme: Support atomic writes John Garry
@ 2023-12-12 11:08 ` John Garry
  2023-12-12 16:32 ` [PATCH v2 00/16] block atomic writes Christoph Hellwig
  16 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-12 11:08 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, jaswin, bvanassche,
	Alan Adamson, John Garry

From: Alan Adamson <alan.adamson@oracle.com>

There is no dedicated NVMe atomic write command (which may error for a
command which exceeds the controller atomic write limits).

As an insurance policy against the block layer sending requests which
cannot be executed atomically, add a check in the queue path.

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/nvme/host/core.c | 23 +++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  2 ++
 2 files changed, 25 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fd4f09f9dbe8..6b89ee7e9921 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -905,6 +905,26 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 	if (req->cmd_flags & REQ_RAHEAD)
 		dsmgmt |= NVME_RW_DSM_FREQ_PREFETCH;
 
+	/*
+	 * Ensure that nothing has been sent which cannot be executed
+	 * atomically.
+	 */
+	if (req->cmd_flags & REQ_ATOMIC) {
+		if (blk_rq_bytes(req) > ns->atomic_max)
+			return BLK_STS_IOERR;
+
+		if (ns->atomic_boundary) {
+			u32 boundary = ns->atomic_boundary >> ns->lba_shift;
+			u32 imask = ~(boundary - 1);
+			u32 lba = nvme_sect_to_lba(ns, blk_rq_pos(req));
+			u32 end = lba + (blk_rq_bytes(req) >> ns->lba_shift)
+					- 1;
+
+			if ((lba & imask) != (end & imask))
+				return BLK_STS_IOERR;
+		}
+	}
+
 	cmnd->rw.opcode = op;
 	cmnd->rw.flags = 0;
 	cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
@@ -1937,6 +1957,9 @@ static void nvme_update_atomic_write_disk_info(struct gendisk *disk,
 				id->nabo);
 	}
 
+	ns->atomic_max = atomic_bs;
+	ns->atomic_boundary = boundary;
+
 	blk_queue_atomic_write_max_bytes(disk->queue, atomic_bs);
 	blk_queue_atomic_write_unit_min_sectors(disk->queue, bs >> SECTOR_SHIFT);
 	blk_queue_atomic_write_unit_max_sectors(disk->queue,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e7411dac00f7..5a3d76bc816f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -520,6 +520,8 @@ struct nvme_ns {
 
 	struct nvme_fault_inject fault_inject;
 
+	u32 atomic_max;
+	u32 atomic_boundary;
 };
 
 /* NVMe ns supports metadata actions by the controller (generate/strip) */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
                   ` (15 preceding siblings ...)
  2023-12-12 11:08 ` [PATCH v2 16/16] nvme: Ensure atomic writes will be executed atomically John Garry
@ 2023-12-12 16:32 ` Christoph Hellwig
  2023-12-13  9:32   ` John Garry
  16 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-12 16:32 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei,
	jaswin, bvanassche

On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
> Two new fields are added to struct statx - atomic_write_unit_min and
> atomic_write_unit_max. For each atomic individual write, the total length
> of a write must be a between atomic_write_unit_min and
> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
> at a natural offset in the file wrt the write length.
> 
> SCSI sd.c and scsi_debug and NVMe kernel support is added.
> 
> Some open questions:
> - How to make API extensible for when we have no HW support? In that case,
>   we would prob not have to follow rule of power-of-2 length et al.
>   As a possible solution, maybe we can say that atomic writes are
>   supported for the file via statx, but not set unit_min and max values,
>   and this means that writes need to be just FS block aligned there.

I don't think the power of two length is much of a problem to be
honest, and if we every want to lift it we can still do that easily
by adding a new flag or limit.

What I'm a lot more worried about is how to tell the file system that
allocations are done right for these requirement.  There is no way
a user can know that allocations in an existing file are properly
aligned, so atomic writes will just fail on existing files.

I suspect we need an on-disk flag that forces allocations to be
aligned to the atomic write limit, in some ways similar how the
XFS rt flag works.  You'd need to set it on an empty file, and all
allocations after that are guaranteed to be properly aligned.

> - For block layer, should atomic_write_unit_max be limited by
>   max_sectors_kb? Currently it is not.

Well.  It must be limited to max_hw_sectors to actually work.
max_sectors is a software limit below that, which with modern hardware
is actually pretty silly and a real performance issue with todays
workloads when people don't tweak it..

> - How to improve requirement that iovecs are PAGE-aligned.
>   There are 2x issues:
>   a. We impose this rule to not split BIOs due to virt boundary for
>      NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
>      16K/64K pages). Easy solution is to impose requirement that iovecs
>      are 4K-aligned.
>   b. We don't enforce this rule for virt boundary == 0, i.e. SCSI

.. we require any device that wants to support atomic writes to not
have that silly limit.  For NVMe that would require SGL support
(and some driver changes I've been wanting to make for long where
we always use SGLs for transfers larger than a single PRP if supported)


> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging
>   would be horrible, should we add some kernel storage stack software
>   integrity checks?

Yes, I think we'll need asserts in the drivers.  At least for NVMe I
will insist on them.  For SCSI I think the device actually checks
because the atomic writes are a different command anyway, or am I
misunderstanding how SCSI works here?


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits
  2023-12-12 11:08 ` [PATCH v2 01/16] block: Add atomic write operations to request_queue limits John Garry
@ 2023-12-13  1:25   ` Ming Lei
  2023-12-13  9:13     ` John Garry
                       ` (2 more replies)
  0 siblings, 3 replies; 73+ messages in thread
From: Ming Lei @ 2023-12-13  1:25 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, jaswin,
	bvanassche, Himanshu Madhani

On Tue, Dec 12, 2023 at 11:08:29AM +0000, John Garry wrote:
> From: Himanshu Madhani <himanshu.madhani@oracle.com>
> 
> Add the following limits:
> - atomic_write_boundary_bytes
> - atomic_write_max_bytes
> - atomic_write_unit_max_bytes
> - atomic_write_unit_min_bytes
> 
> All atomic writes limits are initialised to 0 to indicate no atomic write
> support. Stacked devices are just not supported either for now.
> 
> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> #jpg: Heavy rewrite
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  Documentation/ABI/stable/sysfs-block | 47 ++++++++++++++++++++++
>  block/blk-settings.c                 | 60 ++++++++++++++++++++++++++++
>  block/blk-sysfs.c                    | 33 +++++++++++++++
>  include/linux/blkdev.h               | 37 +++++++++++++++++
>  4 files changed, 177 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index 1fe9a553c37b..ba81a081522f 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -21,6 +21,53 @@ Description:
>  		device is offset from the internal allocation unit's
>  		natural alignment.
>  
> +What:		/sys/block/<disk>/atomic_write_max_bytes
> +Date:		May 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] This parameter specifies the maximum atomic write
> +		size reported by the device. This parameter is relevant
> +		for merging of writes, where a merged atomic write
> +		operation must not exceed this number of bytes.
> +		The atomic_write_max_bytes may exceed the value in
> +		atomic_write_unit_max_bytes if atomic_write_max_bytes
> +		is not a power-of-two or atomic_write_unit_max_bytes is
> +		limited by some queue limits, such as max_segments.
> +
> +
> +What:		/sys/block/<disk>/atomic_write_unit_min_bytes
> +Date:		May 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] This parameter specifies the smallest block which can
> +		be written atomically with an atomic write operation. All
> +		atomic write operations must begin at a
> +		atomic_write_unit_min boundary and must be multiples of
> +		atomic_write_unit_min. This value must be a power-of-two.
> +
> +
> +What:		/sys/block/<disk>/atomic_write_unit_max_bytes
> +Date:		January 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] This parameter defines the largest block which can be
> +		written atomically with an atomic write operation. This
> +		value must be a multiple of atomic_write_unit_min and must
> +		be a power-of-two.
> +
> +
> +What:		/sys/block/<disk>/atomic_write_boundary_bytes
> +Date:		May 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] A device may need to internally split I/Os which
> +		straddle a given logical block address boundary. In that
> +		case a single atomic write operation will be processed as
> +		one of more sub-operations which each complete atomically.
> +		This parameter specifies the size in bytes of the atomic
> +		boundary if one is reported by the device. This value must
> +		be a power-of-two.
> +
>  
>  What:		/sys/block/<disk>/diskseq
>  Date:		February 2021
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 0046b447268f..d151be394c98 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -59,6 +59,10 @@ void blk_set_default_limits(struct queue_limits *lim)
>  	lim->zoned = BLK_ZONED_NONE;
>  	lim->zone_write_granularity = 0;
>  	lim->dma_alignment = 511;
> +	lim->atomic_write_unit_min_sectors = 0;
> +	lim->atomic_write_unit_max_sectors = 0;
> +	lim->atomic_write_max_sectors = 0;
> +	lim->atomic_write_boundary_sectors = 0;

Can we move the four into single structure and setup them in single
API? Then cross-validation can be done in this API.

>  }
>  
>  /**
> @@ -183,6 +187,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
>  }
>  EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>  
> +/**
> + * blk_queue_atomic_write_max_bytes - set max bytes supported by
> + * the device for atomic write operations.
> + * @q:  the request queue for the device
> + * @size: maximum bytes supported
> + */
> +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> +				      unsigned int bytes)
> +{
> +	q->limits.atomic_write_max_sectors = bytes >> SECTOR_SHIFT;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);

What if driver doesn't call it but driver supports atomic write?

I guess the default max sectors should be atomic_write_unit_max_sectors
if the feature is enabled.

> +
> +/**
> + * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
> + * which an atomic write should not cross.
> + * @q:  the request queue for the device
> + * @bytes: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
> +					   unsigned int bytes)
> +{
> +	q->limits.atomic_write_boundary_sectors = bytes >> SECTOR_SHIFT;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);

Default atomic_write_boundary_sectors should be
atomic_write_unit_max_sectors in case of atomic write?

> +
> +/**
> + * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
> + * atomically to the device.
> + * @q:  the request queue for the device
> + * @sectors: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
> +					     unsigned int sectors)
> +{
> +	struct queue_limits *limits = &q->limits;
> +
> +	limits->atomic_write_unit_min_sectors = sectors;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);

atomic_write_unit_min_sectors should be >= (physical block size >> 9)
given the minimized atomic write unit is physical sector for all disk.

> +
> +/*
> + * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
> + * atomically to the device.
> + * @q: the request queue for the device
> + * @sectors: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
> +					     unsigned int sectors)
> +{
> +	struct queue_limits *limits = &q->limits;
> +
> +	limits->atomic_write_unit_max_sectors = sectors;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);

atomic_write_unit_max_sectors should be >= atomic_write_unit_min_sectors.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits
  2023-12-13  1:25   ` Ming Lei
@ 2023-12-13  9:13     ` John Garry
  2023-12-13 12:28       ` Ming Lei
  2023-12-14  4:34     ` Martin K. Petersen
  2023-12-14 16:12     ` Christoph Hellwig
  2 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2023-12-13  9:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, jaswin,
	bvanassche, Himanshu Madhani

>> +
>>   
>>   What:		/sys/block/<disk>/diskseq
>>   Date:		February 2021
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 0046b447268f..d151be394c98 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -59,6 +59,10 @@ void blk_set_default_limits(struct queue_limits *lim)
>>   	lim->zoned = BLK_ZONED_NONE;
>>   	lim->zone_write_granularity = 0;
>>   	lim->dma_alignment = 511;
>> +	lim->atomic_write_unit_min_sectors = 0;
>> +	lim->atomic_write_unit_max_sectors = 0;
>> +	lim->atomic_write_max_sectors = 0;
>> +	lim->atomic_write_boundary_sectors = 0;
> 
> Can we move the four into single structure

There is no precedent for a similar structure in struct queue_limits. So 
would only passing a structure to the blk-settings.c API be ok?

> and setup them in single
> API? Then cross-validation can be done in this API.

I suppose so, if you think that it is better.

We rely on the driver to provide sound values. I suppose that we can 
sanitize them also (in a single API).

> 
>>   }
>>   
>>   /**
>> @@ -183,6 +187,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
>>   }
>>   EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>>   
>> +/**
>> + * blk_queue_atomic_write_max_bytes - set max bytes supported by
>> + * the device for atomic write operations.
>> + * @q:  the request queue for the device
>> + * @size: maximum bytes supported
>> + */
>> +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
>> +				      unsigned int bytes)
>> +{
>> +	q->limits.atomic_write_max_sectors = bytes >> SECTOR_SHIFT;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
> 
> What if driver doesn't call it but driver supports atomic write?

We rely on the driver to do this. Any basic level of testing will show 
an issue if they don't.

> 
> I guess the default max sectors should be atomic_write_unit_max_sectors
> if the feature is enabled.

Sure. If we have a single API to set all values, then we don't need to 
worry about this (assuming the values are filled in properly).

> 
>> +
>> +/**
>> + * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
>> + * which an atomic write should not cross.
>> + * @q:  the request queue for the device
>> + * @bytes: must be a power-of-two.
>> + */
>> +void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
>> +					   unsigned int bytes)
>> +{
>> +	q->limits.atomic_write_boundary_sectors = bytes >> SECTOR_SHIFT;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
> 
> Default atomic_write_boundary_sectors should be
> atomic_write_unit_max_sectors in case of atomic write?

Having atomic_write_boundary_sectors default to 
atomic_write_unit_max_sectors is effectively same as a default of 0.

> 
>> +
>> +/**
>> + * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
>> + * atomically to the device.
>> + * @q:  the request queue for the device
>> + * @sectors: must be a power-of-two.
>> + */
>> +void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
>> +					     unsigned int sectors)
>> +{
>> +	struct queue_limits *limits = &q->limits;
>> +
>> +	limits->atomic_write_unit_min_sectors = sectors;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
> 
> atomic_write_unit_min_sectors should be >= (physical block size >> 9)
> given the minimized atomic write unit is physical sector for all disk.

For SCSI, we have a granularity VPD value, and when set we pay attention 
to that. If not, we use the phys block size.

For NVMe, we use the logical block size. For physical block size, that 
can be greater than the logical block size for npwg set, and I don't 
think it's suitable use that as minimum atomic write unit.

Anyway, I am not too keen on sanitizing this value in this way.

> 
>> +
>> +/*
>> + * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
>> + * atomically to the device.
>> + * @q: the request queue for the device
>> + * @sectors: must be a power-of-two.
>> + */
>> +void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
>> +					     unsigned int sectors)
>> +{
>> +	struct queue_limits *limits = &q->limits;
>> +
>> +	limits->atomic_write_unit_max_sectors = sectors;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
> 
> atomic_write_unit_max_sectors should be >= atomic_write_unit_min_sectors.
> 

Again, we rely on the driver to provide sound values. However, as 
mentioned, we can sanitize.

Thanks,
John

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-12 16:32 ` [PATCH v2 00/16] block atomic writes Christoph Hellwig
@ 2023-12-13  9:32   ` John Garry
  2023-12-13 15:44     ` Christoph Hellwig
  0 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2023-12-13  9:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei,
	jaswin, bvanassche

On 12/12/2023 16:32, Christoph Hellwig wrote:
> On Tue, Dec 12, 2023 at 11:08:28AM +0000, John Garry wrote:
>> Two new fields are added to struct statx - atomic_write_unit_min and
>> atomic_write_unit_max. For each atomic individual write, the total length
>> of a write must be a between atomic_write_unit_min and
>> atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
>> at a natural offset in the file wrt the write length.
>>
>> SCSI sd.c and scsi_debug and NVMe kernel support is added.
>>
>> Some open questions:
>> - How to make API extensible for when we have no HW support? In that case,
>>    we would prob not have to follow rule of power-of-2 length et al.
>>    As a possible solution, maybe we can say that atomic writes are
>>    supported for the file via statx, but not set unit_min and max values,
>>    and this means that writes need to be just FS block aligned there.
> I don't think the power of two length is much of a problem to be
> honest, and if we every want to lift it we can still do that easily
> by adding a new flag or limit.

ok, but it would be nice to have some idea on what that flag or limit 
change would be.

> 
> What I'm a lot more worried about is how to tell the file system that
> allocations are done right for these requirement.  There is no way
> a user can know that allocations in an existing file are properly
> aligned, so atomic writes will just fail on existing files.
> 
> I suspect we need an on-disk flag that forces allocations to be
> aligned to the atomic write limit, in some ways similar how the
> XFS rt flag works.  You'd need to set it on an empty file, and all
> allocations after that are guaranteed to be properly aligned.

Hmmm... so how is this different to the XFS forcealign feature?

For XFS, I thought that your idea was to always CoW new extents for 
misaligned extents or writes which spanned multiple extents.

> 
>> - For block layer, should atomic_write_unit_max be limited by
>>    max_sectors_kb? Currently it is not.
> Well.  It must be limited to max_hw_sectors to actually work.

Sure, as max_hw_sectors may also be limited by DMA controller max 
mapping size.

> max_sectors is a software limit below that, which with modern hardware
> is actually pretty silly and a real performance issue with todays
> workloads when people don't tweak it..

Right, so we should limit atomic write queue limits to max_hw_sectors. 
But people can still tweak max_sectors, and I am inclined to say that 
atomic_write_unit_max et al should be (dynamically) limited to 
max_sectors also.

> 
>> - How to improve requirement that iovecs are PAGE-aligned.
>>    There are 2x issues:
>>    a. We impose this rule to not split BIOs due to virt boundary for
>>       NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
>>       16K/64K pages). Easy solution is to impose requirement that iovecs
>>       are 4K-aligned.
>>    b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
> .. we require any device that wants to support atomic writes to not
> have that silly limit.  For NVMe that would require SGL support
> (and some driver changes I've been wanting to make for long where
> we always use SGLs for transfers larger than a single PRP if supported)

If we could avoid dealing with a virt boundary, then that would be nice.

Are there any patches yet for the change to always use SGLs for 
transfers larger than a single PRP?

On a related topic, I am not sure about how - or if we even should - 
enforce iovec PAGE-alignment or length; rather, the user could just be 
advised that iovecs must be PAGE-aligned and min PAGE length to achieve 
atomic_write_unit_max.

> 
> 
>> - Since debugging torn-writes due to unwanted kernel BIO splitting/merging
>>    would be horrible, should we add some kernel storage stack software
>>    integrity checks?
> Yes, I think we'll need asserts in the drivers.  At least for NVMe I
> will insist on them. 

Please see 16/16 in this series.

> For SCSI I think the device actually checks
> because the atomic writes are a different command anyway, or am I
> misunderstanding how SCSI works here?

Right, for WRITE ATOMIC (16) the target will reject a command when it 
does not respect the device atomic write limits.

Thanks,
John

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 03/16] fs/bdev: Add atomic write support info to statx
  2023-12-12 11:08 ` [PATCH v2 03/16] fs/bdev: Add atomic write support info to statx John Garry
@ 2023-12-13 10:24   ` Jan Kara
  2023-12-13 11:02     ` John Garry
  0 siblings, 1 reply; 73+ messages in thread
From: Jan Kara @ 2023-12-13 10:24 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei,
	jaswin, bvanassche, Prasad Singamsetty

On Tue 12-12-23 11:08:31, John Garry wrote:
> From: Prasad Singamsetty <prasad.singamsetty@oracle.com>
> 
> Extend statx system call to return additional info for atomic write support
> support if the specified file is a block device.
> 
> Add initial support for a block device.
> 
> Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Just some nits below.

> +#define BDEV_STATX_SUPPORTED_MSK (STATX_DIOALIGN | STATX_WRITE_ATOMIC)
                                ^^^
				I believe saving one letter here is not
really beneficial so just spell out MASK here...

>  /*
> - * Handle STATX_DIOALIGN for block devices.
> - *
> - * Note that the inode passed to this is the inode of a block device node file,
> - * not the block device's internal inode.  Therefore it is *not* valid to use
> - * I_BDEV() here; the block device has to be looked up by i_rdev instead.
> + * Handle STATX_{DIOALIGN, WRITE_ATOMIC} for block devices.
>   */

Please keep "Note ..." from the above comment (or you can move the note in
front of blkdev_get_no_open() if you want).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 03/16] fs/bdev: Add atomic write support info to statx
  2023-12-13 10:24   ` Jan Kara
@ 2023-12-13 11:02     ` John Garry
  0 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-13 11:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei,
	jaswin, bvanassche, Prasad Singamsetty

On 13/12/2023 10:24, Jan Kara wrote:
>> Signed-off-by: Prasad Singamsetty<prasad.singamsetty@oracle.com>
>> Signed-off-by: John Garry<john.g.garry@oracle.com>
> Just some nits below.
> 
>> +#define BDEV_STATX_SUPPORTED_MSK (STATX_DIOALIGN | STATX_WRITE_ATOMIC)
>                                  ^^^
> 				I believe saving one letter here is not
> really beneficial so just spell out MASK here...

ok

> 
>>   /*
>> - * Handle STATX_DIOALIGN for block devices.
>> - *
>> - * Note that the inode passed to this is the inode of a block device node file,
>> - * not the block device's internal inode.  Therefore it is*not*  valid to use
>> - * I_BDEV() here; the block device has to be looked up by i_rdev instead.
>> + * Handle STATX_{DIOALIGN, WRITE_ATOMIC} for block devices.
>>    */
> Please keep "Note ..." from the above comment (or you can move the note in
> front of blkdev_get_no_open() if you want).

ok, fine, I think that moving it to in front of blkdev_get_no_open() may 
be best.

Thanks,
John



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 04/16] fs: Increase fmode_t size
  2023-12-12 11:08 ` [PATCH v2 04/16] fs: Increase fmode_t size John Garry
@ 2023-12-13 11:20   ` Jan Kara
  2023-12-13 13:03     ` John Garry
  2023-12-13 13:02   ` Christian Brauner
  1 sibling, 1 reply; 73+ messages in thread
From: Jan Kara @ 2023-12-13 11:20 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei,
	jaswin, bvanassche

On Tue 12-12-23 11:08:32, John Garry wrote:
> Currently all bits are being used in fmode_t.
> 
> To allow for further expansion, increase from unsigned int to unsigned
> long.
> 
> Since the dma-buf driver prints the file->f_mode member, change the print
> as necessary to deal with the larger size.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>

Uh, Al has more experience with fmode_t changes so I'd defer final decision
to him but to me this seems dangerous. Firstly, this breaks packing of
struct file on 64-bit architectures and struct file is highly optimized for
cache efficiency (see the comment before the struct definition). Secondly
this will probably generate warnings on 32-bit architectures as there
sizeof(unsigned long) == sizeof(unsigned int) and so your new flags won't
fit anyway?

								Honza

> ---
>  drivers/dma-buf/dma-buf.c | 2 +-
>  include/linux/types.h     | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 21916bba77d5..a5227ae3d637 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -1628,7 +1628,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>  
>  
>  		spin_lock(&buf_obj->name_lock);
> -		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\t%s\n",
> +		seq_printf(s, "%08zu\t%08x\t%08lx\t%08ld\t%s\t%08lu\t%s\n",
>  				buf_obj->size,
>  				buf_obj->file->f_flags, buf_obj->file->f_mode,
>  				file_count(buf_obj->file),
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 253168bb3fe1..49c754fde1d6 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -153,7 +153,7 @@ typedef u32 dma_addr_t;
>  
>  typedef unsigned int __bitwise gfp_t;
>  typedef unsigned int __bitwise slab_flags_t;
> -typedef unsigned int __bitwise fmode_t;
> +typedef unsigned long __bitwise fmode_t;
>  
>  #ifdef CONFIG_PHYS_ADDR_T_64BIT
>  typedef u64 phys_addr_t;
> -- 
> 2.35.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits
  2023-12-13  9:13     ` John Garry
@ 2023-12-13 12:28       ` Ming Lei
  2023-12-13 19:01         ` John Garry
  2023-12-14  4:38         ` Martin K. Petersen
  0 siblings, 2 replies; 73+ messages in thread
From: Ming Lei @ 2023-12-13 12:28 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, jaswin,
	bvanassche, Himanshu Madhani

On Wed, Dec 13, 2023 at 09:13:48AM +0000, John Garry wrote:
> > > +
> > >   What:		/sys/block/<disk>/diskseq
> > >   Date:		February 2021
> > > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > > index 0046b447268f..d151be394c98 100644
> > > --- a/block/blk-settings.c
> > > +++ b/block/blk-settings.c
> > > @@ -59,6 +59,10 @@ void blk_set_default_limits(struct queue_limits *lim)
> > >   	lim->zoned = BLK_ZONED_NONE;
> > >   	lim->zone_write_granularity = 0;
> > >   	lim->dma_alignment = 511;
> > > +	lim->atomic_write_unit_min_sectors = 0;
> > > +	lim->atomic_write_unit_max_sectors = 0;
> > > +	lim->atomic_write_max_sectors = 0;
> > > +	lim->atomic_write_boundary_sectors = 0;
> > 
> > Can we move the four into single structure
> 
> There is no precedent for a similar structure in struct queue_limits. So
> would only passing a structure to the blk-settings.c API be ok?

Yes, this structure is part of the new API.

> 
> > and setup them in single
> > API? Then cross-validation can be done in this API.
> 
> I suppose so, if you think that it is better.
> 
> We rely on the driver to provide sound values. I suppose that we can
> sanitize them also (in a single API).

Please make the interface correct from beginning, and one good API is
helpful for both sides, such as isolating problems, easy to locate
bug, abstracting common logic, ...

And relying on API users is absolutely not good design.

> 
> > 
> > >   }
> > >   /**
> > > @@ -183,6 +187,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
> > >   }
> > >   EXPORT_SYMBOL(blk_queue_max_discard_sectors);
> > > +/**
> > > + * blk_queue_atomic_write_max_bytes - set max bytes supported by
> > > + * the device for atomic write operations.
> > > + * @q:  the request queue for the device
> > > + * @size: maximum bytes supported
> > > + */
> > > +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> > > +				      unsigned int bytes)
> > > +{
> > > +	q->limits.atomic_write_max_sectors = bytes >> SECTOR_SHIFT;
> > > +}
> > > +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
> > 
> > What if driver doesn't call it but driver supports atomic write?
> 
> We rely on the driver to do this. Any basic level of testing will show an
> issue if they don't.

Software quality depends on good requirement analysis, design and
implementation, instead of test.

Simply you can not cover all possibilities in test.

> 
> > 
> > I guess the default max sectors should be atomic_write_unit_max_sectors
> > if the feature is enabled.
> 
> Sure. If we have a single API to set all values, then we don't need to worry
> about this (assuming the values are filled in properly).
> 
> > 
> > > +
> > > +/**
> > > + * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
> > > + * which an atomic write should not cross.
> > > + * @q:  the request queue for the device
> > > + * @bytes: must be a power-of-two.
> > > + */
> > > +void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
> > > +					   unsigned int bytes)
> > > +{
> > > +	q->limits.atomic_write_boundary_sectors = bytes >> SECTOR_SHIFT;
> > > +}
> > > +EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
> > 
> > Default atomic_write_boundary_sectors should be
> > atomic_write_unit_max_sectors in case of atomic write?
> 
> Having atomic_write_boundary_sectors default to
> atomic_write_unit_max_sectors is effectively same as a default of 0.
> 
> > 
> > > +
> > > +/**
> > > + * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
> > > + * atomically to the device.
> > > + * @q:  the request queue for the device
> > > + * @sectors: must be a power-of-two.
> > > + */
> > > +void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
> > > +					     unsigned int sectors)
> > > +{
> > > +	struct queue_limits *limits = &q->limits;
> > > +
> > > +	limits->atomic_write_unit_min_sectors = sectors;
> > > +}
> > > +EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
> > 
> > atomic_write_unit_min_sectors should be >= (physical block size >> 9)
> > given the minimized atomic write unit is physical sector for all disk.
> 
> For SCSI, we have a granularity VPD value, and when set we pay attention to
> that. If not, we use the phys block size.
> 
> For NVMe, we use the logical block size. For physical block size, that can
> be greater than the logical block size for npwg set, and I don't think it's
> suitable use that as minimum atomic write unit.

I highly suspect it is wrong to use logical block size as minimum
support atomic write unit, given physical block size is supposed to
be the minimum atomic write unit.

> 
> Anyway, I am not too keen on sanitizing this value in this way.
> 
> > 
> > > +
> > > +/*
> > > + * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
> > > + * atomically to the device.
> > > + * @q: the request queue for the device
> > > + * @sectors: must be a power-of-two.
> > > + */
> > > +void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
> > > +					     unsigned int sectors)
> > > +{
> > > +	struct queue_limits *limits = &q->limits;
> > > +
> > > +	limits->atomic_write_unit_max_sectors = sectors;
> > > +}
> > > +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
> > 
> > atomic_write_unit_max_sectors should be >= atomic_write_unit_min_sectors.
> > 
> 
> Again, we rely on the driver to provide sound values. However, as mentioned,
> we can sanitize.

Relying on driver to provide sound value is absolutely bad design from API
viewpoint.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 04/16] fs: Increase fmode_t size
  2023-12-12 11:08 ` [PATCH v2 04/16] fs: Increase fmode_t size John Garry
  2023-12-13 11:20   ` Jan Kara
@ 2023-12-13 13:02   ` Christian Brauner
  2023-12-13 13:15     ` John Garry
  2023-12-13 16:03     ` Christoph Hellwig
  1 sibling, 2 replies; 73+ messages in thread
From: Christian Brauner @ 2023-12-13 13:02 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	dchinner, jack, linux-block, linux-kernel, linux-nvme, linux-xfs,
	linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei, jaswin,
	bvanassche

On Tue, Dec 12, 2023 at 11:08:32AM +0000, John Garry wrote:
> Currently all bits are being used in fmode_t.
> 
> To allow for further expansion, increase from unsigned int to unsigned
> long.
> 
> Since the dma-buf driver prints the file->f_mode member, change the print
> as necessary to deal with the larger size.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  drivers/dma-buf/dma-buf.c | 2 +-
>  include/linux/types.h     | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 21916bba77d5..a5227ae3d637 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -1628,7 +1628,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>  
>  
>  		spin_lock(&buf_obj->name_lock);
> -		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\t%s\n",
> +		seq_printf(s, "%08zu\t%08x\t%08lx\t%08ld\t%s\t%08lu\t%s\n",
>  				buf_obj->size,
>  				buf_obj->file->f_flags, buf_obj->file->f_mode,
>  				file_count(buf_obj->file),
> diff --git a/include/linux/types.h b/include/linux/types.h
> index 253168bb3fe1..49c754fde1d6 100644
> --- a/include/linux/types.h
> +++ b/include/linux/types.h
> @@ -153,7 +153,7 @@ typedef u32 dma_addr_t;
>  
>  typedef unsigned int __bitwise gfp_t;
>  typedef unsigned int __bitwise slab_flags_t;
> -typedef unsigned int __bitwise fmode_t;
> +typedef unsigned long __bitwise fmode_t;

As Jan said, that's likely a bad idea. There's a bunch of places that
assume fmode_t is 32bit. So not really a change we want to make if we
can avoid it.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 04/16] fs: Increase fmode_t size
  2023-12-13 11:20   ` Jan Kara
@ 2023-12-13 13:03     ` John Garry
  0 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-13 13:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei,
	jaswin, bvanassche

On 13/12/2023 11:20, Jan Kara wrote:
>> To allow for further expansion, increase from unsigned int to unsigned
>> long.
>>
>> Since the dma-buf driver prints the file->f_mode member, change the print
>> as necessary to deal with the larger size.
>>
>> Signed-off-by: John Garry<john.g.garry@oracle.com>
> Uh, Al has more experience with fmode_t changes so I'd defer final decision
> to him but to me this seems dangerous.

Ack

> Firstly, this breaks packing of
> struct file on 64-bit architectures and struct file is highly optimized for
> cache efficiency (see the comment before the struct definition).

 From pahole, I think that we still fit on the same 64B cacheline 
(x86_64), but some padding has been added.

Before:
struct file {
union {
	struct llist_node  f_llist; 	/*     0     8 */
                 struct callback_head f_rcuhead 
__attribute__((__aligned__(8))); /*     0    16 */
	unsigned int       f_iocb_flags;         /*     0     4 */
         } __attribute__((__aligned__(8)));	/*     0    16 */
	spinlock_t                 f_lock;	/*    16     4 */
	fmode_t                    f_mode;	/*    20     4 */
	atomic_long_t              f_count;	/*    24     8 */
	struct mutex               f_pos_lock;	/*    32    32 */
         /* --- cacheline 1 boundary (64 bytes) --- */

After:

struct file {
union {
	struct llist_node  f_llist	/*     0     8 */
                 struct callback_head f_rcuhead 
__attribute__((__aligned__(8))); /*     0    16 */
	unsigned int       f_iocb_flags;	/*     0     4 */
         } __attribute__((__aligned__(8)));	/*     0    16 */
	spinlock_t                 f_lock;	 /*    16     4 */

         /* XXX 4 bytes hole, try to pack */

         fmode_t                    f_mode;	/*    24     8 */
         atomic_long_t              f_count;	/*    32     8 */
         struct mutex               f_pos_lock;	/*    40    32 */
         /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */

> Secondly
> this will probably generate warnings on 32-bit architectures as there
> sizeof(unsigned long) == sizeof(unsigned int) and so your new flags won't
> fit anyway?

Right, it would then need to be unsigned long long. Or add another 32b 
member for extended modes. There were no i386 build warnings.

Thanks,
John

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 04/16] fs: Increase fmode_t size
  2023-12-13 13:02   ` Christian Brauner
@ 2023-12-13 13:15     ` John Garry
  2023-12-13 16:03     ` Christoph Hellwig
  1 sibling, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-13 13:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	dchinner, jack, linux-block, linux-kernel, linux-nvme, linux-xfs,
	linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei, jaswin,
	bvanassche

On 13/12/2023 13:02, Christian Brauner wrote:
> On Tue, Dec 12, 2023 at 11:08:32AM +0000, John Garry wrote:
>> Currently all bits are being used in fmode_t.
>>
>> To allow for further expansion, increase from unsigned int to unsigned
>> long.
>>
>> Since the dma-buf driver prints the file->f_mode member, change the print
>> as necessary to deal with the larger size.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   drivers/dma-buf/dma-buf.c | 2 +-
>>   include/linux/types.h     | 2 +-
>>   2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>> index 21916bba77d5..a5227ae3d637 100644
>> --- a/drivers/dma-buf/dma-buf.c
>> +++ b/drivers/dma-buf/dma-buf.c
>> @@ -1628,7 +1628,7 @@ static int dma_buf_debug_show(struct seq_file *s, void *unused)
>>   
>>   
>>   		spin_lock(&buf_obj->name_lock);
>> -		seq_printf(s, "%08zu\t%08x\t%08x\t%08ld\t%s\t%08lu\t%s\n",
>> +		seq_printf(s, "%08zu\t%08x\t%08lx\t%08ld\t%s\t%08lu\t%s\n",
>>   				buf_obj->size,
>>   				buf_obj->file->f_flags, buf_obj->file->f_mode,
>>   				file_count(buf_obj->file),
>> diff --git a/include/linux/types.h b/include/linux/types.h
>> index 253168bb3fe1..49c754fde1d6 100644
>> --- a/include/linux/types.h
>> +++ b/include/linux/types.h
>> @@ -153,7 +153,7 @@ typedef u32 dma_addr_t;
>>   
>>   typedef unsigned int __bitwise gfp_t;
>>   typedef unsigned int __bitwise slab_flags_t;
>> -typedef unsigned int __bitwise fmode_t;
>> +typedef unsigned long __bitwise fmode_t;
> 
> As Jan said, that's likely a bad idea. There's a bunch of places that
> assume fmode_t is 32bit. So not really a change we want to make if we
> can avoid it.

ok, understood.

Some strictly unnecessary bits in f_mode could be recycled (if there 
were any), but this issue will prob come up again.

Could it be considered to add an extended fmode_t member in struct file?

Thanks,
John

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 05/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support
  2023-12-12 11:08 ` [PATCH v2 05/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
@ 2023-12-13 13:31   ` Al Viro
  2023-12-13 16:02     ` John Garry
  2024-01-22  8:29   ` John Garry
  1 sibling, 1 reply; 73+ messages in thread
From: Al Viro @ 2023-12-13 13:31 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, brauner,
	dchinner, jack, linux-block, linux-kernel, linux-nvme, linux-xfs,
	linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei, jaswin,
	bvanassche, Prasad Singamsetty

On Tue, Dec 12, 2023 at 11:08:33AM +0000, John Garry wrote:

> Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
> flag set will have RWF_ATOMIC rejected and not just ignored.
> 
> Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  include/linux/fs.h      | 8 ++++++++
>  include/uapi/linux/fs.h | 5 ++++-
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 70329c81be31..d725c194243c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -185,6 +185,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>  /* File supports async nowait buffered writes */
>  #define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)
>  
> +/* File supports atomic writes */
> +#define FMODE_CAN_ATOMIC_WRITE	((__force fmode_t)0x100000000)

Have you even tried to compile that on e.g. arm?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-13  9:32   ` John Garry
@ 2023-12-13 15:44     ` Christoph Hellwig
  2023-12-13 16:27       ` John Garry
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-13 15:44 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
>>> - How to make API extensible for when we have no HW support? In that case,
>>>    we would prob not have to follow rule of power-of-2 length et al.
>>>    As a possible solution, maybe we can say that atomic writes are
>>>    supported for the file via statx, but not set unit_min and max values,
>>>    and this means that writes need to be just FS block aligned there.
>> I don't think the power of two length is much of a problem to be
>> honest, and if we every want to lift it we can still do that easily
>> by adding a new flag or limit.
>
> ok, but it would be nice to have some idea on what that flag or limit 
> change would be.

That would require a concrete use case.  The simples thing for a file
system that can or does log I/O it would simply be a flag waving all
the alignment and size requirements.

>> I suspect we need an on-disk flag that forces allocations to be
>> aligned to the atomic write limit, in some ways similar how the
>> XFS rt flag works.  You'd need to set it on an empty file, and all
>> allocations after that are guaranteed to be properly aligned.
>
> Hmmm... so how is this different to the XFS forcealign feature?

Maybe not much.  But that's not what it is about - we need a common
API for this and not some XFS internal flag.  So if this is something
we could support in ext4 as well that would be a good step.  And for
btrfs you'd probably want to support something like it in nocow mode
if people care enough, or always support atomics and write out of
place.

> For XFS, I thought that your idea was to always CoW new extents for 
> misaligned extents or writes which spanned multiple extents.

Well, that is useful for two things:

 - atomic writes on hardware that does not support it
 - atomic writes for bufferd I/O
 - supporting other sizes / alignments than the strict power of
   two above.

> Right, so we should limit atomic write queue limits to max_hw_sectors. But 
> people can still tweak max_sectors, and I am inclined to say that 
> atomic_write_unit_max et al should be (dynamically) limited to max_sectors 
> also.

Allowing people to tweak it seems to be asking for trouble.

>> have that silly limit.  For NVMe that would require SGL support
>> (and some driver changes I've been wanting to make for long where
>> we always use SGLs for transfers larger than a single PRP if supported)
>
> If we could avoid dealing with a virt boundary, then that would be nice.
>
> Are there any patches yet for the change to always use SGLs for transfers 
> larger than a single PRP?

No.

> On a related topic, I am not sure about how - or if we even should - 
> enforce iovec PAGE-alignment or length; rather, the user could just be 
> advised that iovecs must be PAGE-aligned and min PAGE length to achieve 
> atomic_write_unit_max.

Anything that just advices the user an it not clear cut and results in
an error is data loss waiting to happen.  Even more so if it differs
from device to device.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 05/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support
  2023-12-13 13:31   ` Al Viro
@ 2023-12-13 16:02     ` John Garry
  0 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-13 16:02 UTC (permalink / raw)
  To: Al Viro
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, brauner,
	dchinner, jack, linux-block, linux-kernel, linux-nvme, linux-xfs,
	linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei, jaswin,
	bvanassche, Prasad Singamsetty

On 13/12/2023 13:31, Al Viro wrote:
>> Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
>> flag set will have RWF_ATOMIC rejected and not just ignored.
>>
>> Signed-off-by: Prasad Singamsetty<prasad.singamsetty@oracle.com>
>> Signed-off-by: John Garry<john.g.garry@oracle.com>
>> ---
>>   include/linux/fs.h      | 8 ++++++++
>>   include/uapi/linux/fs.h | 5 ++++-
>>   2 files changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 70329c81be31..d725c194243c 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -185,6 +185,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
>>   /* File supports async nowait buffered writes */
>>   #define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)
>>   
>> +/* File supports atomic writes */
>> +#define FMODE_CAN_ATOMIC_WRITE	((__force fmode_t)0x100000000)
> Have you even tried to compile that on e.g. arm?

i386 and now arm32, and no grumblings.

I think that the issue is that we only ever do a bitwise OR or test that 
bit 33 for a 32b value, and it is a void operation and ignored.

However if I have file.f_mode = FMODE_CAN_ATOMIC_WRITE and compile for 
arm32, then it complains.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 04/16] fs: Increase fmode_t size
  2023-12-13 13:02   ` Christian Brauner
  2023-12-13 13:15     ` John Garry
@ 2023-12-13 16:03     ` Christoph Hellwig
  2023-12-14  8:56       ` John Garry
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-13 16:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On Wed, Dec 13, 2023 at 02:02:31PM +0100, Christian Brauner wrote:
> >  typedef unsigned int __bitwise gfp_t;
> >  typedef unsigned int __bitwise slab_flags_t;
> > -typedef unsigned int __bitwise fmode_t;
> > +typedef unsigned long __bitwise fmode_t;
> 
> As Jan said, that's likely a bad idea. There's a bunch of places that
> assume fmode_t is 32bit. So not really a change we want to make if we
> can avoid it.

Oh well, let me dust of my series to move the fairly static flags out
of it.  But even without that do we even need to increase it?  There's
still quite a lot of space after FMODE_EXEC for example.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-13 15:44     ` Christoph Hellwig
@ 2023-12-13 16:27       ` John Garry
  2023-12-14 14:37         ` Christoph Hellwig
  2023-12-19  5:14         ` Darrick J. Wong
  0 siblings, 2 replies; 73+ messages in thread
From: John Garry @ 2023-12-13 16:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei,
	jaswin, bvanassche

On 13/12/2023 15:44, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
>>>> - How to make API extensible for when we have no HW support? In that case,
>>>>     we would prob not have to follow rule of power-of-2 length et al.
>>>>     As a possible solution, maybe we can say that atomic writes are
>>>>     supported for the file via statx, but not set unit_min and max values,
>>>>     and this means that writes need to be just FS block aligned there.
>>> I don't think the power of two length is much of a problem to be
>>> honest, and if we every want to lift it we can still do that easily
>>> by adding a new flag or limit.
>> ok, but it would be nice to have some idea on what that flag or limit
>> change would be.
> That would require a concrete use case.  The simples thing for a file
> system that can or does log I/O it would simply be a flag waving all
> the alignment and size requirements.

ok...

> 
>>> I suspect we need an on-disk flag that forces allocations to be
>>> aligned to the atomic write limit, in some ways similar how the
>>> XFS rt flag works.  You'd need to set it on an empty file, and all
>>> allocations after that are guaranteed to be properly aligned.
>> Hmmm... so how is this different to the XFS forcealign feature?
> Maybe not much.  But that's not what it is about - we need a common
> API for this and not some XFS internal flag.  So if this is something
> we could support in ext4 as well that would be a good step.  And for
> btrfs you'd probably want to support something like it in nocow mode
> if people care enough, or always support atomics and write out of
> place.

The forcealign feature was a bit of case of knowing specifically what to 
do for XFS to enable atomic writes.

But some common method to configure any FS file for atomic writes (if 
supported) would be preferable, right?

> 
>> For XFS, I thought that your idea was to always CoW new extents for
>> misaligned extents or writes which spanned multiple extents.
> Well, that is useful for two things:
> 
>   - atomic writes on hardware that does not support it
>   - atomic writes for bufferd I/O
>   - supporting other sizes / alignments than the strict power of
>     two above.

Sure

> 
>> Right, so we should limit atomic write queue limits to max_hw_sectors. But
>> people can still tweak max_sectors, and I am inclined to say that
>> atomic_write_unit_max et al should be (dynamically) limited to max_sectors
>> also.
> Allowing people to tweak it seems to be asking for trouble.

I tend to agree....

Let's just limit at max_hw_sectors.

> 
>>> have that silly limit.  For NVMe that would require SGL support
>>> (and some driver changes I've been wanting to make for long where
>>> we always use SGLs for transfers larger than a single PRP if supported)
>> If we could avoid dealing with a virt boundary, then that would be nice.
>>
>> Are there any patches yet for the change to always use SGLs for transfers
>> larger than a single PRP?
> No.

As below, if we are going to enforce alignment/size of iovecs elsewhere, 
then I don't need to worry about virt boundary.

> 
>> On a related topic, I am not sure about how - or if we even should -
>> enforce iovec PAGE-alignment or length; rather, the user could just be
>> advised that iovecs must be PAGE-aligned and min PAGE length to achieve
>> atomic_write_unit_max.
> Anything that just advices the user an it not clear cut and results in
> an error is data loss waiting to happen.  Even more so if it differs
> from device to device.

ok, fine. I suppose that we better enforce it then.

Thanks,
John

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits
  2023-12-13 12:28       ` Ming Lei
@ 2023-12-13 19:01         ` John Garry
  2023-12-14  4:38         ` Martin K. Petersen
  1 sibling, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-13 19:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, jaswin,
	bvanassche, Himanshu Madhani

On 13/12/2023 12:28, Ming Lei wrote:
>> For NVMe, we use the logical block size. For physical block size, that can
>> be greater than the logical block size for npwg set, and I don't think it's
>> suitable use that as minimum atomic write unit.
> I highly suspect it is wrong to use logical block size as minimum
> support atomic write unit, given physical block size is supposed to
> be the minimum atomic write unit.

I would tend to agree, but I am still a bit curious on how npwg is used 
to calculate atomic_bs/phys_bs as it seems to be a recommended 
performance-related value. It would hint to me that it is the phys_bs, 
but is that same as atomic min granularity?

> 
>> Anyway, I am not too keen on sanitizing this value in this way.
>>
>>>> +
>>>> +/*
>>>> + * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
>>>> + * atomically to the device.
>>>> + * @q: the request queue for the device
>>>> + * @sectors: must be a power-of-two.
>>>> + */
>>>> +void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
>>>> +					     unsigned int sectors)
>>>> +{
>>>> +	struct queue_limits *limits = &q->limits;
>>>> +
>>>> +	limits->atomic_write_unit_max_sectors = sectors;
>>>> +}
>>>> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
>>> atomic_write_unit_max_sectors should be >= atomic_write_unit_min_sectors.
>>>
>> Again, we rely on the driver to provide sound values. However, as mentioned,
>> we can sanitize.
> Relying on driver to provide sound value is absolutely bad design from API
> viewpoint.

OK, fine, I'll look to revise the API to make it more robust.

Thanks,
John

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits
  2023-12-13  1:25   ` Ming Lei
  2023-12-13  9:13     ` John Garry
@ 2023-12-14  4:34     ` Martin K. Petersen
  2023-12-14 16:12     ` Christoph Hellwig
  2 siblings, 0 replies; 73+ messages in thread
From: Martin K. Petersen @ 2023-12-14  4:34 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	jaswin, bvanassche, Himanshu Madhani


Hi Ming!

>> +	lim->atomic_write_unit_min_sectors = 0;
>> +	lim->atomic_write_unit_max_sectors = 0;
>> +	lim->atomic_write_max_sectors = 0;
>> +	lim->atomic_write_boundary_sectors = 0;
>
> Can we move the four into single structure and setup them in single
> API? Then cross-validation can be done in this API.

Why would we put them in a separate struct? We don't do that for any of
the other queue_limits.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits
  2023-12-13 12:28       ` Ming Lei
  2023-12-13 19:01         ` John Garry
@ 2023-12-14  4:38         ` Martin K. Petersen
  2023-12-14 13:46           ` Ming Lei
  1 sibling, 1 reply; 73+ messages in thread
From: Martin K. Petersen @ 2023-12-14  4:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	jaswin, bvanassche, Himanshu Madhani


Ming,

> Relying on driver to provide sound value is absolutely bad design from
> API viewpoint.

All the other queue_limits are validated by the LLDs. It's challenging
to lift that validation to the block layer since the values reported are
heavily protocol-dependent and thus information is lost if we do it
somewhere else.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 04/16] fs: Increase fmode_t size
  2023-12-13 16:03     ` Christoph Hellwig
@ 2023-12-14  8:56       ` John Garry
  0 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-14  8:56 UTC (permalink / raw)
  To: Christoph Hellwig, Christian Brauner
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	dchinner, jack, linux-block, linux-kernel, linux-nvme, linux-xfs,
	linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei, jaswin,
	bvanassche

> But even without that do we even need to increase it?  There's
> still quite a lot of space after FMODE_EXEC for example.

Right, I can use the space after FMODE_EXEC, which came free after 
removal of FMODE_NDELAY, FMODE_EXCL, and FMODE_WRITE_IOCTL in v6.5.

Thanks,
John



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits
  2023-12-14  4:38         ` Martin K. Petersen
@ 2023-12-14 13:46           ` Ming Lei
  0 siblings, 0 replies; 73+ messages in thread
From: Ming Lei @ 2023-12-14 13:46 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, jaswin,
	bvanassche, Himanshu Madhani, ming.lei

Hi Martin,

On Wed, Dec 13, 2023 at 11:38:10PM -0500, Martin K. Petersen wrote:
> 
> Ming,

On Thu, Dec 14, 2023 at 12:35 PM Martin K. Petersen <martin.petersen@oracle.com> wrote:
>
>
> Hi Ming!
>
> >> +    lim->atomic_write_unit_min_sectors = 0;
> >> +    lim->atomic_write_unit_max_sectors = 0;
> >> +    lim->atomic_write_max_sectors = 0;
> >> +    lim->atomic_write_boundary_sectors = 0;
> >
> > Can we move the four into single structure and setup them in single
> > API? Then cross-validation can be done in this API.
>
> Why would we put them in a separate struct? We don't do that for any of
> the other queue_limits.

All the four limits are for same purpose of supporting atomic-write, and
there can many benefits to define single API to setup atomic parameters:

1) common logic can be put into single place, such as running
cross-validation among them and setting up default value, and it is impossible
to do that by the way in this patch

2) all limits are supposed to setup once by driver in same place, so
doing them in single API actually simplifies driver and block layer, and
API itself becomes less fragile

3) it is easier for trace or troubleshoot

> 
> > Relying on driver to provide sound value is absolutely bad design from
> > API viewpoint.
> 
> All the other queue_limits are validated by the LLDs. It's challenging
> to lift that validation to the block layer since the values reported are
> heavily protocol-dependent and

After atomic limits are put into block layer, it becomes not driver
specific any more, scsi and nvme are going to support it first, sooner
or later, other drivers(dm & md, loop or ublk, ...) may try to support it.

Also in theory, they are not protocol-dependent, usually physical block size is
the minimized atomic-write unit, now John's patches are trying to support
bigger atomic-write unit as scsi/nvme's protocol supports it, and the concept
is actually common in disk. Similar in implementation level too, such as,
for NAND flash based storage, I guess atomic-write should be supported by atomic
update of FTL's LBA/PBA mapping.

> thus information is lost if we do it somewhere else.

Block layer is only focusing on common logic, such as the four limits
added in request queue, which are consumed by block layer and related with
other generic limits(physical block size, max IO size), and it won't be
same with driver's internal validation.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-13 16:27       ` John Garry
@ 2023-12-14 14:37         ` Christoph Hellwig
  2023-12-14 15:46           ` John Garry
  2023-12-18 22:50           ` Keith Busch
  2023-12-19  5:14         ` Darrick J. Wong
  1 sibling, 2 replies; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-14 14:37 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
>>> Are there any patches yet for the change to always use SGLs for transfers
>>> larger than a single PRP?
>> No.

Here is the WIP version.  With that you'd need to make atomic writes
conditional on !ctrl->need_virt_boundary.

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 8ebdfd623e0f78..e04faffd6551fe 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1889,7 +1889,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
 		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
 	}
-	blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
+	if (q == ctrl->admin_q || ctrl->need_virt_boundary)
+		blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
 	blk_queue_dma_alignment(q, 3);
 	blk_queue_write_cache(q, vwc, vwc);
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index e7411dac00f725..aa98794a3ec53d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -262,6 +262,7 @@ enum nvme_ctrl_flags {
 struct nvme_ctrl {
 	bool comp_seen;
 	bool identified;
+	bool need_virt_boundary;
 	enum nvme_ctrl_state state;
 	spinlock_t lock;
 	struct mutex scan_lock;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 61af7ff1a9d6ba..a8d273b475cb40 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -60,8 +60,7 @@ MODULE_PARM_DESC(max_host_mem_size_mb,
 static unsigned int sgl_threshold = SZ_32K;
 module_param(sgl_threshold, uint, 0644);
 MODULE_PARM_DESC(sgl_threshold,
-		"Use SGLs when average request segment size is larger or equal to "
-		"this size. Use 0 to disable SGLs.");
+		"Use SGLs when > 0. Use 0 to disable SGLs.");
 
 #define NVME_PCI_MIN_QUEUE_SIZE 2
 #define NVME_PCI_MAX_QUEUE_SIZE 4095
@@ -504,23 +503,6 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
 	spin_unlock(&nvmeq->sq_lock);
 }
 
-static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
-				     int nseg)
-{
-	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
-	unsigned int avg_seg_size;
-
-	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
-
-	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
-		return false;
-	if (!nvmeq->qid)
-		return false;
-	if (!sgl_threshold || avg_seg_size < sgl_threshold)
-		return false;
-	return true;
-}
-
 static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
 {
 	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
@@ -769,12 +751,14 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
 static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		struct nvme_command *cmnd)
 {
+	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+	bool sgl_supported = nvme_ctrl_sgl_supported(&dev->ctrl) &&
+			nvmeq->qid && sgl_threshold;
 	blk_status_t ret = BLK_STS_RESOURCE;
 	int rc;
 
 	if (blk_rq_nr_phys_segments(req) == 1) {
-		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
 		struct bio_vec bv = req_bvec(req);
 
 		if (!is_pci_p2pdma_page(bv.bv_page)) {
@@ -782,8 +766,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 				return nvme_setup_prp_simple(dev, req,
 							     &cmnd->rw, &bv);
 
-			if (nvmeq->qid && sgl_threshold &&
-			    nvme_ctrl_sgl_supported(&dev->ctrl))
+			if (sgl_supported)
 				return nvme_setup_sgl_simple(dev, req,
 							     &cmnd->rw, &bv);
 		}
@@ -806,7 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
 		goto out_free_sg;
 	}
 
-	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
+	if (sgl_supported)
 		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
 	else
 		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
@@ -3036,6 +3019,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	result = nvme_init_ctrl_finish(&dev->ctrl, false);
 	if (result)
 		goto out_disable;
+	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
+		dev->ctrl.need_virt_boundary = true;
 
 	nvme_dbbuf_dma_alloc(dev);
 
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 81e2621169e5d3..416a9fbcccfc74 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -838,6 +838,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 	error = nvme_init_ctrl_finish(&ctrl->ctrl, false);
 	if (error)
 		goto out_quiesce_queue;
+	ctrl->ctrl.need_virt_boundary = true;
 
 	return 0;
 

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-14 14:37         ` Christoph Hellwig
@ 2023-12-14 15:46           ` John Garry
  2023-12-18 22:50           ` Keith Busch
  1 sibling, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-14 15:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei,
	jaswin, bvanassche

On 14/12/2023 14:37, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
>>>> Are there any patches yet for the change to always use SGLs for transfers
>>>> larger than a single PRP?
>>> No.
> Here is the WIP version.  With that you'd need to make atomic writes
> conditional on !ctrl->need_virt_boundary.

Cheers, I gave it a quick spin and on the surface looks ok.

# more /sys/block/nvme0n1/queue/atomic_write_unit_max_bytes
262144
# more /sys/block/nvme0n1/queue/virt_boundary_mask
0


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 01/16] block: Add atomic write operations to request_queue limits
  2023-12-13  1:25   ` Ming Lei
  2023-12-13  9:13     ` John Garry
  2023-12-14  4:34     ` Martin K. Petersen
@ 2023-12-14 16:12     ` Christoph Hellwig
  2 siblings, 0 replies; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-14 16:12 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	jaswin, bvanassche, Himanshu Madhani

On Wed, Dec 13, 2023 at 09:25:38AM +0800, Ming Lei wrote:

<full quote deleted, please only quote what is relevant.


> >  	lim->dma_alignment = 511;
> > +	lim->atomic_write_unit_min_sectors = 0;
> > +	lim->atomic_write_unit_max_sectors = 0;
> > +	lim->atomic_write_max_sectors = 0;
> > +	lim->atomic_write_boundary_sectors = 0;
> 
> Can we move the four into single structure and setup them in single
> API? Then cross-validation can be done in this API.

Please don't be arbitrarily different from all the other limits.  What
we really should be doing is an API that updates all limits at the
same time, and I actually have code for that, I'll just need to finish
it.  I do not want to block this series for it, though.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 08/16] block: Limit atomic write IO size according to atomic_write_max_sectors
  2023-12-12 11:08 ` [PATCH v2 08/16] block: Limit atomic write IO size according to atomic_write_max_sectors John Garry
@ 2023-12-15  2:27   ` Ming Lei
  2023-12-15 13:55     ` John Garry
  0 siblings, 1 reply; 73+ messages in thread
From: Ming Lei @ 2023-12-15  2:27 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, jaswin,
	bvanassche, ming.lei

On Tue, Dec 12, 2023 at 11:08:36AM +0000, John Garry wrote:
> Currently an IO size is limited to the request_queue limits max_sectors.
> Limit the size for an atomic write to queue limit atomic_write_max_sectors
> value.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  block/blk-merge.c | 12 +++++++++++-
>  block/blk.h       |  3 +++
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 0ccc251e22ff..8d4de9253fe9 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -171,7 +171,17 @@ static inline unsigned get_max_io_size(struct bio *bio,
>  {
>  	unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
>  	unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
> -	unsigned max_sectors = lim->max_sectors, start, end;
> +	unsigned max_sectors, start, end;
> +
> +	/*
> +	 * We ignore lim->max_sectors for atomic writes simply because
> +	 * it may less than bio->write_atomic_unit, which we cannot
> +	 * tolerate.
> +	 */
> +	if (bio->bi_opf & REQ_ATOMIC)
> +		max_sectors = lim->atomic_write_max_sectors;
> +	else
> +		max_sectors = lim->max_sectors;

I can understand the trouble for write atomic from bio split, which
may simply split in the max_sectors boundary, however this change is
still too fragile:

1) ->max_sectors may be set from userspace
- so this change simply override userspace setting

2) otherwise ->max_sectors is same with ->max_hw_sectors:

- then something must be wrong in device side or driver side because
->write_atomic_unit conflicts with ->max_hw_sectors, which is supposed
to be figured out before device is setup

3) too big max_sectors may break driver or device, such as nvme-pci
aligns max_hw_sectors with DMA optimized mapping size

And there might be more(better) choices:

1) make sure atomic write limit is respected when userspace updates
->max_sectors

2) when driver finds that atomic write limits conflict with other
existed hardware limits, fail or solve(such as reduce write atomic unit) the
conflict before queue is started; With single write atomic limits update API,
the conflict can be figured out earlier by block layer too.



thanks, 
Ming


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 08/16] block: Limit atomic write IO size according to atomic_write_max_sectors
  2023-12-15  2:27   ` Ming Lei
@ 2023-12-15 13:55     ` John Garry
  0 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-15 13:55 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, bvanassche

On 15/12/2023 02:27, Ming Lei wrote:
> On Tue, Dec 12, 2023 at 11:08:36AM +0000, John Garry wrote:
>> Currently an IO size is limited to the request_queue limits max_sectors.
>> Limit the size for an atomic write to queue limit atomic_write_max_sectors
>> value.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   block/blk-merge.c | 12 +++++++++++-
>>   block/blk.h       |  3 +++
>>   2 files changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index 0ccc251e22ff..8d4de9253fe9 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -171,7 +171,17 @@ static inline unsigned get_max_io_size(struct bio *bio,
>>   {
>>   	unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
>>   	unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
>> -	unsigned max_sectors = lim->max_sectors, start, end;
>> +	unsigned max_sectors, start, end;
>> +
>> +	/*
>> +	 * We ignore lim->max_sectors for atomic writes simply because
>> +	 * it may less than bio->write_atomic_unit, which we cannot
>> +	 * tolerate.
>> +	 */
>> +	if (bio->bi_opf & REQ_ATOMIC)
>> +		max_sectors = lim->atomic_write_max_sectors;
>> +	else
>> +		max_sectors = lim->max_sectors;

Please note that I mentioned this issue in the cover letter, so you can 
see some discussion there (if missed).

> 
> I can understand the trouble for write atomic from bio split, which
> may simply split in the max_sectors boundary, however this change is
> still too fragile:
> 
> 1) ->max_sectors may be set from userspace
> - so this change simply override userspace setting

yes

> 
> 2) otherwise ->max_sectors is same with ->max_hw_sectors:
> 
> - then something must be wrong in device side or driver side because
> ->write_atomic_unit conflicts with ->max_hw_sectors, which is supposed
> to be figured out before device is setup
> 

Right, so I think that it is proper to limit atomic_write_unit_max et al 
to max_hw_sectors or whatever DMA engine device limits. I can make that 
change.

> 3) too big max_sectors may break driver or device, such as nvme-pci
> aligns max_hw_sectors with DMA optimized mapping size
> 
> And there might be more(better) choices:
> 
> 1) make sure atomic write limit is respected when userspace updates
> ->max_sectors

My mind has been changed and I would say no and we can treat 
atomic_write_unit_max as special. Indeed, max_sectors and 
atomic_write_unit_max are complementary. If someone sets max_sectors to 
4KB and then tries an atomic write of 16KB then they don't know what 
they are doing.

My original idea was to dynamically limit atomic_unit_unit_max et al to 
max_sectors (so that max_sectors is always respected for atomic writes).

As an alternative, how about we keep the value of atomic_unit_unit_max 
static, but reject an atomic write if it exceeds max_sectors? It's not 
too different than dynamically limiting atomic_unit_unit_max. But as 
mentioned, it may be asking for trouble....

> 
> 2) when driver finds that atomic write limits conflict with other
> existed hardware limits, fail or solve(such as reduce write atomic unit) the
> conflict before queue is started; With single write atomic limits update API,
> the conflict can be figured out earlier by block layer too.

I think that we can do this, but I am not sure what other queue limits 
may conflict (apart from max_sectors / max_sectors_hw). There is max 
segment size, but we just rely on a single PAGE per iovec to evaluate 
atomic_unit_unit_max, so should not be an issue.

Thanks,
John

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-14 14:37         ` Christoph Hellwig
  2023-12-14 15:46           ` John Garry
@ 2023-12-18 22:50           ` Keith Busch
  1 sibling, 0 replies; 73+ messages in thread
From: Keith Busch @ 2023-12-18 22:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Garry, axboe, sagi, jejb, martin.petersen, djwong, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei,
	jaswin, bvanassche

On Thu, Dec 14, 2023 at 03:37:09PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
> >>> Are there any patches yet for the change to always use SGLs for transfers
> >>> larger than a single PRP?
> >> No.
> 
> Here is the WIP version.  With that you'd need to make atomic writes
> conditional on !ctrl->need_virt_boundary.

This looks pretty good as-is!
 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 8ebdfd623e0f78..e04faffd6551fe 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1889,7 +1889,8 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
>  		blk_queue_max_hw_sectors(q, ctrl->max_hw_sectors);
>  		blk_queue_max_segments(q, min_t(u32, max_segments, USHRT_MAX));
>  	}
> -	blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
> +	if (q == ctrl->admin_q || ctrl->need_virt_boundary)
> +		blk_queue_virt_boundary(q, NVME_CTRL_PAGE_SIZE - 1);
>  	blk_queue_dma_alignment(q, 3);
>  	blk_queue_write_cache(q, vwc, vwc);
>  }
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index e7411dac00f725..aa98794a3ec53d 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -262,6 +262,7 @@ enum nvme_ctrl_flags {
>  struct nvme_ctrl {
>  	bool comp_seen;
>  	bool identified;
> +	bool need_virt_boundary;
>  	enum nvme_ctrl_state state;
>  	spinlock_t lock;
>  	struct mutex scan_lock;
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 61af7ff1a9d6ba..a8d273b475cb40 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -60,8 +60,7 @@ MODULE_PARM_DESC(max_host_mem_size_mb,
>  static unsigned int sgl_threshold = SZ_32K;
>  module_param(sgl_threshold, uint, 0644);
>  MODULE_PARM_DESC(sgl_threshold,
> -		"Use SGLs when average request segment size is larger or equal to "
> -		"this size. Use 0 to disable SGLs.");
> +		"Use SGLs when > 0. Use 0 to disable SGLs.");
>  
>  #define NVME_PCI_MIN_QUEUE_SIZE 2
>  #define NVME_PCI_MAX_QUEUE_SIZE 4095
> @@ -504,23 +503,6 @@ static void nvme_commit_rqs(struct blk_mq_hw_ctx *hctx)
>  	spin_unlock(&nvmeq->sq_lock);
>  }
>  
> -static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
> -				     int nseg)
> -{
> -	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
> -	unsigned int avg_seg_size;
> -
> -	avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
> -
> -	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
> -		return false;
> -	if (!nvmeq->qid)
> -		return false;
> -	if (!sgl_threshold || avg_seg_size < sgl_threshold)
> -		return false;
> -	return true;
> -}
> -
>  static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
>  {
>  	const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
> @@ -769,12 +751,14 @@ static blk_status_t nvme_setup_sgl_simple(struct nvme_dev *dev,
>  static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>  		struct nvme_command *cmnd)
>  {
> +	struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
>  	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +	bool sgl_supported = nvme_ctrl_sgl_supported(&dev->ctrl) &&
> +			nvmeq->qid && sgl_threshold;
>  	blk_status_t ret = BLK_STS_RESOURCE;
>  	int rc;
>  
>  	if (blk_rq_nr_phys_segments(req) == 1) {
> -		struct nvme_queue *nvmeq = req->mq_hctx->driver_data;
>  		struct bio_vec bv = req_bvec(req);
>  
>  		if (!is_pci_p2pdma_page(bv.bv_page)) {
> @@ -782,8 +766,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>  				return nvme_setup_prp_simple(dev, req,
>  							     &cmnd->rw, &bv);
>  
> -			if (nvmeq->qid && sgl_threshold &&
> -			    nvme_ctrl_sgl_supported(&dev->ctrl))
> +			if (sgl_supported)
>  				return nvme_setup_sgl_simple(dev, req,
>  							     &cmnd->rw, &bv);
>  		}
> @@ -806,7 +789,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
>  		goto out_free_sg;
>  	}
>  
> -	if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
> +	if (sgl_supported)
>  		ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
>  	else
>  		ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
> @@ -3036,6 +3019,8 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	result = nvme_init_ctrl_finish(&dev->ctrl, false);
>  	if (result)
>  		goto out_disable;
> +	if (!nvme_ctrl_sgl_supported(&dev->ctrl))
> +		dev->ctrl.need_virt_boundary = true;
>  
>  	nvme_dbbuf_dma_alloc(dev);
>  
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 81e2621169e5d3..416a9fbcccfc74 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -838,6 +838,7 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
>  	error = nvme_init_ctrl_finish(&ctrl->ctrl, false);
>  	if (error)
>  		goto out_quiesce_queue;
> +	ctrl->ctrl.need_virt_boundary = true;
>  
>  	return 0;
>  

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-13 16:27       ` John Garry
  2023-12-14 14:37         ` Christoph Hellwig
@ 2023-12-19  5:14         ` Darrick J. Wong
  2023-12-19  5:21           ` Christoph Hellwig
  1 sibling, 1 reply; 73+ messages in thread
From: Darrick J. Wong @ 2023-12-19  5:14 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On Wed, Dec 13, 2023 at 04:27:35PM +0000, John Garry wrote:
> On 13/12/2023 15:44, Christoph Hellwig wrote:
> > On Wed, Dec 13, 2023 at 09:32:06AM +0000, John Garry wrote:
> > > > > - How to make API extensible for when we have no HW support? In that case,
> > > > >     we would prob not have to follow rule of power-of-2 length et al.
> > > > >     As a possible solution, maybe we can say that atomic writes are
> > > > >     supported for the file via statx, but not set unit_min and max values,
> > > > >     and this means that writes need to be just FS block aligned there.
> > > > I don't think the power of two length is much of a problem to be
> > > > honest, and if we every want to lift it we can still do that easily
> > > > by adding a new flag or limit.
> > > ok, but it would be nice to have some idea on what that flag or limit
> > > change would be.
> > That would require a concrete use case.  The simples thing for a file
> > system that can or does log I/O it would simply be a flag waving all
> > the alignment and size requirements.
> 
> ok...
> 
> > 
> > > > I suspect we need an on-disk flag that forces allocations to be
> > > > aligned to the atomic write limit, in some ways similar how the
> > > > XFS rt flag works.  You'd need to set it on an empty file, and all
> > > > allocations after that are guaranteed to be properly aligned.
> > > Hmmm... so how is this different to the XFS forcealign feature?
> > Maybe not much.  But that's not what it is about - we need a common
> > API for this and not some XFS internal flag.  So if this is something
> > we could support in ext4 as well that would be a good step.  And for
> > btrfs you'd probably want to support something like it in nocow mode
> > if people care enough, or always support atomics and write out of
> > place.
> 
> The forcealign feature was a bit of case of knowing specifically what to do
> for XFS to enable atomic writes.
> 
> But some common method to configure any FS file for atomic writes (if
> supported) would be preferable, right?

/me stumbles back in with plenty of covidbrain to go around.

So ... Christoph, you're asking for a common API for
sysadmins/applications to signal to the filesystem that they want all
data allocations aligned to a given value, right?

e.g. if a nvme device advertises a capability for untorn writes between
$lbasize and 64k, then we need a way to set up each untorn-file with
some alignment between $lbasize and 64k?

or if cxl storage becomes not ung-dly expensive, then we'd want a way to
set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
advantage of PMD mmap mappings?

I guess we could define a fsxattr xflag for FORCEALIGN and declare that
the fsx_extsize field becomes mandates mapping startoff/startblock
alignment if FORCEALIGN is set.

It just happens that since fsxattr comes from XFS then it'll be easy
enough for us to wire that up.  Maybe.

Concretely, xfs would have to have a way to force the
ag/rtgroup/rtextent size to align with the maximum anticipated required
alignment (e.g. 64k) of the device so that groups (and hence per-group
allocations) don't split the alignment.  xfs would also have need to
constrain the per-inode alignment to some factor of the ag size / rt
extent size.

Writing files could use hw acceleration directly if all the factors are
set up correctly; or fall back to COW.  Assuming there's a way to
require that writeback pick up all the folios in a $forcealign chunk for
simultaneous writeout.  (I think this is a lingering bug in the xfs
rtreflink patchset.)  Also, IIRC John mentioned that there might be some
programs that would rather get an EIO than fall back to COW.

ext4 could maybe sort of do this by allowing forcealign alignments up to
the bigalloc size, if bigalloc is enabled.  Assuming bigalloc isn't DOA.
IIRC bigalloc only supports power of 2 factors.  It wouldn't have the
COW problem because it only supports overwrites.

As usual, I dunno about btrfs.

So does that sound like more or less what you're getting at, Christoph?
Or have I misread the entire thing?  (PS: there's been like 1200 fsdevel
messages since I got sick and I've only had sufficient brainpower to
limp the xfs 6.8 patches across the finish line...)

--D

> > 
> > > For XFS, I thought that your idea was to always CoW new extents for
> > > misaligned extents or writes which spanned multiple extents.
> > Well, that is useful for two things:
> > 
> >   - atomic writes on hardware that does not support it
> >   - atomic writes for bufferd I/O
> >   - supporting other sizes / alignments than the strict power of
> >     two above.
> 
> Sure
> 
> > 
> > > Right, so we should limit atomic write queue limits to max_hw_sectors. But
> > > people can still tweak max_sectors, and I am inclined to say that
> > > atomic_write_unit_max et al should be (dynamically) limited to max_sectors
> > > also.
> > Allowing people to tweak it seems to be asking for trouble.
> 
> I tend to agree....
> 
> Let's just limit at max_hw_sectors.
> 
> > 
> > > > have that silly limit.  For NVMe that would require SGL support
> > > > (and some driver changes I've been wanting to make for long where
> > > > we always use SGLs for transfers larger than a single PRP if supported)
> > > If we could avoid dealing with a virt boundary, then that would be nice.
> > > 
> > > Are there any patches yet for the change to always use SGLs for transfers
> > > larger than a single PRP?
> > No.
> 
> As below, if we are going to enforce alignment/size of iovecs elsewhere,
> then I don't need to worry about virt boundary.
> 
> > 
> > > On a related topic, I am not sure about how - or if we even should -
> > > enforce iovec PAGE-alignment or length; rather, the user could just be
> > > advised that iovecs must be PAGE-aligned and min PAGE length to achieve
> > > atomic_write_unit_max.
> > Anything that just advices the user an it not clear cut and results in
> > an error is data loss waiting to happen.  Even more so if it differs
> > from device to device.
> 
> ok, fine. I suppose that we better enforce it then.
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-19  5:14         ` Darrick J. Wong
@ 2023-12-19  5:21           ` Christoph Hellwig
  2023-12-19 12:41             ` John Garry
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-19  5:21 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, Christoph Hellwig, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, jaswin, bvanassche

On Mon, Dec 18, 2023 at 09:14:56PM -0800, Darrick J. Wong wrote:
> /me stumbles back in with plenty of covidbrain to go around.
> 
> So ... Christoph, you're asking for a common API for
> sysadmins/applications to signal to the filesystem that they want all
> data allocations aligned to a given value, right?
> 
> e.g. if a nvme device advertises a capability for untorn writes between
> $lbasize and 64k, then we need a way to set up each untorn-file with
> some alignment between $lbasize and 64k?
> 
> or if cxl storage becomes not ung-dly expensive, then we'd want a way to
> set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
> advantage of PMD mmap mappings?

The most important point is to not mix these up.

If we want to use a file for atomic writes I should tell the fs about
it, and preferably in a way that does not require me to know about weird
internal implementation details of every file system.  I really just
want to use atomic writes.  Preferably I'd just start using them after
asking for the limits.  But that's obviously not going to work for
file systems that use the hardware offload and don't otherwise align
to the limit (which would suck for very small files anyway :))

So as a compromise I tell the file system before writing or otherwise
adding any data [1] to file that I want to use atomic writes so that
the fs can take the right decisions.

[1] reflinking data into a such marked file will be ... interesting.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-19  5:21           ` Christoph Hellwig
@ 2023-12-19 12:41             ` John Garry
  2023-12-19 15:17               ` Christoph Hellwig
  0 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2023-12-19 12:41 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, viro, brauner,
	dchinner, jack, linux-block, linux-kernel, linux-nvme, linux-xfs,
	linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei, jaswin,
	bvanassche

On 19/12/2023 05:21, Christoph Hellwig wrote:
> On Mon, Dec 18, 2023 at 09:14:56PM -0800, Darrick J. Wong wrote:
>> /me stumbles back in with plenty of covidbrain to go around.
>>
>> So ... Christoph, you're asking for a common API for
>> sysadmins/applications to signal to the filesystem that they want all
>> data allocations aligned to a given value, right?
>>
>> e.g. if a nvme device advertises a capability for untorn writes between
>> $lbasize and 64k, then we need a way to set up each untorn-file with
>> some alignment between $lbasize and 64k?
>>
>> or if cxl storage becomes not ung-dly expensive, then we'd want a way to
>> set up files with an alignment of 2M (x86) or 512M (arm64lol) to take
>> advantage of PMD mmap mappings?
> 
> The most important point is to not mix these up.
> 
> If we want to use a file for atomic writes I should tell the fs about
> it, and preferably in a way that does not require me to know about weird
> internal implementation details of every file system.  I really just
> want to use atomic writes.  Preferably I'd just start using them after
> asking for the limits.  But that's obviously not going to work for
> file systems that use the hardware offload and don't otherwise align
> to the limit (which would suck for very small files anyway :))
> 
> So as a compromise I tell the file system before writing or otherwise
> adding any data [1] to file that I want to use atomic writes so that
> the fs can take the right decisions.
> 
> [1] reflinking data into a such marked file will be ... interesting.
> 

How about something based on fcntl, like below? We will prob also 
require some per-FS flag for enabling atomic writes without HW support. 
That flag might be also useful for XFS for differentiating forcealign 
for atomic writes with just forcealign.

---8<----

diff --git a/fs/fcntl.c b/fs/fcntl.c
index c80a6acad742..d4a50655565a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -313,6 +313,15 @@ static long fcntl_rw_hint(struct file *file, 
unsigned int cmd,
  	}
  }

+static long fcntl_atomic_write(struct file *file, unsigned int cmd,
+			  unsigned long arg)
+{
+	if (file->f_op->enable_atomic_writes)
+		return file->f_op->enable_atomic_writes(file, arg);
+
+	return -EOPNOTSUPP;
+}
+
  static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
  		struct file *filp)
  {
@@ -419,6 +428,9 @@ static long do_fcntl(int fd, unsigned int cmd, 
unsigned long arg,
  	case F_SET_RW_HINT:
  		err = fcntl_rw_hint(filp, cmd, arg);
  		break;
+	case F_SET_ATOMIC_WRITE_SIZE:
+		err = fcntl_atomic_write(filp, cmd, arg);
+		break;
  	default:
  		break;
  	}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e33e5e13b95f..dba206cbe1ab 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1478,6 +1478,46 @@ xfs_file_mmap(
  	return 0;
  }

+STATIC int
+xfs_enable_atomic_writes(
+	struct file	*file,
+	unsigned int unit_max)
+{
+	struct xfs_inode *ip = XFS_I(file_inode(file));
+	struct mnt_idmap *idmap = file_mnt_idmap(file);
+	struct dentry *dentry = file->f_path.dentry;
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct block_device	*bdev = target->bt_bdev;
+	struct request_queue	*q = bdev->bd_queue;
+	struct inode		*inode = &ip->i_vnode;
+
+	if (queue_atomic_write_unit_max_bytes(q) == 0) {
+		if (unit_max != 0) {
+			/* We expect unbounded CoW size if no HW support */
+			return -ENOTBLK;
+		}
+		/* Do something for CoW support ... */
+
+		return 0;
+	}
+
+	if (!unit_max)
+		return -EINVAL;
+
+	/* bodge alert */
+	if (inode->i_op->fileattr_set) {
+		struct fileattr fa = {
+			.fsx_extsize = unit_max,
+			.fsx_xflags = FS_XFLAG_EXTSIZE | FS_XFLAG_FORCEALIGN,
+			.fsx_valid = 1,
+		};
+
+		return inode->i_op->fileattr_set(idmap, dentry, &fa);
+	}
+
+	return -EOPNOTSUPP;
+}
+
  const struct file_operations xfs_file_operations = {
  	.llseek		= xfs_file_llseek,
  	.read_iter	= xfs_file_read_iter,
@@ -1498,6 +1538,7 @@ const struct file_operations xfs_file_operations = {
  	.fallocate	= xfs_file_fallocate,
  	.fadvise	= xfs_file_fadvise,
  	.remap_file_range = xfs_file_remap_range,
+	.enable_atomic_writes = xfs_enable_atomic_writes,
  };

  const struct file_operations xfs_dir_file_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98b7a7a8c42e..a715f98057b8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1956,6 +1956,7 @@ struct file_operations {
  	int (*uring_cmd)(struct io_uring_cmd *ioucmd, unsigned int issue_flags);
  	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
  				unsigned int poll_flags);
+	int (*enable_atomic_writes)(struct file *, unsigned int unit_max);
  } __randomize_layout;

  /* Wrap a directory iterator that needs exclusive inode access */
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6c80f96049bd..69780c49622e 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -56,6 +56,8 @@
  #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
  #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)

+#define F_SET_ATOMIC_WRITE_SIZE  (F_LINUX_SPECIFIC_BASE + 15)
+
  /*
   * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
   * used to clear any hints previously set.
-- 


--->8----

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-19 12:41             ` John Garry
@ 2023-12-19 15:17               ` Christoph Hellwig
  2023-12-19 16:53                 ` John Garry
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-19 15:17 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Darrick J. Wong, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, jaswin, bvanassche

On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
> How about something based on fcntl, like below? We will prob also require 
> some per-FS flag for enabling atomic writes without HW support. That flag 
> might be also useful for XFS for differentiating forcealign for atomic 
> writes with just forcealign.

I would have just exposed it through a user visible flag instead of
adding yet another ioctl/fcntl opcode and yet another method.

And yes, for anything that doesn't always support atomic writes it would
need to be persisted.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-19 15:17               ` Christoph Hellwig
@ 2023-12-19 16:53                 ` John Garry
  2023-12-21  6:50                   ` Christoph Hellwig
  0 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2023-12-19 16:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, axboe, kbusch, sagi, jejb, martin.petersen,
	viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On 19/12/2023 15:17, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>> How about something based on fcntl, like below? We will prob also require
>> some per-FS flag for enabling atomic writes without HW support. That flag
>> might be also useful for XFS for differentiating forcealign for atomic
>> writes with just forcealign.
> I would have just exposed it through a user visible flag instead of
> adding yet another ioctl/fcntl opcode and yet another method.
> 

Any specific type of flag?

I would suggest a file attribute which we can set via chattr, but that 
is still using an ioctl and would require a new inode flag; but at least 
there is standard userspace support.

> And yes, for anything that doesn't always support atomic writes it would
> need to be persisted.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-19 16:53                 ` John Garry
@ 2023-12-21  6:50                   ` Christoph Hellwig
  2023-12-21  9:49                     ` John Garry
  2024-01-09  9:55                     ` John Garry
  0 siblings, 2 replies; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-21  6:50 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Darrick J. Wong, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, jaswin, bvanassche

On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
> On 19/12/2023 15:17, Christoph Hellwig wrote:
>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>> How about something based on fcntl, like below? We will prob also require
>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>> might be also useful for XFS for differentiating forcealign for atomic
>>> writes with just forcealign.
>> I would have just exposed it through a user visible flag instead of
>> adding yet another ioctl/fcntl opcode and yet another method.
>>
>
> Any specific type of flag?
>
> I would suggest a file attribute which we can set via chattr, but that is 
> still using an ioctl and would require a new inode flag; but at least there 
> is standard userspace support.

I'd be fine with that, but we're kinda running out of flag there.
That's why I suggested the FS_XFLAG_ instead, which basically works
the same.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-21  6:50                   ` Christoph Hellwig
@ 2023-12-21  9:49                     ` John Garry
  2023-12-21 12:19                       ` Christoph Hellwig
  2024-01-09  9:55                     ` John Garry
  1 sibling, 1 reply; 73+ messages in thread
From: John Garry @ 2023-12-21  9:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, axboe, kbusch, sagi, jejb, martin.petersen,
	viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On 21/12/2023 06:50, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
>> On 19/12/2023 15:17, Christoph Hellwig wrote:
>>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>>> How about something based on fcntl, like below? We will prob also require
>>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>>> might be also useful for XFS for differentiating forcealign for atomic
>>>> writes with just forcealign.
>>> I would have just exposed it through a user visible flag instead of
>>> adding yet another ioctl/fcntl opcode and yet another method.
>>>
>>
>> Any specific type of flag?
>>
>> I would suggest a file attribute which we can set via chattr, but that is
>> still using an ioctl and would require a new inode flag; but at least there
>> is standard userspace support.
> 
> I'd be fine with that, but we're kinda running out of flag there.

Yeah, in looking at e2fsprogs they are all used.

> That's why I suggested the FS_XFLAG_ instead, which basically works
> the same.
> 

ok, fine, we can try that out.

On another topic, maybe you can advise..

I noticed the NVMe patch to stop always setting virt boundary (thanks), 
but I am struggling for the wording for iovecs rules. I'd like to reuse 
iov_iter_is_aligned() to enforce any such rule.

I am thinking:
- ubuf / iovecs need to be PAGE-aligned
- each iovec needs to be length of multiple of PAGE_SIZE

But that does not work for total length < PAGE_SIZE.

So then we could have:
- ubuf / iovecs need to be PAGE-aligned
- each iovec needs to be length of multiple of atomic_write_unit_min. If 
total length > PAGE_SIZE, each iovec also needs to be a multiple of 
PAGE_SIZE.

I'd rather something simpler. Maybe it's ok.

Thanks,
John


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-21  9:49                     ` John Garry
@ 2023-12-21 12:19                       ` Christoph Hellwig
  2023-12-21 12:48                         ` John Garry
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-21 12:19 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Darrick J. Wong, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, jaswin, bvanassche

On Thu, Dec 21, 2023 at 09:49:35AM +0000, John Garry wrote:
> I noticed the NVMe patch to stop always setting virt boundary (thanks), but 
> I am struggling for the wording for iovecs rules. I'd like to reuse 
> iov_iter_is_aligned() to enforce any such rule.
>
> I am thinking:
> - ubuf / iovecs need to be PAGE-aligned
> - each iovec needs to be length of multiple of PAGE_SIZE
>
> But that does not work for total length < PAGE_SIZE.
>
> So then we could have:
> - ubuf / iovecs need to be PAGE-aligned
> - each iovec needs to be length of multiple of atomic_write_unit_min. If 
> total length > PAGE_SIZE, each iovec also needs to be a multiple of 
> PAGE_SIZE.
>
> I'd rather something simpler. Maybe it's ok.

If we decided to not support atomic writes on anything setting a virt
boundary we don't have to care about the alignment of each vector,
and IMHO we should do that as everything else would be a life in
constant pain.  If we really have a use case for atomic writes on
consumer NVMe devices we'll just have to limit it to a single iovec.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-21 12:19                       ` Christoph Hellwig
@ 2023-12-21 12:48                         ` John Garry
  2023-12-21 12:57                           ` Christoph Hellwig
  0 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2023-12-21 12:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, axboe, kbusch, sagi, jejb, martin.petersen,
	viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On 21/12/2023 12:19, Christoph Hellwig wrote:
>> So then we could have:
>> - ubuf / iovecs need to be PAGE-aligned
>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>> PAGE_SIZE.
>>
>> I'd rather something simpler. Maybe it's ok.
> If we decided to not support atomic writes on anything setting a virt
> boundary we don't have to care about the alignment of each vector,

ok, I think that alignment is not so important, but we still need to 
consider a minimum length per iovec, such that we will always be able to 
fit a write of length atomic_write_unit_max in a bio.

> and IMHO we should do that as everything else would be a life in
> constant pain.  If we really have a use case for atomic writes on
> consumer NVMe devices we'll just have to limit it to a single iovec.

I'd be more inclined to say that SGL support is compulsory, but I don't 
know if that is limiting atomic writes support to an unsatisfactory 
market segment.



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-21 12:48                         ` John Garry
@ 2023-12-21 12:57                           ` Christoph Hellwig
  2023-12-21 13:18                             ` John Garry
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-21 12:57 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Darrick J. Wong, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, jaswin, bvanassche

On Thu, Dec 21, 2023 at 12:48:24PM +0000, John Garry wrote:
>>> - ubuf / iovecs need to be PAGE-aligned
>>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>>> PAGE_SIZE.
>>>
>>> I'd rather something simpler. Maybe it's ok.
>> If we decided to not support atomic writes on anything setting a virt
>> boundary we don't have to care about the alignment of each vector,
>
> ok, I think that alignment is not so important, but we still need to 
> consider a minimum length per iovec, such that we will always be able to 
> fit a write of length atomic_write_unit_max in a bio.

I don't think you man a minim length per iovec for that, but a maximum
number of iovecs instead.  For SGL-capable devices that would be
BIO_MAX_VECS, otherwise 1.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-21 12:57                           ` Christoph Hellwig
@ 2023-12-21 13:18                             ` John Garry
  2023-12-21 13:22                               ` Christoph Hellwig
  0 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2023-12-21 13:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, axboe, kbusch, sagi, jejb, martin.petersen,
	viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On 21/12/2023 12:57, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 12:48:24PM +0000, John Garry wrote:
>>>> - ubuf / iovecs need to be PAGE-aligned
>>>> - each iovec needs to be length of multiple of atomic_write_unit_min. If
>>>> total length > PAGE_SIZE, each iovec also needs to be a multiple of
>>>> PAGE_SIZE.
>>>>
>>>> I'd rather something simpler. Maybe it's ok.
>>> If we decided to not support atomic writes on anything setting a virt
>>> boundary we don't have to care about the alignment of each vector,
>>
>> ok, I think that alignment is not so important, but we still need to
>> consider a minimum length per iovec, such that we will always be able to
>> fit a write of length atomic_write_unit_max in a bio.
> 
> I don't think you man a minim length per iovec for that, but a maximum
> number of iovecs instead. 

That would make sense, but I was thinking that if the request queue has 
a limit on segments then we would need to specify a iovec min length.

> For SGL-capable devices that would be
> BIO_MAX_VECS, otherwise 1.

ok, but we would need to advertise that or whatever segment limit. A 
statx field just for that seems a bit inefficient in terms of space.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-21 13:18                             ` John Garry
@ 2023-12-21 13:22                               ` Christoph Hellwig
  2023-12-21 13:56                                 ` John Garry
  2024-01-16 11:35                                 ` John Garry
  0 siblings, 2 replies; 73+ messages in thread
From: Christoph Hellwig @ 2023-12-21 13:22 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Darrick J. Wong, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, jaswin, bvanassche

On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>> For SGL-capable devices that would be
>> BIO_MAX_VECS, otherwise 1.
>
> ok, but we would need to advertise that or whatever segment limit. A statx 
> field just for that seems a bit inefficient in terms of space.

I'd rather not hard code BIO_MAX_VECS in the ABI, which suggest we
want to export is as a field.  Network file systems also might have
their own limits for one reason or another.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-21 13:22                               ` Christoph Hellwig
@ 2023-12-21 13:56                                 ` John Garry
  2024-01-16 11:35                                 ` John Garry
  1 sibling, 0 replies; 73+ messages in thread
From: John Garry @ 2023-12-21 13:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, axboe, kbusch, sagi, jejb, martin.petersen,
	viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On 21/12/2023 13:22, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>>> For SGL-capable devices that would be
>>> BIO_MAX_VECS, otherwise 1.
>>
>> ok, but we would need to advertise that or whatever segment limit. A statx
>> field just for that seems a bit inefficient in terms of space.
> 
> I'd rather not hard code BIO_MAX_VECS in the ABI,

For sure

> which suggest we
> want to export is as a field. 

ok

> Network file systems also might have
> their own limits for one reason or another.

Understood

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-21  6:50                   ` Christoph Hellwig
  2023-12-21  9:49                     ` John Garry
@ 2024-01-09  9:55                     ` John Garry
  2024-01-09 16:02                       ` Christoph Hellwig
  2024-01-09 23:04                       ` Dave Chinner
  1 sibling, 2 replies; 73+ messages in thread
From: John Garry @ 2024-01-09  9:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, axboe, kbusch, sagi, jejb, martin.petersen,
	viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, bvanassche, ojaswin

On 21/12/2023 06:50, Christoph Hellwig wrote:
> On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
>> On 19/12/2023 15:17, Christoph Hellwig wrote:
>>> On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
>>>> How about something based on fcntl, like below? We will prob also require
>>>> some per-FS flag for enabling atomic writes without HW support. That flag
>>>> might be also useful for XFS for differentiating forcealign for atomic
>>>> writes with just forcealign.
>>> I would have just exposed it through a user visible flag instead of
>>> adding yet another ioctl/fcntl opcode and yet another method.
>>>
>> Any specific type of flag?
>>
>> I would suggest a file attribute which we can set via chattr, but that is
>> still using an ioctl and would require a new inode flag; but at least there
>> is standard userspace support.
> I'd be fine with that, but we're kinda running out of flag there.
> That's why I suggested the FS_XFLAG_ instead, which basically works
> the same.

Hi Christoph,

Coming back to this topic... how about this FS_XFLAG_ and fsxattr update:

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index da43810b7485..9ef15fced20c 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -118,7 +118,8 @@ struct fsxattr {
        __u32           fsx_nextents;   /* nextents field value (get)   */
        __u32           fsx_projid;     /* project identifier (get/set) */
        __u32           fsx_cowextsize; /* CoW extsize field value 
(get/set)*/
-       unsigned char   fsx_pad[8];
+       __u32           fsx_atomicwrites_size; /* unit max */
+       unsigned char   fsx_pad[4];
};

/*
@@ -140,6 +141,7 @@ struct fsxattr {
#define FS_XFLAG_FILESTREAM    0x00004000      /* use filestream 
allocator */
#define FS_XFLAG_DAX           0x00008000      /* use DAX for IO */
#define FS_XFLAG_COWEXTSIZE    0x00010000      /* CoW extent size
allocator hint */
+#define FS_XFLAG_ATOMICWRITES  0x00020000
#define FS_XFLAG_HASATTR       0x80000000      /* no DIFLAG for this   */

/* the read-only stuff doesn't really belong here, but any other place is
lines 1-22/22 (END)

Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE 
being set.

So a user can issue:

 >xfs_io -c "atomic-writes 64K" mnt/file
 >xfs_io -c "atomic-writes" mnt/file
[65536] mnt/file

and then:

/xfs_io -c "lsattr" mnt/file
------------------W mnt/file

(W is new flag for atomic writes obvs)

The user will still have to issue statx to get the actual atomic write 
limit for a file, as 'xfs_io -c "atomic-writes"' does not take into 
account any HW/linux block layer atomic write limits.

FS_XFLAG_ATOMICWRITES will force XFS extent size and alignment to 
fsx_atomicwrites_size when we have HW support, so effectively same as 
forcealign.  For no HW support, we still specify a size. In case of 
possible XFS CoW solution for no atomic write HW support, I suppose that 
there would be no size limit in reality, so the specifying the size 
would only be just for userspace experience consistency.

Is this the sort of userspace API which you would like to see?

John

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-09  9:55                     ` John Garry
@ 2024-01-09 16:02                       ` Christoph Hellwig
  2024-01-09 16:52                         ` John Garry
  2024-01-09 23:04                       ` Dave Chinner
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2024-01-09 16:02 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Darrick J. Wong, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, bvanassche, ojaswin

On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
> So a user can issue:
>
> >xfs_io -c "atomic-writes 64K" mnt/file
> >xfs_io -c "atomic-writes" mnt/file
> [65536] mnt/file

Let me try to decipher that:

 - the first call sets a 64k fsx_atomicwrites_size size
 - the secon call queries fsx_atomicwrites_size?

> The user will still have to issue statx to get the actual atomic write 
> limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account 
> any HW/linux block layer atomic write limits.

So will the set side never fail?

> Is this the sort of userspace API which you would like to see?

What I had in mind (and that's doesn't mean it's right..) was that
the user just sets a binary flag, and the fs reports the best it
could.  But there might be reasons to do it differently.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-09 16:02                       ` Christoph Hellwig
@ 2024-01-09 16:52                         ` John Garry
  0 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2024-01-09 16:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, axboe, kbusch, sagi, jejb, martin.petersen,
	viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, bvanassche, ojaswin

On 09/01/2024 16:02, Christoph Hellwig wrote:
> On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
>> So a user can issue:
>>
>>> xfs_io -c "atomic-writes 64K" mnt/file
>>> xfs_io -c "atomic-writes" mnt/file
>> [65536] mnt/file
> Let me try to decipher that:
> 
>   - the first call sets a 64k fsx_atomicwrites_size size

It should also set FS_XFLAG_ATOMICWRITES for the file. So this step does 
everything to enable atomic writes for that file.

>   - the secon call queries fsx_atomicwrites_size?

Right, I'm just demo'ing how it would look

> 
>> The user will still have to issue statx to get the actual atomic write
>> limit for a file, as 'xfs_io -c "atomic-writes"' does not take into account
>> any HW/linux block layer atomic write limits.
> So will the set side never fail?

It could fail.

Examples of when it could fail could include:

a. If user gave a bad size value. So the size should be a power-of-2 and 
also divisible into the AG size and compatible with any stripe alignment.

b. If the file already had flags set which are incompatible with or not 
supported for atomic writes.

c. the file already has data written. I guess that's obvious.

> 
>> Is this the sort of userspace API which you would like to see?
> What I had in mind (and that's doesn't mean it's right..) was that
> the user just sets a binary flag, and the fs reports the best it
> could.  But there might be reasons to do it differently.

That is what I am trying to do, but I also want to specify a size for 
the atomic write unit max which the user could want. I'd rather not use 
the atomic write unit max from the device for that, as that could be 
huge. However, re-reading a., above, makes me think that the kernel 
should have more input on this, but we still need some input on the max 
from the user...

Thanks,
John


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-09  9:55                     ` John Garry
  2024-01-09 16:02                       ` Christoph Hellwig
@ 2024-01-09 23:04                       ` Dave Chinner
  2024-01-10  8:55                         ` John Garry
  2024-01-10  9:19                         ` Christoph Hellwig
  1 sibling, 2 replies; 73+ messages in thread
From: Dave Chinner @ 2024-01-09 23:04 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Darrick J. Wong, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, bvanassche, ojaswin

On Tue, Jan 09, 2024 at 09:55:24AM +0000, John Garry wrote:
> On 21/12/2023 06:50, Christoph Hellwig wrote:
> > On Tue, Dec 19, 2023 at 04:53:27PM +0000, John Garry wrote:
> > > On 19/12/2023 15:17, Christoph Hellwig wrote:
> > > > On Tue, Dec 19, 2023 at 12:41:37PM +0000, John Garry wrote:
> > > > > How about something based on fcntl, like below? We will prob also require
> > > > > some per-FS flag for enabling atomic writes without HW support. That flag
> > > > > might be also useful for XFS for differentiating forcealign for atomic
> > > > > writes with just forcealign.
> > > > I would have just exposed it through a user visible flag instead of
> > > > adding yet another ioctl/fcntl opcode and yet another method.
> > > > 
> > > Any specific type of flag?
> > > 
> > > I would suggest a file attribute which we can set via chattr, but that is
> > > still using an ioctl and would require a new inode flag; but at least there
> > > is standard userspace support.
> > I'd be fine with that, but we're kinda running out of flag there.
> > That's why I suggested the FS_XFLAG_ instead, which basically works
> > the same.
> 
> Hi Christoph,
> 
> Coming back to this topic... how about this FS_XFLAG_ and fsxattr update:
> 
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index da43810b7485..9ef15fced20c 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -118,7 +118,8 @@ struct fsxattr {
>        __u32           fsx_nextents;   /* nextents field value (get)   */
>        __u32           fsx_projid;     /* project identifier (get/set) */
>        __u32           fsx_cowextsize; /* CoW extsize field value
> (get/set)*/
> -       unsigned char   fsx_pad[8];
> +       __u32           fsx_atomicwrites_size; /* unit max */
> +       unsigned char   fsx_pad[4];
> };
> 
> /*
> @@ -140,6 +141,7 @@ struct fsxattr {
> #define FS_XFLAG_FILESTREAM    0x00004000      /* use filestream allocator
> */
> #define FS_XFLAG_DAX           0x00008000      /* use DAX for IO */
> #define FS_XFLAG_COWEXTSIZE    0x00010000      /* CoW extent size
> allocator hint */
> +#define FS_XFLAG_ATOMICWRITES  0x00020000
> #define FS_XFLAG_HASATTR       0x80000000      /* no DIFLAG for this   */
> 
> /* the read-only stuff doesn't really belong here, but any other place is
> lines 1-22/22 (END)
> 
> Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being
> set.
> 
> So a user can issue:
> 
> >xfs_io -c "atomic-writes 64K" mnt/file
> >xfs_io -c "atomic-writes" mnt/file
> [65536] mnt/file

Where are you going to store this value in the inode?  It requires a
new field in the inode and so is a change of on-disk format, right?

As it is, I really don't see this as a better solution than the
original generic "force align" flag that simply makes the extent
size hint alignment a hard physical alignment requirement rather
than just a hint. This has multiple uses (DAX PMD alignment is
another), so I just don't see why something that has a single,
application specific API that implements a hard physical alignment
is desirable.

Indeed, the whole reason that extent size hints are so versatile is
that they implement a generic allocation alignment/size function
that can be used for anything your imagination extends to. If they
were implemented as a "only allow RAID stripe aligned/sized
allocation" for the original use case then that functionality would
have been far less useful than it has proven to be over the past
couple of decades.

Hence history teaches us that we should be designing the API around
the generic filesystem function required (hard alignment of physical
extent allocation), not the specific use case that requires that
functionality.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-09 23:04                       ` Dave Chinner
@ 2024-01-10  8:55                         ` John Garry
  2024-01-10  9:19                         ` Christoph Hellwig
  1 sibling, 0 replies; 73+ messages in thread
From: John Garry @ 2024-01-10  8:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, bvanassche, ojaswin

On 09/01/2024 23:04, Dave Chinner wrote:
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -118,7 +118,8 @@ struct fsxattr {
>>         __u32           fsx_nextents;   /* nextents field value (get)   */
>>         __u32           fsx_projid;     /* project identifier (get/set) */
>>         __u32           fsx_cowextsize; /* CoW extsize field value
>> (get/set)*/
>> -       unsigned char   fsx_pad[8];
>> +       __u32           fsx_atomicwrites_size; /* unit max */
>> +       unsigned char   fsx_pad[4];
>> };
>>
>> /*
>> @@ -140,6 +141,7 @@ struct fsxattr {
>> #define FS_XFLAG_FILESTREAM    0x00004000      /* use filestream allocator
>> */
>> #define FS_XFLAG_DAX           0x00008000      /* use DAX for IO */
>> #define FS_XFLAG_COWEXTSIZE    0x00010000      /* CoW extent size
>> allocator hint */
>> +#define FS_XFLAG_ATOMICWRITES  0x00020000
>> #define FS_XFLAG_HASATTR       0x80000000      /* no DIFLAG for this   */
>>
>> /* the read-only stuff doesn't really belong here, but any other place is
>> lines 1-22/22 (END)
>>
>> Having FS_XFLAG_ATOMICWRITES set will lead to FMODE_CAN_ATOMIC_WRITE being
>> set.
>>
>> So a user can issue:
>>
>>> xfs_io -c "atomic-writes 64K" mnt/file
>>> xfs_io -c "atomic-writes" mnt/file
>> [65536] mnt/file
> Where are you going to store this value in the inode?  It requires a
> new field in the inode and so is a change of on-disk format, right?

It would require an on-disk format change, unless we can find an 
alternative way to store the value, like:
a. re-use pre-existing extsize or even cowextsize fields and 'xfs_io -c 
"atomic-writes $SIZE"' would update those fields and 
FS_XFLAG_ATOMICWRITES would be incompatible with FS_XFLAG_COWEXTSIZE or 
FS_XFLAG_EXTSIZE
b. require FS_XFLAG_EXTSIZE and extsize be also set to enable atomic 
writes, and extsize is used for atomic write unit max

I'm trying to think of ways to avoid requiring a value, but I don't see 
good options, like:
- make atomic write unit max some compile-time option
- require mkfs stripe alignment/width be set and use that as basis for 
atomic write unit max

We could just use the atomic write unit max which HW provides, but that 
could be 1MB or more and that will hardly give efficient data usage for 
small files. But maybe we don't care about that if we expect this 
feature to only be used on DB files, which can be huge anyway. However I 
still have concerns – we require that value to be fixed, but a disk 
firmware update could increase that value and this could mean we have 
what would be pre-existing mis-aligned extents.

> 
> As it is, I really don't see this as a better solution than the
> original generic "force align" flag that simply makes the extent
> size hint alignment a hard physical alignment requirement rather
> than just a hint. This has multiple uses (DAX PMD alignment is
> another), so I just don't see why something that has a single,
> application specific API that implements a hard physical alignment
> is desirable.

I would still hope that we will support forcealign separately for those 
purposes.

> 
> Indeed, the whole reason that extent size hints are so versatile is
> that they implement a generic allocation alignment/size function
> that can be used for anything your imagination extends to. If they
> were implemented as a "only allow RAID stripe aligned/sized
> allocation" for the original use case then that functionality would
> have been far less useful than it has proven to be over the past
> couple of decades.
> 
> Hence history teaches us that we should be designing the API around
> the generic filesystem function required (hard alignment of physical
> extent allocation), not the specific use case that requires that
> functionality.

I understand your concern. However I am not even sure that forcealign 
even gives us everything we want to enable atomic writes. There is an 
issue where we were required to pre-zero a file prior to issuing atomic 
writes to ensure extents are suitably sized, so FS_XFLAG_ATOMICWRITES 
would make the FS do what is required to avoid that pre-zeroing (but 
that pre-zeroing requirement that does sound like a forcealign issue...)

Furthermore, there was some desire to support atomic writes on block 
devices with no HW support by using a CoW-based solution, and forcealign 
would not be relevant there.

Thanks,
John


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-09 23:04                       ` Dave Chinner
  2024-01-10  8:55                         ` John Garry
@ 2024-01-10  9:19                         ` Christoph Hellwig
  2024-01-11  1:40                           ` Darrick J. Wong
  1 sibling, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2024-01-10  9:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, Christoph Hellwig, Darrick J. Wong, axboe, kbusch,
	sagi, jejb, martin.petersen, viro, brauner, dchinner, jack,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, bvanassche, ojaswin

On Wed, Jan 10, 2024 at 10:04:00AM +1100, Dave Chinner wrote:
> Hence history teaches us that we should be designing the API around
> the generic filesystem function required (hard alignment of physical
> extent allocation), not the specific use case that requires that
> functionality.

I disagree.  The alignment requirement is an artefact of how you
implement atomic writes.  As the fs user I care that I can do atomic
writes on a file and need to query how big the writes can be and
what alignment is required.

The forcealign feature is a sensible fs side implementation of that
if using hardware based atomic writes with alignment requirements,
but it is a really lousy userspace API.

So with John's API proposal for XFS with hardware alignment based atomic
writes we could still use force align.

Requesting atomic writes for an inode will set the forcealign flag
and the extent size hint, and after that it'll report atomic write
capabilities.  Roughly the same implementation, but not an API
tied to an implementation detail.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-10  9:19                         ` Christoph Hellwig
@ 2024-01-11  1:40                           ` Darrick J. Wong
  2024-01-11  5:02                             ` Christoph Hellwig
  0 siblings, 1 reply; 73+ messages in thread
From: Darrick J. Wong @ 2024-01-11  1:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, John Garry, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, bvanassche, ojaswin

On Wed, Jan 10, 2024 at 10:19:29AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 10, 2024 at 10:04:00AM +1100, Dave Chinner wrote:
> > Hence history teaches us that we should be designing the API around
> > the generic filesystem function required (hard alignment of physical
> > extent allocation), not the specific use case that requires that
> > functionality.
> 
> I disagree.  The alignment requirement is an artefact of how you
> implement atomic writes.  As the fs user I care that I can do atomic
> writes on a file and need to query how big the writes can be and
> what alignment is required.
> 
> The forcealign feature is a sensible fs side implementation of that
> if using hardware based atomic writes with alignment requirements,
> but it is a really lousy userspace API.
> 
> So with John's API proposal for XFS with hardware alignment based atomic
> writes we could still use force align.
> 
> Requesting atomic writes for an inode will set the forcealign flag
> and the extent size hint, and after that it'll report atomic write
> capabilities.  Roughly the same implementation, but not an API
> tied to an implementation detail.

Sounds good to me!  So to summarize, this is approximately what
userspace programs would have to do something like this:

struct statx statx;
struct fsxattr fsxattr;
int fd = open('/foofile', O_RDWR | O_DIRECT);

ioctl(fd, FS_IOC_GETXATTR, &fsxattr);

fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */

ioctl(fd, FS_IOC_SETXATTR, &fsxattr);

statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);

if (statx.stx_atomic_write_unit_max >= 16384) {
	pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
	printf("HAPPY DANCE\n");
}

(Assume we bail out on errors.)

--D

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-11  1:40                           ` Darrick J. Wong
@ 2024-01-11  5:02                             ` Christoph Hellwig
  2024-01-11  9:55                               ` John Garry
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2024-01-11  5:02 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Dave Chinner, John Garry, axboe, kbusch, sagi,
	jejb, martin.petersen, viro, brauner, dchinner, jack,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, bvanassche, ojaswin

On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
> struct statx statx;
> struct fsxattr fsxattr;
> int fd = open('/foofile', O_RDWR | O_DIRECT);
> 
> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
> 
> fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
> fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
> 
> ioctl(fd, FS_IOC_SETXATTR, &fsxattr);
> 
> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
> 
> if (statx.stx_atomic_write_unit_max >= 16384) {
> 	pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
> 	printf("HAPPY DANCE\n");
> }

I think this still needs a check if the fs needs alignment for
atomic writes at all. i.e.

struct statx statx;
struct fsxattr fsxattr;
int fd = open('/foofile', O_RDWR | O_DIRECT);

ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
if (statx.stx_atomic_write_unit_max < 16384) {
	bailout();
}

fsxattr.fsx_xflags |= FS_XFLAG_WRITE_ATOMIC;
if (statx.stx_atomic_write_alignment) {
	fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN;
	fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
}
if (ioctl(fd, FS_IOC_SETXATTR, &fsxattr) < 1) {
	bailout();
}

pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
printf("HAPPY DANCE\n");

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-11  5:02                             ` Christoph Hellwig
@ 2024-01-11  9:55                               ` John Garry
  2024-01-11 14:45                                 ` Christoph Hellwig
  0 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2024-01-11  9:55 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong
  Cc: Dave Chinner, axboe, kbusch, sagi, jejb, martin.petersen, viro,
	brauner, dchinner, jack, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi, ming.lei,
	bvanassche, ojaswin

On 11/01/2024 05:02, Christoph Hellwig wrote:
> On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
>> struct statx statx;
>> struct fsxattr fsxattr;
>> int fd = open('/foofile', O_RDWR | O_DIRECT);

I'm assuming O_CREAT also.

>>
>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>>
>> fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN | FS_XFLAG_WRITE_ATOMIC;
>> fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
>>
>> ioctl(fd, FS_IOC_SETXATTR, &fsxattr);
>>
>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>>
>> if (statx.stx_atomic_write_unit_max >= 16384) {
>> 	pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
>> 	printf("HAPPY DANCE\n");
>> }
> 
> I think this still needs a check if the fs needs alignment for
> atomic writes at all. i.e.
> 
> struct statx statx;
> struct fsxattr fsxattr;
> int fd = open('/foofile', O_RDWR | O_DIRECT);
> 
> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
> if (statx.stx_atomic_write_unit_max < 16384) {
> 	bailout();
> }

How could this value be >= 16384 initially? Would it be from 
pre-configured FS alignment, like XFS RT extsize? Or is this from some 
special CoW-based atomic write support? Or FS block size of 16384?

Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will 
lead to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC 
is set would it make sense to have statx return 0 for 
STATX_WRITE_ATOMIC. Otherwise the user may be misled to think that it is 
ok to issue an atomic write (when it isn’t).

Thanks,
John

> 
> fsxattr.fsx_xflags |= FS_XFLAG_WRITE_ATOMIC;
> if (statx.stx_atomic_write_alignment) {
> 	fsxattr.fsx_xflags |= FS_XFLAG_FORCEALIGN;
> 	fsxattr.fsx_extsize = 16384; /* only for hardware no-tears writes */
> }
> if (ioctl(fd, FS_IOC_SETXATTR, &fsxattr) < 1) {
> 	bailout();
> }
> 
> pwrite(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);
> printf("HAPPY DANCE\n");
> 



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-11  9:55                               ` John Garry
@ 2024-01-11 14:45                                 ` Christoph Hellwig
  2024-01-11 16:11                                   ` John Garry
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2024-01-11 14:45 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, axboe, kbusch,
	sagi, jejb, martin.petersen, viro, brauner, dchinner, jack,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, bvanassche, ojaswin

On Thu, Jan 11, 2024 at 09:55:36AM +0000, John Garry wrote:
> On 11/01/2024 05:02, Christoph Hellwig wrote:
>> On Wed, Jan 10, 2024 at 05:40:56PM -0800, Darrick J. Wong wrote:
>>> struct statx statx;
>>> struct fsxattr fsxattr;
>>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>
> I'm assuming O_CREAT also.

Yes.

>> I think this still needs a check if the fs needs alignment for
>> atomic writes at all. i.e.
>>
>> struct statx statx;
>> struct fsxattr fsxattr;
>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>>
>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>> if (statx.stx_atomic_write_unit_max < 16384) {
>> 	bailout();
>> }
>
> How could this value be >= 16384 initially? Would it be from pre-configured 
> FS alignment, like XFS RT extsize? Or is this from some special CoW-based 
> atomic write support? Or FS block size of 16384?

Sorry, this check should not be here at all, we should only check it
later.

> Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead 
> to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set 
> would it make sense to have statx return 0 for STATX_WRITE_ATOMIC. 

True.  We might need to report the limits even without that, though.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-11 14:45                                 ` Christoph Hellwig
@ 2024-01-11 16:11                                   ` John Garry
  2024-01-11 16:15                                     ` Christoph Hellwig
  0 siblings, 1 reply; 73+ messages in thread
From: John Garry @ 2024-01-11 16:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Dave Chinner, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, bvanassche, ojaswin


> 
>>> I think this still needs a check if the fs needs alignment for
>>> atomic writes at all. i.e.
>>>
>>> struct statx statx;
>>> struct fsxattr fsxattr;
>>> int fd = open('/foofile', O_RDWR | O_DIRECT);
>>>
>>> ioctl(fd, FS_IOC_GETXATTR, &fsxattr);
>>> statx(fd, "", AT_EMPTY_PATH, STATX_ALL | STATX_WRITE_ATOMIC, &statx);
>>> if (statx.stx_atomic_write_unit_max < 16384) {
>>> 	bailout();
>>> }
>>
>> How could this value be >= 16384 initially? Would it be from pre-configured
>> FS alignment, like XFS RT extsize? Or is this from some special CoW-based
>> atomic write support? Or FS block size of 16384?
> 
> Sorry, this check should not be here at all, we should only check it
> later.
> 
>> Incidentally, for consistency only setting FS_XFLAG_WRITE_ATOMIC will lead
>> to FMODE_CAN_ATOMIC_WRITE being set. So until FS_XFLAG_WRITE_ATOMIC is set
>> would it make sense to have statx return 0 for STATX_WRITE_ATOMIC.
> 
> True.  We might need to report the limits even without that, though.

Could we just error the SETXATTR ioctl when FS_XFLAG_FORCEALIGN is not 
set (and it is required)? The problem is that ioctl reports -EINVAL for 
any such errors, so hard for the user to know the issue...

Thanks,
John


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-11 16:11                                   ` John Garry
@ 2024-01-11 16:15                                     ` Christoph Hellwig
  0 siblings, 0 replies; 73+ messages in thread
From: Christoph Hellwig @ 2024-01-11 16:15 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, axboe, kbusch,
	sagi, jejb, martin.petersen, viro, brauner, dchinner, jack,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, bvanassche, ojaswin

On Thu, Jan 11, 2024 at 04:11:38PM +0000, John Garry wrote:
> Could we just error the SETXATTR ioctl when FS_XFLAG_FORCEALIGN is not set 
> (and it is required)? The problem is that ioctl reports -EINVAL for any 
> such errors, so hard for the user to know the issue...

Sure.  Pick a good unique error code, though.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2023-12-21 13:22                               ` Christoph Hellwig
  2023-12-21 13:56                                 ` John Garry
@ 2024-01-16 11:35                                 ` John Garry
  2024-01-17 15:02                                   ` Christoph Hellwig
  1 sibling, 1 reply; 73+ messages in thread
From: John Garry @ 2024-01-16 11:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, axboe, kbusch, sagi, jejb, martin.petersen,
	viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On 21/12/2023 13:22, Christoph Hellwig wrote:
> On Thu, Dec 21, 2023 at 01:18:33PM +0000, John Garry wrote:
>>> For SGL-capable devices that would be
>>> BIO_MAX_VECS, otherwise 1.
>> ok, but we would need to advertise that or whatever segment limit. A statx
>> field just for that seems a bit inefficient in terms of space.
> I'd rather not hard code BIO_MAX_VECS in the ABI, which suggest we
> want to export is as a field.  Network file systems also might have
> their own limits for one reason or another.

Hi Christoph,

I have been looking at this issue again and I am not sure if telling the 
user the max number of segments allowed is the best option. I’m worried 
that resultant atomic write unit max will be too small.

The background again is that we want to tell the user what the maximum 
atomic write unit size is, such that we can always guarantee to fit the 
write in a single bio. And there would be no iovec length or alignment 
rules.

The max segments value advertised would be min(queue max segments, 
BIO_MAX_VECS), so it would be 256 when the request queue is not limiting.

The worst case scenario for iovec layout (most inefficient) which the 
user could provide would be like .iov_base = 0x...0E00 and .iov_length = 
0x400, which would mean that we would have 2x pages and 2x DMA sg elems 
required for each 1024B-length iovec. I am assuming that we will still 
use the direct IO rule of LBS length and alignment.

As such, we then need to set atomic write unit max = min(queue max 
segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 
* 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, 
for example, then we would have 32K. These seem too low.

Alternative I'm thinking that we should just limit to 1x iovec always, 
and then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) 
- 1) * PAGE_SIZE [ignoring first/last iovec contents]. It also makes 
support for non-enterprise NVMe drives more straightforward. If someone 
wants, they can introduce support for multi-iovec later, but it would 
prob require some more iovec length/alignment rules.

Please let me know your thoughts.

Thanks,
John


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-16 11:35                                 ` John Garry
@ 2024-01-17 15:02                                   ` Christoph Hellwig
  2024-01-17 16:16                                     ` John Garry
  0 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2024-01-17 15:02 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Darrick J. Wong, axboe, kbusch, sagi, jejb,
	martin.petersen, viro, brauner, dchinner, jack, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-scsi, ming.lei, jaswin, bvanassche

On Tue, Jan 16, 2024 at 11:35:47AM +0000, John Garry wrote:
> As such, we then need to set atomic write unit max = min(queue max 
> segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 * 
> 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, for 
> example, then we would have 32K. These seem too low.

I don't see how this would work if support multiple sectors.

>
> Alternative I'm thinking that we should just limit to 1x iovec always, and 
> then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) - 1) * 
> PAGE_SIZE [ignoring first/last iovec contents]. It also makes support for 
> non-enterprise NVMe drives more straightforward. If someone wants, they can 
> introduce support for multi-iovec later, but it would prob require some 
> more iovec length/alignment rules.

Supporting just a single iovec initially is fine with me, as extending
that is pretty easy.  Just talk to your potential users that they can
live with it.

I'd probably still advertise the limits even if it currently always is 1.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 00/16] block atomic writes
  2024-01-17 15:02                                   ` Christoph Hellwig
@ 2024-01-17 16:16                                     ` John Garry
  0 siblings, 0 replies; 73+ messages in thread
From: John Garry @ 2024-01-17 16:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, axboe, kbusch, sagi, jejb, martin.petersen,
	viro, brauner, dchinner, jack, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-scsi,
	ming.lei, jaswin, bvanassche

On 17/01/2024 15:02, Christoph Hellwig wrote:
> On Tue, Jan 16, 2024 at 11:35:47AM +0000, John Garry wrote:
>> As such, we then need to set atomic write unit max = min(queue max
>> segments, BIO_MAX_VECS) * LBS. That would mean atomic write unit max 256 *
>> 512 = 128K (for 512B LBS). For a DMA controller of max segments 64, for
>> example, then we would have 32K. These seem too low.
> 
> I don't see how this would work if support multiple sectors.
> 
>>
>> Alternative I'm thinking that we should just limit to 1x iovec always, and
>> then atomic write unit max = (min(queue max segments, BIO_MAX_VECS) - 1) *
>> PAGE_SIZE [ignoring first/last iovec contents]. It also makes support for
>> non-enterprise NVMe drives more straightforward. If someone wants, they can
>> introduce support for multi-iovec later, but it would prob require some
>> more iovec length/alignment rules.
> 
> Supporting just a single iovec initially is fine with me, as extending
> that is pretty easy.  Just talk to your potential users that they can
> live with it.

Yeah, any porting I know about has just been using aio with 
IO_CMD_PWRITE, so would be ok

> 
> I'd probably still advertise the limits even if it currently always is 1.
> 

I suppose that we don't need any special rule until we support > 1, as 
we cannot break anyone already using > 1 :)

Thanks,
John

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v2 05/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support
  2023-12-12 11:08 ` [PATCH v2 05/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
  2023-12-13 13:31   ` Al Viro
@ 2024-01-22  8:29   ` John Garry
  1 sibling, 0 replies; 73+ messages in thread
From: John Garry @ 2024-01-22  8:29 UTC (permalink / raw)
  To: viro, brauner, jack
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-scsi, ming.lei, ojaswin, bvanassche,
	Prasad Singamsetty, dchinner, djwong, axboe, kbusch, hch, sagi,
	jejb, martin.petersen

On 12/12/2023 11:08, John Garry wrote:
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -301,9 +301,12 @@ typedef int __bitwise __kernel_rwf_t;
>   /* per-IO O_APPEND */
>   #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
>   
> +/* Atomic Write */
> +#define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000020)
> +

vfs maintainers, due to 
https://lore.kernel.org/linux-fsdevel/20240119191053.GT22081@brightrain.aerifal.cx/T/#t, 
I now plan to change this value to 0x40.

Thanks,
John

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2024-01-22  8:30 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-12 11:08 [PATCH v2 00/16] block atomic writes John Garry
2023-12-12 11:08 ` [PATCH v2 01/16] block: Add atomic write operations to request_queue limits John Garry
2023-12-13  1:25   ` Ming Lei
2023-12-13  9:13     ` John Garry
2023-12-13 12:28       ` Ming Lei
2023-12-13 19:01         ` John Garry
2023-12-14  4:38         ` Martin K. Petersen
2023-12-14 13:46           ` Ming Lei
2023-12-14  4:34     ` Martin K. Petersen
2023-12-14 16:12     ` Christoph Hellwig
2023-12-12 11:08 ` [PATCH v2 02/16] block: Limit atomic writes according to bio and queue limits John Garry
2023-12-12 11:08 ` [PATCH v2 03/16] fs/bdev: Add atomic write support info to statx John Garry
2023-12-13 10:24   ` Jan Kara
2023-12-13 11:02     ` John Garry
2023-12-12 11:08 ` [PATCH v2 04/16] fs: Increase fmode_t size John Garry
2023-12-13 11:20   ` Jan Kara
2023-12-13 13:03     ` John Garry
2023-12-13 13:02   ` Christian Brauner
2023-12-13 13:15     ` John Garry
2023-12-13 16:03     ` Christoph Hellwig
2023-12-14  8:56       ` John Garry
2023-12-12 11:08 ` [PATCH v2 05/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
2023-12-13 13:31   ` Al Viro
2023-12-13 16:02     ` John Garry
2024-01-22  8:29   ` John Garry
2023-12-12 11:08 ` [PATCH v2 06/16] block: Add REQ_ATOMIC flag John Garry
2023-12-12 11:08 ` [PATCH v2 07/16] block: Pass blk_queue_get_max_sectors() a request pointer John Garry
2023-12-12 11:08 ` [PATCH v2 08/16] block: Limit atomic write IO size according to atomic_write_max_sectors John Garry
2023-12-15  2:27   ` Ming Lei
2023-12-15 13:55     ` John Garry
2023-12-12 11:08 ` [PATCH v2 09/16] block: Error an attempt to split an atomic write bio John Garry
2023-12-12 11:08 ` [PATCH v2 10/16] block: Add checks to merging of atomic writes John Garry
2023-12-12 11:08 ` [PATCH v2 11/16] block: Add fops atomic write support John Garry
2023-12-12 11:08 ` [PATCH v2 12/16] scsi: sd: Support reading atomic write properties from block limits VPD John Garry
2023-12-12 11:08 ` [PATCH v2 13/16] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
2023-12-12 11:08 ` [PATCH v2 14/16] scsi: scsi_debug: Atomic write support John Garry
2023-12-12 11:08 ` [PATCH v2 15/16] nvme: Support atomic writes John Garry
2023-12-12 11:08 ` [PATCH v2 16/16] nvme: Ensure atomic writes will be executed atomically John Garry
2023-12-12 16:32 ` [PATCH v2 00/16] block atomic writes Christoph Hellwig
2023-12-13  9:32   ` John Garry
2023-12-13 15:44     ` Christoph Hellwig
2023-12-13 16:27       ` John Garry
2023-12-14 14:37         ` Christoph Hellwig
2023-12-14 15:46           ` John Garry
2023-12-18 22:50           ` Keith Busch
2023-12-19  5:14         ` Darrick J. Wong
2023-12-19  5:21           ` Christoph Hellwig
2023-12-19 12:41             ` John Garry
2023-12-19 15:17               ` Christoph Hellwig
2023-12-19 16:53                 ` John Garry
2023-12-21  6:50                   ` Christoph Hellwig
2023-12-21  9:49                     ` John Garry
2023-12-21 12:19                       ` Christoph Hellwig
2023-12-21 12:48                         ` John Garry
2023-12-21 12:57                           ` Christoph Hellwig
2023-12-21 13:18                             ` John Garry
2023-12-21 13:22                               ` Christoph Hellwig
2023-12-21 13:56                                 ` John Garry
2024-01-16 11:35                                 ` John Garry
2024-01-17 15:02                                   ` Christoph Hellwig
2024-01-17 16:16                                     ` John Garry
2024-01-09  9:55                     ` John Garry
2024-01-09 16:02                       ` Christoph Hellwig
2024-01-09 16:52                         ` John Garry
2024-01-09 23:04                       ` Dave Chinner
2024-01-10  8:55                         ` John Garry
2024-01-10  9:19                         ` Christoph Hellwig
2024-01-11  1:40                           ` Darrick J. Wong
2024-01-11  5:02                             ` Christoph Hellwig
2024-01-11  9:55                               ` John Garry
2024-01-11 14:45                                 ` Christoph Hellwig
2024-01-11 16:11                                   ` John Garry
2024-01-11 16:15                                     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.