linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/16] block atomic writes
@ 2023-05-03 18:38 John Garry
  2023-05-03 18:38 ` [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits John Garry
                   ` (15 more replies)
  0 siblings, 16 replies; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

This series introduces a new proposal to implementing atomic writes in the
kernel.

This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically". I
have seen a similar flag for pwritev2() touted on the lists previously.

Only direct IO is supported and for block devices and xfs.

The atomic writes feature requires dedicated HW support, like
SCSI WRITE_ATOMIC_16 command.

The goal here is to provide an interface that allow applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured. For a power fail, for each individual application block, all or
none of the data to be written. A racing atomic write and read will mean
that the read sees all the old data or all the new data, but never a mix
of old and new.

Two new fields are added to struct statx - atomic_write_unit_min and
atomic_write_unit_max. These values are always a power-of-two and
indicate the inclusive min and max block size which the userspace
application may use. The application block size must be a power-of-two.

For each atomic individual write, the total length of a write must be a
multiple of this application block size and must also be at a file offset
which is naturally aligned on that block size. Otherwise, the kernel
cannot know the application block size and what sort of splitting into
BIOs is permissible.

The kernel guarantees to write at least each individual application block
atomically. However, there is no guarantee to atomically write all data
for multiple blocks.

As an example of usage, for a 32KB application block size, userspace
may request a 64KB write at 96KB offset, which the kernel will submit
to HW as 2x 32KB individual atomic write operations.

Since xfs uses iomap and extents there may be discontiguous, we must
ensure that extents have specific alignments to support atomic writes. For
this, we add a new experimental variant of fallocate for xfs, fallocate2,
which takes an alignment arg, and should align any extents on that value.
In practice, it must be same value of atomic_write_unit_max for the
backing block device. This allows the user to submit atomic writes which
may span multiple discontig extents. This does not fully work yet, as
extents may later change and any new extents will not know about this
initial alignment requirement. Another option is to use XFS realtime
volumes, which does allow alignment to be specified via extsize arg. In
both cases, we should ensure extents are in written state prior to any
atomic writes.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

We also have QEMU NVMe support and we hope to share in coming days.

We are sending as an RFC so we can share the code prior to LSFMM.

This series is based on v6.3

Alan Adamson (1):
  nvme: Support atomic writes

Allison Henderson (1):
  xfs: Add support for fallocate2

Himanshu Madhani (2):
  block: Add atomic write operations to request_queue limits
  block: Add REQ_ATOMIC flag

John Garry (10):
  xfs: Support atomic write for statx
  block: Limit atomic writes according to bio and queue limits
  block: Add bdev_find_max_atomic_write_alignment()
  block: Add support for atomic_write_unit
  block: Add blk_validate_atomic_write_op()
  block: Add fops atomic write support
  fs: iomap: Atomic write support
  scsi: sd: Support reading atomic properties from block limits VPD
  scsi: sd: Add WRITE_ATOMIC_16 support
  scsi: scsi_debug: Atomic write support

Prasad Singamsetty (2):
  fs/bdev: Add atomic write support info to statx
  fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support

 Documentation/ABI/stable/sysfs-block |  42 ++
 block/bdev.c                         |  60 +++
 block/bio.c                          |   7 +-
 block/blk-core.c                     |  28 ++
 block/blk-merge.c                    |  84 +++-
 block/blk-settings.c                 |  73 ++++
 block/blk-sysfs.c                    |  33 ++
 block/fops.c                         |  56 ++-
 drivers/nvme/host/core.c             |  33 ++
 drivers/scsi/scsi_debug.c            | 593 +++++++++++++++++++++------
 drivers/scsi/scsi_trace.c            |  22 +
 drivers/scsi/sd.c                    |  54 ++-
 drivers/scsi/sd.h                    |   7 +
 fs/iomap/direct-io.c                 |  72 +++-
 fs/stat.c                            |  10 +
 fs/xfs/Makefile                      |   1 +
 fs/xfs/libxfs/xfs_attr_remote.c      |   2 +-
 fs/xfs/libxfs/xfs_bmap.c             |   9 +-
 fs/xfs/libxfs/xfs_bmap.h             |   4 +-
 fs/xfs/libxfs/xfs_da_btree.c         |   4 +-
 fs/xfs/libxfs/xfs_fs.h               |   1 +
 fs/xfs/xfs_bmap_util.c               |   7 +-
 fs/xfs/xfs_bmap_util.h               |   2 +-
 fs/xfs/xfs_dquot.c                   |   2 +-
 fs/xfs/xfs_file.c                    |  19 +-
 fs/xfs/xfs_fs_staging.c              |  99 +++++
 fs/xfs/xfs_fs_staging.h              |  21 +
 fs/xfs/xfs_ioctl.c                   |   4 +
 fs/xfs/xfs_iomap.c                   |   4 +-
 fs/xfs/xfs_iops.c                    |  10 +
 fs/xfs/xfs_reflink.c                 |   4 +-
 fs/xfs/xfs_rtalloc.c                 |   2 +-
 fs/xfs/xfs_symlink.c                 |   2 +-
 include/linux/blk_types.h            |   4 +
 include/linux/blkdev.h               |  36 ++
 include/linux/fs.h                   |   1 +
 include/linux/stat.h                 |   2 +
 include/scsi/scsi_proto.h            |   1 +
 include/uapi/linux/fs.h              |   5 +-
 include/uapi/linux/stat.h            |   7 +-
 security/security.c                  |   1 +
 tools/include/uapi/linux/fs.h        |   5 +-
 42 files changed, 1257 insertions(+), 176 deletions(-)
 create mode 100644 fs/xfs/xfs_fs_staging.c
 create mode 100644 fs/xfs/xfs_fs_staging.h

-- 
2.31.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 21:39   ` Dave Chinner
  2023-05-09  0:19   ` Mike Snitzer
  2023-05-03 18:38 ` [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx John Garry
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	Himanshu Madhani, John Garry

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add the following limits:
- atomic_write_boundary
- atomic_write_max_bytes
- atomic_write_unit_max
- atomic_write_unit_min

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 Documentation/ABI/stable/sysfs-block | 42 +++++++++++++++++++++
 block/blk-settings.c                 | 56 ++++++++++++++++++++++++++++
 block/blk-sysfs.c                    | 33 ++++++++++++++++
 include/linux/blkdev.h               | 23 ++++++++++++
 4 files changed, 154 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 282de3680367..f3ed9890e03b 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -21,6 +21,48 @@ Description:
 		device is offset from the internal allocation unit's
 		natural alignment.
 
+What:		/sys/block/<disk>/atomic_write_max_bytes
+Date:		May 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] This parameter specifies the maximum atomic write
+		size reported by the device. An atomic write operation
+		must not exceed this number of bytes.
+
+
+What:		/sys/block/<disk>/atomic_write_unit_min
+Date:		May 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] This parameter specifies the smallest block which can
+		be written atomically with an atomic write operation. All
+		atomic write operations must begin at a
+		atomic_write_unit_min boundary and must be multiples of
+		atomic_write_unit_min. This value must be a power-of-two.
+
+
+What:		/sys/block/<disk>/atomic_write_unit_max
+Date:		January 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] This parameter defines the largest block which can be
+		written atomically with an atomic write operation. This
+		value must be a multiple of atomic_write_unit_min and must
+		be a power-of-two.
+
+
+What:		/sys/block/<disk>/atomic_write_boundary
+Date:		May 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] A device may need to internally split I/Os which
+		straddle a given logical block address boundary. In that
+		case a single atomic write operation will be processed as
+		one of more sub-operations which each complete atomically.
+		This parameter specifies the size in bytes of the atomic
+		boundary if one is reported by the device. This value must
+		be a power-of-two.
+
 
 What:		/sys/block/<disk>/diskseq
 Date:		February 2021
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 896b4654ab00..e21731715a12 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,9 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
+	lim->atomic_write_unit_min = lim->atomic_write_unit_max = 1;
+	lim->atomic_write_max_bytes = 512;
+	lim->atomic_write_boundary = 0;
 }
 
 /**
@@ -183,6 +186,59 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_atomic_write_max_bytes - set max bytes supported by
+ * the device for atomic write operations.
+ * @q:  the request queue for the device
+ * @size: maximum bytes supported
+ */
+void blk_queue_atomic_write_max_bytes(struct request_queue *q,
+				      unsigned int size)
+{
+	q->limits.atomic_write_max_bytes = size;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
+
+/**
+ * blk_queue_atomic_write_boundary - Device's logical block address space
+ * which an atomic write should not cross.
+ * @q:  the request queue for the device
+ * @size: size in bytes. Must be a power-of-two.
+ */
+void blk_queue_atomic_write_boundary(struct request_queue *q,
+				     unsigned int size)
+{
+	q->limits.atomic_write_boundary = size;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_boundary);
+
+/**
+ * blk_queue_atomic_write_unit_min - smallest unit that can be written
+ *				     atomically to the device.
+ * @q:  the request queue for the device
+ * @sectors: must be a power-of-two.
+ */
+void blk_queue_atomic_write_unit_min(struct request_queue *q,
+				     unsigned int sectors)
+{
+	q->limits.atomic_write_unit_min = sectors;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_unit_min);
+
+/*
+ * blk_queue_atomic_write_unit_max - largest unit that can be written
+ * atomically to the device.
+ * @q: the reqeust queue for the device
+ * @sectors: must be a power-of-two.
+ */
+void blk_queue_atomic_write_unit_max(struct request_queue *q,
+				     unsigned int sectors)
+{
+	struct queue_limits *limits = &q->limits;
+	limits->atomic_write_unit_max = sectors;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_unit_max);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f1fce1c7fa44..1025beff2281 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -132,6 +132,30 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q,
 	return queue_var_show(queue_max_discard_segments(q), page);
 }
 
+static ssize_t queue_atomic_write_max_bytes_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(q->limits.atomic_write_max_bytes, page);
+}
+
+static ssize_t queue_atomic_write_boundary_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(q->limits.atomic_write_boundary, page);
+}
+
+static ssize_t queue_atomic_write_unit_min_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(queue_atomic_write_unit_min(q), page);
+}
+
+static ssize_t queue_atomic_write_unit_max_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(queue_atomic_write_unit_max(q), page);
+}
+
 static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(q->limits.max_integrity_segments, page);
@@ -604,6 +628,11 @@ QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes");
 QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes");
 QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
 
+QUEUE_RO_ENTRY(queue_atomic_write_max_bytes, "atomic_write_max_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_boundary, "atomic_write_boundary");
+QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max");
+QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min");
+
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
 QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
@@ -661,6 +690,10 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_max_entry.attr,
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
+	&queue_atomic_write_max_bytes_entry.attr,
+	&queue_atomic_write_boundary_entry.attr,
+	&queue_atomic_write_unit_min_entry.attr,
+	&queue_atomic_write_unit_max_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 941304f17492..6b6f2992338c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -304,6 +304,11 @@ struct queue_limits {
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
 
+	unsigned int		atomic_write_boundary;
+	unsigned int		atomic_write_max_bytes;
+	unsigned int		atomic_write_unit_min;
+	unsigned int		atomic_write_unit_max;
+
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
@@ -929,6 +934,14 @@ void blk_queue_zone_write_granularity(struct request_queue *q,
 				      unsigned int size);
 extern void blk_queue_alignment_offset(struct request_queue *q,
 				       unsigned int alignment);
+extern void blk_queue_atomic_write_max_bytes(struct request_queue *q,
+					     unsigned int size);
+extern void blk_queue_atomic_write_unit_max(struct request_queue *q,
+					    unsigned int sectors);
+extern void blk_queue_atomic_write_unit_min(struct request_queue *q,
+					    unsigned int sectors);
+extern void blk_queue_atomic_write_boundary(struct request_queue *q,
+					    unsigned int size);
 void disk_update_readahead(struct gendisk *disk);
 extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
 extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
@@ -1331,6 +1344,16 @@ static inline int queue_dma_alignment(const struct request_queue *q)
 	return q ? q->limits.dma_alignment : 511;
 }
 
+static inline unsigned int queue_atomic_write_unit_max(const struct request_queue *q)
+{
+	return q->limits.atomic_write_unit_max << SECTOR_SHIFT;
+}
+
+static inline unsigned int queue_atomic_write_unit_min(const struct request_queue *q)
+{
+	return q->limits.atomic_write_unit_min << SECTOR_SHIFT;
+}
+
 static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
 {
 	return queue_dma_alignment(bdev_get_queue(bdev));
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
  2023-05-03 18:38 ` [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 21:58   ` Dave Chinner
  2023-05-03 18:38 ` [PATCH RFC 03/16] xfs: Support atomic write for statx John Garry
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	Prasad Singamsetty, John Garry

From: Prasad Singamsetty <prasad.singamsetty@oracle.com>

Extend statx system call to return additional info for atomic write support
support if the specified file is a block device.

Add initial support for a block device.

Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/bdev.c              | 21 +++++++++++++++++++++
 fs/stat.c                 | 10 ++++++++++
 include/linux/blkdev.h    |  4 ++++
 include/linux/stat.h      |  2 ++
 include/uapi/linux/stat.h |  7 ++++++-
 5 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/block/bdev.c b/block/bdev.c
index 1795c7d4b99e..6a5fd5abaadc 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1014,3 +1014,24 @@ void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
 
 	blkdev_put_no_open(bdev);
 }
+
+/*
+ * Handle statx for block devices to get properties of WRITE ATOMIC
+ * feature support.
+ */
+void bdev_statx_atomic(struct inode *inode, struct kstat *stat)
+{
+	struct block_device *bdev;
+
+	bdev = blkdev_get_no_open(inode->i_rdev);
+	if (!bdev)
+		return;
+
+	stat->atomic_write_unit_min = queue_atomic_write_unit_min(bdev->bd_queue);
+	stat->atomic_write_unit_max = queue_atomic_write_unit_max(bdev->bd_queue);
+	stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
+	stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
+	stat->result_mask |= STATX_WRITE_ATOMIC;
+
+	blkdev_put_no_open(bdev);
+}
diff --git a/fs/stat.c b/fs/stat.c
index 7c238da22ef0..d20334a0e9ae 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -256,6 +256,14 @@ static int vfs_statx(int dfd, struct filename *filename, int flags,
 			bdev_statx_dioalign(inode, stat);
 	}
 
+	/* Handle STATX_WRITE_ATOMIC for block devices */
+	if (request_mask & STATX_WRITE_ATOMIC) {
+		struct inode *inode = d_backing_inode(path.dentry);
+
+		if (S_ISBLK(inode->i_mode))
+			bdev_statx_atomic(inode, stat);
+	}
+
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
@@ -636,6 +644,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_mnt_id = stat->mnt_id;
 	tmp.stx_dio_mem_align = stat->dio_mem_align;
 	tmp.stx_dio_offset_align = stat->dio_offset_align;
+	tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min;
+	tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6b6f2992338c..19d33b2897b2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1527,6 +1527,7 @@ int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend);
 int sync_blockdev_nowait(struct block_device *bdev);
 void sync_bdevs(bool wait);
 void bdev_statx_dioalign(struct inode *inode, struct kstat *stat);
+void bdev_statx_atomic(struct inode *inode, struct kstat *stat);
 void printk_all_partitions(void);
 #else
 static inline void invalidate_bdev(struct block_device *bdev)
@@ -1546,6 +1547,9 @@ static inline void sync_bdevs(bool wait)
 static inline void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
 {
 }
+static inline void bdev_statx_atomic(struct inode *inode, struct kstat *stat)
+{
+}
 static inline void printk_all_partitions(void)
 {
 }
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 52150570d37a..dfa69ecfaacf 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -53,6 +53,8 @@ struct kstat {
 	u32		dio_mem_align;
 	u32		dio_offset_align;
 	u64		change_cookie;
+	u32		atomic_write_unit_max;
+	u32		atomic_write_unit_min;
 };
 
 /* These definitions are internal to the kernel for now. Mainly used by nfsd. */
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7cab2c65d3d7..c99d7cac2aa6 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -127,7 +127,10 @@ struct statx {
 	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
 	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
 	/* 0xa0 */
-	__u64	__spare3[12];	/* Spare space for future expansion */
+	__u32	stx_atomic_write_unit_max;
+	__u32	stx_atomic_write_unit_min;
+	/* 0xb0 */
+	__u64	__spare3[11];	/* Spare space for future expansion */
 	/* 0x100 */
 };
 
@@ -154,6 +157,7 @@ struct statx {
 #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
 #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
 #define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
+#define STATX_WRITE_ATOMIC	0x00004000U	/* Want/got atomic_write_* fields */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
@@ -189,6 +193,7 @@ struct statx {
 #define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
 #define STATX_ATTR_DAX			0x00200000 /* File is currently in DAX state */
+#define STATX_ATTR_WRITE_ATOMIC		0x00400000 /* File supports atomic write operations */
 
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 03/16] xfs: Support atomic write for statx
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
  2023-05-03 18:38 ` [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits John Garry
  2023-05-03 18:38 ` [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 22:17   ` Dave Chinner
  2023-05-03 18:38 ` [PATCH RFC 04/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

Support providing info on atomic write unit min and max.

Darrick Wong originally authored this change.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 24718adb3c16..e542077704aa 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -614,6 +614,16 @@ xfs_vn_getattr(
 			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
 			stat->dio_offset_align = bdev_logical_block_size(bdev);
 		}
+		if (request_mask & STATX_WRITE_ATOMIC) {
+			struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+			struct block_device	*bdev = target->bt_bdev;
+
+			stat->atomic_write_unit_min = queue_atomic_write_unit_min(bdev->bd_queue);
+			stat->atomic_write_unit_max = queue_atomic_write_unit_max(bdev->bd_queue);
+			stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
+			stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
+			stat->result_mask |= STATX_WRITE_ATOMIC;
+		}
 		fallthrough;
 	default:
 		stat->blksize = xfs_stat_blksize(ip);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 04/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (2 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 03/16] xfs: Support atomic write for statx John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:38 ` [PATCH RFC 05/16] block: Add REQ_ATOMIC flag John Garry
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	Prasad Singamsetty, John Garry

From: Prasad Singamsetty <prasad.singamsetty@oracle.com>

Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued atomically, according to special alignment and
length rules.

For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.

A call to statx will give the relevant atomic write info:
- atomic_write_unit_min
- atomic_write_unit_max

Both values are a power-of-2.

Applications can avail of atomic write feature by ensuring that its data
blocks are a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that data blocks are naturally aligned also. If these rules
are followed then the kernel will guarantee to write each data block
atomically.

Not following these rules mean that there is no guarantee that data
will be written atomically.

Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/linux/fs.h            | 1 +
 include/uapi/linux/fs.h       | 5 ++++-
 tools/include/uapi/linux/fs.h | 5 ++++-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..5bace817c041 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -329,6 +329,7 @@ enum rw_hint {
 #define IOCB_SYNC		(__force int) RWF_SYNC
 #define IOCB_NOWAIT		(__force int) RWF_NOWAIT
 #define IOCB_APPEND		(__force int) RWF_APPEND
+#define IOCB_ATOMIC		(__force int) RWF_ATOMIC
 
 /* non-RWF related bits - start at 16 */
 #define IOCB_EVENTFD		(1 << 16)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..e3b4f5bc6860 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -301,8 +301,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* Atomic Write */
+#define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000020)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_ATOMIC)
 
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index b7b56871029c..e3b4f5bc6860 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -301,8 +301,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* Atomic Write */
+#define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000020)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_ATOMIC)
 
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 05/16] block: Add REQ_ATOMIC flag
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (3 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 04/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:38 ` [PATCH RFC 06/16] block: Limit atomic writes according to bio and queue limits John Garry
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	Himanshu Madhani, John Garry

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add flag REQ_ATOMIC, meaning an atomic operation. This should only be
used in conjunction with REQ_OP_WRITE.

We will not add a special "request atomic write" operation, as to try to
avoid maintenance effort for an operation which is almost the same as
REQ_OP_WRITE.

This flag was originally proposed by Chris Mason for an atomic writes
proposal some time ago.

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/linux/blk_types.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 99be590f952f..347b52e00322 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -417,6 +417,7 @@ enum req_flag_bits {
 	__REQ_SWAP,		/* swap I/O */
 	__REQ_DRV,		/* for driver use */
 
+	__REQ_ATOMIC,		/* for atomic write operations */
 	/*
 	 * Command specific flags, keep last:
 	 */
@@ -444,6 +445,7 @@ enum req_flag_bits {
 #define REQ_BACKGROUND	(__force blk_opf_t)(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT	(__force blk_opf_t)(1ULL << __REQ_NOWAIT)
 #define REQ_CGROUP_PUNT	(__force blk_opf_t)(1ULL << __REQ_CGROUP_PUNT)
+#define REQ_ATOMIC	(__force blk_opf_t)(1ULL << __REQ_ATOMIC)
 
 #define REQ_NOUNMAP	(__force blk_opf_t)(1ULL << __REQ_NOUNMAP)
 #define REQ_POLLED	(__force blk_opf_t)(1ULL << __REQ_POLLED)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 06/16] block: Limit atomic writes according to bio and queue limits
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (4 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 05/16] block: Add REQ_ATOMIC flag John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:53   ` Keith Busch
  2023-05-03 18:38 ` [PATCH RFC 07/16] block: Add bdev_find_max_atomic_write_alignment() John Garry
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

We rely the block layer always being able to send a bio of size
atomic_write_unit_max without being required to split it due to request
queue or other bio limits. We already know at any bio should have an
alignment of atomic_write_unit_max or lower.

A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
and each vector of at least PAGE_SIZE, except for if start address may
not be PAGE aligned; for this case, subtract 1 to give the max guaranteed
count of pages which we may store in a bio, and limit both
atomic_write_unit_min and atomic_write_unit_max to this value.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-settings.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index e21731715a12..f64a2f736cb8 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -212,6 +212,18 @@ void blk_queue_atomic_write_boundary(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_atomic_write_boundary);
 
+static unsigned int blk_queue_max_guaranteed_bio_size(struct queue_limits *limits)
+{
+	unsigned int max_segments = limits->max_segments;
+	unsigned int atomic_write_max_segments =
+				min(BIO_MAX_VECS, max_segments);
+	/* subtract 1 to assume PAGE-misaligned IOV start address */
+	unsigned int size = (atomic_write_max_segments - 1) *
+				(PAGE_SIZE / SECTOR_SIZE);
+
+	return rounddown_pow_of_two(size);
+}
+
 /**
  * blk_queue_atomic_write_unit_min - smallest unit that can be written
  *				     atomically to the device.
@@ -221,7 +233,10 @@ EXPORT_SYMBOL(blk_queue_atomic_write_boundary);
 void blk_queue_atomic_write_unit_min(struct request_queue *q,
 				     unsigned int sectors)
 {
-	q->limits.atomic_write_unit_min = sectors;
+	struct queue_limits *limits= &q->limits;
+	unsigned int guaranteed = blk_queue_max_guaranteed_bio_size(limits);
+
+	limits->atomic_write_unit_min = min(guaranteed, sectors);
 }
 EXPORT_SYMBOL(blk_queue_atomic_write_unit_min);
 
@@ -234,8 +249,10 @@ EXPORT_SYMBOL(blk_queue_atomic_write_unit_min);
 void blk_queue_atomic_write_unit_max(struct request_queue *q,
 				     unsigned int sectors)
 {
-	struct queue_limits *limits = &q->limits;
-	limits->atomic_write_unit_max = sectors;
+	struct queue_limits *limits= &q->limits;
+	unsigned int guaranteed = blk_queue_max_guaranteed_bio_size(limits);
+
+	limits->atomic_write_unit_max = min(guaranteed, sectors);
 }
 EXPORT_SYMBOL(blk_queue_atomic_write_unit_max);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 07/16] block: Add bdev_find_max_atomic_write_alignment()
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (5 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 06/16] block: Limit atomic writes according to bio and queue limits John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:38 ` [PATCH RFC 08/16] block: Add support for atomic_write_unit John Garry
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

Add a function to find the max alignment of an atomic write for a bdev
when provided with an offset and length.

We should be able to optimise this function later, most especially since
the values involved are powers-of-2.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/bdev.c           | 39 +++++++++++++++++++++++++++++++++++++++
 include/linux/blkdev.h |  9 +++++++++
 2 files changed, 48 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index 6a5fd5abaadc..3373f0d5cad9 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -46,6 +46,45 @@ struct block_device *I_BDEV(struct inode *inode)
 }
 EXPORT_SYMBOL(I_BDEV);
 
+unsigned int bdev_find_max_atomic_write_alignment(struct block_device *bdev,
+					loff_t pos, unsigned int len)
+{
+	struct request_queue *bd_queue = bdev->bd_queue;
+	struct queue_limits *limits = &bd_queue->limits;
+	unsigned int atomic_write_unit_min = limits->atomic_write_unit_min;
+	unsigned int atomic_write_unit_max = limits->atomic_write_unit_max;
+	unsigned int max_align;
+
+	pos /= SECTOR_SIZE;
+	len /= SECTOR_SIZE;
+
+	max_align = min_not_zero(len, atomic_write_unit_max);
+
+	if (len <= 1)
+		return atomic_write_unit_min * SECTOR_SIZE;
+
+	max_align = rounddown_pow_of_two(max_align);
+	while (1) {
+		unsigned int mod1, mod2;
+
+		if (max_align == 0)
+			return atomic_write_unit_min * SECTOR_SIZE;
+
+		/* This should not happen */
+		if (!is_power_of_2(max_align))
+			goto end;
+
+		mod1 = len % max_align;
+		mod2 = pos % max_align;
+		if (!mod1 && !mod2)
+			break;
+end:
+		max_align /= 2;
+	}
+
+	return max_align * SECTOR_SIZE;
+}
+
 static void bdev_write_inode(struct block_device *bdev)
 {
 	struct inode *inode = bdev->bd_inode;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 19d33b2897b2..96138550928c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1529,6 +1529,8 @@ void sync_bdevs(bool wait);
 void bdev_statx_dioalign(struct inode *inode, struct kstat *stat);
 void bdev_statx_atomic(struct inode *inode, struct kstat *stat);
 void printk_all_partitions(void);
+unsigned int bdev_find_max_atomic_write_alignment(struct block_device *bdev,
+				loff_t pos, unsigned int len);
 #else
 static inline void invalidate_bdev(struct block_device *bdev)
 {
@@ -1553,6 +1555,13 @@ static inline void bdev_statx_atomic(struct inode *inode, struct kstat *stat)
 static inline void printk_all_partitions(void)
 {
 }
+static inline unsigned int bdev_find_max_atomic_write_alignment(
+				struct block_device *bdev,
+				loff_t pos, unsigned int len)
+{
+	return 0;
+}		
+				
 #endif /* CONFIG_BLOCK */
 
 int fsync_bdev(struct block_device *bdev);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 08/16] block: Add support for atomic_write_unit
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (6 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 07/16] block: Add bdev_find_max_atomic_write_alignment() John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:38 ` [PATCH RFC 09/16] block: Add blk_validate_atomic_write_op() John Garry
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

Add bio.atomic_write_unit, which is the min size which we can split a bio.
Any bio needs to be split in a multiple of this size and also aligned to
this size.

In __bio_iov_iter_get_pages(), use atomic_write_unit to trim a bio to
be a multiple of atomic_write_unit.

In bio_split_rw(), we need to consider splitting as follows:
- For a regular split which does not cross an atomic write boundary, same
  as in __bio_iov_iter_get_pages(), trim to be a multiple of
  atomic_write_unit
- We also need to check for when a bio straddles an atomic write boundary.
  In this case, split to be start/end-aligned with the boundary.

We need to ignore lim->max_sectors since to may be less than
bio->write_atomic_unit, which we cannot tolerate.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/bio.c               |  7 +++-
 block/blk-merge.c         | 84 ++++++++++++++++++++++++++++++++++-----
 include/linux/blk_types.h |  2 +
 3 files changed, 81 insertions(+), 12 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index fd11614bba4d..fc2f29e1c14c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -247,6 +247,7 @@ void bio_init(struct bio *bio, struct block_device *bdev, struct bio_vec *table,
 	      unsigned short max_vecs, blk_opf_t opf)
 {
 	bio->bi_next = NULL;
+	bio->atomic_write_unit = 0;
 	bio->bi_bdev = bdev;
 	bio->bi_opf = opf;
 	bio->bi_flags = 0;
@@ -815,6 +816,7 @@ static int __bio_clone(struct bio *bio, struct bio *bio_src, gfp_t gfp)
 	bio->bi_ioprio = bio_src->bi_ioprio;
 	bio->bi_iter = bio_src->bi_iter;
 
+	bio->atomic_write_unit = bio_src->atomic_write_unit;
 	if (bio->bi_bdev) {
 		if (bio->bi_bdev == bio_src->bi_bdev &&
 		    bio_flagged(bio_src, BIO_REMAPPED))
@@ -1273,7 +1275,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
-	trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
+	if (bio->atomic_write_unit)
+		trim = size & (bio->atomic_write_unit - 1);
+	else
+		trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
 	iov_iter_revert(iter, trim);
 
 	size -= trim;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6460abdb2426..95ab6b644955 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -171,7 +171,17 @@ static inline unsigned get_max_io_size(struct bio *bio,
 {
 	unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
 	unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
-	unsigned max_sectors = lim->max_sectors, start, end;
+	unsigned max_sectors, start, end;
+
+	/*
+	 * We ignore lim->max_sectors for atomic writes simply because
+	 * it may less than bio->write_atomic_unit, which we cannot
+	 * tolerate.
+	 */
+	if (bio->bi_opf & REQ_ATOMIC)
+		max_sectors = lim->atomic_write_max_bytes >> SECTOR_SHIFT;
+	else
+		max_sectors = lim->max_sectors;
 
 	if (lim->chunk_sectors) {
 		max_sectors = min(max_sectors,
@@ -256,6 +266,22 @@ static bool bvec_split_segs(const struct queue_limits *lim,
 	return len > 0 || bv->bv_len > max_len;
 }
 
+static bool bio_straddles_boundary(struct bio *bio, unsigned int bytes,
+				   unsigned int boundary)
+{
+	loff_t start = bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	loff_t end = start + bytes;
+	loff_t start_mod = start % boundary;
+	loff_t end_mod = end % boundary;
+
+	if (end - start > boundary)
+		return true;
+	if ((start_mod > end_mod) && (start_mod && end_mod))
+		return true;
+
+	return false;
+}
+
 /**
  * bio_split_rw - split a bio in two bios
  * @bio:  [in] bio to be split
@@ -276,10 +302,15 @@ static bool bvec_split_segs(const struct queue_limits *lim,
  * responsible for ensuring that @bs is only destroyed after processing of the
  * split bio has finished.
  */
+
+
 struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
 		unsigned *segs, struct bio_set *bs, unsigned max_bytes)
 {
+	unsigned int atomic_write_boundary = lim->atomic_write_boundary;
+	bool atomic_write = bio->bi_opf & REQ_ATOMIC;
 	struct bio_vec bv, bvprv, *bvprvp = NULL;
+	bool straddles_boundary = false;
 	struct bvec_iter iter;
 	unsigned nsegs = 0, bytes = 0;
 
@@ -291,14 +322,31 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
 		if (bvprvp && bvec_gap_to_prev(lim, bvprvp, bv.bv_offset))
 			goto split;
 
+		if (atomic_write && atomic_write_boundary) {
+			straddles_boundary = bio_straddles_boundary(bio,
+					bytes + bv.bv_len, atomic_write_boundary);
+		}
 		if (nsegs < lim->max_segments &&
 		    bytes + bv.bv_len <= max_bytes &&
-		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
+		    bv.bv_offset + bv.bv_len <= PAGE_SIZE &&
+		    !straddles_boundary) {
 			nsegs++;
 			bytes += bv.bv_len;
 		} else {
-			if (bvec_split_segs(lim, &bv, &nsegs, &bytes,
-					lim->max_segments, max_bytes))
+			bool split_the_segs =
+				bvec_split_segs(lim, &bv, &nsegs, &bytes,
+						lim->max_segments, max_bytes);
+
+			/*
+			 * We may not actually straddle the boundary as we may
+			 * have added less bytes than anticipated
+			 */
+			if (straddles_boundary) {
+				straddles_boundary = bio_straddles_boundary(bio,
+						bytes, atomic_write_boundary);
+			}
+
+			if (split_the_segs || straddles_boundary)
 				goto split;
 		}
 
@@ -321,12 +369,25 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
 
 	*segs = nsegs;
 
-	/*
-	 * Individual bvecs might not be logical block aligned. Round down the
-	 * split size so that each bio is properly block size aligned, even if
-	 * we do not use the full hardware limits.
-	 */
-	bytes = ALIGN_DOWN(bytes, lim->logical_block_size);
+	if (straddles_boundary) {
+		loff_t new_end = (bio->bi_iter.bi_sector << SECTOR_SHIFT) + bytes;
+		unsigned int trim = new_end & (atomic_write_boundary - 1);
+		bytes -= trim;
+		new_end = (bio->bi_iter.bi_sector << SECTOR_SHIFT) + bytes;
+		BUG_ON(new_end % atomic_write_boundary);
+	} else if (bio->atomic_write_unit) {
+		unsigned int atomic_write_unit = bio->atomic_write_unit;
+		unsigned int trim = bytes % atomic_write_unit;
+
+		bytes -= trim;
+	} else {
+		/*
+		 * Individual bvecs might not be logical block aligned. Round down the
+		 * split size so that each bio is properly block size aligned, even if
+		 * we do not use the full hardware limits.
+		 */
+		bytes = ALIGN_DOWN(bytes, lim->logical_block_size);
+	}
 
 	/*
 	 * Bio splitting may cause subtle trouble such as hang when doing sync
@@ -355,7 +416,8 @@ struct bio *__bio_split_to_limits(struct bio *bio,
 				  const struct queue_limits *lim,
 				  unsigned int *nr_segs)
 {
-	struct bio_set *bs = &bio->bi_bdev->bd_disk->bio_split;
+	struct block_device *bi_bdev = bio->bi_bdev;
+	struct bio_set *bs = &bi_bdev->bd_disk->bio_split;
 	struct bio *split;
 
 	switch (bio_op(bio)) {
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 347b52e00322..daa44eac9f14 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -303,6 +303,8 @@ struct bio {
 
 	struct bio_set		*bi_pool;
 
+	unsigned int atomic_write_unit;
+
 	/*
 	 * We can inline a number of vecs at the end of the bio, to avoid
 	 * double allocations for a small number of bio_vecs. This member
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 09/16] block: Add blk_validate_atomic_write_op()
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (7 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 08/16] block: Add support for atomic_write_unit John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:38 ` [PATCH RFC 10/16] block: Add fops atomic write support John Garry
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

Validate that an atomic write bio size satisfies atomic_write_unit_min
and atomic_write_unit and that the sector satisfies atomic_write_unit.

Also set REQ_NOMERGE - we do not support it yet.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-core.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 42926e6cb83c..91abf8cc2b62 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -591,6 +591,27 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 	return BLK_STS_OK;
 }
 
+static blk_status_t blk_validate_atomic_write_op(struct request_queue *q,
+						 struct bio *bio)
+{
+	struct queue_limits *limits = &q->limits;
+
+	if (bio->bi_iter.bi_size % bio->atomic_write_unit)
+		return BLK_STS_IOERR;
+
+	if ((bio->bi_iter.bi_size >> SECTOR_SHIFT) %
+	    limits->atomic_write_unit_min)
+		return BLK_STS_IOERR;
+
+	if (bio->bi_iter.bi_sector % limits->atomic_write_unit_min)
+		return BLK_STS_IOERR;
+
+	/* No support to merge yet, so disable */
+	bio->bi_opf |= REQ_NOMERGE;
+
+	return BLK_STS_OK;
+}
+
 static void __submit_bio(struct bio *bio)
 {
 	struct gendisk *disk = bio->bi_bdev->bd_disk;
@@ -770,6 +791,13 @@ void submit_bio_noacct(struct bio *bio)
 		bio_clear_polled(bio);
 
 	switch (bio_op(bio)) {
+	case REQ_OP_WRITE:
+		if (bio->bi_opf & REQ_ATOMIC) {
+			status = blk_validate_atomic_write_op(q, bio);
+			if (status != BLK_STS_OK)
+				goto end_io;
+		}
+		break;
 	case REQ_OP_DISCARD:
 		if (!bdev_max_discard_sectors(bdev))
 			goto not_supported;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 10/16] block: Add fops atomic write support
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (8 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 09/16] block: Add blk_validate_atomic_write_op() John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:38 ` [PATCH RFC 11/16] fs: iomap: Atomic " John Garry
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

Add support to set bio->atomic_write_unit

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/fops.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 51 insertions(+), 5 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index d2e6be4e3d1c..9a2c595cd93d 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -43,8 +43,17 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
 }
 
 static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
-			      struct iov_iter *iter)
+			      struct iov_iter *iter, bool atomic_write)
 {
+	if (atomic_write) {
+		unsigned int atomic_write_unit_min =
+			queue_atomic_write_unit_min(bdev_get_queue(bdev));
+		if (pos % atomic_write_unit_min)
+			return false;
+		if (iov_iter_count(iter) % atomic_write_unit_min)
+			return false;
+	}
+
 	return pos & (bdev_logical_block_size(bdev) - 1) ||
 		!bdev_iter_is_aligned(bdev, iter);
 }
@@ -56,12 +65,21 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 {
 	struct block_device *bdev = iocb->ki_filp->private_data;
 	struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs;
+	bool is_read = iov_iter_rw(iter) == READ;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
+	unsigned int max_align_bytes = 0;
 	loff_t pos = iocb->ki_pos;
 	bool should_dirty = false;
 	struct bio bio;
 	ssize_t ret;
 
-	if (blkdev_dio_unaligned(bdev, pos, iter))
+	/* iov_iter_count() return value will change later, so calculate now */
+	if (atomic_write) {
+		max_align_bytes = bdev_find_max_atomic_write_alignment(bdev,
+					iocb->ki_pos, iov_iter_count(iter));
+	}
+
+	if (blkdev_dio_unaligned(bdev, pos, iter, atomic_write))
 		return -EINVAL;
 
 	if (nr_pages <= DIO_INLINE_BIO_VECS)
@@ -73,7 +91,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 			return -ENOMEM;
 	}
 
-	if (iov_iter_rw(iter) == READ) {
+	if (is_read) {
 		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
 		if (user_backed_iter(iter))
 			should_dirty = true;
@@ -82,6 +100,10 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
 	bio.bi_ioprio = iocb->ki_ioprio;
+	if (atomic_write) {
+		bio.bi_opf |= REQ_ATOMIC;
+		bio.atomic_write_unit = max_align_bytes;
+	}
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
@@ -175,11 +197,19 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	struct blkdev_dio *dio;
 	struct bio *bio;
 	bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
+	unsigned int max_align_bytes = 0;
 	loff_t pos = iocb->ki_pos;
 	int ret = 0;
 
-	if (blkdev_dio_unaligned(bdev, pos, iter))
+	/* iov_iter_count() return value will change later, so calculate now */
+	if (atomic_write) {
+		max_align_bytes = bdev_find_max_atomic_write_alignment(bdev,
+					iocb->ki_pos, iov_iter_count(iter));
+	}
+
+	if (blkdev_dio_unaligned(bdev, pos, iter, atomic_write))
 		return -EINVAL;
 
 	if (iocb->ki_flags & IOCB_ALLOC_CACHE)
@@ -214,6 +244,8 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		bio->bi_private = dio;
 		bio->bi_end_io = blkdev_bio_end_io;
 		bio->bi_ioprio = iocb->ki_ioprio;
+		if (atomic_write)
+			bio->atomic_write_unit = max_align_bytes;
 
 		ret = bio_iov_iter_get_pages(bio, iter);
 		if (unlikely(ret)) {
@@ -244,8 +276,11 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 			if (dio->flags & DIO_SHOULD_DIRTY)
 				bio_set_pages_dirty(bio);
 		} else {
+			if (atomic_write)
+				bio->bi_opf |= REQ_ATOMIC;
 			task_io_account_write(bio->bi_iter.bi_size);
 		}
+
 		dio->size += bio->bi_iter.bi_size;
 		pos += bio->bi_iter.bi_size;
 
@@ -313,14 +348,21 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	struct block_device *bdev = iocb->ki_filp->private_data;
 	bool is_read = iov_iter_rw(iter) == READ;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
+	unsigned int max_align_bytes = 0;
 	struct blkdev_dio *dio;
 	struct bio *bio;
 	loff_t pos = iocb->ki_pos;
 	int ret = 0;
 
-	if (blkdev_dio_unaligned(bdev, pos, iter))
+	if (blkdev_dio_unaligned(bdev, pos, iter, atomic_write))
 		return -EINVAL;
 
+	if (atomic_write) {
+		max_align_bytes = bdev_find_max_atomic_write_alignment(bdev,
+					iocb->ki_pos, iov_iter_count(iter));
+	}
+
 	if (iocb->ki_flags & IOCB_ALLOC_CACHE)
 		opf |= REQ_ALLOC_CACHE;
 	bio = bio_alloc_bioset(bdev, nr_pages, opf, GFP_KERNEL,
@@ -331,6 +373,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	bio->bi_iter.bi_sector = pos >> SECTOR_SHIFT;
 	bio->bi_end_io = blkdev_bio_end_io_async;
 	bio->bi_ioprio = iocb->ki_ioprio;
+	if (atomic_write)
+		bio->atomic_write_unit = max_align_bytes;
 
 	if (iov_iter_is_bvec(iter)) {
 		/*
@@ -355,6 +399,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 			bio_set_pages_dirty(bio);
 		}
 	} else {
+		if (atomic_write)
+			bio->bi_opf |= REQ_ATOMIC;
 		task_io_account_write(bio->bi_iter.bi_size);
 	}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 11/16] fs: iomap: Atomic write support
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (9 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 10/16] block: Add fops atomic write support John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-04  5:00   ` Dave Chinner
  2023-05-03 18:38 ` [PATCH RFC 12/16] xfs: Add support for fallocate2 John Garry
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

Add support to create bio's whose bi_sector and bi_size are aligned to and
multiple of atomic_write_unit, respectively.

When we call iomap_dio_bio_iter() -> bio_iov_iter_get_pages() ->
__bio_iov_iter_get_pages(), we trim the bio to a multiple of
atomic_write_unit.

As such, we expect the iomi start and length to have same size and
alignment requirements per iomap_dio_bio_iter() call.

In iomap_dio_bio_iter(), ensure that for a non-dsync iocb that the mapping
is not dirty nor unmapped.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c | 72 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 70 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f771001574d0..37c3c926dfd8 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -36,6 +36,8 @@ struct iomap_dio {
 	size_t			done_before;
 	bool			wait_for_completion;
 
+	unsigned int atomic_write_unit;
+
 	union {
 		/* used during submission and for synchronous completion: */
 		struct {
@@ -229,9 +231,21 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 	return opflags;
 }
 
+
+/*
+ * Note: For atomic writes, each bio which we create when we iter should have
+ *	 bi_sector aligned to atomic_write_unit and also its bi_size should be
+ *	 a multiple of atomic_write_unit.
+ *	 The call to bio_iov_iter_get_pages() -> __bio_iov_iter_get_pages()
+ *	 should trim the length to a multiple of atomic_write_unit for us.
+ *	 This allows us to split each bio later in the block layer to fit
+ *	 request_queue limit.
+ */
 static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		struct iomap_dio *dio)
 {
+	bool atomic_write = (dio->iocb->ki_flags & IOCB_ATOMIC) &&
+			    (dio->flags & IOMAP_DIO_WRITE);
 	const struct iomap *iomap = &iter->iomap;
 	struct inode *inode = iter->inode;
 	unsigned int fs_block_size = i_blocksize(inode), pad;
@@ -249,6 +263,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
 		return -EINVAL;
 
+
+	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
+		if (iomap->flags & IOMAP_F_DIRTY)
+			return -EIO;
+		if (iomap->type != IOMAP_MAPPED)
+			return -EIO;
+	}
+
 	if (iomap->type == IOMAP_UNWRITTEN) {
 		dio->flags |= IOMAP_DIO_UNWRITTEN;
 		need_zeroout = true;
@@ -318,6 +340,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 					  GFP_KERNEL);
 		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
 		bio->bi_ioprio = dio->iocb->ki_ioprio;
+		if (atomic_write) {
+			bio->bi_opf |= REQ_ATOMIC;
+			bio->atomic_write_unit = dio->atomic_write_unit;
+		}
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
 
@@ -492,6 +518,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		is_sync_kiocb(iocb) || (dio_flags & IOMAP_DIO_FORCE_WAIT);
 	struct blk_plug plug;
 	struct iomap_dio *dio;
+	bool is_read = iov_iter_rw(iter) == READ;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 
 	if (!iomi.len)
 		return NULL;
@@ -500,6 +528,20 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (!dio)
 		return ERR_PTR(-ENOMEM);
 
+	if (atomic_write) {
+		/*
+		 * Note: This lookup is not proper for a multi-device scenario,
+		 *	 however for current iomap users, the bdev per iter
+		 *	 will be fixed, so "works" for now.
+		 */
+		struct super_block *i_sb = inode->i_sb;
+		struct block_device *bdev = i_sb->s_bdev;
+
+		dio->atomic_write_unit =
+			bdev_find_max_atomic_write_alignment(bdev,
+					iomi.pos, iomi.len);
+	}
+
 	dio->iocb = iocb;
 	atomic_set(&dio->ref, 1);
 	dio->size = 0;
@@ -513,7 +555,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->submit.waiter = current;
 	dio->submit.poll_bio = NULL;
 
-	if (iov_iter_rw(iter) == READ) {
+	if (is_read) {
 		if (iomi.pos >= dio->i_size)
 			goto out_free_dio;
 
@@ -567,7 +609,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (ret)
 		goto out_free_dio;
 
-	if (iov_iter_rw(iter) == WRITE) {
+	if (!is_read) {
 		/*
 		 * Try to invalidate cache pages for the range we are writing.
 		 * If this invalidation fails, let the caller fall back to
@@ -592,6 +634,32 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	blk_start_plug(&plug);
 	while ((ret = iomap_iter(&iomi, ops)) > 0) {
+		if (atomic_write) {
+			const struct iomap *_iomap = &iomi.iomap;
+			loff_t iomi_length = iomap_length(&iomi);
+
+			/*
+			 * Ensure length and start address is a multiple of
+			 * atomic_write_unit - this is critical. If the length
+			 * is not a multiple of atomic_write_unit, then we
+			 * cannot create a set of bio's in iomap_dio_bio_iter()
+			 * who are each a length which is a multiple of
+			 * atomic_write_unit.
+			 *
+			 * Note: It may be more appropiate to have this check
+			 *	 in iomap_dio_bio_iter()
+			 */
+			if ((iomap_sector(_iomap, iomi.pos) << SECTOR_SHIFT) %
+			    dio->atomic_write_unit) {
+				ret = -EIO;
+				break;
+			}
+
+			if (iomi_length % dio->atomic_write_unit) {
+				ret = -EIO;
+				break;
+			}
+		}
 		iomi.processed = iomap_dio_iter(&iomi, dio);
 
 		/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 12/16] xfs: Add support for fallocate2
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (10 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 11/16] fs: iomap: Atomic " John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 23:26   ` Dave Chinner
  2023-05-03 18:38 ` [PATCH RFC 13/16] scsi: sd: Support reading atomic properties from block limits VPD John Garry
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	Allison Henderson, Catherine Hoang, John Garry

From: Allison Henderson <allison.henderson@oracle.com>

Add support for fallocate2 ioctl, which is xfs' own version of fallocate.

Struct xfs_fallocate2 is passed in the ioctl, and xfs_fallocate2.alignment
allows the user to specify required extent alignment. This is key for
atomic write support, as we expect extents to be aligned on
atomic_write_unit_max boundaries.

The alignment flag is not sticky, so further extent mutation will not
obey this original alignment request. In addition, extent lengths should
always be a multiple of atomic_write_unit_max, which they are not yet. So
this really just works for scenarios when we were lucky enough to get a
single extent.

The following is sample usage and c code:

mkfs.xfs -f /dev/sda
mount /dev/sda mnt
xfs_fallocate2 mnt/test_file1.img 0 20971520 262144
filefrag -v mnt/test_file1.img

xfs_fallocate2.c

struct xfs_fallocate2 {
	int64_t offset;     /* bytes */
	int64_t length;     /* bytes */
	uint64_t flags;
	uint32_t alignment;  /* bytes */
	uint32_t padding[9];
};

int main(int argc, char **argv) {
	char *file;
	int fd, ret;
	struct xfs_fallocate2 fa = {};

	if (argc != 5) {
		printf("expected 5 arguments\n");
		exit(0);
	}

	argv++;
	file = *argv;
	argv++;

	fa.offset = atoi(*argv);
	argv++;

	fa.length = atoi(*argv);
	argv++;

	fa.alignment = atoi(*argv);
	argv++;

	if (fa.alignment)
		fa.flags = XFS_FALLOC2_ALIGNED;

	fd = open(file, O_RDWR | O_CREAT, 0600);
	if (fd < 0)
		exit(0);

	ret = ioctl(fd, XFS_IOC_FALLOCATE2, &fa);
	close(fd);

	return ret;
}

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/Makefile                 |  1 +
 fs/xfs/libxfs/xfs_attr_remote.c |  2 +-
 fs/xfs/libxfs/xfs_bmap.c        |  9 ++-
 fs/xfs/libxfs/xfs_bmap.h        |  4 +-
 fs/xfs/libxfs/xfs_da_btree.c    |  4 +-
 fs/xfs/libxfs/xfs_fs.h          |  1 +
 fs/xfs/xfs_bmap_util.c          |  7 ++-
 fs/xfs/xfs_bmap_util.h          |  2 +-
 fs/xfs/xfs_dquot.c              |  2 +-
 fs/xfs/xfs_file.c               | 19 +++++--
 fs/xfs/xfs_fs_staging.c         | 99 +++++++++++++++++++++++++++++++++
 fs/xfs/xfs_fs_staging.h         | 21 +++++++
 fs/xfs/xfs_ioctl.c              |  4 ++
 fs/xfs/xfs_iomap.c              |  4 +-
 fs/xfs/xfs_reflink.c            |  4 +-
 fs/xfs/xfs_rtalloc.c            |  2 +-
 fs/xfs/xfs_symlink.c            |  2 +-
 security/security.c             |  1 +
 18 files changed, 168 insertions(+), 20 deletions(-)
 create mode 100644 fs/xfs/xfs_fs_staging.c
 create mode 100644 fs/xfs/xfs_fs_staging.h

diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 92d88dc3c9f7..9b413544d358 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -93,6 +93,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_sysfs.o \
 				   xfs_trans.o \
 				   xfs_xattr.o \
+				   xfs_fs_staging.o \
 				   kmem.o
 
 # low-level transaction/log code
diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c
index d440393b40eb..c5f190fef1b5 100644
--- a/fs/xfs/libxfs/xfs_attr_remote.c
+++ b/fs/xfs/libxfs/xfs_attr_remote.c
@@ -615,7 +615,7 @@ xfs_attr_rmtval_set_blk(
 	error = xfs_bmapi_write(args->trans, dp,
 			(xfs_fileoff_t)attr->xattri_lblkno,
 			attr->xattri_blkcnt, XFS_BMAPI_ATTRFORK, args->total,
-			map, &nmap);
+			map, &nmap, 0);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 34de6e6898c4..52a6e2b61228 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3275,7 +3275,9 @@ xfs_bmap_compute_alignments(
 	struct xfs_alloc_arg	*args)
 {
 	struct xfs_mount	*mp = args->mp;
-	xfs_extlen_t		align = 0; /* minimum allocation alignment */
+
+	/* minimum allocation alignment */
+	xfs_extlen_t		align = args->alignment;
 	int			stripe_align = 0;
 
 	/* stripe alignment for allocation is determined by mount parameters */
@@ -3652,6 +3654,7 @@ xfs_bmap_btalloc(
 		.datatype	= ap->datatype,
 		.alignment	= 1,
 		.minalignslop	= 0,
+		.alignment	= ap->align,
 	};
 	xfs_fileoff_t		orig_offset;
 	xfs_extlen_t		orig_length;
@@ -4279,12 +4282,14 @@ xfs_bmapi_write(
 	uint32_t		flags,		/* XFS_BMAPI_... */
 	xfs_extlen_t		total,		/* total blocks needed */
 	struct xfs_bmbt_irec	*mval,		/* output: map values */
-	int			*nmap)		/* i/o: mval size/count */
+	int			*nmap,
+	xfs_extlen_t		align)		/* i/o: mval size/count */
 {
 	struct xfs_bmalloca	bma = {
 		.tp		= tp,
 		.ip		= ip,
 		.total		= total,
+		.align		= align,
 	};
 	struct xfs_mount	*mp = ip->i_mount;
 	int			whichfork = xfs_bmapi_whichfork(flags);
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index dd08361ca5a6..0573dfc5fa6b 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -26,6 +26,7 @@ struct xfs_bmalloca {
 	xfs_fileoff_t		offset;	/* offset in file filling in */
 	xfs_extlen_t		length;	/* i/o length asked/allocated */
 	xfs_fsblock_t		blkno;	/* starting block of new extent */
+	xfs_extlen_t		align;
 
 	struct xfs_btree_cur	*cur;	/* btree cursor */
 	struct xfs_iext_cursor	icur;	/* incore extent cursor */
@@ -189,7 +190,8 @@ int	xfs_bmapi_read(struct xfs_inode *ip, xfs_fileoff_t bno,
 		int *nmap, uint32_t flags);
 int	xfs_bmapi_write(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t len, uint32_t flags,
-		xfs_extlen_t total, struct xfs_bmbt_irec *mval, int *nmap);
+		xfs_extlen_t total, struct xfs_bmbt_irec *mval, int *nmap,
+		xfs_extlen_t align);
 int	__xfs_bunmapi(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t bno, xfs_filblks_t *rlen, uint32_t flags,
 		xfs_extnum_t nexts);
diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index e576560b46e9..e6581254092f 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2174,7 +2174,7 @@ xfs_da_grow_inode_int(
 	nmap = 1;
 	error = xfs_bmapi_write(tp, dp, *bno, count,
 			xfs_bmapi_aflag(w)|XFS_BMAPI_METADATA|XFS_BMAPI_CONTIG,
-			args->total, &map, &nmap);
+			args->total, &map, &nmap, 0);
 	if (error)
 		return error;
 
@@ -2196,7 +2196,7 @@ xfs_da_grow_inode_int(
 			nmap = min(XFS_BMAP_MAX_NMAP, c);
 			error = xfs_bmapi_write(tp, dp, b, c,
 					xfs_bmapi_aflag(w)|XFS_BMAPI_METADATA,
-					args->total, &mapp[mapi], &nmap);
+					args->total, &mapp[mapi], &nmap, 0);
 			if (error)
 				goto out_free_map;
 			if (nmap < 1)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 1cfd5bc6520a..829316ca01ea 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -831,6 +831,7 @@ struct xfs_scrub_metadata {
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 126, struct xfs_fsop_geom)
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
+#define XFS_IOC_FALLOCATE2	     _IOR ('X', 129, struct xfs_fallocate2)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index a09dd2606479..a0c55af6f051 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -776,10 +776,12 @@ int
 xfs_alloc_file_space(
 	struct xfs_inode	*ip,
 	xfs_off_t		offset,
-	xfs_off_t		len)
+	xfs_off_t		len,
+	xfs_off_t		align)
 {
 	xfs_mount_t		*mp = ip->i_mount;
 	xfs_off_t		count;
+	xfs_filblks_t		align_fsb;
 	xfs_filblks_t		allocated_fsb;
 	xfs_filblks_t		allocatesize_fsb;
 	xfs_extlen_t		extsz, temp;
@@ -811,6 +813,7 @@ xfs_alloc_file_space(
 	nimaps = 1;
 	startoffset_fsb	= XFS_B_TO_FSBT(mp, offset);
 	endoffset_fsb = XFS_B_TO_FSB(mp, offset + count);
+	align_fsb = XFS_B_TO_FSB(mp, align);
 	allocatesize_fsb = endoffset_fsb - startoffset_fsb;
 
 	/*
@@ -872,7 +875,7 @@ xfs_alloc_file_space(
 
 		error = xfs_bmapi_write(tp, ip, startoffset_fsb,
 				allocatesize_fsb, XFS_BMAPI_PREALLOC, 0, imapp,
-				&nimaps);
+				&nimaps, align_fsb);
 		if (error)
 			goto error;
 
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 6888078f5c31..476f610ad617 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -54,7 +54,7 @@ int	xfs_bmap_last_extent(struct xfs_trans *tp, struct xfs_inode *ip,
 
 /* preallocation and hole punch interface */
 int	xfs_alloc_file_space(struct xfs_inode *ip, xfs_off_t offset,
-			     xfs_off_t len);
+			     xfs_off_t len, xfs_off_t align);
 int	xfs_free_file_space(struct xfs_inode *ip, xfs_off_t offset,
 			    xfs_off_t len);
 int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 8fb90da89787..475e1a56d1b0 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -328,7 +328,7 @@ xfs_dquot_disk_alloc(
 	/* Create the block mapping. */
 	error = xfs_bmapi_write(tp, quotip, dqp->q_fileoffset,
 			XFS_DQUOT_CLUSTER_SIZE_FSB, XFS_BMAPI_METADATA, 0, &map,
-			&nmaps);
+			&nmaps, 0);
 	if (error)
 		goto err_cancel;
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 705250f9f90a..9b1db42a8d33 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -883,12 +883,13 @@ static inline bool xfs_file_sync_writes(struct file *filp)
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
 		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
 
-STATIC long
-xfs_file_fallocate(
+long
+_xfs_file_fallocate(
 	struct file		*file,
 	int			mode,
 	loff_t			offset,
-	loff_t			len)
+	loff_t			len,
+	loff_t 			alignment)
 {
 	struct inode		*inode = file_inode(file);
 	struct xfs_inode	*ip = XFS_I(inode);
@@ -1035,7 +1036,7 @@ xfs_file_fallocate(
 		}
 
 		if (!xfs_is_always_cow_inode(ip)) {
-			error = xfs_alloc_file_space(ip, offset, len);
+			error = xfs_alloc_file_space(ip, offset, len, alignment);
 			if (error)
 				goto out_unlock;
 		}
@@ -1073,6 +1074,16 @@ xfs_file_fallocate(
 	return error;
 }
 
+STATIC long
+xfs_file_fallocate(
+	struct file		*file,
+	int			mode,
+	loff_t			offset,
+	loff_t			len)
+{
+	return _xfs_file_fallocate(file, mode, offset, len, 0);
+}
+
 STATIC int
 xfs_file_fadvise(
 	struct file	*file,
diff --git a/fs/xfs/xfs_fs_staging.c b/fs/xfs/xfs_fs_staging.c
new file mode 100644
index 000000000000..1d635c0a9f49
--- /dev/null
+++ b/fs/xfs/xfs_fs_staging.c
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023 Oracle.  All Rights Reserved.
+ */
+
+#include "xfs.h"
+#include "xfs_fs_staging.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include "linux/security.h"
+#include "linux/fsnotify.h"
+
+extern long _xfs_file_fallocate(
+	struct file		*file,
+	int			mode,
+	loff_t			offset,
+	loff_t			len,
+	loff_t 			alignment);
+
+int xfs_fallocate2(	struct file		*filp,
+	void			__user *arg)
+{
+	struct inode		*inode = file_inode(filp);
+	//struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_fallocate2 fallocate2;
+	int ret;
+
+	if (copy_from_user(&fallocate2, arg, sizeof(fallocate2)))
+		return -EFAULT;
+
+	if (fallocate2.flags & XFS_FALLOC2_ALIGNED) {
+		if (!fallocate2.alignment || !is_power_of_2(fallocate2.alignment))
+			return -EINVAL;
+
+		if (fallocate2.offset % fallocate2.alignment)
+			return -EINVAL;
+
+		if (fallocate2.length % fallocate2.alignment)
+			return -EINVAL;
+	} else if (fallocate2.alignment) {
+		return -EINVAL;
+	}
+
+	/* These are all just copied from vfs_fallocate() */
+	if (fallocate2.offset < 0 || fallocate2.length <= 0)
+		return -EINVAL;
+
+	if (!(filp->f_mode & FMODE_WRITE))
+		return -EBADF;
+
+	if (IS_IMMUTABLE(inode))
+		return -EPERM;
+
+	/*
+	 * We cannot allow any fallocate operation on an active swapfile
+	 */
+	if (IS_SWAPFILE(inode))
+		return -ETXTBSY;
+
+	/*
+	 * Revalidate the write permissions, in case security policy has
+	 * changed since the files were opened.
+	 */
+	ret = security_file_permission(filp, MAY_WRITE);
+	if (ret)
+		return ret;
+
+	if (S_ISFIFO(inode->i_mode))
+		return -ESPIPE;
+
+	if (S_ISDIR(inode->i_mode))
+		return -EISDIR;
+
+	if (!S_ISREG(inode->i_mode) && !S_ISBLK(inode->i_mode))
+		return -ENODEV;
+
+	/* Check for wrap through zero too */
+	if (((fallocate2.offset + fallocate2.length) > inode->i_sb->s_maxbytes) ||
+		((fallocate2.offset + fallocate2.length) < 0))
+		return -EFBIG;
+
+	if (!filp->f_op->fallocate)
+		return -EOPNOTSUPP;
+
+	file_start_write(filp);
+	ret = _xfs_file_fallocate(filp, 0, fallocate2.offset, fallocate2.length, fallocate2.alignment);
+
+	if (ret == 0)
+		fsnotify_modify(filp);
+
+	file_end_write(filp);
+
+	return ret;
+}
diff --git a/fs/xfs/xfs_fs_staging.h b/fs/xfs/xfs_fs_staging.h
new file mode 100644
index 000000000000..a82e61063dba
--- /dev/null
+++ b/fs/xfs/xfs_fs_staging.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2023 Oracle.  All Rights Reserved.
+ */
+#ifndef __XFS_FS_STAGING_H__
+#define __XFS_FS_STAGING_H__
+
+struct xfs_fallocate2 {
+	s64 offset;	/* bytes */
+	s64 length;	/* bytes */
+	u64 flags;
+	u32 alignment;	/* bytes */
+	u32 padding[8];
+};
+
+#define XFS_FALLOC2_ALIGNED (1U << 0)
+
+int xfs_fallocate2(	struct file		*filp,
+	void			__user *arg);
+
+#endif	/* __XFS_FS_STAGING_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 55bb01173cde..6e60fce44068 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -4,6 +4,7 @@
  * All Rights Reserved.
  */
 #include "xfs.h"
+#include "xfs_fs_staging.h"
 #include "xfs_fs.h"
 #include "xfs_shared.h"
 #include "xfs_format.h"
@@ -2149,6 +2150,9 @@ xfs_file_ioctl(
 		return error;
 	}
 
+	case XFS_IOC_FALLOCATE2:
+		return xfs_fallocate2(filp, arg);
+
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 285885c308bd..a4389a0c4bf2 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -306,7 +306,7 @@ xfs_iomap_write_direct(
 	 */
 	nimaps = 1;
 	error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb, bmapi_flags, 0,
-				imap, &nimaps);
+				imap, &nimaps, 0);
 	if (error)
 		goto out_trans_cancel;
 
@@ -614,7 +614,7 @@ xfs_iomap_write_unwritten(
 		nimaps = 1;
 		error = xfs_bmapi_write(tp, ip, offset_fsb, count_fsb,
 					XFS_BMAPI_CONVERT, resblks, &imap,
-					&nimaps);
+					&nimaps, 0);
 		if (error)
 			goto error_on_bmapi_transaction;
 
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f5dc46ce9803..a2e5ba6cf7f3 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -420,7 +420,7 @@ xfs_reflink_fill_cow_hole(
 	nimaps = 1;
 	error = xfs_bmapi_write(tp, ip, imap->br_startoff, imap->br_blockcount,
 			XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, 0, cmap,
-			&nimaps);
+			&nimaps, 0);
 	if (error)
 		goto out_trans_cancel;
 
@@ -490,7 +490,7 @@ xfs_reflink_fill_delalloc(
 		error = xfs_bmapi_write(tp, ip, cmap->br_startoff,
 				cmap->br_blockcount,
 				XFS_BMAPI_COWFORK | XFS_BMAPI_PREALLOC, 0,
-				cmap, &nimaps);
+				cmap, &nimaps, 0);
 		if (error)
 			goto out_trans_cancel;
 
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index 16534e9873f6..a57a8a4d8294 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -817,7 +817,7 @@ xfs_growfs_rt_alloc(
 		 */
 		nmap = 1;
 		error = xfs_bmapi_write(tp, ip, oblocks, nblocks - oblocks,
-					XFS_BMAPI_METADATA, 0, &map, &nmap);
+					XFS_BMAPI_METADATA, 0, &map, &nmap, 0);
 		if (!error && nmap < 1)
 			error = -ENOSPC;
 		if (error)
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 85e433df6a3f..2a4524bf34a5 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -269,7 +269,7 @@ xfs_symlink(
 		nmaps = XFS_SYMLINK_MAPS;
 
 		error = xfs_bmapi_write(tp, ip, first_fsb, fs_blocks,
-				  XFS_BMAPI_METADATA, resblks, mval, &nmaps);
+				  XFS_BMAPI_METADATA, resblks, mval, &nmaps, 0);
 		if (error)
 			goto out_trans_cancel;
 
diff --git a/security/security.c b/security/security.c
index cf6cc576736f..d53b1b6c2d59 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1593,6 +1593,7 @@ int security_file_permission(struct file *file, int mask)
 
 	return fsnotify_perm(file, mask);
 }
+EXPORT_SYMBOL(security_file_permission);
 
 int security_file_alloc(struct file *file)
 {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 13/16] scsi: sd: Support reading atomic properties from block limits VPD
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (11 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 12/16] xfs: Add support for fallocate2 John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:38 ` [PATCH RFC 14/16] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

Also update block layer request queue sysfs properties.

See sbc4r22 section 6.6.4 - Block limits VPD page.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/sd.c | 34 +++++++++++++++++++++++++++++++++-
 drivers/scsi/sd.h |  7 +++++++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 4bb87043e6db..8db8b9389227 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -876,6 +876,30 @@ static blk_status_t sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 	return scsi_alloc_sgtables(cmd);
 }
 
+static void sd_config_atomic(struct scsi_disk *sdkp)
+{
+	unsigned int logical_block_size = sdkp->device->sector_size;
+	struct request_queue *q = sdkp->disk->queue;
+
+	if (sdkp->max_atomic) {
+		unsigned int physical_block_size_sectors =
+			sdkp->physical_block_size / sdkp->device->sector_size;
+		unsigned int max_atomic = max_t(unsigned int,
+			rounddown_pow_of_two(sdkp->max_atomic),
+			rounddown_pow_of_two(sdkp->max_atomic_with_boundary));
+		unsigned int unit_max = min_t(unsigned int, max_atomic,
+			rounddown_pow_of_two(sdkp->max_atomic_boundary));
+		unsigned int unit_min = sdkp->atomic_granularity ?
+			rounddown_pow_of_two(sdkp->atomic_granularity) :
+			physical_block_size_sectors;
+
+		blk_queue_atomic_write_max_bytes(q, max_atomic * logical_block_size);
+		blk_queue_atomic_write_unit_min(q, unit_min);
+		blk_queue_atomic_write_unit_max(q, unit_max);
+		blk_queue_atomic_write_boundary(q, 0);
+	}
+}
+
 static blk_status_t sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd,
 		bool unmap)
 {
@@ -2922,7 +2946,7 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 		sdkp->max_ws_blocks = (u32)get_unaligned_be64(&vpd->data[36]);
 
 		if (!sdkp->lbpme)
-			goto out;
+			goto read_atomics;
 
 		lba_count = get_unaligned_be32(&vpd->data[20]);
 		desc_count = get_unaligned_be32(&vpd->data[24]);
@@ -2953,6 +2977,14 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 			else
 				sd_config_discard(sdkp, SD_LBP_DISABLE);
 		}
+read_atomics:
+		sdkp->max_atomic = get_unaligned_be32(&vpd->data[44]);
+		sdkp->atomic_alignment  = get_unaligned_be32(&vpd->data[48]);
+		sdkp->atomic_granularity  = get_unaligned_be32(&vpd->data[52]);
+		sdkp->max_atomic_with_boundary  = get_unaligned_be32(&vpd->data[56]);
+		sdkp->max_atomic_boundary = get_unaligned_be32(&vpd->data[60]);
+
+		sd_config_atomic(sdkp);
 	}
 
  out:
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 5eea762f84d1..bca05fbd74df 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -121,6 +121,13 @@ struct scsi_disk {
 	u32		max_unmap_blocks;
 	u32		unmap_granularity;
 	u32		unmap_alignment;
+
+	u32		max_atomic;
+	u32		atomic_alignment;
+	u32		atomic_granularity;
+	u32		max_atomic_with_boundary;
+	u32		max_atomic_boundary;
+
 	u32		index;
 	unsigned int	physical_block_size;
 	unsigned int	max_medium_access_timeouts;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 14/16] scsi: sd: Add WRITE_ATOMIC_16 support
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (12 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 13/16] scsi: sd: Support reading atomic properties from block limits VPD John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:48   ` Bart Van Assche
  2023-05-03 18:38 ` [PATCH RFC 15/16] scsi: scsi_debug: Atomic write support John Garry
  2023-05-03 18:38 ` [PATCH RFC 16/16] nvme: Support atomic writes John Garry
  15 siblings, 1 reply; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

Add function sd_setup_atomic_cmnd() to setup an WRITE_ATOMIC_16
CDB for when REQ_ATOMIC flag is set for the request.

Also add trace info.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_trace.c | 22 ++++++++++++++++++++++
 drivers/scsi/sd.c         | 20 ++++++++++++++++++++
 include/scsi/scsi_proto.h |  1 +
 3 files changed, 43 insertions(+)

diff --git a/drivers/scsi/scsi_trace.c b/drivers/scsi/scsi_trace.c
index 41a950075913..3e47c4472a80 100644
--- a/drivers/scsi/scsi_trace.c
+++ b/drivers/scsi/scsi_trace.c
@@ -325,6 +325,26 @@ scsi_trace_zbc_out(struct trace_seq *p, unsigned char *cdb, int len)
 	return ret;
 }
 
+static const char *
+scsi_trace_atomic_write16_out(struct trace_seq *p, unsigned char *cdb, int len)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+	unsigned int boundary_size;
+	unsigned int nr_blocks;
+	sector_t lba;
+
+	lba = get_unaligned_be64(&cdb[2]);
+	boundary_size = get_unaligned_be16(&cdb[10]);
+	nr_blocks = get_unaligned_be16(&cdb[12]);
+
+	trace_seq_printf(p, "lba=%llu txlen=%u boundary_size=%u",
+			  lba, nr_blocks, boundary_size);
+
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+
 static const char *
 scsi_trace_varlen(struct trace_seq *p, unsigned char *cdb, int len)
 {
@@ -385,6 +405,8 @@ scsi_trace_parse_cdb(struct trace_seq *p, unsigned char *cdb, int len)
 		return scsi_trace_zbc_in(p, cdb, len);
 	case ZBC_OUT:
 		return scsi_trace_zbc_out(p, cdb, len);
+	case WRITE_ATOMIC_16:
+		return scsi_trace_atomic_write16_out(p, cdb, len);
 	default:
 		return scsi_trace_misc(p, cdb, len);
 	}
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8db8b9389227..e69473fa2dd7 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1139,6 +1139,23 @@ static blk_status_t sd_setup_rw6_cmnd(struct scsi_cmnd *cmd, bool write,
 	return BLK_STS_OK;
 }
 
+static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd,
+					sector_t lba, unsigned int nr_blocks,
+					unsigned char flags)
+{
+	cmd->cmd_len  = 16;
+	cmd->cmnd[0]  = WRITE_ATOMIC_16;
+	cmd->cmnd[1]  = flags;
+	put_unaligned_be64(lba, &cmd->cmnd[2]);
+	cmd->cmnd[10] = 0;
+	cmd->cmnd[11] = 0;
+	put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
+	cmd->cmnd[14] = 0;
+	cmd->cmnd[15] = 0;
+
+	return BLK_STS_OK;
+}
+
 static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 {
 	struct request *rq = scsi_cmd_to_rq(cmd);
@@ -1149,6 +1166,7 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	unsigned int nr_blocks = sectors_to_logical(sdp, blk_rq_sectors(rq));
 	unsigned int mask = logical_to_sectors(sdp, 1) - 1;
 	bool write = rq_data_dir(rq) == WRITE;
+	bool atomic_write = !!(rq->cmd_flags & REQ_ATOMIC) && write;
 	unsigned char protect, fua;
 	blk_status_t ret;
 	unsigned int dif;
@@ -1208,6 +1226,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
 		ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
 					 protect | fua);
+	} else if (atomic_write) {
+		ret = sd_setup_atomic_cmnd(cmd, lba, nr_blocks, protect | fua);
 	} else if (sdp->use_16_for_rw || (nr_blocks > 0xffff)) {
 		ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks,
 					 protect | fua);
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index fbe5bdfe4d6e..c449be9cba60 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -119,6 +119,7 @@
 #define WRITE_SAME_16	      0x93
 #define ZBC_OUT		      0x94
 #define ZBC_IN		      0x95
+#define WRITE_ATOMIC_16	0x9c
 #define SERVICE_ACTION_BIDIRECTIONAL 0x9d
 #define SERVICE_ACTION_IN_16  0x9e
 #define SERVICE_ACTION_OUT_16 0x9f
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 15/16] scsi: scsi_debug: Atomic write support
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (13 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 14/16] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:38 ` [PATCH RFC 16/16] nvme: Support atomic writes John Garry
  15 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	John Garry

Add initial support for atomic writes.

As is standard method, feed device properties via modules param, those
being:
- atomic_max_size_blks
- atomic_alignment_blks
- atomic_granularity_blks
- atomic_max_size_with_boundary_blks
- atomic_max_boundary_blks

These just match sbc4r22 section 6.6.4 - Block limits VPD page.

We just support ATOMIC_WRITE_16.

The major change in the driver is how we lock the device for RW accesses.

Currently the driver uses a per-device lock for accessing device metadata
and "media" data (calls to do_device_access()) atomically for the duration
of the whole read/write command.

This should not suit verifying atomic writes. Reason being that currently
all reads/writes are atomic, so using atomic writes does not prove
anything.

Change device access model to basis that regular writes only atomic on a
per-sector basis, while reads and atomic writes are fully atomic.

As mentioned, since accessing metadata and device media is atomic,
continue to have regular writes involving metadata - like discard or PI -
as atomic. We can improve this later.

Currently we only support model where overlapping going reads or writes
wait for current access to complete before commencing an atomic write.
This is described in 4.29.3.2 section of the SBC. However, we simplify,
things and wait for all accesses to complete (when issuing an atomic
write).

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_debug.c | 593 +++++++++++++++++++++++++++++---------
 1 file changed, 460 insertions(+), 133 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 776371080762..0555aee30ea1 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -66,6 +66,8 @@ static const char *sdebug_version_date = "20210520";
 
 /* Additional Sense Code (ASC) */
 #define NO_ADDITIONAL_SENSE 0x0
+#define OVERLAP_ATOMIC_COMMAND_ASC 0x0
+#define OVERLAP_ATOMIC_COMMAND_ASCQ 0x23
 #define LOGICAL_UNIT_NOT_READY 0x4
 #define LOGICAL_UNIT_COMMUNICATION_FAILURE 0x8
 #define UNRECOVERED_READ_ERR 0x11
@@ -100,6 +102,7 @@ static const char *sdebug_version_date = "20210520";
 #define READ_BOUNDARY_ASCQ 0x7
 #define ATTEMPT_ACCESS_GAP 0x9
 #define INSUFF_ZONE_ASCQ 0xe
+/* see drivers/scsi/sense_codes.h */
 
 /* Additional Sense Code Qualifier (ASCQ) */
 #define ACK_NAK_TO 0x3
@@ -149,6 +152,12 @@ static const char *sdebug_version_date = "20210520";
 #define DEF_VIRTUAL_GB   0
 #define DEF_VPD_USE_HOSTNO 1
 #define DEF_WRITESAME_LENGTH 0xFFFF
+#define DEF_ATOMIC_WRITE 1
+#define DEF_ATOMIC_MAX_LENGTH 8192
+#define DEF_ATOMIC_ALIGNMENT 2
+#define DEF_ATOMIC_GRANULARITY 2
+#define DEF_ATOMIC_BOUNDARY_MAX_LENGTH (DEF_ATOMIC_MAX_LENGTH)
+#define DEF_ATOMIC_MAX_BOUNDARY 128
 #define DEF_STRICT 0
 #define DEF_STATISTICS false
 #define DEF_SUBMIT_QUEUES 1
@@ -318,7 +327,9 @@ struct sdebug_host_info {
 
 /* There is an xarray of pointers to this struct's objects, one per host */
 struct sdeb_store_info {
-	rwlock_t macc_lck;	/* for atomic media access on this store */
+	rwlock_t macc_data_lck;	/* for media data access on this store */
+	rwlock_t macc_meta_lck;	/* for atomic media meta access on this store */
+	rwlock_t macc_sector_lck;	/* per-sector media data access on this store */
 	u8 *storep;		/* user data storage (ram) */
 	struct t10_pi_tuple *dif_storep; /* protection info */
 	void *map_storep;	/* provisioning map */
@@ -345,12 +356,20 @@ struct sdebug_defer {
 	enum sdeb_defer_type defer_t;
 };
 
+struct sdebug_device_access_info {
+	bool atomic_write;
+	u64 lba;
+	u32 num;
+	struct scsi_cmnd *self;
+};
+
 struct sdebug_queued_cmd {
 	/* corresponding bit set in in_use_bm[] in owning struct sdebug_queue
 	 * instance indicates this slot is in use.
 	 */
 	struct sdebug_defer *sd_dp;
 	struct scsi_cmnd *a_cmnd;
+	struct sdebug_device_access_info *i;
 };
 
 struct sdebug_queue {
@@ -413,7 +432,8 @@ enum sdeb_opcode_index {
 	SDEB_I_PRE_FETCH = 29,		/* 10, 16 */
 	SDEB_I_ZONE_OUT = 30,		/* 0x94+SA; includes no data xfer */
 	SDEB_I_ZONE_IN = 31,		/* 0x95+SA; all have data-in */
-	SDEB_I_LAST_ELEM_P1 = 32,	/* keep this last (previous + 1) */
+	SDEB_I_ATOMIC_WRITE_16 = 32,	/* keep this last (previous + 1) */
+	SDEB_I_LAST_ELEM_P1 = 33,	/* keep this last (previous + 1) */
 };
 
 
@@ -447,7 +467,8 @@ static const unsigned char opcode_ind_arr[256] = {
 	0, 0, 0, SDEB_I_VERIFY,
 	SDEB_I_PRE_FETCH, SDEB_I_SYNC_CACHE, 0, SDEB_I_WRITE_SAME,
 	SDEB_I_ZONE_OUT, SDEB_I_ZONE_IN, 0, 0,
-	0, 0, 0, 0, 0, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16,
+	0, 0, 0, 0,
+	SDEB_I_ATOMIC_WRITE_16, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16,
 /* 0xa0; 0xa0->0xbf: 12 byte cdbs */
 	SDEB_I_REPORT_LUNS, SDEB_I_ATA_PT, 0, SDEB_I_MAINT_IN,
 	     SDEB_I_MAINT_OUT, 0, 0, 0,
@@ -495,6 +516,7 @@ static int resp_write_buffer(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_sync_cache(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_pre_fetch(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_report_zones(struct scsi_cmnd *, struct sdebug_dev_info *);
+static int resp_atomic_write(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_open_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_close_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_finish_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
@@ -731,6 +753,11 @@ static const struct opcode_info_t opcode_info_arr[SDEB_I_LAST_ELEM_P1 + 1] = {
 	    resp_report_zones, zone_in_iarr, /* ZONE_IN(16), REPORT ZONES) */
 		{16,  0x0 /* SA */, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 		 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xbf, 0xc7} },
+/* 31 */
+	{0, 0x0, 0x0, F_D_OUT | FF_MEDIA_IO,
+	    resp_atomic_write, NULL, /* ATOMIC WRITE 16 */
+		{16,  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff} },
 /* sentinel */
 	{0xff, 0, 0, 0, NULL, NULL,		/* terminating element */
 	    {0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} },
@@ -779,6 +806,12 @@ static unsigned int sdebug_unmap_granularity = DEF_UNMAP_GRANULARITY;
 static unsigned int sdebug_unmap_max_blocks = DEF_UNMAP_MAX_BLOCKS;
 static unsigned int sdebug_unmap_max_desc = DEF_UNMAP_MAX_DESC;
 static unsigned int sdebug_write_same_length = DEF_WRITESAME_LENGTH;
+static unsigned int sdebug_atomic_write = DEF_ATOMIC_WRITE;
+static unsigned int sdebug_atomic_max_size_blks = DEF_ATOMIC_MAX_LENGTH;
+static unsigned int sdebug_atomic_alignment_blks = DEF_ATOMIC_ALIGNMENT;
+static unsigned int sdebug_atomic_granularity_blks = DEF_ATOMIC_GRANULARITY;
+static unsigned int sdebug_atomic_max_size_with_boundary_blks = DEF_ATOMIC_BOUNDARY_MAX_LENGTH;
+static unsigned int sdebug_atomic_max_boundary_blks = DEF_ATOMIC_MAX_BOUNDARY;
 static int sdebug_uuid_ctl = DEF_UUID_CTL;
 static bool sdebug_random = DEF_RANDOM;
 static bool sdebug_per_host_store = DEF_PER_HOST_STORE;
@@ -880,6 +913,11 @@ static inline bool scsi_debug_lbp(void)
 		(sdebug_lbpu || sdebug_lbpws || sdebug_lbpws10);
 }
 
+static inline bool scsi_debug_atomic_write(void)
+{
+	return 0 == sdebug_fake_rw && sdebug_atomic_write;
+}
+
 static void *lba2fake_store(struct sdeb_store_info *sip,
 			    unsigned long long lba)
 {
@@ -1510,6 +1548,14 @@ static int inquiry_vpd_b0(unsigned char *arr)
 	/* Maximum WRITE SAME Length */
 	put_unaligned_be64(sdebug_write_same_length, &arr[32]);
 
+	if (sdebug_atomic_write) {
+		put_unaligned_be32(sdebug_atomic_max_size_blks, &arr[40]);
+		put_unaligned_be32(sdebug_atomic_alignment_blks, &arr[44]);
+		put_unaligned_be32(sdebug_atomic_granularity_blks, &arr[48]);
+		put_unaligned_be32(sdebug_atomic_max_size_with_boundary_blks, &arr[52]);
+		put_unaligned_be32(sdebug_atomic_max_boundary_blks, &arr[56]);
+	}
+
 	return 0x3c; /* Mandatory page length for Logical Block Provisioning */
 }
 
@@ -3011,15 +3057,242 @@ static inline struct sdeb_store_info *devip2sip(struct sdebug_dev_info *devip,
 	return xa_load(per_store_ap, devip->sdbg_host->si_idx);
 }
 
+
+static inline void
+sdeb_read_lock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__acquire(lock);
+	else
+		read_lock(lock);
+}
+
+static inline void
+sdeb_read_unlock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__release(lock);
+	else
+		read_unlock(lock);
+}
+
+static inline void
+sdeb_write_lock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__acquire(lock);
+	else
+		write_lock(lock);
+}
+
+static inline void
+sdeb_write_unlock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__release(lock);
+	else
+		write_unlock(lock);
+}
+
+static inline void
+sdeb_data_read_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_lock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_read_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_unlock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_write_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_lock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_write_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_unlock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_sector_read_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_lock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_read_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_unlock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_write_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_lock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_write_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_unlock(&sip->macc_sector_lck);
+}
+
+/*
+Atomic locking:
+We simplify the atomic model to allow only 1x atomic
+write and many non-atomic reads or writes for all
+LBAs.
+
+A RW lock has a similar bahaviour:
+Only 1x writer and many readers.
+
+So use a RW lock for per-device read and write locking:
+An atomic access grabs the lock as a writer and
+non-atomic grabs the lock as a reader.
+*/
+
+static inline void
+sdeb_data_lock(struct sdeb_store_info *sip, bool atomic_write)
+{
+	if (atomic_write)
+		sdeb_data_write_lock(sip);
+	else
+		sdeb_data_read_lock(sip);
+}
+
+static inline void
+sdeb_data_unlock(struct sdeb_store_info *sip, bool atomic_write)
+{
+	if (atomic_write)
+		sdeb_data_write_unlock(sip);
+	else
+		sdeb_data_read_unlock(sip);
+}
+
+/* Allow many reads but only 1x write per sector */
+static inline void
+sdeb_data_sector_lock(struct sdeb_store_info *sip, bool do_write)
+{
+	if (do_write)
+		sdeb_data_sector_write_lock(sip);
+	else
+		sdeb_data_sector_read_lock(sip);
+}
+
+static inline void
+sdeb_data_sector_unlock(struct sdeb_store_info *sip, bool do_write)
+{
+	if (do_write)
+		sdeb_data_sector_write_unlock(sip);
+	else
+		sdeb_data_sector_read_unlock(sip);
+}
+
+static inline void
+sdeb_meta_read_lock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__acquire(&sip->macc_meta_lck);
+		else
+			__acquire(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			read_lock(&sip->macc_meta_lck);
+		else
+			read_lock(&sdeb_fake_rw_lck);
+	}
+}
+
+static inline void
+sdeb_meta_read_unlock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__release(&sip->macc_meta_lck);
+		else
+			__release(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			read_unlock(&sip->macc_meta_lck);
+		else
+			read_unlock(&sdeb_fake_rw_lck);
+	}
+}
+
+static inline void
+sdeb_meta_write_lock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__acquire(&sip->macc_meta_lck);
+		else
+			__acquire(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			write_lock(&sip->macc_meta_lck);
+		else
+			write_lock(&sdeb_fake_rw_lck);
+	}
+}
+
+static inline void
+sdeb_meta_write_unlock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__release(&sip->macc_meta_lck);
+		else
+			__release(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			write_unlock(&sip->macc_meta_lck);
+		else
+			write_unlock(&sdeb_fake_rw_lck);
+	}
+}
+
+static struct sdebug_queue *get_queue(struct scsi_cmnd *cmnd);
+
 /* Returns number of bytes copied or -1 if error. */
 static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp,
-			    u32 sg_skip, u64 lba, u32 num, bool do_write)
+			    u32 sg_skip, u64 lba, u32 num, bool do_write,
+			    bool atomic_write)
 {
 	int ret;
-	u64 block, rest = 0;
+	u64 block;
 	enum dma_data_direction dir;
 	struct scsi_data_buffer *sdb = &scp->sdb;
 	u8 *fsp;
+	int i;
+
+	/*
+	 * Even though reads are inherently atomic (in this driver), we expect
+	 * the atomic flag only for writes.
+	 */
+	if (!do_write && atomic_write)
+		return -1;
 
 	if (do_write) {
 		dir = DMA_TO_DEVICE;
@@ -3035,21 +3308,26 @@ static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp,
 	fsp = sip->storep;
 
 	block = do_div(lba, sdebug_store_sectors);
-	if (block + num > sdebug_store_sectors)
-		rest = block + num - sdebug_store_sectors;
 
-	ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
+	/* Only allow 1x atomic write or multiple non-atomic writes at any given time */
+	sdeb_data_lock(sip, atomic_write);
+	for (i = 0; i < num; i++) {
+		/* We shouldn't need to lock for atomic writes, but do it anyway */
+		sdeb_data_sector_lock(sip, do_write);
+		ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 		   fsp + (block * sdebug_sector_size),
-		   (num - rest) * sdebug_sector_size, sg_skip, do_write);
-	if (ret != (num - rest) * sdebug_sector_size)
-		return ret;
-
-	if (rest) {
-		ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
-			    fsp, rest * sdebug_sector_size,
-			    sg_skip + ((num - rest) * sdebug_sector_size),
-			    do_write);
+		   sdebug_sector_size, sg_skip, do_write);
+		sdeb_data_sector_unlock(sip, do_write);
+		if (ret != sdebug_sector_size) {
+			ret += (i * sdebug_sector_size);
+			break;
+		}
+		sg_skip += sdebug_sector_size;
+		if (++block >= sdebug_store_sectors)
+			block = 0;
 	}
+	ret = num * sdebug_sector_size;
+	sdeb_data_unlock(sip, atomic_write);
 
 	return ret;
 }
@@ -3225,70 +3503,6 @@ static int prot_verify_read(struct scsi_cmnd *scp, sector_t start_sec,
 	return ret;
 }
 
-static inline void
-sdeb_read_lock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__acquire(&sip->macc_lck);
-		else
-			__acquire(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			read_lock(&sip->macc_lck);
-		else
-			read_lock(&sdeb_fake_rw_lck);
-	}
-}
-
-static inline void
-sdeb_read_unlock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__release(&sip->macc_lck);
-		else
-			__release(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			read_unlock(&sip->macc_lck);
-		else
-			read_unlock(&sdeb_fake_rw_lck);
-	}
-}
-
-static inline void
-sdeb_write_lock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__acquire(&sip->macc_lck);
-		else
-			__acquire(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			write_lock(&sip->macc_lck);
-		else
-			write_lock(&sdeb_fake_rw_lck);
-	}
-}
-
-static inline void
-sdeb_write_unlock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__release(&sip->macc_lck);
-		else
-			__release(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			write_unlock(&sip->macc_lck);
-		else
-			write_unlock(&sdeb_fake_rw_lck);
-	}
-}
-
 static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 {
 	bool check_prot;
@@ -3298,6 +3512,7 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	u64 lba;
 	struct sdeb_store_info *sip = devip2sip(devip, true);
 	u8 *cmd = scp->cmnd;
+	bool meta_data_locked = false;
 
 	switch (cmd[0]) {
 	case READ_16:
@@ -3356,6 +3571,10 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		atomic_set(&sdeb_inject_pending, 0);
 	}
 
+	/*
+	 * When checking device access params, for reads we only check data
+	 * versus what is set at init time, so no need to lock.
+	 */
 	ret = check_device_access_params(scp, lba, num, false);
 	if (ret)
 		return ret;
@@ -3375,29 +3594,33 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		return check_condition_result;
 	}
 
-	sdeb_read_lock(sip);
+	if (sdebug_dev_is_zoned(devip) ||
+	    (sdebug_dix && scsi_prot_sg_count(scp)))  {
+		sdeb_meta_read_lock(sip);
+		meta_data_locked = true;
+	}
 
 	/* DIX + T10 DIF */
 	if (unlikely(sdebug_dix && scsi_prot_sg_count(scp))) {
 		switch (prot_verify_read(scp, lba, num, ei_lba)) {
 		case 1: /* Guard tag error */
 			if (cmd[1] >> 5 != 3) { /* RDPROTECT != 3 */
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 1);
 				return check_condition_result;
 			} else if (scp->prot_flags & SCSI_PROT_GUARD_CHECK) {
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 1);
 				return illegal_condition_result;
 			}
 			break;
 		case 3: /* Reference tag error */
 			if (cmd[1] >> 5 != 3) { /* RDPROTECT != 3 */
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 3);
 				return check_condition_result;
 			} else if (scp->prot_flags & SCSI_PROT_REF_CHECK) {
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 3);
 				return illegal_condition_result;
 			}
@@ -3405,8 +3628,9 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		}
 	}
 
-	ret = do_device_access(sip, scp, 0, lba, num, false);
-	sdeb_read_unlock(sip);
+	ret = do_device_access(sip, scp, 0, lba, num, false, false);
+	if (meta_data_locked)
+		sdeb_meta_read_unlock(sip);
 	if (unlikely(ret == -1))
 		return DID_ERROR << 16;
 
@@ -3595,6 +3819,7 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	u64 lba;
 	struct sdeb_store_info *sip = devip2sip(devip, true);
 	u8 *cmd = scp->cmnd;
+	bool meta_data_locked = false;
 
 	switch (cmd[0]) {
 	case WRITE_16:
@@ -3648,10 +3873,17 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 				    "to DIF device\n");
 	}
 
-	sdeb_write_lock(sip);
+	if (sdebug_dev_is_zoned(devip) ||
+	    (sdebug_dix && scsi_prot_sg_count(scp)) ||
+	    scsi_debug_lbp())  {
+		sdeb_meta_write_lock(sip);
+		meta_data_locked = true;
+	}
+
 	ret = check_device_access_params(scp, lba, num, true);
 	if (ret) {
-		sdeb_write_unlock(sip);
+		if (meta_data_locked)
+			sdeb_meta_write_unlock(sip);
 		return ret;
 	}
 
@@ -3660,22 +3892,22 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		switch (prot_verify_write(scp, lba, num, ei_lba)) {
 		case 1: /* Guard tag error */
 			if (scp->prot_flags & SCSI_PROT_GUARD_CHECK) {
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 1);
 				return illegal_condition_result;
 			} else if (scp->cmnd[1] >> 5 != 3) { /* WRPROTECT != 3 */
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 1);
 				return check_condition_result;
 			}
 			break;
 		case 3: /* Reference tag error */
 			if (scp->prot_flags & SCSI_PROT_REF_CHECK) {
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 3);
 				return illegal_condition_result;
 			} else if (scp->cmnd[1] >> 5 != 3) { /* WRPROTECT != 3 */
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 3);
 				return check_condition_result;
 			}
@@ -3683,13 +3915,16 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		}
 	}
 
-	ret = do_device_access(sip, scp, 0, lba, num, true);
+	ret = do_device_access(sip, scp, 0, lba, num, true, false);
 	if (unlikely(scsi_debug_lbp()))
 		map_region(sip, lba, num);
+
 	/* If ZBC zone then bump its write pointer */
 	if (sdebug_dev_is_zoned(devip))
 		zbc_inc_wp(devip, lba, num);
-	sdeb_write_unlock(sip);
+	if (meta_data_locked)
+		sdeb_meta_write_unlock(sip);
+
 	if (unlikely(-1 == ret))
 		return DID_ERROR << 16;
 	else if (unlikely(sdebug_verbose &&
@@ -3796,7 +4031,8 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 		goto err_out;
 	}
 
-	sdeb_write_lock(sip);
+	/* Just keep it simple and always lock for now */
+	sdeb_meta_write_lock(sip);
 	sg_off = lbdof_blen;
 	/* Spec says Buffer xfer Length field in number of LBs in dout */
 	cum_lb = 0;
@@ -3839,7 +4075,11 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 			}
 		}
 
-		ret = do_device_access(sip, scp, sg_off, lba, num, true);
+		/*
+		 * Write ranges atomically to keep as close to pre-atomic
+		 * writes behaviour as possible.
+		 */
+		ret = do_device_access(sip, scp, sg_off, lba, num, true, true);
 		/* If ZBC zone then bump its write pointer */
 		if (sdebug_dev_is_zoned(devip))
 			zbc_inc_wp(devip, lba, num);
@@ -3878,7 +4118,7 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 	}
 	ret = 0;
 err_out_unlock:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 err_out:
 	kfree(lrdp);
 	return ret;
@@ -3897,14 +4137,16 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 						scp->device->hostdata, true);
 	u8 *fs1p;
 	u8 *fsp;
+	bool meta_data_locked = false;
 
-	sdeb_write_lock(sip);
+	if (sdebug_dev_is_zoned(devip) || scsi_debug_lbp()) {
+		sdeb_meta_write_lock(sip);
+		meta_data_locked = true;
+	}
 
 	ret = check_device_access_params(scp, lba, num, true);
-	if (ret) {
-		sdeb_write_unlock(sip);
-		return ret;
-	}
+	if (ret)
+		goto out;
 
 	if (unmap && scsi_debug_lbp()) {
 		unmap_region(sip, lba, num);
@@ -3915,6 +4157,7 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 	/* if ndob then zero 1 logical block, else fetch 1 logical block */
 	fsp = sip->storep;
 	fs1p = fsp + (block * lb_size);
+	sdeb_data_write_lock(sip);
 	if (ndob) {
 		memset(fs1p, 0, lb_size);
 		ret = 0;
@@ -3922,8 +4165,8 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 		ret = fetch_to_dev_buffer(scp, fs1p, lb_size);
 
 	if (-1 == ret) {
-		sdeb_write_unlock(sip);
-		return DID_ERROR << 16;
+		ret = DID_ERROR << 16;
+		goto out;
 	} else if (sdebug_verbose && !ndob && (ret < lb_size))
 		sdev_printk(KERN_INFO, scp->device,
 			    "%s: %s: lb size=%u, IO sent=%d bytes\n",
@@ -3940,10 +4183,12 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 	/* If ZBC zone then bump its write pointer */
 	if (sdebug_dev_is_zoned(devip))
 		zbc_inc_wp(devip, lba, num);
+	sdeb_data_write_unlock(sip);
+	ret = 0;
 out:
-	sdeb_write_unlock(sip);
-
-	return 0;
+	if (meta_data_locked)
+		sdeb_meta_write_unlock(sip);
+	return ret;
 }
 
 static int resp_write_same_10(struct scsi_cmnd *scp,
@@ -4086,25 +4331,30 @@ static int resp_comp_write(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
-
 	ret = do_dout_fetch(scp, dnum, arr);
 	if (ret == -1) {
 		retval = DID_ERROR << 16;
-		goto cleanup;
+		goto cleanup_free;
 	} else if (sdebug_verbose && (ret < (dnum * lb_size)))
 		sdev_printk(KERN_INFO, scp->device, "%s: compare_write: cdb "
 			    "indicated=%u, IO sent=%d bytes\n", my_name,
 			    dnum * lb_size, ret);
+
+	sdeb_data_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 	if (!comp_write_worker(sip, lba, num, arr, false)) {
 		mk_sense_buffer(scp, MISCOMPARE, MISCOMPARE_VERIFY_ASC, 0);
 		retval = check_condition_result;
-		goto cleanup;
+		goto cleanup_unlock;
 	}
+
+	/* Cover sip->map_storep (which map_region()) sets with data lock */
 	if (scsi_debug_lbp())
 		map_region(sip, lba, num);
-cleanup:
-	sdeb_write_unlock(sip);
+cleanup_unlock:
+	sdeb_meta_write_unlock(sip);
+	sdeb_data_write_unlock(sip);
+cleanup_free:
 	kfree(arr);
 	return retval;
 }
@@ -4148,7 +4398,7 @@ static int resp_unmap(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 
 	desc = (void *)&buf[8];
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	for (i = 0 ; i < descriptors ; i++) {
 		unsigned long long lba = get_unaligned_be64(&desc[i].lba);
@@ -4164,7 +4414,7 @@ static int resp_unmap(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	ret = 0;
 
 out:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	kfree(buf);
 
 	return ret;
@@ -4277,12 +4527,13 @@ static int resp_pre_fetch(struct scsi_cmnd *scp,
 		rest = block + nblks - sdebug_store_sectors;
 
 	/* Try to bring the PRE-FETCH range into CPU's cache */
-	sdeb_read_lock(sip);
+	sdeb_data_read_lock(sip);
 	prefetch_range(fsp + (sdebug_sector_size * block),
 		       (nblks - rest) * sdebug_sector_size);
 	if (rest)
 		prefetch_range(fsp, rest * sdebug_sector_size);
-	sdeb_read_unlock(sip);
+
+	sdeb_data_read_unlock(sip);
 fini:
 	if (cmd[1] & 0x2)
 		res = SDEG_RES_IMMED_MASK;
@@ -4441,7 +4692,7 @@ static int resp_verify(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		return check_condition_result;
 	}
 	/* Not changing store, so only need read access */
-	sdeb_read_lock(sip);
+	sdeb_data_read_lock(sip);
 
 	ret = do_dout_fetch(scp, a_num, arr);
 	if (ret == -1) {
@@ -4463,7 +4714,7 @@ static int resp_verify(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		goto cleanup;
 	}
 cleanup:
-	sdeb_read_unlock(sip);
+	sdeb_data_read_unlock(sip);
 	kfree(arr);
 	return ret;
 }
@@ -4509,7 +4760,7 @@ static int resp_report_zones(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_read_lock(sip);
+	sdeb_meta_read_lock(sip);
 
 	desc = arr + 64;
 	for (lba = zs_lba; lba < sdebug_capacity;
@@ -4607,11 +4858,68 @@ static int resp_report_zones(struct scsi_cmnd *scp,
 	ret = fill_from_dev_buffer(scp, arr, min_t(u32, alloc_len, rep_len));
 
 fini:
-	sdeb_read_unlock(sip);
+	sdeb_meta_read_unlock(sip);
 	kfree(arr);
 	return ret;
 }
 
+static int resp_atomic_write(struct scsi_cmnd *scp,
+			     struct sdebug_dev_info *devip)
+{
+	struct sdeb_store_info *sip;
+	u8 *cmd = scp->cmnd;
+	u16 boundary, len;
+	u64 lba;
+	int ret;
+
+	if (!scsi_debug_atomic_write()) {
+		mk_sense_invalid_opcode(scp);
+		return check_condition_result;
+	}
+
+	sip = devip2sip(devip, true);
+
+	lba = get_unaligned_be64(cmd + 2);
+	boundary = get_unaligned_be16(cmd + 10);
+	len = get_unaligned_be16(cmd + 12);
+
+	if (sdebug_atomic_alignment_blks && lba % sdebug_atomic_alignment_blks) {
+		/* Does not meet alignment requirement */
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, INVALID_FIELD_IN_CDB, 0);
+		return check_condition_result;
+	}
+
+	if (sdebug_atomic_granularity_blks && len % sdebug_atomic_granularity_blks) {
+		/* Does not meet alignment requirement */
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, INVALID_FIELD_IN_CDB, 0);
+		return check_condition_result;
+	}
+
+	if (boundary > 0) {
+		if (boundary > sdebug_atomic_max_boundary_blks) {
+			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+			return check_condition_result;
+		}
+
+		if (len > sdebug_atomic_max_size_with_boundary_blks) {
+			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+			return check_condition_result;
+		}
+	} else {
+		if (len > sdebug_atomic_max_size_blks) {
+			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+			return check_condition_result;
+		}
+	}
+
+	ret = do_device_access(sip, scp, 0, lba, len, true, true);
+	if (unlikely(ret == -1))
+		return DID_ERROR << 16;
+	if (unlikely(ret != len * sdebug_sector_size))
+		return DID_ERROR << 16;
+	return 0;
+}
+
 /* Logic transplanted from tcmu-runner, file_zbc.c */
 static void zbc_open_all(struct sdebug_dev_info *devip)
 {
@@ -4638,8 +4946,7 @@ static int resp_open_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		mk_sense_invalid_opcode(scp);
 		return check_condition_result;
 	}
-
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		/* Check if all closed zones can be open */
@@ -4688,7 +4995,7 @@ static int resp_open_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 
 	zbc_open_zone(devip, zsp, true);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -4715,7 +5022,7 @@ static int resp_close_zone(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		zbc_close_all(devip);
@@ -4744,7 +5051,7 @@ static int resp_close_zone(struct scsi_cmnd *scp,
 
 	zbc_close_zone(devip, zsp);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -4787,7 +5094,7 @@ static int resp_finish_zone(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		zbc_finish_all(devip);
@@ -4816,7 +5123,7 @@ static int resp_finish_zone(struct scsi_cmnd *scp,
 
 	zbc_finish_zone(devip, zsp, true);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -4867,7 +5174,7 @@ static int resp_rwp_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		zbc_rwp_all(devip);
@@ -4895,7 +5202,7 @@ static int resp_rwp_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 
 	zbc_rwp_zone(devip, zsp);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -4962,6 +5269,8 @@ static void sdebug_q_cmd_complete(struct sdebug_defer *sd_dp)
 		retiring = 1;
 
 	sqcp->a_cmnd = NULL;
+	scp->host_scribble = NULL;
+	sqcp->i = NULL;
 	if (unlikely(!test_and_clear_bit(qc_idx, sqp->in_use_bm))) {
 		spin_unlock_irqrestore(&sqp->qc_lock, iflags);
 		pr_err("Unexpected completion\n");
@@ -5717,6 +6026,7 @@ static int schedule_resp(struct scsi_cmnd *cmnd, struct sdebug_dev_info *devip,
 				if (kt <= d) {	/* elapsed duration >= kt */
 					spin_lock_irqsave(&sqp->qc_lock, iflags);
 					sqcp->a_cmnd = NULL;
+					cmnd->host_scribble = NULL;
 					atomic_dec(&devip->num_in_q);
 					clear_bit(k, sqp->in_use_bm);
 					spin_unlock_irqrestore(&sqp->qc_lock, iflags);
@@ -5837,6 +6147,7 @@ module_param_named(lbprz, sdebug_lbprz, int, S_IRUGO);
 module_param_named(lbpu, sdebug_lbpu, int, S_IRUGO);
 module_param_named(lbpws, sdebug_lbpws, int, S_IRUGO);
 module_param_named(lbpws10, sdebug_lbpws10, int, S_IRUGO);
+module_param_named(atomic_write, sdebug_atomic_write, int, S_IRUGO);
 module_param_named(lowest_aligned, sdebug_lowest_aligned, int, S_IRUGO);
 module_param_named(lun_format, sdebug_lun_am_i, int, S_IRUGO | S_IWUSR);
 module_param_named(max_luns, sdebug_max_luns, int, S_IRUGO | S_IWUSR);
@@ -5871,6 +6182,11 @@ module_param_named(unmap_alignment, sdebug_unmap_alignment, int, S_IRUGO);
 module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO);
 module_param_named(unmap_max_blocks, sdebug_unmap_max_blocks, int, S_IRUGO);
 module_param_named(unmap_max_desc, sdebug_unmap_max_desc, int, S_IRUGO);
+module_param_named(atomic_max_size_blks, sdebug_unmap_alignment, int, S_IRUGO);
+module_param_named(atomic_alignment_blks, sdebug_atomic_alignment_blks, int, S_IRUGO);
+module_param_named(atomic_granularity_blks, sdebug_atomic_granularity_blks, int, S_IRUGO);
+module_param_named(atomic_max_size_with_boundary_blks, sdebug_atomic_max_size_with_boundary_blks, int, S_IRUGO);
+module_param_named(atomic_max_boundary_blks, sdebug_atomic_max_boundary_blks, int, S_IRUGO);
 module_param_named(uuid_ctl, sdebug_uuid_ctl, int, S_IRUGO);
 module_param_named(virtual_gb, sdebug_virtual_gb, int, S_IRUGO | S_IWUSR);
 module_param_named(vpd_use_hostno, sdebug_vpd_use_hostno, int,
@@ -5913,6 +6229,7 @@ MODULE_PARM_DESC(lbprz,
 MODULE_PARM_DESC(lbpu, "enable LBP, support UNMAP command (def=0)");
 MODULE_PARM_DESC(lbpws, "enable LBP, support WRITE SAME(16) with UNMAP bit (def=0)");
 MODULE_PARM_DESC(lbpws10, "enable LBP, support WRITE SAME(10) with UNMAP bit (def=0)");
+MODULE_PARM_DESC(atomic_write, "enable ATOMIC WRITE support, support WRITE ATOMIC(16) (def=1)");
 MODULE_PARM_DESC(lowest_aligned, "lowest aligned lba (def=0)");
 MODULE_PARM_DESC(lun_format, "LUN format: 0->peripheral (def); 1 --> flat address method");
 MODULE_PARM_DESC(max_luns, "number of LUNs per target to simulate(def=1)");
@@ -5944,6 +6261,11 @@ MODULE_PARM_DESC(unmap_alignment, "lowest aligned thin provisioning lba (def=0)"
 MODULE_PARM_DESC(unmap_granularity, "thin provisioning granularity in blocks (def=1)");
 MODULE_PARM_DESC(unmap_max_blocks, "max # of blocks can be unmapped in one cmd (def=0xffffffff)");
 MODULE_PARM_DESC(unmap_max_desc, "max # of ranges that can be unmapped in one cmd (def=256)");
+MODULE_PARM_DESC(atomic_max_size_blks, "max # of blocks can be atomically written in one cmd (def=0xff)");
+MODULE_PARM_DESC(atomic_alignment_blks, "minimum alignment of atomic write in blocks (def=2)");
+MODULE_PARM_DESC(atomic_granularity_blks, "minimum granularity of atomic write in blocks (def=2)");
+MODULE_PARM_DESC(atomic_max_size_with_boundary_blks, "max # of blocks can be atomically written in one cmd with boundary set (def=0xff)");
+MODULE_PARM_DESC(atomic_boundary_blks, "max # boundaries per atomic write (def=0)");
 MODULE_PARM_DESC(uuid_ctl,
 		 "1->use uuid for lu name, 0->don't, 2->all use same (def=0)");
 MODULE_PARM_DESC(virtual_gb, "virtual gigabyte (GiB) size (def=0 -> use dev_size_mb)");
@@ -7079,6 +7401,7 @@ static int __init scsi_debug_init(void)
 			goto free_q_arr;
 		}
 	}
+
 	xa_init_flags(per_store_ap, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
 	if (want_store) {
 		idx = sdebug_add_store();
@@ -7279,7 +7602,9 @@ static int sdebug_add_store(void)
 			map_region(sip, 0, 2);
 	}
 
-	rwlock_init(&sip->macc_lck);
+	rwlock_init(&sip->macc_data_lck);
+	rwlock_init(&sip->macc_meta_lck);
+	rwlock_init(&sip->macc_sector_lck);
 	return (int)n_idx;
 err:
 	sdebug_erase_store((int)n_idx, sip);
@@ -7573,6 +7898,8 @@ static int sdebug_blk_mq_poll(struct Scsi_Host *shost, unsigned int queue_num)
 			retiring = true;
 
 		sqcp->a_cmnd = NULL;
+		sqcp->i = NULL;
+		scp->host_scribble = NULL;
 		if (unlikely(!test_and_clear_bit(qc_idx, sqp->in_use_bm))) {
 			pr_err("Unexpected completion sqp %p queue_num=%d qc_idx=%u from %s\n",
 				sqp, queue_num, qc_idx, __func__);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH RFC 16/16] nvme: Support atomic writes
  2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
                   ` (14 preceding siblings ...)
  2023-05-03 18:38 ` [PATCH RFC 15/16] scsi: scsi_debug: Atomic write support John Garry
@ 2023-05-03 18:38 ` John Garry
  2023-05-03 18:49   ` Bart Van Assche
  15 siblings, 1 reply; 50+ messages in thread
From: John Garry @ 2023-05-03 18:38 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	Alan Adamson, John Garry

From: Alan Adamson <alan.adamson@oracle.com>

Support reading atomic write registers to fill in request_queue
properties.

Use following method to calculate limits:
atomic_write_max_bytes = flp2(NAWUPF ?: AWUPF)
atomic_write_unit_min = logical_block_size
atomic_write_unit_max = flp2(NAWUPF ?: AWUPF)
atomic_write_boundary = NABSPF

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/nvme/host/core.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d6a9bac91a4c..289561915ad3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1879,6 +1879,39 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	blk_queue_io_min(disk->queue, phys_bs);
 	blk_queue_io_opt(disk->queue, io_opt);
 
+	atomic_bs = rounddown_pow_of_two(atomic_bs);
+	if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf) {
+		if (id->nabo) {
+			dev_err(ns->ctrl->device, "Support atomic NABO=%x\n",
+				id->nabo);
+		} else {
+			u32 boundary = 0;
+
+			if (le16_to_cpu(id->nabspf))
+				boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
+
+			if (!(boundary & (boundary - 1))) {
+				blk_queue_atomic_write_max_bytes(disk->queue,
+							atomic_bs);
+				blk_queue_atomic_write_unit_min(disk->queue, 1);
+				blk_queue_atomic_write_unit_max(disk->queue,
+					atomic_bs / bs);
+				blk_queue_atomic_write_boundary(disk->queue,
+								boundary);
+			} else {
+				dev_err(ns->ctrl->device, "Unsupported atomic boundary=0x%x\n",
+					boundary);
+			}
+		}
+	} else if (ns->ctrl->subsys->awupf) {
+		blk_queue_atomic_write_max_bytes(disk->queue,
+				atomic_bs);
+		blk_queue_atomic_write_unit_min(disk->queue, 1);
+		blk_queue_atomic_write_unit_max(disk->queue,
+				atomic_bs / bs);
+		blk_queue_atomic_write_boundary(disk->queue, 0);
+	}
+
 	/*
 	 * Register a metadata profile for PI, or the plain non-integrity NVMe
 	 * metadata masquerading as Type 0 if supported, otherwise reject block
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 14/16] scsi: sd: Add WRITE_ATOMIC_16 support
  2023-05-03 18:38 ` [PATCH RFC 14/16] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
@ 2023-05-03 18:48   ` Bart Van Assche
  2023-05-04  8:17     ` John Garry
  0 siblings, 1 reply; 50+ messages in thread
From: Bart Van Assche @ 2023-05-03 18:48 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, martin.petersen, djwong,
	viro, brauner, dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge

On 5/3/23 11:38, John Garry wrote:
> +static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd,
> +					sector_t lba, unsigned int nr_blocks,
> +					unsigned char flags)
> +{
> +	cmd->cmd_len  = 16;
> +	cmd->cmnd[0]  = WRITE_ATOMIC_16;
> +	cmd->cmnd[1]  = flags;
> +	put_unaligned_be64(lba, &cmd->cmnd[2]);
> +	cmd->cmnd[10] = 0;
> +	cmd->cmnd[11] = 0;
> +	put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
> +	cmd->cmnd[14] = 0;
> +	cmd->cmnd[15] = 0;
> +
> +	return BLK_STS_OK;
> +}

A single space in front of the assignment operator please.

> +
>   static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>   {
>   	struct request *rq = scsi_cmd_to_rq(cmd);
> @@ -1149,6 +1166,7 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>   	unsigned int nr_blocks = sectors_to_logical(sdp, blk_rq_sectors(rq));
>   	unsigned int mask = logical_to_sectors(sdp, 1) - 1;
>   	bool write = rq_data_dir(rq) == WRITE;
> +	bool atomic_write = !!(rq->cmd_flags & REQ_ATOMIC) && write;

Isn't the !! superfluous in the above expression? I have not yet seen 
any other kernel code where a flag test is used in a boolean expression 
and where !! occurs in front of the flag test.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 16/16] nvme: Support atomic writes
  2023-05-03 18:38 ` [PATCH RFC 16/16] nvme: Support atomic writes John Garry
@ 2023-05-03 18:49   ` Bart Van Assche
  2023-05-04  8:19     ` John Garry
  0 siblings, 1 reply; 50+ messages in thread
From: Bart Van Assche @ 2023-05-03 18:49 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, martin.petersen, djwong,
	viro, brauner, dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	Alan Adamson

On 5/3/23 11:38, John Garry wrote:
> +			if (!(boundary & (boundary - 1))) {

Please use is_power_of_2() instead of open-coding it.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 06/16] block: Limit atomic writes according to bio and queue limits
  2023-05-03 18:38 ` [PATCH RFC 06/16] block: Limit atomic writes according to bio and queue limits John Garry
@ 2023-05-03 18:53   ` Keith Busch
  2023-05-04  8:24     ` John Garry
  0 siblings, 1 reply; 50+ messages in thread
From: Keith Busch @ 2023-05-03 18:53 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge

On Wed, May 03, 2023 at 06:38:11PM +0000, John Garry wrote:
> +	unsigned int size = (atomic_write_max_segments - 1) *
> +				(PAGE_SIZE / SECTOR_SIZE);

Maybe use PAGE_SECTORS instead of recalculating it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-03 18:38 ` [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits John Garry
@ 2023-05-03 21:39   ` Dave Chinner
  2023-05-04 18:14     ` John Garry
  2023-05-09  0:19   ` Mike Snitzer
  1 sibling, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2023-05-03 21:39 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Himanshu Madhani

On Wed, May 03, 2023 at 06:38:06PM +0000, John Garry wrote:
> From: Himanshu Madhani <himanshu.madhani@oracle.com>
> 
> Add the following limits:
> - atomic_write_boundary
> - atomic_write_max_bytes
> - atomic_write_unit_max
> - atomic_write_unit_min
> 
> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  Documentation/ABI/stable/sysfs-block | 42 +++++++++++++++++++++
>  block/blk-settings.c                 | 56 ++++++++++++++++++++++++++++
>  block/blk-sysfs.c                    | 33 ++++++++++++++++
>  include/linux/blkdev.h               | 23 ++++++++++++
>  4 files changed, 154 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index 282de3680367..f3ed9890e03b 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -21,6 +21,48 @@ Description:
>  		device is offset from the internal allocation unit's
>  		natural alignment.
>  
> +What:		/sys/block/<disk>/atomic_write_max_bytes
> +Date:		May 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] This parameter specifies the maximum atomic write
> +		size reported by the device. An atomic write operation
> +		must not exceed this number of bytes.
> +
> +
> +What:		/sys/block/<disk>/atomic_write_unit_min
> +Date:		May 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] This parameter specifies the smallest block which can
> +		be written atomically with an atomic write operation. All
> +		atomic write operations must begin at a
> +		atomic_write_unit_min boundary and must be multiples of
> +		atomic_write_unit_min. This value must be a power-of-two.

What units is this defined to use? Bytes?

> +
> +
> +What:		/sys/block/<disk>/atomic_write_unit_max
> +Date:		January 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] This parameter defines the largest block which can be
> +		written atomically with an atomic write operation. This
> +		value must be a multiple of atomic_write_unit_min and must
> +		be a power-of-two.

Same question. Also, how is this different to
atomic_write_max_bytes?

> +
> +
> +What:		/sys/block/<disk>/atomic_write_boundary
> +Date:		May 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] A device may need to internally split I/Os which
> +		straddle a given logical block address boundary. In that
> +		case a single atomic write operation will be processed as
> +		one of more sub-operations which each complete atomically.
> +		This parameter specifies the size in bytes of the atomic
> +		boundary if one is reported by the device. This value must
> +		be a power-of-two.

How are users/filesystems supposed to use this?

> +
>  
>  What:		/sys/block/<disk>/diskseq
>  Date:		February 2021
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 896b4654ab00..e21731715a12 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -59,6 +59,9 @@ void blk_set_default_limits(struct queue_limits *lim)
>  	lim->zoned = BLK_ZONED_NONE;
>  	lim->zone_write_granularity = 0;
>  	lim->dma_alignment = 511;
> +	lim->atomic_write_unit_min = lim->atomic_write_unit_max = 1;

A value of "1" isn't obviously a power of 2, nor does it tell me
what units these values use.

> +	lim->atomic_write_max_bytes = 512;
> +	lim->atomic_write_boundary = 0;

The behaviour when the value is zero is not defined by the syfs
description above.

>  }
>  
>  /**
> @@ -183,6 +186,59 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
>  }
>  EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>  
> +/**
> + * blk_queue_atomic_write_max_bytes - set max bytes supported by
> + * the device for atomic write operations.
> + * @q:  the request queue for the device
> + * @size: maximum bytes supported
> + */
> +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> +				      unsigned int size)
> +{
> +	q->limits.atomic_write_max_bytes = size;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
> +
> +/**
> + * blk_queue_atomic_write_boundary - Device's logical block address space
> + * which an atomic write should not cross.

I have no idea what "logical block address space which an atomic
write should not cross" means, especially as the unit is in bytes
and not in sectors (which are the units LBAs are expressed in).

> + * @q:  the request queue for the device
> + * @size: size in bytes. Must be a power-of-two.
> + */
> +void blk_queue_atomic_write_boundary(struct request_queue *q,
> +				     unsigned int size)
> +{
> +	q->limits.atomic_write_boundary = size;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_boundary);
> +
> +/**
> + * blk_queue_atomic_write_unit_min - smallest unit that can be written
> + *				     atomically to the device.
> + * @q:  the request queue for the device
> + * @sectors: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_unit_min(struct request_queue *q,
> +				     unsigned int sectors)
> +{
> +	q->limits.atomic_write_unit_min = sectors;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_min);

Oh, these are sectors?

What size sector? Are we talking about fixed size 512 byte basic
block units, or are we talking about physical device sector sizes
(e.g. 4kB, maybe larger in future?)

These really should be in bytes, as they are directly exposed to
userspace applications via statx and applications will have no idea
what the sector size actually is without having to query the block
device directly...

> +
> +/*
> + * blk_queue_atomic_write_unit_max - largest unit that can be written
> + * atomically to the device.
> + * @q: the reqeust queue for the device
> + * @sectors: must be a power-of-two.
> + */
> +void blk_queue_atomic_write_unit_max(struct request_queue *q,
> +				     unsigned int sectors)
> +{
> +	struct queue_limits *limits = &q->limits;
> +	limits->atomic_write_unit_max = sectors;
> +}
> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max);
> +
>  /**
>   * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
>   * @q:  the request queue for the device
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index f1fce1c7fa44..1025beff2281 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -132,6 +132,30 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q,
>  	return queue_var_show(queue_max_discard_segments(q), page);
>  }
>  
> +static ssize_t queue_atomic_write_max_bytes_show(struct request_queue *q,
> +						char *page)
> +{
> +	return queue_var_show(q->limits.atomic_write_max_bytes, page);
> +}
> +
> +static ssize_t queue_atomic_write_boundary_show(struct request_queue *q,
> +						char *page)
> +{
> +	return queue_var_show(q->limits.atomic_write_boundary, page);
> +}
> +
> +static ssize_t queue_atomic_write_unit_min_show(struct request_queue *q,
> +						char *page)
> +{
> +	return queue_var_show(queue_atomic_write_unit_min(q), page);
> +}
> +
> +static ssize_t queue_atomic_write_unit_max_show(struct request_queue *q,
> +						char *page)
> +{
> +	return queue_var_show(queue_atomic_write_unit_max(q), page);
> +}
> +
>  static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page)
>  {
>  	return queue_var_show(q->limits.max_integrity_segments, page);
> @@ -604,6 +628,11 @@ QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes");
>  QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes");
>  QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
>  
> +QUEUE_RO_ENTRY(queue_atomic_write_max_bytes, "atomic_write_max_bytes");
> +QUEUE_RO_ENTRY(queue_atomic_write_boundary, "atomic_write_boundary");
> +QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max");
> +QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min");
> +
>  QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
>  QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
>  QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
> @@ -661,6 +690,10 @@ static struct attribute *queue_attrs[] = {
>  	&queue_discard_max_entry.attr,
>  	&queue_discard_max_hw_entry.attr,
>  	&queue_discard_zeroes_data_entry.attr,
> +	&queue_atomic_write_max_bytes_entry.attr,
> +	&queue_atomic_write_boundary_entry.attr,
> +	&queue_atomic_write_unit_min_entry.attr,
> +	&queue_atomic_write_unit_max_entry.attr,
>  	&queue_write_same_max_entry.attr,
>  	&queue_write_zeroes_max_entry.attr,
>  	&queue_zone_append_max_entry.attr,
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 941304f17492..6b6f2992338c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -304,6 +304,11 @@ struct queue_limits {
>  	unsigned int		discard_alignment;
>  	unsigned int		zone_write_granularity;
>  
> +	unsigned int		atomic_write_boundary;
> +	unsigned int		atomic_write_max_bytes;
> +	unsigned int		atomic_write_unit_min;
> +	unsigned int		atomic_write_unit_max;
> +
>  	unsigned short		max_segments;
>  	unsigned short		max_integrity_segments;
>  	unsigned short		max_discard_segments;
> @@ -929,6 +934,14 @@ void blk_queue_zone_write_granularity(struct request_queue *q,
>  				      unsigned int size);
>  extern void blk_queue_alignment_offset(struct request_queue *q,
>  				       unsigned int alignment);
> +extern void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> +					     unsigned int size);
> +extern void blk_queue_atomic_write_unit_max(struct request_queue *q,
> +					    unsigned int sectors);
> +extern void blk_queue_atomic_write_unit_min(struct request_queue *q,
> +					    unsigned int sectors);
> +extern void blk_queue_atomic_write_boundary(struct request_queue *q,
> +					    unsigned int size);
>  void disk_update_readahead(struct gendisk *disk);
>  extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
>  extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
> @@ -1331,6 +1344,16 @@ static inline int queue_dma_alignment(const struct request_queue *q)
>  	return q ? q->limits.dma_alignment : 511;
>  }
>  
> +static inline unsigned int queue_atomic_write_unit_max(const struct request_queue *q)
> +{
> +	return q->limits.atomic_write_unit_max << SECTOR_SHIFT;
> +}
> +
> +static inline unsigned int queue_atomic_write_unit_min(const struct request_queue *q)
> +{
> +	return q->limits.atomic_write_unit_min << SECTOR_SHIFT;
> +}

Ah, what? This undocumented interface reports "unit limits" in
bytes, but it's not using the physical device sector size to convert
between sector units and bytes. This really needs some more
documentation and work to make it present all units consistently and
not result in confusion when devices have 4kB sector sizes and not
512 byte sectors...

Also, I think all the byte ranges should support full 64 bit values,
otherwise there will be silent overflows in converting 32 bit sector
counts to byte ranges. And, eventually, something will want to do
larger than 4GB atomic IOs

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx
  2023-05-03 18:38 ` [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx John Garry
@ 2023-05-03 21:58   ` Dave Chinner
  2023-05-04  8:45     ` John Garry
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2023-05-03 21:58 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Prasad Singamsetty

On Wed, May 03, 2023 at 06:38:07PM +0000, John Garry wrote:
> From: Prasad Singamsetty <prasad.singamsetty@oracle.com>
> 
> Extend statx system call to return additional info for atomic write support
> support if the specified file is a block device.
> 
> Add initial support for a block device.
> 
> Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  block/bdev.c              | 21 +++++++++++++++++++++
>  fs/stat.c                 | 10 ++++++++++
>  include/linux/blkdev.h    |  4 ++++
>  include/linux/stat.h      |  2 ++
>  include/uapi/linux/stat.h |  7 ++++++-
>  5 files changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 1795c7d4b99e..6a5fd5abaadc 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -1014,3 +1014,24 @@ void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
>  
>  	blkdev_put_no_open(bdev);
>  }
> +
> +/*
> + * Handle statx for block devices to get properties of WRITE ATOMIC
> + * feature support.
> + */
> +void bdev_statx_atomic(struct inode *inode, struct kstat *stat)
> +{
> +	struct block_device *bdev;
> +
> +	bdev = blkdev_get_no_open(inode->i_rdev);
> +	if (!bdev)
> +		return;
> +
> +	stat->atomic_write_unit_min = queue_atomic_write_unit_min(bdev->bd_queue);
> +	stat->atomic_write_unit_max = queue_atomic_write_unit_max(bdev->bd_queue);
> +	stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
> +	stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
> +	stat->result_mask |= STATX_WRITE_ATOMIC;
> +
> +	blkdev_put_no_open(bdev);
> +}
> diff --git a/fs/stat.c b/fs/stat.c
> index 7c238da22ef0..d20334a0e9ae 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -256,6 +256,14 @@ static int vfs_statx(int dfd, struct filename *filename, int flags,
>  			bdev_statx_dioalign(inode, stat);
>  	}
>  
> +	/* Handle STATX_WRITE_ATOMIC for block devices */
> +	if (request_mask & STATX_WRITE_ATOMIC) {
> +		struct inode *inode = d_backing_inode(path.dentry);
> +
> +		if (S_ISBLK(inode->i_mode))
> +			bdev_statx_atomic(inode, stat);
> +	}

This duplicates STATX_DIOALIGN bdev handling.

Really, the bdev attribute handling should be completely factored
out of vfs_statx() - blockdevs are not the common fastpath for stat
operations. Somthing like:

	/*
	 * If this is a block device inode, override the filesystem
	 * attributes with the block device specific parameters
	 * that need to be obtained from the bdev backing inode.
	 */
	if (S_ISBLK(d_backing_inode(path.dentry)->i_mode))
		bdev_statx(path.dentry, stat);

And then all the overrides can go in the one function that doesn't
need to repeatedly check S_ISBLK()....


> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 6b6f2992338c..19d33b2897b2 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1527,6 +1527,7 @@ int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend);
>  int sync_blockdev_nowait(struct block_device *bdev);
>  void sync_bdevs(bool wait);
>  void bdev_statx_dioalign(struct inode *inode, struct kstat *stat);
> +void bdev_statx_atomic(struct inode *inode, struct kstat *stat);
>  void printk_all_partitions(void);
>  #else
>  static inline void invalidate_bdev(struct block_device *bdev)
> @@ -1546,6 +1547,9 @@ static inline void sync_bdevs(bool wait)
>  static inline void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
>  {
>  }
> +static inline void bdev_statx_atomic(struct inode *inode, struct kstat *stat)
> +{
> +}
>  static inline void printk_all_partitions(void)
>  {
>  }

That also gets rid of the need for all these fine grained exports
out of the bdev code for statx....

> diff --git a/include/linux/stat.h b/include/linux/stat.h
> index 52150570d37a..dfa69ecfaacf 100644
> --- a/include/linux/stat.h
> +++ b/include/linux/stat.h
> @@ -53,6 +53,8 @@ struct kstat {
>  	u32		dio_mem_align;
>  	u32		dio_offset_align;
>  	u64		change_cookie;
> +	u32		atomic_write_unit_max;
> +	u32		atomic_write_unit_min;
>  };
>  
>  /* These definitions are internal to the kernel for now. Mainly used by nfsd. */
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7cab2c65d3d7..c99d7cac2aa6 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -127,7 +127,10 @@ struct statx {
>  	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
>  	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
>  	/* 0xa0 */
> -	__u64	__spare3[12];	/* Spare space for future expansion */
> +	__u32	stx_atomic_write_unit_max;
> +	__u32	stx_atomic_write_unit_min;
> +	/* 0xb0 */
> +	__u64	__spare3[11];	/* Spare space for future expansion */
>  	/* 0x100 */
>  };

No documentation on what units these are in. Is there a statx() man
page update for this addition?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 03/16] xfs: Support atomic write for statx
  2023-05-03 18:38 ` [PATCH RFC 03/16] xfs: Support atomic write for statx John Garry
@ 2023-05-03 22:17   ` Dave Chinner
  2023-05-05 22:10     ` Darrick J. Wong
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2023-05-03 22:17 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge

On Wed, May 03, 2023 at 06:38:08PM +0000, John Garry wrote:
> Support providing info on atomic write unit min and max.
> 
> Darrick Wong originally authored this change.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iops.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 24718adb3c16..e542077704aa 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -614,6 +614,16 @@ xfs_vn_getattr(
>  			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
>  			stat->dio_offset_align = bdev_logical_block_size(bdev);
>  		}
> +		if (request_mask & STATX_WRITE_ATOMIC) {
> +			struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +			struct block_device	*bdev = target->bt_bdev;
> +
> +			stat->atomic_write_unit_min = queue_atomic_write_unit_min(bdev->bd_queue);
> +			stat->atomic_write_unit_max = queue_atomic_write_unit_max(bdev->bd_queue);

I'm not sure this is right.

Given that we may have a 4kB physical sector device, XFS will not
allow IOs smaller than physical sector size. The initial values of
queue_atomic_write_unit_min/max() will be (1 << SECTOR_SIZE) which
is 512 bytes. IOs done with 4kB sector size devices will fail in
this case.

Further, XFS has a software sector size - it can define the sector
size for the filesystem to be 4KB on a 512 byte sector device. And
in that case, the filesystem will reject 512 byte sized/aligned IOs
as they are smaller than the filesystem sector size (i.e. a config
that prevents sub-physical sector IO for 512 logical/4kB physical
devices).

There may other filesystem constraints - realtime devices have fixed
minimum allocation sizes which may be larger than atomic write
limits, which means that IO completion needs to split extents into
multiple unwritten/written extents, extent size hints might be in
use meaning we have different allocation alignment constraints to
atomic write constraints, stripe alignment of extent allocation may
through out atomic write alignment, etc.

These are all solvable, but we need to make sure here that the
filesystem constraints are taken into account here, not just the
block device limits.

As such, it is probably better to query these limits at filesystem
mount time and add them to the xfs buftarg (same as we do for
logical and physical sector sizes) and then use the xfs buftarg
values rather than having to go all the way to the device queue
here. That way we can ensure at mount time that atomic write limits
don't conflict with logical/physical IO limits, and we can further
constrain atomic limits during mount without always having to
recalculate those limits from first principles on every stat()
call...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 12/16] xfs: Add support for fallocate2
  2023-05-03 18:38 ` [PATCH RFC 12/16] xfs: Add support for fallocate2 John Garry
@ 2023-05-03 23:26   ` Dave Chinner
  2023-05-05 22:23     ` Darrick J. Wong
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2023-05-03 23:26 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Allison Henderson, Catherine Hoang

On Wed, May 03, 2023 at 06:38:17PM +0000, John Garry wrote:
> From: Allison Henderson <allison.henderson@oracle.com>
> 
> Add support for fallocate2 ioctl, which is xfs' own version of fallocate.
> Struct xfs_fallocate2 is passed in the ioctl, and xfs_fallocate2.alignment
> allows the user to specify required extent alignment. This is key for
> atomic write support, as we expect extents to be aligned on
> atomic_write_unit_max boundaries.

This approach of adding filesystem specific ioctls for minor behavioural
modifiers to existing syscalls is not a sustainable development
model.

If we want fallocate() operations to apply filesystem atomic write
constraints to operations, then add a new modifier flag to
fallocate(), say FALLOC_FL_ATOMIC. The filesystem can then
look up it's atomic write alignment constraints and apply them to
the operation being performed appropriately.

> The alignment flag is not sticky, so further extent mutation will not
> obey this original alignment request.

IOWs, you want the specific allocation to behave exactly as if an
extent size hint of the given alignment had been set on that inode.
Which could be done with:

	ioctl(FS_IOC_FSGETXATTR, &fsx)
	old_extsize = fsx.fsx_extsize;
	fsx.fsx_extsize = atomic_align_size;
	ioctl(FS_IOC_FSSETXATTR, &fsx)
	fallocate(....)
	fsx.fsx_extsize = old_extsize;
	ioctl(FS_IOC_FSSETXATTR, &fsx)

Yeah, messy, but if an application is going to use atomic writes,
then setting an extent size hint of the atomic write granularity the
application will use at file create time makes a whole lot of sense.
This will largely guarantee that any allocation will be aligned to
atomic IO constraints even when non atomic IO operations are
performed on that inode. Hence when the application needs to do an
atomic IO, it's not going to fail because previous allocation was
not correctly aligned.

All that we'd then need to do for atomic IO is ensure that we fail
the allocation early if we can't allocate fully sized and aligned
extents rather than falling back to unaligned extents when there are
no large enough contiguous free spaces for aligned extents to be
allocated. i.e. when RWF_ATOMIC or FALLOC_FL_ATOMIC are set by the
application...

> In addition, extent lengths should
> always be a multiple of atomic_write_unit_max,

Yup, that's what extent size hint based allocation does - it rounds
both down and up to hint alignment...

....

> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 34de6e6898c4..52a6e2b61228 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3275,7 +3275,9 @@ xfs_bmap_compute_alignments(
>  	struct xfs_alloc_arg	*args)
>  {
>  	struct xfs_mount	*mp = args->mp;
> -	xfs_extlen_t		align = 0; /* minimum allocation alignment */
> +
> +	/* minimum allocation alignment */
> +	xfs_extlen_t		align = args->alignment;
>  	int			stripe_align = 0;


This doesn't do what you think it should. For one, it will get
overwritten by extent size hints that are set, hence the user will
not get the alignment they expected in that case.

Secondly, args->alignment is an internal alignment control for
stripe alignment used later in the allocator when doing file
extenstion allocations.  Overloading it to pass a user alignment
here means that initial data allocations will have alignments set
without actually having set up the allocator parameters for aligned
allocation correctly.

This will lead to unexpected allocation failure as the filesystem
fills as the reservations needed for allocation to succeed won't
match what is actually required for allocation to succeed. It will
also cause problematic behaviour for fallback allocation algorithms
that expect only to be called with args->alignment = 1...

>  	/* stripe alignment for allocation is determined by mount parameters */
> @@ -3652,6 +3654,7 @@ xfs_bmap_btalloc(
>  		.datatype	= ap->datatype,
>  		.alignment	= 1,
>  		.minalignslop	= 0,
> +		.alignment	= ap->align,
>  	};
>  	xfs_fileoff_t		orig_offset;
>  	xfs_extlen_t		orig_length;

> @@ -4279,12 +4282,14 @@ xfs_bmapi_write(
>  	uint32_t		flags,		/* XFS_BMAPI_... */
>  	xfs_extlen_t		total,		/* total blocks needed */
>  	struct xfs_bmbt_irec	*mval,		/* output: map values */
> -	int			*nmap)		/* i/o: mval size/count */
> +	int			*nmap,
> +	xfs_extlen_t		align)		/* i/o: mval size/count */


As per above - IMO this is not the right way to specify aligment for
atomic IO. A XFS_BMAPI_ATOMIC flag is probably the right thing to
add from the caller - this also communicates the specific allocation
failure behaviour required, too.

Then xfs_bmap_compute_alignments() can pull the alignment
from the relevant buftarg similar to how it already pulls preset
alignments for extent size hints and/or realtime devices. And then
the allocator can attempt exact aligned allocation for maxlen, then
if that fails an exact aligned allocation for minlen, and if both of
those fail then we return ENOSPC without attempting any unaligned
allocations...

This also gets rid of the need to pass another parameter to
xfs_bmapi_write(), and it's trivial to plumb into the XFS iomap and
fallocate code paths....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 11/16] fs: iomap: Atomic write support
  2023-05-03 18:38 ` [PATCH RFC 11/16] fs: iomap: Atomic " John Garry
@ 2023-05-04  5:00   ` Dave Chinner
  2023-05-05 21:19     ` Darrick J. Wong
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2023-05-04  5:00 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge

On Wed, May 03, 2023 at 06:38:16PM +0000, John Garry wrote:
> Add support to create bio's whose bi_sector and bi_size are aligned to and
> multiple of atomic_write_unit, respectively.
> 
> When we call iomap_dio_bio_iter() -> bio_iov_iter_get_pages() ->
> __bio_iov_iter_get_pages(), we trim the bio to a multiple of
> atomic_write_unit.
> 
> As such, we expect the iomi start and length to have same size and
> alignment requirements per iomap_dio_bio_iter() call.
> 
> In iomap_dio_bio_iter(), ensure that for a non-dsync iocb that the mapping
> is not dirty nor unmapped.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/iomap/direct-io.c | 72 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 70 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f771001574d0..37c3c926dfd8 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -36,6 +36,8 @@ struct iomap_dio {
>  	size_t			done_before;
>  	bool			wait_for_completion;
>  
> +	unsigned int atomic_write_unit;
> +
>  	union {
>  		/* used during submission and for synchronous completion: */
>  		struct {
> @@ -229,9 +231,21 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  	return opflags;
>  }
>  
> +
> +/*
> + * Note: For atomic writes, each bio which we create when we iter should have
> + *	 bi_sector aligned to atomic_write_unit and also its bi_size should be
> + *	 a multiple of atomic_write_unit.
> + *	 The call to bio_iov_iter_get_pages() -> __bio_iov_iter_get_pages()
> + *	 should trim the length to a multiple of atomic_write_unit for us.
> + *	 This allows us to split each bio later in the block layer to fit
> + *	 request_queue limit.
> + */
>  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		struct iomap_dio *dio)
>  {
> +	bool atomic_write = (dio->iocb->ki_flags & IOCB_ATOMIC) &&
> +			    (dio->flags & IOMAP_DIO_WRITE);
>  	const struct iomap *iomap = &iter->iomap;
>  	struct inode *inode = iter->inode;
>  	unsigned int fs_block_size = i_blocksize(inode), pad;
> @@ -249,6 +263,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>  		return -EINVAL;
>  
> +
> +	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
> +		if (iomap->flags & IOMAP_F_DIRTY)
> +			return -EIO;
> +		if (iomap->type != IOMAP_MAPPED)
> +			return -EIO;
> +	}

IDGI. If the iomap had space allocated for this dio iteration,
then IOMAP_F_DIRTY will be set and it is likely (guaranteed for XFS)
that the iomap type will be IOMAP_UNWRITTEN. Indeed, if we are doing
a write into preallocated space (i.e. from fallocate()) then this
will cause -EIO on all RWF_ATOMIC IO to that file unless RWF_DSYNC
is also used.

"For a power fail, for each individual application block, all or
none of the data to be written."

Ok, does this means RWF_ATOMIC still needs fdatasync() to guarantee
that the data makes it to stable storage? And the result is
undefined until fdatasync() is run, but the device will guarantee
that either all or none of the data will be on stable storage
prior to the next device cache flush completing?

i.e. does REQ_ATOMIC imply REQ_FUA, or does it require a separate
device cache flush to commit the atomic IO to stable storage?

What about ordering - do the devices guarantee strict ordering of
REQ_ATOMIC writes? i.e. if atomic write N is seen on disk, then all
the previous atomic writes up to N will also be seen on disk? If
not, how does the application and filesystem guarantee persistence
of completed atomic writes?

i.e. If we still need a post-IO device cache flush to guarantee
persistence and/or ordering of RWF_ATOMIC IOs, then the above code
makes no sense - we'll still need fdatasync() to provide persistence
checkpoints and that means we ensure metadata is also up to date
at those checkpoints.

I need someone to put down in writing exactly what the data
integrity, ordering and persistence semantics of REQ_ATOMIC are
before I can really comment any further. From my perspective as a
filesystem developer, this is the single most important set of
behaviours that need to be documented, as this determines how
everything else interacts with atomic writes....

>  	if (iomap->type == IOMAP_UNWRITTEN) {
>  		dio->flags |= IOMAP_DIO_UNWRITTEN;
>  		need_zeroout = true;
> @@ -318,6 +340,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  					  GFP_KERNEL);
>  		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>  		bio->bi_ioprio = dio->iocb->ki_ioprio;
> +		if (atomic_write) {
> +			bio->bi_opf |= REQ_ATOMIC;
> +			bio->atomic_write_unit = dio->atomic_write_unit;
> +		}
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> @@ -492,6 +518,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		is_sync_kiocb(iocb) || (dio_flags & IOMAP_DIO_FORCE_WAIT);
>  	struct blk_plug plug;
>  	struct iomap_dio *dio;
> +	bool is_read = iov_iter_rw(iter) == READ;
> +	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
>  
>  	if (!iomi.len)
>  		return NULL;
> @@ -500,6 +528,20 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (!dio)
>  		return ERR_PTR(-ENOMEM);
>  
> +	if (atomic_write) {
> +		/*
> +		 * Note: This lookup is not proper for a multi-device scenario,
> +		 *	 however for current iomap users, the bdev per iter
> +		 *	 will be fixed, so "works" for now.
> +		 */
> +		struct super_block *i_sb = inode->i_sb;
> +		struct block_device *bdev = i_sb->s_bdev;
> +
> +		dio->atomic_write_unit =
> +			bdev_find_max_atomic_write_alignment(bdev,
> +					iomi.pos, iomi.len);
> +	}

This will break atomic IO to XFS realtime devices. The device we are
doing IO to is iomap->bdev, we should never be using sb->s_bdev in
the iomap code.  Of course, at this point in __iomap_dio_rw() we
don't have an iomap so this "alignment constraint" can't be done
correctly at this point in the IO path.

However, even ignoring the bdev source, I think this is completely
wrong. Passing a *file* offset to the underlying block device so the
block device can return a device alignment constraint for IO is not
valid. We don't know how that file offset/length is going to be
mapped to the underlying block device until we ask the filesystem
for an iomap covering the file range, so we can't possibly know what
the device IO alignment of the user request will be until we have an
iomap for it.

At which point, the "which block device should we ask for alignment
constraints" question is moot, because we now have an iomap and can
use iomap->bdev....

> @@ -592,6 +634,32 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	blk_start_plug(&plug);
>  	while ((ret = iomap_iter(&iomi, ops)) > 0) {
> +		if (atomic_write) {
> +			const struct iomap *_iomap = &iomi.iomap;
> +			loff_t iomi_length = iomap_length(&iomi);
> +
> +			/*
> +			 * Ensure length and start address is a multiple of
> +			 * atomic_write_unit - this is critical. If the length
> +			 * is not a multiple of atomic_write_unit, then we
> +			 * cannot create a set of bio's in iomap_dio_bio_iter()
> +			 * who are each a length which is a multiple of
> +			 * atomic_write_unit.
> +			 *
> +			 * Note: It may be more appropiate to have this check
> +			 *	 in iomap_dio_bio_iter()
> +			 */
> +			if ((iomap_sector(_iomap, iomi.pos) << SECTOR_SHIFT) %
> +			    dio->atomic_write_unit) {
> +				ret = -EIO;
> +				break;
> +			}
> +
> +			if (iomi_length % dio->atomic_write_unit) {
> +				ret = -EIO;
> +				break;
> +			}

This looks wrong - the length of the mapped extent could be shorter
than the max atomic write size returned by
bdev_find_max_atomic_write_alignment() but the iomap could still be aligned
to the minimum atomic write unit supported. At this point, we reject
the IO with -EIO, even though it could have been done as an atomic
write, just a shorter one than the user requested.

That said, I don't think we can call a user IO that is being
sliced and diced into multiple individual IOs "atomic". "Atomic"
implies all-or-none behaviour - slicing up a large DIO into smaller
individual bios means the bios can be submitted and completed out of
order. If we then we get a power failure, the application's "atomic"
IO can appear on disk as only being partially complete - it violates
the "all or none" semantics of "atomic IO".

Hence I think that we should be rejecting RWF_ATOMIC IOs that are
larger than the maximum atomic write unit or cannot be dispatched in
a single IO e.g. filesystem has allocated multiple minimum aligned
extents and so a max len atomic write IO over that range must be
broken up into multiple smaller IOs.

We should be doing max atomic write size rejection high up in the IO
path (e.g. filesystem ->write_iter() method) before we get anywhere
near the DIO path, and we should be rejecting atomic write IOs in
the DIO path during the ->iomap_begin() mapping callback if we can't
map the entire atomic IO to a single aligned filesystem extent.

i.e. the alignment checks and constraints need to be applied by the
filesystem mapping code, not the layer that packs the pages into the
bio as directed by the filesystem mapping....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 14/16] scsi: sd: Add WRITE_ATOMIC_16 support
  2023-05-03 18:48   ` Bart Van Assche
@ 2023-05-04  8:17     ` John Garry
  0 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-04  8:17 UTC (permalink / raw)
  To: Bart Van Assche, axboe, kbusch, hch, sagi, martin.petersen,
	djwong, viro, brauner, dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge

On 03/05/2023 19:48, Bart Van Assche wrote:
> On 5/3/23 11:38, John Garry wrote:
>> +static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd,
>> +                    sector_t lba, unsigned int nr_blocks,
>> +                    unsigned char flags)
>> +{
>> +    cmd->cmd_len  = 16;
>> +    cmd->cmnd[0]  = WRITE_ATOMIC_16;
>> +    cmd->cmnd[1]  = flags;
>> +    put_unaligned_be64(lba, &cmd->cmnd[2]);
>> +    cmd->cmnd[10] = 0;
>> +    cmd->cmnd[11] = 0;
>> +    put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
>> +    cmd->cmnd[14] = 0;
>> +    cmd->cmnd[15] = 0;
>> +
>> +    return BLK_STS_OK;
>> +}
> 
> A single space in front of the assignment operator please.

ok

> 
>> +
>>   static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>   {
>>       struct request *rq = scsi_cmd_to_rq(cmd);
>> @@ -1149,6 +1166,7 @@ static blk_status_t 
>> sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>       unsigned int nr_blocks = sectors_to_logical(sdp, 
>> blk_rq_sectors(rq));
>>       unsigned int mask = logical_to_sectors(sdp, 1) - 1;
>>       bool write = rq_data_dir(rq) == WRITE;
>> +    bool atomic_write = !!(rq->cmd_flags & REQ_ATOMIC) && write;
> 
> Isn't the !! superfluous in the above expression? I have not yet seen 
> any other kernel code where a flag test is used in a boolean expression 
> and where !! occurs in front of the flag test.

So you think that && means that (rq->cmd_flags & REQ_ATOMIC) will be 
auto a bool. Fine, I can change that.

Thanks,
John


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 16/16] nvme: Support atomic writes
  2023-05-03 18:49   ` Bart Van Assche
@ 2023-05-04  8:19     ` John Garry
  0 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-04  8:19 UTC (permalink / raw)
  To: Bart Van Assche, axboe, kbusch, hch, sagi, martin.petersen,
	djwong, viro, brauner, dchinner, jejb
  Cc: linux-block, linux-kernel, linux-nvme, linux-scsi, linux-xfs,
	linux-fsdevel, linux-security-module, paul, jmorris, serge,
	Alan Adamson

On 03/05/2023 19:49, Bart Van Assche wrote:
> On 5/3/23 11:38, John Garry wrote:
>> +            if (!(boundary & (boundary - 1))) {
> 
> Please use is_power_of_2() instead of open-coding it.

Sure, that can be changed, thanks

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 06/16] block: Limit atomic writes according to bio and queue limits
  2023-05-03 18:53   ` Keith Busch
@ 2023-05-04  8:24     ` John Garry
  0 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-04  8:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: axboe, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge

On 03/05/2023 19:53, Keith Busch wrote:
> On Wed, May 03, 2023 at 06:38:11PM +0000, John Garry wrote:
>> +	unsigned int size = (atomic_write_max_segments - 1) *
>> +				(PAGE_SIZE / SECTOR_SIZE);
> Maybe use PAGE_SECTORS instead of recalculating it.

ok, that simplifies it a bit, but I still do have a doubt that the calc 
I use for guaranteed amount of data which can fit in a bio without ever 
requiring splitting to queue limits is correct...

Thanks,
John

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx
  2023-05-03 21:58   ` Dave Chinner
@ 2023-05-04  8:45     ` John Garry
  2023-05-04 22:40       ` Dave Chinner
  0 siblings, 1 reply; 50+ messages in thread
From: John Garry @ 2023-05-04  8:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Prasad Singamsetty

On 03/05/2023 22:58, Dave Chinner wrote:

Hi Dave,

>> +	/* Handle STATX_WRITE_ATOMIC for block devices */
>> +	if (request_mask & STATX_WRITE_ATOMIC) {
>> +		struct inode *inode = d_backing_inode(path.dentry);
>> +
>> +		if (S_ISBLK(inode->i_mode))
>> +			bdev_statx_atomic(inode, stat);
>> +	}
> This duplicates STATX_DIOALIGN bdev handling.
> 
> Really, the bdev attribute handling should be completely factored
> out of vfs_statx() - blockdevs are not the common fastpath for stat
> operations. Somthing like:
> 
> 	/*
> 	 * If this is a block device inode, override the filesystem
> 	 * attributes with the block device specific parameters
> 	 * that need to be obtained from the bdev backing inode.
> 	 */
> 	if (S_ISBLK(d_backing_inode(path.dentry)->i_mode))
> 		bdev_statx(path.dentry, stat);
> 
> And then all the overrides can go in the one function that doesn't
> need to repeatedly check S_ISBLK()....

ok, that looks like a reasonable idea. We'll look to make that change.

> 
> 
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index 6b6f2992338c..19d33b2897b2 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -1527,6 +1527,7 @@ int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend);
>>   int sync_blockdev_nowait(struct block_device *bdev);
>>   void sync_bdevs(bool wait);
>>   void bdev_statx_dioalign(struct inode *inode, struct kstat *stat);
>> +void bdev_statx_atomic(struct inode *inode, struct kstat *stat);
>>   void printk_all_partitions(void);
>>   #else
>>   static inline void invalidate_bdev(struct block_device *bdev)
>> @@ -1546,6 +1547,9 @@ static inline void sync_bdevs(bool wait)
>>   static inline void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
>>   {
>>   }
>> +static inline void bdev_statx_atomic(struct inode *inode, struct kstat *stat)
>> +{
>> +}
>>   static inline void printk_all_partitions(void)
>>   {
>>   }
> That also gets rid of the need for all these fine grained exports
> out of the bdev code for statx....

Sure

> 
>> diff --git a/include/linux/stat.h b/include/linux/stat.h
>> index 52150570d37a..dfa69ecfaacf 100644
>> --- a/include/linux/stat.h
>> +++ b/include/linux/stat.h
>> @@ -53,6 +53,8 @@ struct kstat {
>>   	u32		dio_mem_align;
>>   	u32		dio_offset_align;
>>   	u64		change_cookie;
>> +	u32		atomic_write_unit_max;
>> +	u32		atomic_write_unit_min;
>>   };
>>   
>>   /* These definitions are internal to the kernel for now. Mainly used by nfsd. */
>> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
>> index 7cab2c65d3d7..c99d7cac2aa6 100644
>> --- a/include/uapi/linux/stat.h
>> +++ b/include/uapi/linux/stat.h
>> @@ -127,7 +127,10 @@ struct statx {
>>   	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
>>   	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
>>   	/* 0xa0 */
>> -	__u64	__spare3[12];	/* Spare space for future expansion */
>> +	__u32	stx_atomic_write_unit_max;
>> +	__u32	stx_atomic_write_unit_min;
>> +	/* 0xb0 */
>> +	__u64	__spare3[11];	/* Spare space for future expansion */
>>   	/* 0x100 */
>>   };
> No documentation on what units these are in.

It's in bytes, we're really just continuing the pattern of what we do 
for dio. I think that it would be reasonable to limit to u32.

> Is there a statx() man
> page update for this addition?

No, not yet. Is it normally expected to provide a proposed man page 
update in parallel? Or somewhat later, when the kernel API change has 
some appreciable level of agreement?

Thanks,
John


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-03 21:39   ` Dave Chinner
@ 2023-05-04 18:14     ` John Garry
  2023-05-04 22:26       ` Dave Chinner
  0 siblings, 1 reply; 50+ messages in thread
From: John Garry @ 2023-05-04 18:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Himanshu Madhani

Hi Dave,

>> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
>> index 282de3680367..f3ed9890e03b 100644
>> --- a/Documentation/ABI/stable/sysfs-block
>> +++ b/Documentation/ABI/stable/sysfs-block
>> @@ -21,6 +21,48 @@ Description:
>>   		device is offset from the internal allocation unit's
>>   		natural alignment.
>>   
>> +What:		/sys/block/<disk>/atomic_write_max_bytes
>> +Date:		May 2023
>> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
>> +Description:
>> +		[RO] This parameter specifies the maximum atomic write
>> +		size reported by the device. An atomic write operation
>> +		must not exceed this number of bytes.
>> +
>> +
>> +What:		/sys/block/<disk>/atomic_write_unit_min
>> +Date:		May 2023
>> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
>> +Description:
>> +		[RO] This parameter specifies the smallest block which can
>> +		be written atomically with an atomic write operation. All
>> +		atomic write operations must begin at a
>> +		atomic_write_unit_min boundary and must be multiples of
>> +		atomic_write_unit_min. This value must be a power-of-two.
> 
> What units is this defined to use? Bytes?

Bytes

> 
>> +
>> +
>> +What:		/sys/block/<disk>/atomic_write_unit_max
>> +Date:		January 2023
>> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
>> +Description:
>> +		[RO] This parameter defines the largest block which can be
>> +		written atomically with an atomic write operation. This
>> +		value must be a multiple of atomic_write_unit_min and must
>> +		be a power-of-two.
> 
> Same question. Also, how is this different to
> atomic_write_max_bytes?

Again, this is bytes. We can add "bytes" to the name of these other 
files if people think it's better. Unfortunately request_queue sysfs 
file naming isn't consistent here to begin with.

atomic_write_unit_max is largest application block size which we can 
support, while atomic_write_max_bytes is the max size of an atomic 
operation which the HW supports.

 From your review on the iomap patch, I assume that now you realise that 
we are proposing a write which may include multiple application data 
blocks (each limited in size to atomic_write_unit_max), and the limit in 
total size of that write is atomic_write_max_bytes.

user applications should only pay attention to what we return from 
statx, that being atomic_write_unit_min and atomic_write_unit_max.

atomic_write_max_bytes and atomic_write_boundary is only relevant to the 
block layer.

> 
>> +
>> +
>> +What:		/sys/block/<disk>/atomic_write_boundary
>> +Date:		May 2023
>> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
>> +Description:
>> +		[RO] A device may need to internally split I/Os which
>> +		straddle a given logical block address boundary. In that
>> +		case a single atomic write operation will be processed as
>> +		one of more sub-operations which each complete atomically.
>> +		This parameter specifies the size in bytes of the atomic
>> +		boundary if one is reported by the device. This value must
>> +		be a power-of-two.
> 
> How are users/filesystems supposed to use this?

As above, this is not relevant to the user.

> 
>> +
>>   
>>   What:		/sys/block/<disk>/diskseq
>>   Date:		February 2021
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 896b4654ab00..e21731715a12 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -59,6 +59,9 @@ void blk_set_default_limits(struct queue_limits *lim)
>>   	lim->zoned = BLK_ZONED_NONE;
>>   	lim->zone_write_granularity = 0;
>>   	lim->dma_alignment = 511;
>> +	lim->atomic_write_unit_min = lim->atomic_write_unit_max = 1;
> 
> A value of "1" isn't obviously a power of 2, nor does it tell me
> what units these values use.

I think that we should store these in bytes.

> 
>> +	lim->atomic_write_max_bytes = 512;
>> +	lim->atomic_write_boundary = 0;
> 
> The behaviour when the value is zero is not defined by the syfs
> description above.

I'll add it. A value of zero means no atomic boundary.

> 
>>   }
>>   
>>   /**
>> @@ -183,6 +186,59 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
>>   }
>>   EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>>   
>> +/**
>> + * blk_queue_atomic_write_max_bytes - set max bytes supported by
>> + * the device for atomic write operations.
>> + * @q:  the request queue for the device
>> + * @size: maximum bytes supported
>> + */
>> +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
>> +				      unsigned int size)
>> +{
>> +	q->limits.atomic_write_max_bytes = size;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
>> +
>> +/**
>> + * blk_queue_atomic_write_boundary - Device's logical block address space
>> + * which an atomic write should not cross.
> 
> I have no idea what "logical block address space which an atomic
> write should not cross" means, especially as the unit is in bytes
> and not in sectors (which are the units LBAs are expressed in).

It means that an atomic operation which straddles the atomic boundary is 
not guaranteed to be atomic by the device, so we should (must) not cross 
it to maintain atomic behaviour for an application block. That's one 
reason that we have all these size and alignment rules.

> 
>> + * @q:  the request queue for the device
>> + * @size: size in bytes. Must be a power-of-two.
>> + */
>> +void blk_queue_atomic_write_boundary(struct request_queue *q,
>> +				     unsigned int size)
>> +{
>> +	q->limits.atomic_write_boundary = size;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_boundary);
>> +
>> +/**
>> + * blk_queue_atomic_write_unit_min - smallest unit that can be written
>> + *				     atomically to the device.
>> + * @q:  the request queue for the device
>> + * @sectors: must be a power-of-two.
>> + */
>> +void blk_queue_atomic_write_unit_min(struct request_queue *q,
>> +				     unsigned int sectors)
>> +{
>> +	q->limits.atomic_write_unit_min = sectors;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_min);
> 
> Oh, these are sectors?

Again, we'll change to bytes.

> 
> What size sector? Are we talking about fixed size 512 byte basic
> block units,

Normally we would be referring to fixed size 512 byte basic
block unit

> or are we talking about physical device sector sizes
> (e.g. 4kB, maybe larger in future?)
> 
> These really should be in bytes, as they are directly exposed to
> userspace applications via statx and applications will have no idea
> what the sector size actually is without having to query the block
> device directly...

ok

> 
>> +
>> +/*
>> + * blk_queue_atomic_write_unit_max - largest unit that can be written
>> + * atomically to the device.
>> + * @q: the reqeust queue for the device
>> + * @sectors: must be a power-of-two.
>> + */
>> +void blk_queue_atomic_write_unit_max(struct request_queue *q,
>> +				     unsigned int sectors)
>> +{
>> +	struct queue_limits *limits = &q->limits;
>> +	limits->atomic_write_unit_max = sectors;
>> +}
>> +EXPORT_SYMBOL(blk_queue_atomic_write_unit_max);
>> +
>>   /**
>>    * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
>>    * @q:  the request queue for the device
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index f1fce1c7fa44..1025beff2281 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -132,6 +132,30 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q,
>>   	return queue_var_show(queue_max_discard_segments(q), page);
>>   }
>>   

...

>>   
>> +static inline unsigned int queue_atomic_write_unit_max(const struct request_queue *q)
>> +{
>> +	return q->limits.atomic_write_unit_max << SECTOR_SHIFT;
>> +}
>> +
>> +static inline unsigned int queue_atomic_write_unit_min(const struct request_queue *q)
>> +{
>> +	return q->limits.atomic_write_unit_min << SECTOR_SHIFT;
>> +}
> 
> Ah, what? This undocumented interface reports "unit limits" in
> bytes, but it's not using the physical device sector size to convert
> between sector units and bytes. This really needs some more
> documentation and work to make it present all units consistently and
> not result in confusion when devices have 4kB sector sizes and not
> 512 byte sectors...

ok, we'll look to fix this up to give a coherent and clear interface.

> 
> Also, I think all the byte ranges should support full 64 bit values,
> otherwise there will be silent overflows in converting 32 bit sector
> counts to byte ranges. And, eventually, something will want to do
> larger than 4GB atomic IOs
> 

ok, we can do that but would also then make statx field 64b. I'm fine 
with that if it is wise to do so - I don't don't want to wastefully use 
up an extra 2 x 32b in struct statx.

Thanks,
John


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-04 18:14     ` John Garry
@ 2023-05-04 22:26       ` Dave Chinner
  2023-05-05  7:54         ` John Garry
  2023-05-05 22:47         ` Eric Biggers
  0 siblings, 2 replies; 50+ messages in thread
From: Dave Chinner @ 2023-05-04 22:26 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Himanshu Madhani

On Thu, May 04, 2023 at 07:14:29PM +0100, John Garry wrote:
> Hi Dave,
> 
> > > diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> > > index 282de3680367..f3ed9890e03b 100644
> > > --- a/Documentation/ABI/stable/sysfs-block
> > > +++ b/Documentation/ABI/stable/sysfs-block
> > > @@ -21,6 +21,48 @@ Description:
> > >   		device is offset from the internal allocation unit's
> > >   		natural alignment.
> > > +What:		/sys/block/<disk>/atomic_write_max_bytes
> > > +Date:		May 2023
> > > +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> > > +Description:
> > > +		[RO] This parameter specifies the maximum atomic write
> > > +		size reported by the device. An atomic write operation
> > > +		must not exceed this number of bytes.
> > > +
> > > +
> > > +What:		/sys/block/<disk>/atomic_write_unit_min
> > > +Date:		May 2023
> > > +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> > > +Description:
> > > +		[RO] This parameter specifies the smallest block which can
> > > +		be written atomically with an atomic write operation. All
> > > +		atomic write operations must begin at a
> > > +		atomic_write_unit_min boundary and must be multiples of
> > > +		atomic_write_unit_min. This value must be a power-of-two.
> > 
> > What units is this defined to use? Bytes?
> 
> Bytes
> 
> > 
> > > +
> > > +
> > > +What:		/sys/block/<disk>/atomic_write_unit_max
> > > +Date:		January 2023
> > > +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> > > +Description:
> > > +		[RO] This parameter defines the largest block which can be
> > > +		written atomically with an atomic write operation. This
> > > +		value must be a multiple of atomic_write_unit_min and must
> > > +		be a power-of-two.
> > 
> > Same question. Also, how is this different to
> > atomic_write_max_bytes?
> 
> Again, this is bytes. We can add "bytes" to the name of these other files if
> people think it's better. Unfortunately request_queue sysfs file naming
> isn't consistent here to begin with.
> 
> atomic_write_unit_max is largest application block size which we can
> support, while atomic_write_max_bytes is the max size of an atomic operation
> which the HW supports.

Why are these different? If the hardware supports 128kB atomic
writes, why limit applications to something smaller?

> From your review on the iomap patch, I assume that now you realise that we
> are proposing a write which may include multiple application data blocks
> (each limited in size to atomic_write_unit_max), and the limit in total size
> of that write is atomic_write_max_bytes.

I still don't get it - you haven't explained why/what an application
atomic block write might be, nor why the block device should be
determining the size of application data blocks, etc.  If the block
device can do 128kB atomic writes, why wouldn't the device allow the
application to do 128kB atomic writes if they've aligned the atomic
write correctly?

What happens we we get hardware that can do atomic writes at any
alignment, of any size up to atomic_write_max_bytes? Because this
interface defines atomic writes as "must be a multiple of 2 of
atomic_write_unit_min" then hardware that can do atomic writes of
any size can not be effectively utilised by this interface....

> user applications should only pay attention to what we return from statx,
> that being atomic_write_unit_min and atomic_write_unit_max.
> 
> atomic_write_max_bytes and atomic_write_boundary is only relevant to the
> block layer.

If applications can issue an multi-atomic_write_unit_max-block
writes as a single, non-atomic, multi-bio RWF_ATOMIC pwritev2() IO
and such IO is constrainted to atomic_write_max_bytes, then
atomic_write_max_bytes is most definitely relevant to user
applications.


> > > +What:		/sys/block/<disk>/atomic_write_boundary
> > > +Date:		May 2023
> > > +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> > > +Description:
> > > +		[RO] A device may need to internally split I/Os which
> > > +		straddle a given logical block address boundary. In that
> > > +		case a single atomic write operation will be processed as
> > > +		one of more sub-operations which each complete atomically.
> > > +		This parameter specifies the size in bytes of the atomic
> > > +		boundary if one is reported by the device. This value must
> > > +		be a power-of-two.
> > 
> > How are users/filesystems supposed to use this?
> 
> As above, this is not relevant to the user.

Applications will greatly care if their atomic IO gets split into
multiple IOs whose persistence order is undefined. I think it also
matters for filesystems when it comes to allocation, because we are
going to have to be very careful not to have extents straddle ranges
that will cause an atomic write to be split.

e.g. how does this work with striped devices? e.g. we have a stripe
unit of 16kB, but the devices support atomic_write_unit_max = 32kB.
Instantly, we have a configuration where atomic writes need to be
split at 16kB boundaries, and so the maximum atomic write size that
can be supported is actually 16kB - the stripe unit of RAID device.

This means the filesystem must, at minimum, align all allocations
for atomic IO to 16kB stripe unit alignment, and must not allow
atomic IOs that are not stripe unit aligned or sized to proceed
because they can't be processed as an atomic IO....


> > >   /**
> > > @@ -183,6 +186,59 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
> > >   }
> > >   EXPORT_SYMBOL(blk_queue_max_discard_sectors);
> > > +/**
> > > + * blk_queue_atomic_write_max_bytes - set max bytes supported by
> > > + * the device for atomic write operations.
> > > + * @q:  the request queue for the device
> > > + * @size: maximum bytes supported
> > > + */
> > > +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> > > +				      unsigned int size)
> > > +{
> > > +	q->limits.atomic_write_max_bytes = size;
> > > +}
> > > +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
> > > +
> > > +/**
> > > + * blk_queue_atomic_write_boundary - Device's logical block address space
> > > + * which an atomic write should not cross.
> > 
> > I have no idea what "logical block address space which an atomic
> > write should not cross" means, especially as the unit is in bytes
> > and not in sectors (which are the units LBAs are expressed in).
> 
> It means that an atomic operation which straddles the atomic boundary is not
> guaranteed to be atomic by the device, so we should (must) not cross it to
> maintain atomic behaviour for an application block. That's one reason that
> we have all these size and alignment rules.

Yes, That much is obvious. What I have no idea diea about is what
this means in practice. When is this ever going to be non-zero, and
what should be we doing at the filesystem allocation level when it
is non-zero to ensure that allocations for atomic writes never cross
such a boundary. i.e. how do we prevent applications from ever
needing this functionality to be triggered? i.e. so the filesystem
can guarantee a single RWF_ATOMIC user IO is actually dispatched
as a single REQ_ATOMIC IO....

> ...
> 
> > > +static inline unsigned int queue_atomic_write_unit_max(const struct request_queue *q)
> > > +{
> > > +	return q->limits.atomic_write_unit_max << SECTOR_SHIFT;
> > > +}
> > > +
> > > +static inline unsigned int queue_atomic_write_unit_min(const struct request_queue *q)
> > > +{
> > > +	return q->limits.atomic_write_unit_min << SECTOR_SHIFT;
> > > +}
> > 
> > Ah, what? This undocumented interface reports "unit limits" in
> > bytes, but it's not using the physical device sector size to convert
> > between sector units and bytes. This really needs some more
> > documentation and work to make it present all units consistently and
> > not result in confusion when devices have 4kB sector sizes and not
> > 512 byte sectors...
> 
> ok, we'll look to fix this up to give a coherent and clear interface.
> 
> > 
> > Also, I think all the byte ranges should support full 64 bit values,
> > otherwise there will be silent overflows in converting 32 bit sector
> > counts to byte ranges. And, eventually, something will want to do
> > larger than 4GB atomic IOs
> > 
> 
> ok, we can do that but would also then make statx field 64b. I'm fine with
> that if it is wise to do so - I don't don't want to wastefully use up an
> extra 2 x 32b in struct statx.

Why do we need specific varibles for DIO atomic write alignment
limits? We already have direct IO alignment and size constraints in statx(),
so why wouldn't we just reuse those variables when the user requests
atomic limits for DIO?

i.e. if STATX_DIOALIGN is set, we return normal DIO alignment
constraints. If STATX_DIOALIGN_ATOMIC is set, we return the atomic
DIO alignment requirements in those variables.....

Yes, we probably need the dio max size to be added to statx for
this. Historically speaking, I wanted statx to support this in the
first place because that's what we were already giving userspace
with XFS_IOC_DIOINFO and we already knew that atomic IO when it came
along would require a bound maximum IO size much smaller than normal
DIO limits.  i.e.:

struct dioattr {
        __u32           d_mem;          /* data buffer memory alignment */
        __u32           d_miniosz;      /* min xfer size                */
        __u32           d_maxiosz;      /* max xfer size                */
};

where d_miniosz defined the alignment and size constraints for DIOs.

If we simply document that STATX_DIOALIGN_ATOMIC returns minimum
(unit) atomic IO size and alignment in statx->dio_offset_align (as
per STATX_DIOALIGN) and the maximum atomic IO size in
statx->dio_max_iosize, then we don't burn up anywhere near as much
space in the statx structure....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx
  2023-05-04  8:45     ` John Garry
@ 2023-05-04 22:40       ` Dave Chinner
  2023-05-05  8:01         ` John Garry
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2023-05-04 22:40 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Prasad Singamsetty

On Thu, May 04, 2023 at 09:45:50AM +0100, John Garry wrote:
> On 03/05/2023 22:58, Dave Chinner wrote:
> > Is there a statx() man
> > page update for this addition?
> 
> No, not yet. Is it normally expected to provide a proposed man page update
> in parallel? Or somewhat later, when the kernel API change has some
> appreciable level of agreement?

Normally we ask for man page updates to be presented at the same
time, as the man page defines the user interface that is being
implemented. In this case, we need updates for the pwritev2() man
page to document RWF_ATOMIC semantics, and the statx() man page to
document what the variables being exposed mean w.r.t. RWF_ATOMIC.

The pwritev2() man page is probably the most important one right now
- it needs to explain the guarantees that RWF_ATOMIC is supposed to
provide w.r.t. data integrity, IO ordering, persistence, etc.
Indeed, it will need to explain exactly how this "multi-atomic-unit
mulit-bio non-atomic RWF_ATOMIC" IO thing can be used safely and
reliably, especially w.r.t. IO ordering and persistence guarantees
in the face of crashes and power failures. Not to mention
documenting error conditions specific to RWF_ATOMIC...

It's all well and good to have some implementation, but without
actually defining and documenting the *guarantees* that RWF_ATOMIC
provides userspace it is completely useless for application
developers. And from the perspective of a reviewer, without the
documentation stating what the infrastructure actually guarantees
applications, we can't determine if the implementation being
presented is fit for purpose....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-04 22:26       ` Dave Chinner
@ 2023-05-05  7:54         ` John Garry
  2023-05-05 22:00           ` Darrick J. Wong
  2023-05-05 23:18           ` Dave Chinner
  2023-05-05 22:47         ` Eric Biggers
  1 sibling, 2 replies; 50+ messages in thread
From: John Garry @ 2023-05-05  7:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Himanshu Madhani

On 04/05/2023 23:26, Dave Chinner wrote:

Hi Dave,

>> atomic_write_unit_max is largest application block size which we can
>> support, while atomic_write_max_bytes is the max size of an atomic operation
>> which the HW supports.
> Why are these different? If the hardware supports 128kB atomic
> writes, why limit applications to something smaller?

Two reasons:
a. If you see patch 6/16, we need to apply a limit on 
atomic_write_unit_max from what is guaranteed we can fit in a bio 
without it being required to be split when submitted.

Consider iomap generates an atomic write bio for a single userspace 
block and submits to the block layer - if the block layer needs to split 
due to block driver request_queue limits, like max_segments, then we're 
in trouble. So we need to limit atomic_write_unit_max such that this 
will not occur. That same limit should not apply to atomic_write_max_bytes.

b. For NVMe, atomic_write_unit_max and atomic_write_max_bytes which the 
host reports will be the same (ignoring a.).

However for SCSI they may be different. SCSI has its own concept of 
boundary and it is relevant here. This is confusing as it is very 
different from NVMe boundary. NVMe is a media boundary really. For SCSI, 
a boundary is a sub-segment which the device may split an atomic write 
operation. For a SCSI device which only supports this boundary mode of 
operation, we limit atomic_write_unit_max to the max boundary segment 
size (such that we don't get splitting of an atomic write by the device) 
and then limit atomic_write_max_bytes to what is known in the spec as 
"maximum atomic transfer length with boundary". So in this device mode 
of operation, atomic_write_max_bytes and atomic_write_unit_max should be 
different.

> 
>>  From your review on the iomap patch, I assume that now you realise that we
>> are proposing a write which may include multiple application data blocks
>> (each limited in size to atomic_write_unit_max), and the limit in total size
>> of that write is atomic_write_max_bytes.
> I still don't get it - you haven't explained why/what an application
> atomic block write might be, nor why the block device should be
> determining the size of application data blocks, etc.  If the block
> device can do 128kB atomic writes, why wouldn't the device allow the
> application to do 128kB atomic writes if they've aligned the atomic
> write correctly?

An application block needs to be:
- sized at a power-of-two
- sized between atomic_write_unit_min and atomic_write_unit_max, inclusive
- naturally aligned

Please consider that the application does not explicitly tell the kernel 
the size of its data blocks, it's implied from the size of the write and 
file offset. So, assuming that userspace follows the rules properly when 
issuing a write, the kernel may deduce the application block size and 
ensure only that each individual user data block is not split.

If userspace wants a guarantee of no splitting of all in its write, then 
it may issue a write for a single userspace data block, e.g. userspace 
block size is 16KB, then write at a file offset aligned to 16KB and a 
total write size of 16KB will be guaranteed to be written atomically by 
the device.

> 
> What happens we we get hardware that can do atomic writes at any
> alignment, of any size up to atomic_write_max_bytes? Because this
> interface defines atomic writes as "must be a multiple of 2 of
> atomic_write_unit_min" then hardware that can do atomic writes of
> any size can not be effectively utilised by this interface....
> 
>> user applications should only pay attention to what we return from statx,
>> that being atomic_write_unit_min and atomic_write_unit_max.
>>
>> atomic_write_max_bytes and atomic_write_boundary is only relevant to the
>> block layer.
> If applications can issue an multi-atomic_write_unit_max-block
> writes as a single, non-atomic, multi-bio RWF_ATOMIC pwritev2() IO
> and such IO is constrainted to atomic_write_max_bytes, then
> atomic_write_max_bytes is most definitely relevant to user
> applications.

But we still do not guarantee that multi-atomic_write_unit_max-block 
writes as a single, non-atomic, multi-bio RWF_ATOMIC pwritev2() IO and 
such IO is constrained to atomic_write_max_bytes will be written 
atomically by the device.

Three things may happen in the kernel:
- we may need to split due to atomic boundary
- we may need to split due to the write spanning discontig extents
- atomic_write_max_bytes may be much larger than what we could fit in a 
bio, so may need multiple bios

And maybe more which does not come to mind.

So I am not sure what value there is in reporting atomic_write_max_bytes 
to the user. The description would need to be something like "we 
guarantee that if the total write length is greater than 
atomic_write_max_bytes, then all data will never be submitted to the 
device atomically. Otherwise it might be".

> 
> 
>>>> +What:		/sys/block/<disk>/atomic_write_boundary
>>>> +Date:		May 2023
>>>> +Contact:	Himanshu Madhani<himanshu.madhani@oracle.com>
>>>> +Description:
>>>> +		[RO] A device may need to internally split I/Os which
>>>> +		straddle a given logical block address boundary. In that
>>>> +		case a single atomic write operation will be processed as
>>>> +		one of more sub-operations which each complete atomically.
>>>> +		This parameter specifies the size in bytes of the atomic
>>>> +		boundary if one is reported by the device. This value must
>>>> +		be a power-of-two.
>>> How are users/filesystems supposed to use this?
>> As above, this is not relevant to the user.
> Applications will greatly care if their atomic IO gets split into
> multiple IOs whose persistence order is undefined.

Sure, so maybe then we need to define and support persistence ordering 
rules. But still, any atomic_write_boundary is already taken into 
account when we report atomic_write_unit_min and atomic_write_unit_max 
to the user.

> I think it also
> matters for filesystems when it comes to allocation, because we are
> going to have to be very careful not to have extents straddle ranges
> that will cause an atomic write to be split.

Note that block drivers need to ensure that they report the following:
- atomic_write_unit_max is a power-of-2
- atomic_write_boundary is a power-of-2 (and naturally it would need to 
be greater or equal to atomic_write_unit_max)
[sidenote: I actually think that atomic_write_boundary needs to be just 
a multiple of atomic_write_unit_max, but let's stick with these rules 
for the moment]

As such, if we split a write due to a boundary, we would still always be 
able to split such that we don't need to split an individual userspace 
data block.

> 
> e.g. how does this work with striped devices? e.g. we have a stripe
> unit of 16kB, but the devices support atomic_write_unit_max = 32kB.
> Instantly, we have a configuration where atomic writes need to be
> split at 16kB boundaries, and so the maximum atomic write size that
> can be supported is actually 16kB - the stripe unit of RAID device.

OK, so in that case, I think that we would need to limit the reported 
atomic_write_unit_max value to the stripe value in a RAID config.

> 
> This means the filesystem must, at minimum, align all allocations
> for atomic IO to 16kB stripe unit alignment, and must not allow
> atomic IOs that are not stripe unit aligned or sized to proceed
> because they can't be processed as an atomic IO....

As above. Martin may be able to comment more on this.

> 
> 
>>>>    /**
>>>> @@ -183,6 +186,59 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
>>>>    }
>>>>    EXPORT_SYMBOL(blk_queue_max_discard_sectors);
>>>> +/**
>>>> + * blk_queue_atomic_write_max_bytes - set max bytes supported by
>>>> + * the device for atomic write operations.
>>>> + * @q:  the request queue for the device
>>>> + * @size: maximum bytes supported
>>>> + */
>>>> +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
>>>> +				      unsigned int size)
>>>> +{
>>>> +	q->limits.atomic_write_max_bytes = size;
>>>> +}
>>>> +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
>>>> +
>>>> +/**
>>>> + * blk_queue_atomic_write_boundary - Device's logical block address space
>>>> + * which an atomic write should not cross.
>>> I have no idea what "logical block address space which an atomic
>>> write should not cross" means, especially as the unit is in bytes
>>> and not in sectors (which are the units LBAs are expressed in).
>> It means that an atomic operation which straddles the atomic boundary is not
>> guaranteed to be atomic by the device, so we should (must) not cross it to
>> maintain atomic behaviour for an application block. That's one reason that
>> we have all these size and alignment rules.
> Yes, That much is obvious. What I have no idea diea about is what
> this means in practice. When is this ever going to be non-zero, and
> what should be we doing at the filesystem allocation level when it
> is non-zero to ensure that allocations for atomic writes never cross
> such a boundary. i.e. how do we prevent applications from ever
> needing this functionality to be triggered? i.e. so the filesystem
> can guarantee a single RWF_ATOMIC user IO is actually dispatched
> as a single REQ_ATOMIC IO....

We only guarantee that a single user data block will not be split. So to 
avoid any splitting at all, all you can do is write a single user data 
block. That's the best which we can offer.

As mentioned earlier, atomic boundary is only relevant to NVMe. If the 
device does not support an atomic boundary which is not compliant with 
the rules, then we cannot support atomic writes for that device.

> 
>> ...
>>
>>>> +static inline unsigned int queue_atomic_write_unit_max(const struct request_queue *q)
>>>> +{
>>>> +	return q->limits.atomic_write_unit_max << SECTOR_SHIFT;
>>>> +}
>>>> +
>>>> +static inline unsigned int queue_atomic_write_unit_min(const struct request_queue *q)
>>>> +{
>>>> +	return q->limits.atomic_write_unit_min << SECTOR_SHIFT;
>>>> +}
>>> Ah, what? This undocumented interface reports "unit limits" in
>>> bytes, but it's not using the physical device sector size to convert
>>> between sector units and bytes. This really needs some more
>>> documentation and work to make it present all units consistently and
>>> not result in confusion when devices have 4kB sector sizes and not
>>> 512 byte sectors...
>> ok, we'll look to fix this up to give a coherent and clear interface.
>>
>>> Also, I think all the byte ranges should support full 64 bit values,
>>> otherwise there will be silent overflows in converting 32 bit sector
>>> counts to byte ranges. And, eventually, something will want to do
>>> larger than 4GB atomic IOs
>>>
>> ok, we can do that but would also then make statx field 64b. I'm fine with
>> that if it is wise to do so - I don't don't want to wastefully use up an
>> extra 2 x 32b in struct statx.
> Why do we need specific varibles for DIO atomic write alignment
> limits?

I guess that we don't

> We already have direct IO alignment and size constraints in statx(),
> so why wouldn't we just reuse those variables when the user requests
> atomic limits for DIO?
> 
> i.e. if STATX_DIOALIGN is set, we return normal DIO alignment
> constraints. If STATX_DIOALIGN_ATOMIC is set, we return the atomic
> DIO alignment requirements in those variables.....
> 
> Yes, we probably need the dio max size to be added to statx for
> this. Historically speaking, I wanted statx to support this in the
> first place because that's what we were already giving userspace
> with XFS_IOC_DIOINFO and we already knew that atomic IO when it came
> along would require a bound maximum IO size much smaller than normal
> DIO limits.  i.e.:
> 
> struct dioattr {
>          __u32           d_mem;          /* data buffer memory alignment */
>          __u32           d_miniosz;      /* min xfer size                */
>          __u32           d_maxiosz;      /* max xfer size                */
> };
> 
> where d_miniosz defined the alignment and size constraints for DIOs.
> 
> If we simply document that STATX_DIOALIGN_ATOMIC returns minimum
> (unit) atomic IO size and alignment in statx->dio_offset_align (as
> per STATX_DIOALIGN) and the maximum atomic IO size in
> statx->dio_max_iosize, then we don't burn up anywhere near as much
> space in the statx structure....

ok, so you are saying to unionize them, right? That would seem 
reasonable to me.

Thanks,
John


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx
  2023-05-04 22:40       ` Dave Chinner
@ 2023-05-05  8:01         ` John Garry
  2023-05-05 22:04           ` Darrick J. Wong
  0 siblings, 1 reply; 50+ messages in thread
From: John Garry @ 2023-05-05  8:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Prasad Singamsetty

On 04/05/2023 23:40, Dave Chinner wrote:

Hi Dave,

>> No, not yet. Is it normally expected to provide a proposed man page update
>> in parallel? Or somewhat later, when the kernel API change has some
>> appreciable level of agreement?
> Normally we ask for man page updates to be presented at the same
> time, as the man page defines the user interface that is being
> implemented. In this case, we need updates for the pwritev2() man
> page to document RWF_ATOMIC semantics, and the statx() man page to
> document what the variables being exposed mean w.r.t. RWF_ATOMIC.
> 
> The pwritev2() man page is probably the most important one right now
> - it needs to explain the guarantees that RWF_ATOMIC is supposed to
> provide w.r.t. data integrity, IO ordering, persistence, etc.
> Indeed, it will need to explain exactly how this "multi-atomic-unit
> mulit-bio non-atomic RWF_ATOMIC" IO thing can be used safely and
> reliably, especially w.r.t. IO ordering and persistence guarantees
> in the face of crashes and power failures. Not to mention
> documenting error conditions specific to RWF_ATOMIC...
> 
> It's all well and good to have some implementation, but without
> actually defining and documenting the*guarantees*  that RWF_ATOMIC
> provides userspace it is completely useless for application
> developers. And from the perspective of a reviewer, without the
> documentation stating what the infrastructure actually guarantees
> applications, we can't determine if the implementation being
> presented is fit for purpose....

ok, understood. Obviously from any discussion so far there are many 
details which the user needs to know about how to use this interface and 
what to expect.

We'll look to start working on those man page details now.

Thanks,
John

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 11/16] fs: iomap: Atomic write support
  2023-05-04  5:00   ` Dave Chinner
@ 2023-05-05 21:19     ` Darrick J. Wong
  2023-05-05 23:56       ` Dave Chinner
  0 siblings, 1 reply; 50+ messages in thread
From: Darrick J. Wong @ 2023-05-05 21:19 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, axboe, kbusch, hch, sagi, martin.petersen, viro,
	brauner, dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge

On Thu, May 04, 2023 at 03:00:06PM +1000, Dave Chinner wrote:
> On Wed, May 03, 2023 at 06:38:16PM +0000, John Garry wrote:
> > Add support to create bio's whose bi_sector and bi_size are aligned to and
> > multiple of atomic_write_unit, respectively.
> > 
> > When we call iomap_dio_bio_iter() -> bio_iov_iter_get_pages() ->
> > __bio_iov_iter_get_pages(), we trim the bio to a multiple of
> > atomic_write_unit.
> > 
> > As such, we expect the iomi start and length to have same size and
> > alignment requirements per iomap_dio_bio_iter() call.
> > 
> > In iomap_dio_bio_iter(), ensure that for a non-dsync iocb that the mapping
> > is not dirty nor unmapped.
> > 
> > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > ---
> >  fs/iomap/direct-io.c | 72 ++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 70 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index f771001574d0..37c3c926dfd8 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -36,6 +36,8 @@ struct iomap_dio {
> >  	size_t			done_before;
> >  	bool			wait_for_completion;
> >  
> > +	unsigned int atomic_write_unit;
> > +
> >  	union {
> >  		/* used during submission and for synchronous completion: */
> >  		struct {
> > @@ -229,9 +231,21 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> >  	return opflags;
> >  }
> >  
> > +
> > +/*
> > + * Note: For atomic writes, each bio which we create when we iter should have
> > + *	 bi_sector aligned to atomic_write_unit and also its bi_size should be
> > + *	 a multiple of atomic_write_unit.
> > + *	 The call to bio_iov_iter_get_pages() -> __bio_iov_iter_get_pages()
> > + *	 should trim the length to a multiple of atomic_write_unit for us.
> > + *	 This allows us to split each bio later in the block layer to fit
> > + *	 request_queue limit.
> > + */
> >  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> >  		struct iomap_dio *dio)
> >  {
> > +	bool atomic_write = (dio->iocb->ki_flags & IOCB_ATOMIC) &&
> > +			    (dio->flags & IOMAP_DIO_WRITE);
> >  	const struct iomap *iomap = &iter->iomap;
> >  	struct inode *inode = iter->inode;
> >  	unsigned int fs_block_size = i_blocksize(inode), pad;
> > @@ -249,6 +263,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> >  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> >  		return -EINVAL;
> >  
> > +
> > +	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
> > +		if (iomap->flags & IOMAP_F_DIRTY)
> > +			return -EIO;
> > +		if (iomap->type != IOMAP_MAPPED)
> > +			return -EIO;
> > +	}
> 
> IDGI. If the iomap had space allocated for this dio iteration,
> then IOMAP_F_DIRTY will be set and it is likely (guaranteed for XFS)
> that the iomap type will be IOMAP_UNWRITTEN. Indeed, if we are doing
> a write into preallocated space (i.e. from fallocate()) then this
> will cause -EIO on all RWF_ATOMIC IO to that file unless RWF_DSYNC
> is also used.
> 
> "For a power fail, for each individual application block, all or
> none of the data to be written."
> 
> Ok, does this means RWF_ATOMIC still needs fdatasync() to guarantee
> that the data makes it to stable storage? And the result is
> undefined until fdatasync() is run, but the device will guarantee
> that either all or none of the data will be on stable storage
> prior to the next device cache flush completing?
> 
> i.e. does REQ_ATOMIC imply REQ_FUA, or does it require a separate
> device cache flush to commit the atomic IO to stable storage?

From the SCSI and NVME device information that I've been presented, it
sounds like an explicit cache flush or FUA is required to persist the
data.

> What about ordering - do the devices guarantee strict ordering of
> REQ_ATOMIC writes? i.e. if atomic write N is seen on disk, then all
> the previous atomic writes up to N will also be seen on disk? If
> not, how does the application and filesystem guarantee persistence
> of completed atomic writes?

I /think/ the applications have to ensure ordering themselves.  If Y
cannot appear before X is persisted, then the application must wait for
the ack for X, flush the cache, and only then send Y.

> i.e. If we still need a post-IO device cache flush to guarantee
> persistence and/or ordering of RWF_ATOMIC IOs, then the above code
> makes no sense - we'll still need fdatasync() to provide persistence
> checkpoints and that means we ensure metadata is also up to date
> at those checkpoints.

I'll let the block layer developers weigh in on this, but I /think/ this
means that we require RWF_DSYNC for atomic block writes to written
mappings, and RWF_SYNC if iomap_begin gives us an unwritten/hole/dirty
mapping.

> I need someone to put down in writing exactly what the data
> integrity, ordering and persistence semantics of REQ_ATOMIC are
> before I can really comment any further. From my perspective as a
> filesystem developer, this is the single most important set of
> behaviours that need to be documented, as this determines how
> everything else interacts with atomic writes....
> 
> >  	if (iomap->type == IOMAP_UNWRITTEN) {
> >  		dio->flags |= IOMAP_DIO_UNWRITTEN;
> >  		need_zeroout = true;
> > @@ -318,6 +340,10 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> >  					  GFP_KERNEL);
> >  		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
> >  		bio->bi_ioprio = dio->iocb->ki_ioprio;
> > +		if (atomic_write) {
> > +			bio->bi_opf |= REQ_ATOMIC;
> > +			bio->atomic_write_unit = dio->atomic_write_unit;
> > +		}
> >  		bio->bi_private = dio;
> >  		bio->bi_end_io = iomap_dio_bio_end_io;
> >  
> > @@ -492,6 +518,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  		is_sync_kiocb(iocb) || (dio_flags & IOMAP_DIO_FORCE_WAIT);
> >  	struct blk_plug plug;
> >  	struct iomap_dio *dio;
> > +	bool is_read = iov_iter_rw(iter) == READ;
> > +	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
> >  
> >  	if (!iomi.len)
> >  		return NULL;
> > @@ -500,6 +528,20 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  	if (!dio)
> >  		return ERR_PTR(-ENOMEM);
> >  
> > +	if (atomic_write) {
> > +		/*
> > +		 * Note: This lookup is not proper for a multi-device scenario,
> > +		 *	 however for current iomap users, the bdev per iter
> > +		 *	 will be fixed, so "works" for now.
> > +		 */
> > +		struct super_block *i_sb = inode->i_sb;
> > +		struct block_device *bdev = i_sb->s_bdev;
> > +
> > +		dio->atomic_write_unit =
> > +			bdev_find_max_atomic_write_alignment(bdev,
> > +					iomi.pos, iomi.len);
> > +	}
> 
> This will break atomic IO to XFS realtime devices. The device we are
> doing IO to is iomap->bdev, we should never be using sb->s_bdev in
> the iomap code.  Of course, at this point in __iomap_dio_rw() we
> don't have an iomap so this "alignment constraint" can't be done
> correctly at this point in the IO path.

(Agreed.)

> However, even ignoring the bdev source, I think this is completely
> wrong. Passing a *file* offset to the underlying block device so the
> block device can return a device alignment constraint for IO is not
> valid. We don't know how that file offset/length is going to be
> mapped to the underlying block device until we ask the filesystem
> for an iomap covering the file range, so we can't possibly know what
> the device IO alignment of the user request will be until we have an
> iomap for it.

(Agreed.)

> At which point, the "which block device should we ask for alignment
> constraints" question is moot, because we now have an iomap and can
> use iomap->bdev....
> 
> > @@ -592,6 +634,32 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  
> >  	blk_start_plug(&plug);
> >  	while ((ret = iomap_iter(&iomi, ops)) > 0) {
> > +		if (atomic_write) {
> > +			const struct iomap *_iomap = &iomi.iomap;
> > +			loff_t iomi_length = iomap_length(&iomi);
> > +
> > +			/*
> > +			 * Ensure length and start address is a multiple of
> > +			 * atomic_write_unit - this is critical. If the length
> > +			 * is not a multiple of atomic_write_unit, then we
> > +			 * cannot create a set of bio's in iomap_dio_bio_iter()
> > +			 * who are each a length which is a multiple of
> > +			 * atomic_write_unit.
> > +			 *
> > +			 * Note: It may be more appropiate to have this check
> > +			 *	 in iomap_dio_bio_iter()
> > +			 */
> > +			if ((iomap_sector(_iomap, iomi.pos) << SECTOR_SHIFT) %

The file offset (and by extension the position) are not important for
deciding if we can issue an atomic write.  Only the mapped LBA space on
the underlying device is important.

IOWs, if we have a disk that can write a 64k aligned block atomically,
iomap only has to check that iomap->addr is aligned to a 64k boundary.
If that space happens to be mapped to file offset 57k, then it is indeed
possible to perform a 64k atomic write to the file starting at offset
57k and ending at offset 121k, right?

Now, obviously, nobody will ever do that, but my point is that no
changes to iomap_dio_rw are necessary -- only the alignment of the
mapping returned by ->iomap_begin requires checking.

> > +			    dio->atomic_write_unit) {
> > +				ret = -EIO;
> > +				break;
> > +			}
> > +
> > +			if (iomi_length % dio->atomic_write_unit) {
> > +				ret = -EIO;
> > +				break;
> > +			}
> 
> This looks wrong - the length of the mapped extent could be shorter
> than the max atomic write size returned by
> bdev_find_max_atomic_write_alignment() but the iomap could still be aligned
> to the minimum atomic write unit supported. At this point, we reject
> the IO with -EIO, even though it could have been done as an atomic
> write, just a shorter one than the user requested.
> 
> That said, I don't think we can call a user IO that is being
> sliced and diced into multiple individual IOs "atomic". "Atomic"
> implies all-or-none behaviour - slicing up a large DIO into smaller
> individual bios means the bios can be submitted and completed out of
> order. If we then we get a power failure, the application's "atomic"
> IO can appear on disk as only being partially complete - it violates
> the "all or none" semantics of "atomic IO".

This "you can write multiple atomic units but you can't know which ones
completed" behavior is the part I dislike the most about the entire
feature.

> Hence I think that we should be rejecting RWF_ATOMIC IOs that are
> larger than the maximum atomic write unit or cannot be dispatched in
> a single IO e.g. filesystem has allocated multiple minimum aligned
> extents and so a max len atomic write IO over that range must be
> broken up into multiple smaller IOs.
> 
> We should be doing max atomic write size rejection high up in the IO
> path (e.g. filesystem ->write_iter() method) before we get anywhere
> near the DIO path, and we should be rejecting atomic write IOs in
> the DIO path during the ->iomap_begin() mapping callback if we can't
> map the entire atomic IO to a single aligned filesystem extent.
> 
> i.e. the alignment checks and constraints need to be applied by the
> filesystem mapping code, not the layer that packs the pages into the
> bio as directed by the filesystem mapping....

Hmm.  I think I see what you're saying here -- iomap should communicate
to ->iomap_begin that we want to perform an atomic write, and there had
better be either (a) a properly aligned mapping all ready to go; or (b)
the fs must perform an aligned allocation and map that in, or return no
mapping so the write fails.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-05  7:54         ` John Garry
@ 2023-05-05 22:00           ` Darrick J. Wong
  2023-05-07  1:59             ` Martin K. Petersen
  2023-05-05 23:18           ` Dave Chinner
  1 sibling, 1 reply; 50+ messages in thread
From: Darrick J. Wong @ 2023-05-05 22:00 UTC (permalink / raw)
  To: John Garry
  Cc: Dave Chinner, axboe, kbusch, hch, sagi, martin.petersen, viro,
	brauner, dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Himanshu Madhani

On Fri, May 05, 2023 at 08:54:12AM +0100, John Garry wrote:
> On 04/05/2023 23:26, Dave Chinner wrote:
> 
> Hi Dave,
> 
> > > atomic_write_unit_max is largest application block size which we can
> > > support, while atomic_write_max_bytes is the max size of an atomic operation
> > > which the HW supports.
> > Why are these different? If the hardware supports 128kB atomic
> > writes, why limit applications to something smaller?
> 
> Two reasons:
> a. If you see patch 6/16, we need to apply a limit on atomic_write_unit_max
> from what is guaranteed we can fit in a bio without it being required to be
> split when submitted.
> 
> Consider iomap generates an atomic write bio for a single userspace block
> and submits to the block layer - if the block layer needs to split due to
> block driver request_queue limits, like max_segments, then we're in trouble.
> So we need to limit atomic_write_unit_max such that this will not occur.
> That same limit should not apply to atomic_write_max_bytes.
> 
> b. For NVMe, atomic_write_unit_max and atomic_write_max_bytes which the host
> reports will be the same (ignoring a.).
> 
> However for SCSI they may be different. SCSI has its own concept of boundary
> and it is relevant here. This is confusing as it is very different from NVMe
> boundary. NVMe is a media boundary really. For SCSI, a boundary is a
> sub-segment which the device may split an atomic write operation. For a SCSI
> device which only supports this boundary mode of operation, we limit
> atomic_write_unit_max to the max boundary segment size (such that we don't
> get splitting of an atomic write by the device) and then limit
> atomic_write_max_bytes to what is known in the spec as "maximum atomic
> transfer length with boundary". So in this device mode of operation,
> atomic_write_max_bytes and atomic_write_unit_max should be different.

Hmm, maybe some concrete examples would be useful here?  I find the
queue limits stuff pretty confusing too.

Could a SCSI device could advertise 512b LBAs, 4096b physical blocks, a
64k atomic_write_unit_max, and a 1MB maximum transfer length
(atomic_write_max_bytes)?  And does that mean that application software
can send one 64k-aligned write and expect it either to be persisted
completely or not at all?

And, does that mean that the application can send up to 16 of these
64k-aligned blocks as a single 1MB IO and expect that each of those 16
blocks will either be persisted entirely or not at all?  There doesn't
seem to be any means for the device to report /which/ of the 16 were
persisted, which is disappointing.  But maybe the application encodes
LSNs and can tell after the fact that something went wrong, and recover?

If the same device reports a 2048b atomic_write_unit_min, does that mean
that I can send between 2 and 64k of data as a single atomic write and
that's ok?  I assume that this weird situation (512b LBA, 4k physical,
2k atomic unit min) requires some fancy RMW but that the device is
prepared to cr^Wpersist that correctly?

What if the device also advertises a 128k atomic_write_boundary?
That means that a 2k atomic block write will fail if it starts at 127k,
but if it starts at 126k then thats ok.  Right?

As for avoiding splits in the block layer, I guess that also means that
someone needs to reduce atomic_write_unit_max and atomic_write_boundary
if (say) some sysadmin decides to create a raid0 of these devices with a
32k stripe size?

It sounds like NVME is simpler in that it would report 64k for both the
max unit and the max transfer length?  And for the 1M write I mentioned
above, the application must send 16 individual writes?

(Did I get all that correctly?)

> > 
> > >  From your review on the iomap patch, I assume that now you realise that we
> > > are proposing a write which may include multiple application data blocks
> > > (each limited in size to atomic_write_unit_max), and the limit in total size
> > > of that write is atomic_write_max_bytes.
> > I still don't get it - you haven't explained why/what an application
> > atomic block write might be, nor why the block device should be
> > determining the size of application data blocks, etc.  If the block
> > device can do 128kB atomic writes, why wouldn't the device allow the
> > application to do 128kB atomic writes if they've aligned the atomic
> > write correctly?
> 
> An application block needs to be:
> - sized at a power-of-two
> - sized between atomic_write_unit_min and atomic_write_unit_max, inclusive
> - naturally aligned
> 
> Please consider that the application does not explicitly tell the kernel the
> size of its data blocks, it's implied from the size of the write and file
> offset. So, assuming that userspace follows the rules properly when issuing
> a write, the kernel may deduce the application block size and ensure only
> that each individual user data block is not split.
> 
> If userspace wants a guarantee of no splitting of all in its write, then it
> may issue a write for a single userspace data block, e.g. userspace block
> size is 16KB, then write at a file offset aligned to 16KB and a total write
> size of 16KB will be guaranteed to be written atomically by the device.

I'm ... not sure what the userspace block size is?

With my app developer hat on, the simplest mental model of this is that
if I want to persist a blob of data that is larger than one device LBA,
then atomic_write_unit_min <= blob size <= atomic_write_unit_max must be
true, and the LBA range for the write cannot cross a atomic_write_boundary.

Does that sound right?

Going back to my sample device above, the XFS buffer cache could write
individual 4k filesystem metadata blocks using REQ_ATOMIC because 4k is
between the atomic write unit min/max, 4k metadata blocks will never
cross a 128k boundary, and we'd never have to worry about torn writes
in metadata ever again?

Furthermore, if I want to persist a bunch of blobs in a contiguous LBA
range and atomic_write_max_bytes > atomic_write_unit_max, then I can do
that with a single direct write?  I'm assuming that the blobs in the
middle of the range must all be exactly atomic_write_unit_max bytes in
size?  And I had better be prepared to (I guess) re-read the entire
range after the system goes down to find out if any of them did or did
not persist?

(This part sounds like a PITA.)

> > 
> > What happens we we get hardware that can do atomic writes at any
> > alignment, of any size up to atomic_write_max_bytes? Because this
> > interface defines atomic writes as "must be a multiple of 2 of
> > atomic_write_unit_min" then hardware that can do atomic writes of
> > any size can not be effectively utilised by this interface....
> > 
> > > user applications should only pay attention to what we return from statx,
> > > that being atomic_write_unit_min and atomic_write_unit_max.
> > > 
> > > atomic_write_max_bytes and atomic_write_boundary is only relevant to the
> > > block layer.
> > If applications can issue an multi-atomic_write_unit_max-block
> > writes as a single, non-atomic, multi-bio RWF_ATOMIC pwritev2() IO
> > and such IO is constrainted to atomic_write_max_bytes, then
> > atomic_write_max_bytes is most definitely relevant to user
> > applications.
> 
> But we still do not guarantee that multi-atomic_write_unit_max-block writes
> as a single, non-atomic, multi-bio RWF_ATOMIC pwritev2() IO and such IO is
> constrained to atomic_write_max_bytes will be written atomically by the
> device.
> 
> Three things may happen in the kernel:
> - we may need to split due to atomic boundary
> - we may need to split due to the write spanning discontig extents
> - atomic_write_max_bytes may be much larger than what we could fit in a bio,
> so may need multiple bios
> 
> And maybe more which does not come to mind.
> 
> So I am not sure what value there is in reporting atomic_write_max_bytes to
> the user. The description would need to be something like "we guarantee that
> if the total write length is greater than atomic_write_max_bytes, then all
> data will never be submitted to the device atomically. Otherwise it might
> be".
> 
> > 
> > 
> > > > > +What:		/sys/block/<disk>/atomic_write_boundary
> > > > > +Date:		May 2023
> > > > > +Contact:	Himanshu Madhani<himanshu.madhani@oracle.com>
> > > > > +Description:
> > > > > +		[RO] A device may need to internally split I/Os which
> > > > > +		straddle a given logical block address boundary. In that
> > > > > +		case a single atomic write operation will be processed as
> > > > > +		one of more sub-operations which each complete atomically.
> > > > > +		This parameter specifies the size in bytes of the atomic
> > > > > +		boundary if one is reported by the device. This value must
> > > > > +		be a power-of-two.
> > > > How are users/filesystems supposed to use this?
> > > As above, this is not relevant to the user.
> > Applications will greatly care if their atomic IO gets split into
> > multiple IOs whose persistence order is undefined.
> 
> Sure, so maybe then we need to define and support persistence ordering
> rules. But still, any atomic_write_boundary is already taken into account
> when we report atomic_write_unit_min and atomic_write_unit_max to the user.
> 
> > I think it also
> > matters for filesystems when it comes to allocation, because we are
> > going to have to be very careful not to have extents straddle ranges
> > that will cause an atomic write to be split.
> 
> Note that block drivers need to ensure that they report the following:
> - atomic_write_unit_max is a power-of-2
> - atomic_write_boundary is a power-of-2 (and naturally it would need to be
> greater or equal to atomic_write_unit_max)
> [sidenote: I actually think that atomic_write_boundary needs to be just a
> multiple of atomic_write_unit_max, but let's stick with these rules for the
> moment]
> 
> As such, if we split a write due to a boundary, we would still always be
> able to split such that we don't need to split an individual userspace data
> block.

...but only if userspace data blocks (whatever those are) don't
themselves split atomic_write_boundary.

> > 
> > e.g. how does this work with striped devices? e.g. we have a stripe
> > unit of 16kB, but the devices support atomic_write_unit_max = 32kB.
> > Instantly, we have a configuration where atomic writes need to be
> > split at 16kB boundaries, and so the maximum atomic write size that
> > can be supported is actually 16kB - the stripe unit of RAID device.
> 
> OK, so in that case, I think that we would need to limit the reported
> atomic_write_unit_max value to the stripe value in a RAID config.
> 
> > 
> > This means the filesystem must, at minimum, align all allocations
> > for atomic IO to 16kB stripe unit alignment, and must not allow
> > atomic IOs that are not stripe unit aligned or sized to proceed
> > because they can't be processed as an atomic IO....
> 
> As above. Martin may be able to comment more on this.
> 
> > 
> > 
> > > > >    /**
> > > > > @@ -183,6 +186,59 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
> > > > >    }
> > > > >    EXPORT_SYMBOL(blk_queue_max_discard_sectors);
> > > > > +/**
> > > > > + * blk_queue_atomic_write_max_bytes - set max bytes supported by
> > > > > + * the device for atomic write operations.
> > > > > + * @q:  the request queue for the device
> > > > > + * @size: maximum bytes supported
> > > > > + */
> > > > > +void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> > > > > +				      unsigned int size)
> > > > > +{
> > > > > +	q->limits.atomic_write_max_bytes = size;
> > > > > +}
> > > > > +EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
> > > > > +
> > > > > +/**
> > > > > + * blk_queue_atomic_write_boundary - Device's logical block address space
> > > > > + * which an atomic write should not cross.
> > > > I have no idea what "logical block address space which an atomic
> > > > write should not cross" means, especially as the unit is in bytes
> > > > and not in sectors (which are the units LBAs are expressed in).
> > > It means that an atomic operation which straddles the atomic boundary is not
> > > guaranteed to be atomic by the device, so we should (must) not cross it to
> > > maintain atomic behaviour for an application block. That's one reason that
> > > we have all these size and alignment rules.
> > Yes, That much is obvious. What I have no idea diea about is what
> > this means in practice. When is this ever going to be non-zero, and
> > what should be we doing at the filesystem allocation level when it
> > is non-zero to ensure that allocations for atomic writes never cross
> > such a boundary. i.e. how do we prevent applications from ever
> > needing this functionality to be triggered? i.e. so the filesystem
> > can guarantee a single RWF_ATOMIC user IO is actually dispatched
> > as a single REQ_ATOMIC IO....
> 
> We only guarantee that a single user data block will not be split. So to
> avoid any splitting at all, all you can do is write a single user data
> block. That's the best which we can offer.
> 
> As mentioned earlier, atomic boundary is only relevant to NVMe. If the
> device does not support an atomic boundary which is not compliant with the
> rules, then we cannot support atomic writes for that device.

I guess here that any device advertising a atomic_write_boundary > 0
internally splits its LBA address space into chunks of that size and can
only persist full chunks.  The descriptions of how flash storage work
would seem to fit that description to me.  <shrug>

> > 
> > > ...
> > > 
> > > > > +static inline unsigned int queue_atomic_write_unit_max(const struct request_queue *q)
> > > > > +{
> > > > > +	return q->limits.atomic_write_unit_max << SECTOR_SHIFT;
> > > > > +}
> > > > > +
> > > > > +static inline unsigned int queue_atomic_write_unit_min(const struct request_queue *q)
> > > > > +{
> > > > > +	return q->limits.atomic_write_unit_min << SECTOR_SHIFT;
> > > > > +}
> > > > Ah, what? This undocumented interface reports "unit limits" in
> > > > bytes, but it's not using the physical device sector size to convert
> > > > between sector units and bytes. This really needs some more
> > > > documentation and work to make it present all units consistently and
> > > > not result in confusion when devices have 4kB sector sizes and not
> > > > 512 byte sectors...
> > > ok, we'll look to fix this up to give a coherent and clear interface.
> > > 
> > > > Also, I think all the byte ranges should support full 64 bit values,
> > > > otherwise there will be silent overflows in converting 32 bit sector
> > > > counts to byte ranges. And, eventually, something will want to do
> > > > larger than 4GB atomic IOs
> > > > 
> > > ok, we can do that but would also then make statx field 64b. I'm fine with
> > > that if it is wise to do so - I don't don't want to wastefully use up an
> > > extra 2 x 32b in struct statx.
> > Why do we need specific varibles for DIO atomic write alignment
> > limits?
> 
> I guess that we don't
> 
> > We already have direct IO alignment and size constraints in statx(),
> > so why wouldn't we just reuse those variables when the user requests
> > atomic limits for DIO?
> > 
> > i.e. if STATX_DIOALIGN is set, we return normal DIO alignment
> > constraints. If STATX_DIOALIGN_ATOMIC is set, we return the atomic
> > DIO alignment requirements in those variables.....
> > 
> > Yes, we probably need the dio max size to be added to statx for
> > this. Historically speaking, I wanted statx to support this in the
> > first place because that's what we were already giving userspace
> > with XFS_IOC_DIOINFO and we already knew that atomic IO when it came
> > along would require a bound maximum IO size much smaller than normal
> > DIO limits.  i.e.:
> > 
> > struct dioattr {
> >          __u32           d_mem;          /* data buffer memory alignment */
> >          __u32           d_miniosz;      /* min xfer size                */
> >          __u32           d_maxiosz;      /* max xfer size                */
> > };
> > 
> > where d_miniosz defined the alignment and size constraints for DIOs.
> > 
> > If we simply document that STATX_DIOALIGN_ATOMIC returns minimum
> > (unit) atomic IO size and alignment in statx->dio_offset_align (as
> > per STATX_DIOALIGN) and the maximum atomic IO size in
> > statx->dio_max_iosize, then we don't burn up anywhere near as much
> > space in the statx structure....
> 
> ok, so you are saying to unionize them, right? That would seem reasonable to
> me.
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx
  2023-05-05  8:01         ` John Garry
@ 2023-05-05 22:04           ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2023-05-05 22:04 UTC (permalink / raw)
  To: John Garry
  Cc: Dave Chinner, axboe, kbusch, hch, sagi, martin.petersen, viro,
	brauner, dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Prasad Singamsetty

On Fri, May 05, 2023 at 09:01:58AM +0100, John Garry wrote:
> On 04/05/2023 23:40, Dave Chinner wrote:
> 
> Hi Dave,
> 
> > > No, not yet. Is it normally expected to provide a proposed man page update
> > > in parallel? Or somewhat later, when the kernel API change has some
> > > appreciable level of agreement?
> > Normally we ask for man page updates to be presented at the same
> > time, as the man page defines the user interface that is being
> > implemented. In this case, we need updates for the pwritev2() man
> > page to document RWF_ATOMIC semantics, and the statx() man page to
> > document what the variables being exposed mean w.r.t. RWF_ATOMIC.
> > 
> > The pwritev2() man page is probably the most important one right now
> > - it needs to explain the guarantees that RWF_ATOMIC is supposed to
> > provide w.r.t. data integrity, IO ordering, persistence, etc.
> > Indeed, it will need to explain exactly how this "multi-atomic-unit
> > mulit-bio non-atomic RWF_ATOMIC" IO thing can be used safely and
> > reliably, especially w.r.t. IO ordering and persistence guarantees
> > in the face of crashes and power failures. Not to mention
> > documenting error conditions specific to RWF_ATOMIC...
> > 
> > It's all well and good to have some implementation, but without
> > actually defining and documenting the*guarantees*  that RWF_ATOMIC
> > provides userspace it is completely useless for application
> > developers. And from the perspective of a reviewer, without the
> > documentation stating what the infrastructure actually guarantees
> > applications, we can't determine if the implementation being
> > presented is fit for purpose....
> 
> ok, understood. Obviously from any discussion so far there are many details
> which the user needs to know about how to use this interface and what to
> expect.
> 
> We'll look to start working on those man page details now.

Agreed.  The manpage contents are what needs to get worked on at LSFMM
where you'll have various block/fs/storage device people in the same
room with which to discuss various issues and try to smooth out the
misundertandings.

(Also: I've decided to cancel my in-person attendance due to a sudden
health issue.   I'll still be in the room, just virtually now. :()

--D

> Thanks,
> John

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 03/16] xfs: Support atomic write for statx
  2023-05-03 22:17   ` Dave Chinner
@ 2023-05-05 22:10     ` Darrick J. Wong
  0 siblings, 0 replies; 50+ messages in thread
From: Darrick J. Wong @ 2023-05-05 22:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, axboe, kbusch, hch, sagi, martin.petersen, viro,
	brauner, dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge

On Thu, May 04, 2023 at 08:17:49AM +1000, Dave Chinner wrote:
> On Wed, May 03, 2023 at 06:38:08PM +0000, John Garry wrote:
> > Support providing info on atomic write unit min and max.
> > 
> > Darrick Wong originally authored this change.
> > 
> > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > ---
> >  fs/xfs/xfs_iops.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 24718adb3c16..e542077704aa 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -614,6 +614,16 @@ xfs_vn_getattr(
> >  			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
> >  			stat->dio_offset_align = bdev_logical_block_size(bdev);
> >  		}
> > +		if (request_mask & STATX_WRITE_ATOMIC) {
> > +			struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> > +			struct block_device	*bdev = target->bt_bdev;
> > +
> > +			stat->atomic_write_unit_min = queue_atomic_write_unit_min(bdev->bd_queue);
> > +			stat->atomic_write_unit_max = queue_atomic_write_unit_max(bdev->bd_queue);
> 
> I'm not sure this is right.
> 
> Given that we may have a 4kB physical sector device, XFS will not
> allow IOs smaller than physical sector size. The initial values of
> queue_atomic_write_unit_min/max() will be (1 << SECTOR_SIZE) which
> is 512 bytes. IOs done with 4kB sector size devices will fail in
> this case.
> 
> Further, XFS has a software sector size - it can define the sector
> size for the filesystem to be 4KB on a 512 byte sector device. And
> in that case, the filesystem will reject 512 byte sized/aligned IOs
> as they are smaller than the filesystem sector size (i.e. a config
> that prevents sub-physical sector IO for 512 logical/4kB physical
> devices).

Yep.  I'd forgotten about those.

> There may other filesystem constraints - realtime devices have fixed
> minimum allocation sizes which may be larger than atomic write
> limits, which means that IO completion needs to split extents into
> multiple unwritten/written extents, extent size hints might be in
> use meaning we have different allocation alignment constraints to
> atomic write constraints, stripe alignment of extent allocation may
> through out atomic write alignment, etc.
> 
> These are all solvable, but we need to make sure here that the
> filesystem constraints are taken into account here, not just the
> block device limits.
> 
> As such, it is probably better to query these limits at filesystem
> mount time and add them to the xfs buftarg (same as we do for
> logical and physical sector sizes) and then use the xfs buftarg

I'm not sure that's right either.  device mapper can switch the
underlying storage out from under us, yes?  That would be a dirty thing
to do in my book, but I've long wondered if we need to be more resilient
to that kind of evilness.

> values rather than having to go all the way to the device queue
> here. That way we can ensure at mount time that atomic write limits
> don't conflict with logical/physical IO limits, and we can further
> constrain atomic limits during mount without always having to
> recalculate those limits from first principles on every stat()
> call...

With Christoph's recent patchset to allow block devices to call back
into filesystems, we could add one for "device queue limits changed"
that would cause recomputation of those elements, solving what I was
just mumbling about above.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 12/16] xfs: Add support for fallocate2
  2023-05-03 23:26   ` Dave Chinner
@ 2023-05-05 22:23     ` Darrick J. Wong
  2023-05-05 23:42       ` Dave Chinner
  0 siblings, 1 reply; 50+ messages in thread
From: Darrick J. Wong @ 2023-05-05 22:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, axboe, kbusch, hch, sagi, martin.petersen, viro,
	brauner, dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Allison Henderson, Catherine Hoang

On Thu, May 04, 2023 at 09:26:16AM +1000, Dave Chinner wrote:
> On Wed, May 03, 2023 at 06:38:17PM +0000, John Garry wrote:
> > From: Allison Henderson <allison.henderson@oracle.com>
> > 
> > Add support for fallocate2 ioctl, which is xfs' own version of fallocate.
> > Struct xfs_fallocate2 is passed in the ioctl, and xfs_fallocate2.alignment
> > allows the user to specify required extent alignment. This is key for
> > atomic write support, as we expect extents to be aligned on
> > atomic_write_unit_max boundaries.
> 
> This approach of adding filesystem specific ioctls for minor behavioural
> modifiers to existing syscalls is not a sustainable development
> model.

To be fair to John and Allison, I told them to shove all the new UAPI
bits into a xfs_fs_staging.h because of that conversation you and
Catherine and I had a month or two ago (the fsuuid ioctls) about putting
new interfaces in an obviously marked staging file, using that to
prototype and discover the interface that we really wanted, and only
then talk about hoisting it to the VFS.

Hence this fallocate2 because we weren't sure if syscalls for aligned
allocations should explicitly define the alignment or get it from the
extent size hint, if there should be an explicit flag mandating aligned
allocation, etc.

> If we want fallocate() operations to apply filesystem atomic write
> constraints to operations, then add a new modifier flag to
> fallocate(), say FALLOC_FL_ATOMIC. The filesystem can then
> look up it's atomic write alignment constraints and apply them to
> the operation being performed appropriately.
> 
> > The alignment flag is not sticky, so further extent mutation will not
> > obey this original alignment request.
> 
> IOWs, you want the specific allocation to behave exactly as if an
> extent size hint of the given alignment had been set on that inode.
> Which could be done with:
> 
> 	ioctl(FS_IOC_FSGETXATTR, &fsx)
> 	old_extsize = fsx.fsx_extsize;
> 	fsx.fsx_extsize = atomic_align_size;
> 	ioctl(FS_IOC_FSSETXATTR, &fsx)

Eww, multiple threads doing fallocates can clobber each other here.

> 	fallocate(....)
> 	fsx.fsx_extsize = old_extsize;
> 	ioctl(FS_IOC_FSSETXATTR, &fsx)

Also, you can't set extsize if the data fork has any mappings in it,
so you can't set the old value.  But perhaps it's not so bad to expect
that programs will set this up once and not change the underlying
storage?

I'm not actually sure why you can't change the extent size hint.  Why is
that?

> Yeah, messy, but if an application is going to use atomic writes,
> then setting an extent size hint of the atomic write granularity the
> application will use at file create time makes a whole lot of sense.
> This will largely guarantee that any allocation will be aligned to
> atomic IO constraints even when non atomic IO operations are
> performed on that inode. Hence when the application needs to do an
> atomic IO, it's not going to fail because previous allocation was
> not correctly aligned.
> 
> All that we'd then need to do for atomic IO is ensure that we fail
> the allocation early if we can't allocate fully sized and aligned
> extents rather than falling back to unaligned extents when there are
> no large enough contiguous free spaces for aligned extents to be
> allocated. i.e. when RWF_ATOMIC or FALLOC_FL_ATOMIC are set by the
> application...

Right.

> 
> > In addition, extent lengths should
> > always be a multiple of atomic_write_unit_max,
> 
> Yup, that's what extent size hint based allocation does - it rounds
> both down and up to hint alignment...
> 
> ....
> 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 34de6e6898c4..52a6e2b61228 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3275,7 +3275,9 @@ xfs_bmap_compute_alignments(
> >  	struct xfs_alloc_arg	*args)
> >  {
> >  	struct xfs_mount	*mp = args->mp;
> > -	xfs_extlen_t		align = 0; /* minimum allocation alignment */
> > +
> > +	/* minimum allocation alignment */
> > +	xfs_extlen_t		align = args->alignment;
> >  	int			stripe_align = 0;
> 
> 
> This doesn't do what you think it should. For one, it will get
> overwritten by extent size hints that are set, hence the user will
> not get the alignment they expected in that case.
> 
> Secondly, args->alignment is an internal alignment control for
> stripe alignment used later in the allocator when doing file
> extenstion allocations.  Overloading it to pass a user alignment
> here means that initial data allocations will have alignments set
> without actually having set up the allocator parameters for aligned
> allocation correctly.
> 
> This will lead to unexpected allocation failure as the filesystem
> fills as the reservations needed for allocation to succeed won't
> match what is actually required for allocation to succeed. It will
> also cause problematic behaviour for fallback allocation algorithms
> that expect only to be called with args->alignment = 1...
> 
> >  	/* stripe alignment for allocation is determined by mount parameters */
> > @@ -3652,6 +3654,7 @@ xfs_bmap_btalloc(
> >  		.datatype	= ap->datatype,
> >  		.alignment	= 1,
> >  		.minalignslop	= 0,
> > +		.alignment	= ap->align,
> >  	};
> >  	xfs_fileoff_t		orig_offset;
> >  	xfs_extlen_t		orig_length;
> 
> > @@ -4279,12 +4282,14 @@ xfs_bmapi_write(
> >  	uint32_t		flags,		/* XFS_BMAPI_... */
> >  	xfs_extlen_t		total,		/* total blocks needed */
> >  	struct xfs_bmbt_irec	*mval,		/* output: map values */
> > -	int			*nmap)		/* i/o: mval size/count */
> > +	int			*nmap,
> > +	xfs_extlen_t		align)		/* i/o: mval size/count */
> 
> 
> As per above - IMO this is not the right way to specify aligment for
> atomic IO. A XFS_BMAPI_ATOMIC flag is probably the right thing to
> add from the caller - this also communicates the specific allocation
> failure behaviour required, too.
> 
> Then xfs_bmap_compute_alignments() can pull the alignment
> from the relevant buftarg similar to how it already pulls preset
> alignments for extent size hints and/or realtime devices. And then
> the allocator can attempt exact aligned allocation for maxlen, then
> if that fails an exact aligned allocation for minlen, and if both of
> those fail then we return ENOSPC without attempting any unaligned
> allocations...
> 
> This also gets rid of the need to pass another parameter to
> xfs_bmapi_write(), and it's trivial to plumb into the XFS iomap and
> fallocate code paths....

I too prefer a XFS_BMAPI_ALLOC_ALIGNED flag to all this extra plumbing,
having now seen the extra plumbing.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-04 22:26       ` Dave Chinner
  2023-05-05  7:54         ` John Garry
@ 2023-05-05 22:47         ` Eric Biggers
  2023-05-05 23:31           ` Dave Chinner
  1 sibling, 1 reply; 50+ messages in thread
From: Eric Biggers @ 2023-05-05 22:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, axboe, kbusch, hch, sagi, martin.petersen, djwong,
	viro, brauner, dchinner, jejb, linux-block, linux-kernel,
	linux-nvme, linux-scsi, linux-xfs, linux-fsdevel,
	linux-security-module, paul, jmorris, serge, Himanshu Madhani

On Fri, May 05, 2023 at 08:26:23AM +1000, Dave Chinner wrote:
> > ok, we can do that but would also then make statx field 64b. I'm fine with
> > that if it is wise to do so - I don't don't want to wastefully use up an
> > extra 2 x 32b in struct statx.
> 
> Why do we need specific varibles for DIO atomic write alignment
> limits? We already have direct IO alignment and size constraints in statx(),
> so why wouldn't we just reuse those variables when the user requests
> atomic limits for DIO?
> 
> i.e. if STATX_DIOALIGN is set, we return normal DIO alignment
> constraints. If STATX_DIOALIGN_ATOMIC is set, we return the atomic
> DIO alignment requirements in those variables.....
> 
> Yes, we probably need the dio max size to be added to statx for
> this. Historically speaking, I wanted statx to support this in the
> first place because that's what we were already giving userspace
> with XFS_IOC_DIOINFO and we already knew that atomic IO when it came
> along would require a bound maximum IO size much smaller than normal
> DIO limits.  i.e.:
> 
> struct dioattr {
>         __u32           d_mem;          /* data buffer memory alignment */
>         __u32           d_miniosz;      /* min xfer size                */
>         __u32           d_maxiosz;      /* max xfer size                */
> };
> 
> where d_miniosz defined the alignment and size constraints for DIOs.
> 
> If we simply document that STATX_DIOALIGN_ATOMIC returns minimum
> (unit) atomic IO size and alignment in statx->dio_offset_align (as
> per STATX_DIOALIGN) and the maximum atomic IO size in
> statx->dio_max_iosize, then we don't burn up anywhere near as much
> space in the statx structure....

I don't think that's how statx() is meant to work.  The request mask is a bitmask, and the user can
request an arbitrary combination of different items.  For example, the user could request both
STATX_DIOALIGN and STATX_WRITE_ATOMIC at the same time.  That doesn't work if different items share
the same fields.

- Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-05  7:54         ` John Garry
  2023-05-05 22:00           ` Darrick J. Wong
@ 2023-05-05 23:18           ` Dave Chinner
  2023-05-06  9:38             ` John Garry
  2023-05-07  2:35             ` Martin K. Petersen
  1 sibling, 2 replies; 50+ messages in thread
From: Dave Chinner @ 2023-05-05 23:18 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Himanshu Madhani

On Fri, May 05, 2023 at 08:54:12AM +0100, John Garry wrote:
> On 04/05/2023 23:26, Dave Chinner wrote:
> 
> Hi Dave,
> 
> > > atomic_write_unit_max is largest application block size which we can
> > > support, while atomic_write_max_bytes is the max size of an atomic operation
> > > which the HW supports.
> > Why are these different? If the hardware supports 128kB atomic
> > writes, why limit applications to something smaller?
> 
> Two reasons:
> a. If you see patch 6/16, we need to apply a limit on atomic_write_unit_max
> from what is guaranteed we can fit in a bio without it being required to be
> split when submitted.

Yes, that's obvious behaviour for an atomic IO.

> Consider iomap generates an atomic write bio for a single userspace block
> and submits to the block layer - if the block layer needs to split due to
> block driver request_queue limits, like max_segments, then we're in trouble.
> So we need to limit atomic_write_unit_max such that this will not occur.
> That same limit should not apply to atomic_write_max_bytes.

Except the block layer doesn't provide any mechanism to do
REQ_ATOMIC IOs larger than atomic_write_unit_max. So in what case
will atomic_write_max_bytes > atomic_write_unit_max ever be
relevant to anyone?

> b. For NVMe, atomic_write_unit_max and atomic_write_max_bytes which the host
> reports will be the same (ignoring a.).
> 
> However for SCSI they may be different. SCSI has its own concept of boundary
> and it is relevant here. This is confusing as it is very different from NVMe
> boundary. NVMe is a media boundary really. For SCSI, a boundary is a
> sub-segment which the device may split an atomic write operation. For a SCSI
> device which only supports this boundary mode of operation, we limit
> atomic_write_unit_max to the max boundary segment size (such that we don't
> get splitting of an atomic write by the device) and then limit
> atomic_write_max_bytes to what is known in the spec as "maximum atomic
> transfer length with boundary". So in this device mode of operation,
> atomic_write_max_bytes and atomic_write_unit_max should be different.

But if the application is limited to atomic_write_unit_max sized
IOs, and that is always less than or equal to the size of the atomic
write boundary, why does the block layer even need to care about
this whacky quirk of the SCSI protocol implementation?

The block layer shouldn't even need to be aware that SCSI can split
"atomic" IOs into smaller individual IOs that result in the larger
requested IO being non-atomic. the SCSI layer should just expose 
"write with boundary" as the max atomic IO size it supports to the
block layer. 

At this point, both atomic_write_max_bytes and atomic write
boundary size are completely irrelevant to anything in the block
layer or above. If usrespace is limited to atomic_write_unit_max IO
sizes and it is enforced at the ->write_iter() layer, then the block
layer will never need to split REQ_ATOMIC bios because the entire
stack has already stated that it guarantees atomic_write_unit_max
bios will not get split....

In what cases does hardware that supports atomic_write_max_bytes >
atomic_write_unit_max actually be useful? I can see one situation,
and one situation only: merging adjacent small REQ_ATOMIC write
requests into single larger IOs before issuing them to the hardware.

This is exactly the sort of optimisation the block layers should be
doing - it fits perfectly with the SCSI "write with boundary"
behaviour - the merged bios can be split by the hardware at the
point where they were merged by the block layer, and everything is
fine because the are independent IOs, not a single RWF_ATOMIC IO
from userspace. And for NVMe, it allows IOs from small atomic write
limits (because, say, 16kB RAID stripe unit) to be merged into
larger atomic IOs with no penalty...


> > >  From your review on the iomap patch, I assume that now you realise that we
> > > are proposing a write which may include multiple application data blocks
> > > (each limited in size to atomic_write_unit_max), and the limit in total size
> > > of that write is atomic_write_max_bytes.
> > I still don't get it - you haven't explained why/what an application
> > atomic block write might be, nor why the block device should be
> > determining the size of application data blocks, etc.  If the block
> > device can do 128kB atomic writes, why wouldn't the device allow the
> > application to do 128kB atomic writes if they've aligned the atomic
> > write correctly?
> 
> An application block needs to be:
> - sized at a power-of-two
> - sized between atomic_write_unit_min and atomic_write_unit_max, inclusive
> - naturally aligned
> 
> Please consider that the application does not explicitly tell the kernel the
> size of its data blocks, it's implied from the size of the write and file
> offset. So, assuming that userspace follows the rules properly when issuing
> a write, the kernel may deduce the application block size and ensure only
> that each individual user data block is not split.

That's just *gross*. The kernel has no business assuming anything
about the data layout inside an IO request. The kernel cannot assume
that the application uses a single IO size for atomic writes when it
expicitly provides a range of IO sizes that the application can use.

e.g. min unit = 4kB, max unit = 128kB allows IO sizes of 4kB, 8kiB,
16kiB, 32kB, 64kB and 128kB. How does the kernel infer what that
application data block size is based on a 32kB atomic write vs a
128kB atomic write?

The kernel can't use file offset alignment to infer application
block size, either. e.g. a 16kB write at 128kB could be a single
16kB data block, it could be 2x8kB data blocks, or it could be 4x4kB
data blocks - they all follow the rules you set above. So how does
the kernel know that for two of these cases it is safe to split the
IO at 8kB, and for one it isn't safe at all?

AFAICS, there is no way the kernel can accurately derive this sort
of information, so any assumptions that the "kernel can infer the
application data layout" to split IOs correctly simply won't work.
And that very important because we are talking about operations that
provide data persistence guarantees....

> If userspace wants a guarantee of no splitting of all in its write, then it
> may issue a write for a single userspace data block, e.g. userspace block
> size is 16KB, then write at a file offset aligned to 16KB and a total write
> size of 16KB will be guaranteed to be written atomically by the device.

Exactly what has "userspace block size" got to do with the kernel
providing a guarantee that a RWF_ATOMIC write of a 16kB buffer at
offset 16kB will be written atomically?

> > What happens we we get hardware that can do atomic writes at any
> > alignment, of any size up to atomic_write_max_bytes? Because this
> > interface defines atomic writes as "must be a multiple of 2 of
> > atomic_write_unit_min" then hardware that can do atomic writes of
> > any size can not be effectively utilised by this interface....
> > 
> > > user applications should only pay attention to what we return from statx,
> > > that being atomic_write_unit_min and atomic_write_unit_max.
> > > 
> > > atomic_write_max_bytes and atomic_write_boundary is only relevant to the
> > > block layer.
> > If applications can issue an multi-atomic_write_unit_max-block
> > writes as a single, non-atomic, multi-bio RWF_ATOMIC pwritev2() IO
> > and such IO is constrainted to atomic_write_max_bytes, then
> > atomic_write_max_bytes is most definitely relevant to user
> > applications.
> 
> But we still do not guarantee that multi-atomic_write_unit_max-block writes
> as a single, non-atomic, multi-bio RWF_ATOMIC pwritev2() IO and such IO is
> constrained to atomic_write_max_bytes will be written atomically by the
> device.
>
> Three things may happen in the kernel:
> - we may need to split due to atomic boundary

Block layer rejects the IO - cannot be performed atomically.

> - we may need to split due to the write spanning discontig extents

Filesystem rejects the IO - cannot be performed atomically.

> - atomic_write_max_bytes may be much larger than what we could fit in a bio,
> so may need multiple bios

Filesystem/blockdev rejects the IO - cannot be performed atomically.

> And maybe more which does not come to mind.

Relevant layer rejects the IO - cannot be performed atomically.

> So I am not sure what value there is in reporting atomic_write_max_bytes to
> the user. The description would need to be something like "we guarantee that
> if the total write length is greater than atomic_write_max_bytes, then all
> data will never be submitted to the device atomically. Otherwise it might
> be".

Exactly my point - there's a change of guarantee that the kernel
provides userspace at that point, and hence application developers
need to know it exists and, likely, be able to discover that
threshold programatically.

But this, to me, is a just another symptom of what I see as the
wider issue here: trying to allow RWF_ATOMIC IO to do more than a
*single atomic IO*.

This reeks of premature API optimisation. We should be make
RWF_ATOMIC do one thing, and one thing only: guaranteed single
atomic IO submission.

It doesn't matter what data userspace is trying to write atomically;
it only matters that the kernel submits the write as a single atomic
unit to the hardware which then guarantees that it completes the
whole IO as a single atomic unit.

What functionality the hardware can provide is largely irrelevant
here; it's the IO semantics that we guarantee userspace that matter.
The kernel interface needs to have simple, well defined behaviour
and provide clear data persistence guarantees.

Once we have that, we can optimise both the applications and the
kernel implementation around that behaviour and guarantees. e.g.
adjacent IO merging (either in the application or in the block
layer), using AIO/io_uring with completion to submission ordering,
etc.

There are many well known IO optimisation techniques that do not
require the kernel to infer or assume the format of the data in the
user buffers as this current API does. May the API simple and hard
to get wrong first, then optimise from there....



> > We already have direct IO alignment and size constraints in statx(),
> > so why wouldn't we just reuse those variables when the user requests
> > atomic limits for DIO?
> > 
> > i.e. if STATX_DIOALIGN is set, we return normal DIO alignment
> > constraints. If STATX_DIOALIGN_ATOMIC is set, we return the atomic
> > DIO alignment requirements in those variables.....
> > 
> > Yes, we probably need the dio max size to be added to statx for
> > this. Historically speaking, I wanted statx to support this in the
> > first place because that's what we were already giving userspace
> > with XFS_IOC_DIOINFO and we already knew that atomic IO when it came
> > along would require a bound maximum IO size much smaller than normal
> > DIO limits.  i.e.:
> > 
> > struct dioattr {
> >          __u32           d_mem;          /* data buffer memory alignment */
> >          __u32           d_miniosz;      /* min xfer size                */
> >          __u32           d_maxiosz;      /* max xfer size                */
> > };
> > 
> > where d_miniosz defined the alignment and size constraints for DIOs.
> > 
> > If we simply document that STATX_DIOALIGN_ATOMIC returns minimum
> > (unit) atomic IO size and alignment in statx->dio_offset_align (as
> > per STATX_DIOALIGN) and the maximum atomic IO size in
> > statx->dio_max_iosize, then we don't burn up anywhere near as much
> > space in the statx structure....
> 
> ok, so you are saying to unionize them, right? That would seem reasonable to
> me.

No, I don't recommend unionising them. RWF_ATOMIC only applies to
direct IO, so if the application ask for ATOMIC DIO limits, we put
the atomic dio limits in the dio limits variables rather than the
looser non-atomic dio limits......

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-05 22:47         ` Eric Biggers
@ 2023-05-05 23:31           ` Dave Chinner
  2023-05-06  0:08             ` Eric Biggers
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Chinner @ 2023-05-05 23:31 UTC (permalink / raw)
  To: Eric Biggers
  Cc: John Garry, axboe, kbusch, hch, sagi, martin.petersen, djwong,
	viro, brauner, dchinner, jejb, linux-block, linux-kernel,
	linux-nvme, linux-scsi, linux-xfs, linux-fsdevel,
	linux-security-module, paul, jmorris, serge, Himanshu Madhani

On Fri, May 05, 2023 at 10:47:19PM +0000, Eric Biggers wrote:
> On Fri, May 05, 2023 at 08:26:23AM +1000, Dave Chinner wrote:
> > > ok, we can do that but would also then make statx field 64b. I'm fine with
> > > that if it is wise to do so - I don't don't want to wastefully use up an
> > > extra 2 x 32b in struct statx.
> > 
> > Why do we need specific varibles for DIO atomic write alignment
> > limits? We already have direct IO alignment and size constraints in statx(),
> > so why wouldn't we just reuse those variables when the user requests
> > atomic limits for DIO?
> > 
> > i.e. if STATX_DIOALIGN is set, we return normal DIO alignment
> > constraints. If STATX_DIOALIGN_ATOMIC is set, we return the atomic
> > DIO alignment requirements in those variables.....
> > 
> > Yes, we probably need the dio max size to be added to statx for
> > this. Historically speaking, I wanted statx to support this in the
> > first place because that's what we were already giving userspace
> > with XFS_IOC_DIOINFO and we already knew that atomic IO when it came
> > along would require a bound maximum IO size much smaller than normal
> > DIO limits.  i.e.:
> > 
> > struct dioattr {
> >         __u32           d_mem;          /* data buffer memory alignment */
> >         __u32           d_miniosz;      /* min xfer size                */
> >         __u32           d_maxiosz;      /* max xfer size                */
> > };
> > 
> > where d_miniosz defined the alignment and size constraints for DIOs.
> > 
> > If we simply document that STATX_DIOALIGN_ATOMIC returns minimum
> > (unit) atomic IO size and alignment in statx->dio_offset_align (as
> > per STATX_DIOALIGN) and the maximum atomic IO size in
> > statx->dio_max_iosize, then we don't burn up anywhere near as much
> > space in the statx structure....
> 
> I don't think that's how statx() is meant to work.  The request mask is a bitmask, and the user can
> request an arbitrary combination of different items.  For example, the user could request both
> STATX_DIOALIGN and STATX_WRITE_ATOMIC at the same time.  That doesn't work if different items share
> the same fields.

Sure it does - what is contained in the field on return is defined
by the result mask. In this case, whatever the filesystem puts in
the DIO fields will match which flag it asserts in the result mask.

i.e. if the application wants RWF_ATOMIC and so asks for STATX_DIOALIGN |
STATX_DIOALIGN_ATOMIC in the request mask then:

- if the filesystem does not support RWF_ATOMIC it fills in the
  normal DIO alingment values and puts STATX_DIOALIGN in the result
  mask.

  Now the application knows that it can't use RWF_ATOMIC, and it
  doesn't need to do another statx() call to get the dio alignment
  values it needs.

- if the filesystem supports RWF_ATOMIC, it fills in the values with
  the atomic DIO constraints and puts STATX_DIOALIGN_ATOMIC in the
  result mask.

  Now the application knows it can use RWF_ATOMIC and has the atomic
  DIO constraints in the dio alignment fields returned.

This uses the request/result masks exactly as intended, yes?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 12/16] xfs: Add support for fallocate2
  2023-05-05 22:23     ` Darrick J. Wong
@ 2023-05-05 23:42       ` Dave Chinner
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Chinner @ 2023-05-05 23:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, axboe, kbusch, hch, sagi, martin.petersen, viro,
	brauner, dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Allison Henderson, Catherine Hoang

On Fri, May 05, 2023 at 03:23:33PM -0700, Darrick J. Wong wrote:
> On Thu, May 04, 2023 at 09:26:16AM +1000, Dave Chinner wrote:
> > On Wed, May 03, 2023 at 06:38:17PM +0000, John Garry wrote:
> > If we want fallocate() operations to apply filesystem atomic write
> > constraints to operations, then add a new modifier flag to
> > fallocate(), say FALLOC_FL_ATOMIC. The filesystem can then
> > look up it's atomic write alignment constraints and apply them to
> > the operation being performed appropriately.
> > 
> > > The alignment flag is not sticky, so further extent mutation will not
> > > obey this original alignment request.
> > 
> > IOWs, you want the specific allocation to behave exactly as if an
> > extent size hint of the given alignment had been set on that inode.
> > Which could be done with:
> > 
> > 	ioctl(FS_IOC_FSGETXATTR, &fsx)
> > 	old_extsize = fsx.fsx_extsize;
> > 	fsx.fsx_extsize = atomic_align_size;
> > 	ioctl(FS_IOC_FSSETXATTR, &fsx)
> 
> Eww, multiple threads doing fallocates can clobber each other here.

Sure, this was just an example of how the same behaviour could be be
acheived without the new ioctl. Locking and other trivialities were
left as an exercise for the reader.

> 
> > 	fallocate(....)
> > 	fsx.fsx_extsize = old_extsize;
> > 	ioctl(FS_IOC_FSSETXATTR, &fsx)
> 
> Also, you can't set extsize if the data fork has any mappings in it,
> so you can't set the old value.  But perhaps it's not so bad to expect
> that programs will set this up once and not change the underlying
> storage?
> 
> I'm not actually sure why you can't change the extent size hint.  Why is
> that?

Hysterical raisins, I think.

IIUC, it was largely to do with the fact that pre-existing
allocation could not be realigned, so once allocation has been done
the extent size hint can't guarantee extent size hint aligned/sized
extents are actually allocated for the file.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 11/16] fs: iomap: Atomic write support
  2023-05-05 21:19     ` Darrick J. Wong
@ 2023-05-05 23:56       ` Dave Chinner
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Chinner @ 2023-05-05 23:56 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, axboe, kbusch, hch, sagi, martin.petersen, viro,
	brauner, dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge

On Fri, May 05, 2023 at 02:19:28PM -0700, Darrick J. Wong wrote:
> On Thu, May 04, 2023 at 03:00:06PM +1000, Dave Chinner wrote:
> > On Wed, May 03, 2023 at 06:38:16PM +0000, John Garry wrote:
> > > Add support to create bio's whose bi_sector and bi_size are aligned to and
> > > multiple of atomic_write_unit, respectively.
> > > 
> > > When we call iomap_dio_bio_iter() -> bio_iov_iter_get_pages() ->
> > > __bio_iov_iter_get_pages(), we trim the bio to a multiple of
> > > atomic_write_unit.
> > > 
> > > As such, we expect the iomi start and length to have same size and
> > > alignment requirements per iomap_dio_bio_iter() call.
> > > 
> > > In iomap_dio_bio_iter(), ensure that for a non-dsync iocb that the mapping
> > > is not dirty nor unmapped.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >  fs/iomap/direct-io.c | 72 ++++++++++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 70 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index f771001574d0..37c3c926dfd8 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -36,6 +36,8 @@ struct iomap_dio {
> > >  	size_t			done_before;
> > >  	bool			wait_for_completion;
> > >  
> > > +	unsigned int atomic_write_unit;
> > > +
> > >  	union {
> > >  		/* used during submission and for synchronous completion: */
> > >  		struct {
> > > @@ -229,9 +231,21 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> > >  	return opflags;
> > >  }
> > >  
> > > +
> > > +/*
> > > + * Note: For atomic writes, each bio which we create when we iter should have
> > > + *	 bi_sector aligned to atomic_write_unit and also its bi_size should be
> > > + *	 a multiple of atomic_write_unit.
> > > + *	 The call to bio_iov_iter_get_pages() -> __bio_iov_iter_get_pages()
> > > + *	 should trim the length to a multiple of atomic_write_unit for us.
> > > + *	 This allows us to split each bio later in the block layer to fit
> > > + *	 request_queue limit.
> > > + */
> > >  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >  		struct iomap_dio *dio)
> > >  {
> > > +	bool atomic_write = (dio->iocb->ki_flags & IOCB_ATOMIC) &&
> > > +			    (dio->flags & IOMAP_DIO_WRITE);
> > >  	const struct iomap *iomap = &iter->iomap;
> > >  	struct inode *inode = iter->inode;
> > >  	unsigned int fs_block_size = i_blocksize(inode), pad;
> > > @@ -249,6 +263,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > >  		return -EINVAL;
> > >  
> > > +
> > > +	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
> > > +		if (iomap->flags & IOMAP_F_DIRTY)
> > > +			return -EIO;
> > > +		if (iomap->type != IOMAP_MAPPED)
> > > +			return -EIO;
> > > +	}
> > 
> > IDGI. If the iomap had space allocated for this dio iteration,
> > then IOMAP_F_DIRTY will be set and it is likely (guaranteed for XFS)
> > that the iomap type will be IOMAP_UNWRITTEN. Indeed, if we are doing
> > a write into preallocated space (i.e. from fallocate()) then this
> > will cause -EIO on all RWF_ATOMIC IO to that file unless RWF_DSYNC
> > is also used.
> > 
> > "For a power fail, for each individual application block, all or
> > none of the data to be written."
> > 
> > Ok, does this means RWF_ATOMIC still needs fdatasync() to guarantee
> > that the data makes it to stable storage? And the result is
> > undefined until fdatasync() is run, but the device will guarantee
> > that either all or none of the data will be on stable storage
> > prior to the next device cache flush completing?
> > 
> > i.e. does REQ_ATOMIC imply REQ_FUA, or does it require a separate
> > device cache flush to commit the atomic IO to stable storage?
> 
> From the SCSI and NVME device information that I've been presented, it
> sounds like an explicit cache flush or FUA is required to persist the
> data.

Ok, that makes it sound like RWF_ATOMIC has the same data integrity
semantics as normal DIO submission. i.e. the application has to
specify data integrity requirements and/or provide integrity
checkpoints itself.

> > What about ordering - do the devices guarantee strict ordering of
> > REQ_ATOMIC writes? i.e. if atomic write N is seen on disk, then all
> > the previous atomic writes up to N will also be seen on disk? If
> > not, how does the application and filesystem guarantee persistence
> > of completed atomic writes?
> 
> I /think/ the applications have to ensure ordering themselves.  If Y
> cannot appear before X is persisted, then the application must wait for
> the ack for X, flush the cache, and only then send Y.

RIght, I'd expect that completion-to-submission ordering is required
with RWF_ATOMIC the same way it is required for normal DIO, but I've
been around long enough to know that we can't make assumptions about
data integrity semantics...

> > i.e. If we still need a post-IO device cache flush to guarantee
> > persistence and/or ordering of RWF_ATOMIC IOs, then the above code
> > makes no sense - we'll still need fdatasync() to provide persistence
> > checkpoints and that means we ensure metadata is also up to date
> > at those checkpoints.
> 
> I'll let the block layer developers weigh in on this, but I /think/ this
> means that we require RWF_DSYNC for atomic block writes to written
> mappings, and RWF_SYNC if iomap_begin gives us an unwritten/hole/dirty
> mapping.

RWF_DSYNC is functionally the same as RWF_OSYNC. The only difference
is that RWF_OSYNC considers timestamps as dirty metadata, whilst
RWF_DSYNC doesn't. Hence I don't think there's any functional
difference w.r.t. data integrity by using OSYNC vs DSYNC...

> > > @@ -592,6 +634,32 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >  
> > >  	blk_start_plug(&plug);
> > >  	while ((ret = iomap_iter(&iomi, ops)) > 0) {
> > > +		if (atomic_write) {
> > > +			const struct iomap *_iomap = &iomi.iomap;
> > > +			loff_t iomi_length = iomap_length(&iomi);
> > > +
> > > +			/*
> > > +			 * Ensure length and start address is a multiple of
> > > +			 * atomic_write_unit - this is critical. If the length
> > > +			 * is not a multiple of atomic_write_unit, then we
> > > +			 * cannot create a set of bio's in iomap_dio_bio_iter()
> > > +			 * who are each a length which is a multiple of
> > > +			 * atomic_write_unit.
> > > +			 *
> > > +			 * Note: It may be more appropiate to have this check
> > > +			 *	 in iomap_dio_bio_iter()
> > > +			 */
> > > +			if ((iomap_sector(_iomap, iomi.pos) << SECTOR_SHIFT) %
> 
> The file offset (and by extension the position) are not important for
> deciding if we can issue an atomic write.  Only the mapped LBA space on
> the underlying device is important.
> 
> IOWs, if we have a disk that can write a 64k aligned block atomically,
> iomap only has to check that iomap->addr is aligned to a 64k boundary.
> If that space happens to be mapped to file offset 57k, then it is indeed
> possible to perform a 64k atomic write to the file starting at offset
> 57k and ending at offset 121k, right?

Yup, that was kinda what I was implying in pointing out that file
offset does not reflect device IO alignment...

> > Hence I think that we should be rejecting RWF_ATOMIC IOs that are
> > larger than the maximum atomic write unit or cannot be dispatched in
> > a single IO e.g. filesystem has allocated multiple minimum aligned
> > extents and so a max len atomic write IO over that range must be
> > broken up into multiple smaller IOs.
> > 
> > We should be doing max atomic write size rejection high up in the IO
> > path (e.g. filesystem ->write_iter() method) before we get anywhere
> > near the DIO path, and we should be rejecting atomic write IOs in
> > the DIO path during the ->iomap_begin() mapping callback if we can't
> > map the entire atomic IO to a single aligned filesystem extent.
> > 
> > i.e. the alignment checks and constraints need to be applied by the
> > filesystem mapping code, not the layer that packs the pages into the
> > bio as directed by the filesystem mapping....
> 
> Hmm.  I think I see what you're saying here -- iomap should communicate
> to ->iomap_begin that we want to perform an atomic write, and there had
> better be either (a) a properly aligned mapping all ready to go; or (b)
> the fs must perform an aligned allocation and map that in, or return no
> mapping so the write fails.

Exactly. This is how IOCB_NOWAIT works, too - we can reject it high
up in the IO path if we can't get locks, and then if we have to do
allocation in ->iomap_begin because there is no mapping available we
reject the IO there.

Hence I think we should use the same constraint checking model for
RWF_ATOMIC - the constraints are slightly different, but the layers
at which we can first resolve the various constraints are exactly
the same...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-05 23:31           ` Dave Chinner
@ 2023-05-06  0:08             ` Eric Biggers
  0 siblings, 0 replies; 50+ messages in thread
From: Eric Biggers @ 2023-05-06  0:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, axboe, kbusch, hch, sagi, martin.petersen, djwong,
	viro, brauner, dchinner, jejb, linux-block, linux-kernel,
	linux-nvme, linux-scsi, linux-xfs, linux-fsdevel,
	linux-security-module, paul, jmorris, serge, Himanshu Madhani

On Sat, May 06, 2023 at 09:31:52AM +1000, Dave Chinner wrote:
> On Fri, May 05, 2023 at 10:47:19PM +0000, Eric Biggers wrote:
> > On Fri, May 05, 2023 at 08:26:23AM +1000, Dave Chinner wrote:
> > > > ok, we can do that but would also then make statx field 64b. I'm fine with
> > > > that if it is wise to do so - I don't don't want to wastefully use up an
> > > > extra 2 x 32b in struct statx.
> > > 
> > > Why do we need specific varibles for DIO atomic write alignment
> > > limits? We already have direct IO alignment and size constraints in statx(),
> > > so why wouldn't we just reuse those variables when the user requests
> > > atomic limits for DIO?
> > > 
> > > i.e. if STATX_DIOALIGN is set, we return normal DIO alignment
> > > constraints. If STATX_DIOALIGN_ATOMIC is set, we return the atomic
> > > DIO alignment requirements in those variables.....
> > > 
> > > Yes, we probably need the dio max size to be added to statx for
> > > this. Historically speaking, I wanted statx to support this in the
> > > first place because that's what we were already giving userspace
> > > with XFS_IOC_DIOINFO and we already knew that atomic IO when it came
> > > along would require a bound maximum IO size much smaller than normal
> > > DIO limits.  i.e.:
> > > 
> > > struct dioattr {
> > >         __u32           d_mem;          /* data buffer memory alignment */
> > >         __u32           d_miniosz;      /* min xfer size                */
> > >         __u32           d_maxiosz;      /* max xfer size                */
> > > };
> > > 
> > > where d_miniosz defined the alignment and size constraints for DIOs.
> > > 
> > > If we simply document that STATX_DIOALIGN_ATOMIC returns minimum
> > > (unit) atomic IO size and alignment in statx->dio_offset_align (as
> > > per STATX_DIOALIGN) and the maximum atomic IO size in
> > > statx->dio_max_iosize, then we don't burn up anywhere near as much
> > > space in the statx structure....
> > 
> > I don't think that's how statx() is meant to work.  The request mask is a bitmask, and the user can
> > request an arbitrary combination of different items.  For example, the user could request both
> > STATX_DIOALIGN and STATX_WRITE_ATOMIC at the same time.  That doesn't work if different items share
> > the same fields.
> 
> Sure it does - what is contained in the field on return is defined
> by the result mask. In this case, whatever the filesystem puts in
> the DIO fields will match which flag it asserts in the result mask.
> 
> i.e. if the application wants RWF_ATOMIC and so asks for STATX_DIOALIGN |
> STATX_DIOALIGN_ATOMIC in the request mask then:
> 
> - if the filesystem does not support RWF_ATOMIC it fills in the
>   normal DIO alingment values and puts STATX_DIOALIGN in the result
>   mask.
> 
>   Now the application knows that it can't use RWF_ATOMIC, and it
>   doesn't need to do another statx() call to get the dio alignment
>   values it needs.
> 
> - if the filesystem supports RWF_ATOMIC, it fills in the values with
>   the atomic DIO constraints and puts STATX_DIOALIGN_ATOMIC in the
>   result mask.
> 
>   Now the application knows it can use RWF_ATOMIC and has the atomic
>   DIO constraints in the dio alignment fields returned.
> 
> This uses the request/result masks exactly as intended, yes?
> 

We could certainly implement some scheme like that, but I don't think that was
how statx() was intended to work.  I think that each bit in the mask was
intended to correspond to an independent piece of information.

- Eric

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-05 23:18           ` Dave Chinner
@ 2023-05-06  9:38             ` John Garry
  2023-05-07  2:35             ` Martin K. Petersen
  1 sibling, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-06  9:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Himanshu Madhani

On 06/05/2023 00:18, Dave Chinner wrote:

Hi Dave,

>> Consider iomap generates an atomic write bio for a single userspace block
>> and submits to the block layer - if the block layer needs to split due to
>> block driver request_queue limits, like max_segments, then we're in trouble.
>> So we need to limit atomic_write_unit_max such that this will not occur.
>> That same limit should not apply to atomic_write_max_bytes.
> Except the block layer doesn't provide any mechanism to do
> REQ_ATOMIC IOs larger than atomic_write_unit_max. So in what case
> will atomic_write_max_bytes > atomic_write_unit_max ever be
> relevant to anyone?

atomic_write_max_bytes is only relevant to the block layer.

Consider userspace does a single pwritev2 RWF_ATOMIC call of size 1M and 
offset aligned to 8K, which is 128x 8k userspace data blocks. It so 
happens that this write spans 2x extents right in the middle (for 
simplicity of the example), so iomap produces 2x bios, each of size 
512K. In this example atomic_write_max_bytes is 256K. So when those 2x 
bios are submitted to the block layer, they each need to be split into 
2x 256K bios - so we end up with a total of 4x 256K bios being submitted 
to the device. Splitting into these 4x 256K bios will satisfy the 
guarantee to not split any 8k data blocks.

When the kernel handles this pwritev2 RWF_ATOMIC call, it deduces the 
userspace block size, but does not always split into multiple bios each 
of that size. When iomap or block fops creates a bio, it will still fill 
the bio as large as it can, but also it needs to trim such that it is 
always a multiple of the userspace block size. And when we submit this 
bio, the block layer will only split it when it needs to, e.g. when bio 
exceeds atomic_write_max_bytes in size, or crosses a boundary, or any 
other reason you can see in bio_split_rw(). Please see patch 8/16 for 
how this is done.

> 
>> b. For NVMe, atomic_write_unit_max and atomic_write_max_bytes which the host
>> reports will be the same (ignoring a.).
>>
>> However for SCSI they may be different. SCSI has its own concept of boundary
>> and it is relevant here. This is confusing as it is very different from NVMe
>> boundary. NVMe is a media boundary really. For SCSI, a boundary is a
>> sub-segment which the device may split an atomic write operation. For a SCSI
>> device which only supports this boundary mode of operation, we limit
>> atomic_write_unit_max to the max boundary segment size (such that we don't
>> get splitting of an atomic write by the device) and then limit
>> atomic_write_max_bytes to what is known in the spec as "maximum atomic
>> transfer length with boundary". So in this device mode of operation,
>> atomic_write_max_bytes and atomic_write_unit_max should be different.
> But if the application is limited to atomic_write_unit_max sized
> IOs, and that is always less than or equal to the size of the atomic
> write boundary, why does the block layer even need to care about
> this whacky quirk of the SCSI protocol implementation?

The block layer just exposes some atomic write queue limits to the block 
device driver to be set. We tried to make the limits generic so that 
they fit the orthogonal features and wackiness of both NVMe and SCSI.

> 
> The block layer shouldn't even need to be aware that SCSI can split
> "atomic" IOs into smaller individual IOs that result in the larger
> requested IO being non-atomic. the SCSI layer should just expose
> "write with boundary" as the max atomic IO size it supports to the
> block layer.
> 
> At this point, both atomic_write_max_bytes and atomic write
> boundary size are completely irrelevant to anything in the block
> layer or above. If usrespace is limited to atomic_write_unit_max IO
> sizes and it is enforced at the ->write_iter() layer, then the block
> layer will never need to split REQ_ATOMIC bios because the entire
> stack has already stated that it guarantees atomic_write_unit_max
> bios will not get split....

Please refer to my first point on this.

> 
> In what cases does hardware that supports atomic_write_max_bytes >
> atomic_write_unit_max actually be useful?

This is only possible for SCSI. As I mentioned before, SCSI supports its 
own boundary feature, and it is very different from NVMe. For SCSI, it 
is a sub-segment which a device which split an atomic write operation.

So consider the example of the max boundary size of the SCSI device is 
8K, but max atomic length with with boundary is 256K. This would mean 
that an atomic write operation for the device supports upto 32x 8K 
segments. Each of the sub-segments may atomically complete in the 
device. So, in terms of fitting our atomic write request_queue limits 
for this device, atomic_write_max_bytes would be 256K and 
atomic_write_unit_max would be 8k.

> I can see one situation,
> and one situation only: merging adjacent small REQ_ATOMIC write
> requests into single larger IOs before issuing them to the hardware.
> 
> This is exactly the sort of optimisation the block layers should be
> doing - it fits perfectly with the SCSI "write with boundary"
> behaviour - the merged bios can be split by the hardware at the
> point where they were merged by the block layer, and everything is
> fine because the are independent IOs, not a single RWF_ATOMIC IO
> from userspace. And for NVMe, it allows IOs from small atomic write
> limits (because, say, 16kB RAID stripe unit) to be merged into
> larger atomic IOs with no penalty...

Yes, that is another scenario. FWIW, we disallow merging currently, but 
it should be possible to support it.

> 
> 
>>>>   From your review on the iomap patch, I assume that now you realise that we
>>>> are proposing a write which may include multiple application data blocks
>>>> (each limited in size to atomic_write_unit_max), and the limit in total size
>>>> of that write is atomic_write_max_bytes.
>>> I still don't get it - you haven't explained why/what an application
>>> atomic block write might be, nor why the block device should be
>>> determining the size of application data blocks, etc.  If the block
>>> device can do 128kB atomic writes, why wouldn't the device allow the
>>> application to do 128kB atomic writes if they've aligned the atomic
>>> write correctly?
>> An application block needs to be:
>> - sized at a power-of-two
>> - sized between atomic_write_unit_min and atomic_write_unit_max, inclusive
>> - naturally aligned
>>
>> Please consider that the application does not explicitly tell the kernel the
>> size of its data blocks, it's implied from the size of the write and file
>> offset. So, assuming that userspace follows the rules properly when issuing
>> a write, the kernel may deduce the application block size and ensure only
>> that each individual user data block is not split.
> That's just*gross*. The kernel has no business assuming anything
> about the data layout inside an IO request. The kernel cannot assume
> that the application uses a single IO size for atomic writes when it
> expicitly provides a range of IO sizes that the application can use.
> 
> e.g. min unit = 4kB, max unit = 128kB allows IO sizes of 4kB, 8kiB,
> 16kiB, 32kB, 64kB and 128kB. How does the kernel infer what that
> application data block size is based on a 32kB atomic write vs a
> 128kB atomic write?

 From the requirement that userspace always naturally aligns writes to 
its own block size.

If a write so happens to be sized and aligned such that it could be 16x 
4k, or 8x 8k, or 4x 16K, or 2x 32K, or 1x 64K, then the kernel will 
always assume the largest size, i.e. we will assume 64K in this example, 
and always submit a 64K atomic write operation to the device.

We're coming at this from the database perspective, which generally uses 
fixed block sizes.

> 
> The kernel can't use file offset alignment to infer application
> block size, either. e.g. a 16kB write at 128kB could be a single
> 16kB data block, it could be 2x8kB data blocks, or it could be 4x4kB
> data blocks - they all follow the rules you set above. So how does
> the kernel know that for two of these cases it is safe to split the
> IO at 8kB, and for one it isn't safe at all?

As above, the kernel assumes the largest block size which fits the 
length and alignment of the write. In doing so, it will always satisfy 
requirement to atomically write userspace data blocks.

> 
> AFAICS, there is no way the kernel can accurately derive this sort
> of information, so any assumptions that the "kernel can infer the
> application data layout" to split IOs correctly simply won't work.
> And that very important because we are talking about operations that
> provide data persistence guarantees....

What I describe is not ideal, and it would be nice for userspace to be 
able to explicitly tell the kernel its block size, as to avoid any doubt 
and catch any userspace misbehavior.

Could there be a way to do this for both block device and FS files, like 
fcntl?

> 
>> If userspace wants a guarantee of no splitting of all in its write, then it
>> may issue a write for a single userspace data block, e.g. userspace block
>> size is 16KB, then write at a file offset aligned to 16KB and a total write
>> size of 16KB will be guaranteed to be written atomically by the device.
> Exactly what has "userspace block size" got to do with the kernel
> providing a guarantee that a RWF_ATOMIC write of a 16kB buffer at
> offset 16kB will be written atomically?

Please let me know if I have not explained this well enough.

> 
>>> What happens we we get hardware that can do atomic writes at any
>>> alignment, of any size up to atomic_write_max_bytes? Because this
>>> interface defines atomic writes as "must be a multiple of 2 of
>>> atomic_write_unit_min" then hardware that can do atomic writes of
>>> any size can not be effectively utilised by this interface....
>>>
>>>> user applications should only pay attention to what we return from statx,
>>>> that being atomic_write_unit_min and atomic_write_unit_max.
>>>>
>>>> atomic_write_max_bytes and atomic_write_boundary is only relevant to the
>>>> block layer.
>>> If applications can issue an multi-atomic_write_unit_max-block
>>> writes as a single, non-atomic, multi-bio RWF_ATOMIC pwritev2() IO
>>> and such IO is constrainted to atomic_write_max_bytes, then
>>> atomic_write_max_bytes is most definitely relevant to user
>>> applications.
>> But we still do not guarantee that multi-atomic_write_unit_max-block writes
>> as a single, non-atomic, multi-bio RWF_ATOMIC pwritev2() IO and such IO is
>> constrained to atomic_write_max_bytes will be written atomically by the
>> device.
>>
>> Three things may happen in the kernel:
>> - we may need to split due to atomic boundary
> Block layer rejects the IO - cannot be performed atomically.

But what good is that to the user?

As far as the user is concerned, they tried to write some data and the 
kernel rejected it. They don't know why. Even worse is that there is 
nothing which they can do about it, apart from trying smaller sized 
writes, which may not suit them.

In an ideal world we would not have atomic write boundaries or discontig 
FS extents or bio limits, but that is not what we have.

> 
>> - we may need to split due to the write spanning discontig extents
> Filesystem rejects the IO - cannot be performed atomically.
> 
>> - atomic_write_max_bytes may be much larger than what we could fit in a bio,
>> so may need multiple bios
> Filesystem/blockdev rejects the IO - cannot be performed atomically.
> 
>> And maybe more which does not come to mind.
> Relevant layer rejects the IO - cannot be performed atomically.
> 
>> So I am not sure what value there is in reporting atomic_write_max_bytes to
>> the user. The description would need to be something like "we guarantee that
>> if the total write length is greater than atomic_write_max_bytes, then all
>> data will never be submitted to the device atomically. Otherwise it might
>> be".
> Exactly my point - there's a change of guarantee that the kernel
> provides userspace at that point, and hence application developers
> need to know it exists and, likely, be able to discover that
> threshold programatically.

hmmm, ok, if you think it would help the userspace programmer.

> 
> But this, to me, is a just another symptom of what I see as the
> wider issue here: trying to allow RWF_ATOMIC IO to do more than a
> *single atomic IO*.
> 
> This reeks of premature API optimisation. We should be make
> RWF_ATOMIC do one thing, and one thing only: guaranteed single
> atomic IO submission.
> 
> It doesn't matter what data userspace is trying to write atomically;
> it only matters that the kernel submits the write as a single atomic
> unit to the hardware which then guarantees that it completes the
> whole IO as a single atomic unit.
> 
> What functionality the hardware can provide is largely irrelevant
> here; it's the IO semantics that we guarantee userspace that matter.
> The kernel interface needs to have simple, well defined behaviour
> and provide clear data persistence guarantees.

ok, I'm hearing you. So we could just say to the user: the rule is that 
you are allowed to submit a power-of-2 sized write of size between 
atomic_write_unit_min and atomic_write_unit_max and it must be naturally 
aligned. Nothing else is permitted for RWF_ATOMIC.

But then for a 10M write of 8K blocks, we have userspace issuing 1280x 
pwritev2() calls, which isn’t efficient. However, if for example 
atomic_write_unit_max was 128K, the userspace app could issue 80x 
pwritv2() calls. Still, not ideal. The block layer could be merging a 
lot of those writes, but we still have overhead per pwritev2(). And 
block layer merging takes many CPU cycles also.

In our proposal, we're just giving userspace the option to write a large 
range of blocks in a single pwritev2() call, hopefully improving 
performance. Most often, we would not be getting any splitting - 
splitting should only really happen at the extremes, so it should give 
good performance. We don't have performance figures yet, though, to 
enforce this point.

> 
> Once we have that, we can optimise both the applications and the
> kernel implementation around that behaviour and guarantees. e.g.
> adjacent IO merging (either in the application or in the block
> layer), using AIO/io_uring with completion to submission ordering,
> etc.
> 
> There are many well known IO optimisation techniques that do not
> require the kernel to infer or assume the format of the data in the
> user buffers as this current API does. May the API simple and hard
> to get wrong first, then optimise from there....
> 

OK, so these could help. We need performance figures...

> 
> 
>>> We already have direct IO alignment and size constraints in statx(),
>>> so why wouldn't we just reuse those variables when the user requests
>>> atomic limits for DIO?
>>>
>>> i.e. if STATX_DIOALIGN is set, we return normal DIO alignment
>>> constraints. If STATX_DIOALIGN_ATOMIC is set, we return the atomic
>>> DIO alignment requirements in those variables.....
>>>
>>> Yes, we probably need the dio max size to be added to statx for
>>> this. Historically speaking, I wanted statx to support this in the
>>> first place because that's what we were already giving userspace
>>> with XFS_IOC_DIOINFO and we already knew that atomic IO when it came
>>> along would require a bound maximum IO size much smaller than normal
>>> DIO limits.  i.e.:
>>>
>>> struct dioattr {
>>>           __u32           d_mem;          /* data buffer memory alignment */
>>>           __u32           d_miniosz;      /* min xfer size                */
>>>           __u32           d_maxiosz;      /* max xfer size                */
>>> };
>>>
>>> where d_miniosz defined the alignment and size constraints for DIOs.
>>>
>>> If we simply document that STATX_DIOALIGN_ATOMIC returns minimum
>>> (unit) atomic IO size and alignment in statx->dio_offset_align (as
>>> per STATX_DIOALIGN) and the maximum atomic IO size in
>>> statx->dio_max_iosize, then we don't burn up anywhere near as much
>>> space in the statx structure....
>> ok, so you are saying to unionize them, right? That would seem reasonable to
>> me.
> No, I don't recommend unionising them.

ah, ok, I assumed that we would not support asking for both 
STATX_DIOALIGN and STATX_DIOATOMIC. I should have read your proposal 
more closely.

> RWF_ATOMIC only applies to
> direct IO, so if the application ask for ATOMIC DIO limits, we put
> the atomic dio limits in the dio limits variables rather than the
> looser non-atomic dio limits......

I see. TBH, I'm not sure about this and will be let other experts comment.

Thanks,
John


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-05 22:00           ` Darrick J. Wong
@ 2023-05-07  1:59             ` Martin K. Petersen
  0 siblings, 0 replies; 50+ messages in thread
From: Martin K. Petersen @ 2023-05-07  1:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, Dave Chinner, axboe, kbusch, hch, sagi,
	martin.petersen, viro, brauner, dchinner, jejb, linux-block,
	linux-kernel, linux-nvme, linux-scsi, linux-xfs, linux-fsdevel,
	linux-security-module, paul, jmorris, serge, Himanshu Madhani


Darrick,

> Could a SCSI device could advertise 512b LBAs, 4096b physical blocks,
> a 64k atomic_write_unit_max, and a 1MB maximum transfer length
> (atomic_write_max_bytes)?

Yes.

> And does that mean that application software can send one 64k-aligned
> write and expect it either to be persisted completely or not at all?

Yes.

> And, does that mean that the application can send up to 16 of these
> 64k-aligned blocks as a single 1MB IO and expect that each of those 16
> blocks will either be persisted entirely or not at all?

Yes.

> There doesn't seem to be any means for the device to report /which/ of
> the 16 were persisted, which is disappointing. But maybe the
> application encodes LSNs and can tell after the fact that something
> went wrong, and recover?

Correct. Although we traditionally haven't had too much fun with partial
completion for sequential I/O either.

> If the same device reports a 2048b atomic_write_unit_min, does that mean
> that I can send between 2 and 64k of data as a single atomic write and
> that's ok?  I assume that this weird situation (512b LBA, 4k physical,
> 2k atomic unit min) requires some fancy RMW but that the device is
> prepared to cr^Wpersist that correctly?

Yes.

It would not make much sense for a device to report a minimum atomic
granularity smaller than the reported physical block size. But in theory
it could.

> What if the device also advertises a 128k atomic_write_boundary?
> That means that a 2k atomic block write will fail if it starts at 127k,
> but if it starts at 126k then thats ok.  Right?

Correct.

> As for avoiding splits in the block layer, I guess that also means that
> someone needs to reduce atomic_write_unit_max and atomic_write_boundary
> if (say) some sysadmin decides to create a raid0 of these devices with a
> 32k stripe size?

Correct. Atomic limits will need to be stacked for MD and DM like we do
with the remaining queue limits.

> It sounds like NVME is simpler in that it would report 64k for both the
> max unit and the max transfer length?  And for the 1M write I mentioned
> above, the application must send 16 individual writes?

Correct.

> With my app developer hat on, the simplest mental model of this is that
> if I want to persist a blob of data that is larger than one device LBA,
> then atomic_write_unit_min <= blob size <= atomic_write_unit_max must be
> true, and the LBA range for the write cannot cross a atomic_write_boundary.
>
> Does that sound right?

Yep.

> Going back to my sample device above, the XFS buffer cache could write
> individual 4k filesystem metadata blocks using REQ_ATOMIC because 4k is
> between the atomic write unit min/max, 4k metadata blocks will never
> cross a 128k boundary, and we'd never have to worry about torn writes
> in metadata ever again?

Correct.

> Furthermore, if I want to persist a bunch of blobs in a contiguous LBA
> range and atomic_write_max_bytes > atomic_write_unit_max, then I can do
> that with a single direct write?

Yes.

> I'm assuming that the blobs in the middle of the range must all be
> exactly atomic_write_unit_max bytes in size?

If you care about each blob being written atomically, yes.

> And I had better be prepared to (I guess) re-read the entire range
> after the system goes down to find out if any of them did or did not
> persist?

If you crash or get an I/O error, then yes. There is no way to inquire
which blobs were written. Just like we don't know which LBAs were
written if the OS crashes in the middle of a regular write operation.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-05 23:18           ` Dave Chinner
  2023-05-06  9:38             ` John Garry
@ 2023-05-07  2:35             ` Martin K. Petersen
  1 sibling, 0 replies; 50+ messages in thread
From: Martin K. Petersen @ 2023-05-07  2:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, axboe, kbusch, hch, sagi, martin.petersen, djwong,
	viro, brauner, dchinner, jejb, linux-block, linux-kernel,
	linux-nvme, linux-scsi, linux-xfs, linux-fsdevel,
	linux-security-module, paul, jmorris, serge, Himanshu Madhani


Dave,

> But if the application is limited to atomic_write_unit_max sized
> IOs, and that is always less than or equal to the size of the atomic
> write boundary, why does the block layer even need to care about
> this whacky quirk of the SCSI protocol implementation?

Dealing with boundaries is mainly an NVMe issue. NVMe boundaries are
fixed in LBA space. SCSI boundaries are per-I/O.

> In what cases does hardware that supports atomic_write_max_bytes >
> atomic_write_unit_max actually be useful?

The common case is a database using 16K blocks and wanting to do 1M
writes for performance reasons.

> There are many well known IO optimisation techniques that do not
> require the kernel to infer or assume the format of the data in the
> user buffers as this current API does. May the API simple and hard
> to get wrong first, then optimise from there....

We discussed whether it made sense to have an explicit interface to set
an "application" block size when creating a file. I am not against it,
but our experience is that it doesn't buy you anything over what the
careful alignment of powers-of-two provides. As long as everything is
properly aligned, there is no need for the kernel to infer or assume
anything. It's the application's business what it is doing inside the
file.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-03 18:38 ` [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits John Garry
  2023-05-03 21:39   ` Dave Chinner
@ 2023-05-09  0:19   ` Mike Snitzer
  2023-05-17 17:02     ` John Garry
  1 sibling, 1 reply; 50+ messages in thread
From: Mike Snitzer @ 2023-05-09  0:19 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Himanshu Madhani, dm-devel

On Wed, May 3, 2023 at 2:40 PM John Garry <john.g.garry@oracle.com> wrote:
>
> From: Himanshu Madhani <himanshu.madhani@oracle.com>
>
> Add the following limits:
> - atomic_write_boundary
> - atomic_write_max_bytes
> - atomic_write_unit_max
> - atomic_write_unit_min
>
> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  Documentation/ABI/stable/sysfs-block | 42 +++++++++++++++++++++
>  block/blk-settings.c                 | 56 ++++++++++++++++++++++++++++
>  block/blk-sysfs.c                    | 33 ++++++++++++++++
>  include/linux/blkdev.h               | 23 ++++++++++++
>  4 files changed, 154 insertions(+)
>

...

> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 896b4654ab00..e21731715a12 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -59,6 +59,9 @@ void blk_set_default_limits(struct queue_limits *lim)
>         lim->zoned = BLK_ZONED_NONE;
>         lim->zone_write_granularity = 0;
>         lim->dma_alignment = 511;
> +       lim->atomic_write_unit_min = lim->atomic_write_unit_max = 1;
> +       lim->atomic_write_max_bytes = 512;
> +       lim->atomic_write_boundary = 0;
>  }

Not seeing required changes to blk_set_stacking_limits() nor blk_stack_limits().

Sorry to remind you of DM and MD limits stacking requirements. ;)

Mike

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits
  2023-05-09  0:19   ` Mike Snitzer
@ 2023-05-17 17:02     ` John Garry
  0 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2023-05-17 17:02 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: axboe, kbusch, hch, sagi, martin.petersen, djwong, viro, brauner,
	dchinner, jejb, linux-block, linux-kernel, linux-nvme,
	linux-scsi, linux-xfs, linux-fsdevel, linux-security-module,
	paul, jmorris, serge, Himanshu Madhani, dm-devel

On 09/05/2023 01:19, Mike Snitzer wrote:
> On Wed, May 3, 2023 at 2:40 PM John Garry <john.g.garry@oracle.com> wrote:
>>
>> From: Himanshu Madhani <himanshu.madhani@oracle.com>
>>
>> Add the following limits:
>> - atomic_write_boundary
>> - atomic_write_max_bytes
>> - atomic_write_unit_max
>> - atomic_write_unit_min
>>
>> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   Documentation/ABI/stable/sysfs-block | 42 +++++++++++++++++++++
>>   block/blk-settings.c                 | 56 ++++++++++++++++++++++++++++
>>   block/blk-sysfs.c                    | 33 ++++++++++++++++
>>   include/linux/blkdev.h               | 23 ++++++++++++
>>   4 files changed, 154 insertions(+)
>>
> 
> ...
> 
>> diff --git a/block/blk-settings.c b/block/blk-settings.c
>> index 896b4654ab00..e21731715a12 100644
>> --- a/block/blk-settings.c
>> +++ b/block/blk-settings.c
>> @@ -59,6 +59,9 @@ void blk_set_default_limits(struct queue_limits *lim)
>>          lim->zoned = BLK_ZONED_NONE;
>>          lim->zone_write_granularity = 0;
>>          lim->dma_alignment = 511;
>> +       lim->atomic_write_unit_min = lim->atomic_write_unit_max = 1;
>> +       lim->atomic_write_max_bytes = 512;
>> +       lim->atomic_write_boundary = 0;
>>   }
> 
> Not seeing required changes to blk_set_stacking_limits() nor blk_stack_limits().
> 
> Sorry to remind you of DM and MD limits stacking requirements. ;)
> 

Hi Mike,

Sorry for the slow response.

The idea is that initially we would not be adding stacked device 
support, so we can leave atomic defaults as min unit we always consider 
atomic, i.e. logical block size/fixed 512B sector size.

Thanks,
John


^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2023-05-17 17:08 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-03 18:38 [PATCH RFC 00/16] block atomic writes John Garry
2023-05-03 18:38 ` [PATCH RFC 01/16] block: Add atomic write operations to request_queue limits John Garry
2023-05-03 21:39   ` Dave Chinner
2023-05-04 18:14     ` John Garry
2023-05-04 22:26       ` Dave Chinner
2023-05-05  7:54         ` John Garry
2023-05-05 22:00           ` Darrick J. Wong
2023-05-07  1:59             ` Martin K. Petersen
2023-05-05 23:18           ` Dave Chinner
2023-05-06  9:38             ` John Garry
2023-05-07  2:35             ` Martin K. Petersen
2023-05-05 22:47         ` Eric Biggers
2023-05-05 23:31           ` Dave Chinner
2023-05-06  0:08             ` Eric Biggers
2023-05-09  0:19   ` Mike Snitzer
2023-05-17 17:02     ` John Garry
2023-05-03 18:38 ` [PATCH RFC 02/16] fs/bdev: Add atomic write support info to statx John Garry
2023-05-03 21:58   ` Dave Chinner
2023-05-04  8:45     ` John Garry
2023-05-04 22:40       ` Dave Chinner
2023-05-05  8:01         ` John Garry
2023-05-05 22:04           ` Darrick J. Wong
2023-05-03 18:38 ` [PATCH RFC 03/16] xfs: Support atomic write for statx John Garry
2023-05-03 22:17   ` Dave Chinner
2023-05-05 22:10     ` Darrick J. Wong
2023-05-03 18:38 ` [PATCH RFC 04/16] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
2023-05-03 18:38 ` [PATCH RFC 05/16] block: Add REQ_ATOMIC flag John Garry
2023-05-03 18:38 ` [PATCH RFC 06/16] block: Limit atomic writes according to bio and queue limits John Garry
2023-05-03 18:53   ` Keith Busch
2023-05-04  8:24     ` John Garry
2023-05-03 18:38 ` [PATCH RFC 07/16] block: Add bdev_find_max_atomic_write_alignment() John Garry
2023-05-03 18:38 ` [PATCH RFC 08/16] block: Add support for atomic_write_unit John Garry
2023-05-03 18:38 ` [PATCH RFC 09/16] block: Add blk_validate_atomic_write_op() John Garry
2023-05-03 18:38 ` [PATCH RFC 10/16] block: Add fops atomic write support John Garry
2023-05-03 18:38 ` [PATCH RFC 11/16] fs: iomap: Atomic " John Garry
2023-05-04  5:00   ` Dave Chinner
2023-05-05 21:19     ` Darrick J. Wong
2023-05-05 23:56       ` Dave Chinner
2023-05-03 18:38 ` [PATCH RFC 12/16] xfs: Add support for fallocate2 John Garry
2023-05-03 23:26   ` Dave Chinner
2023-05-05 22:23     ` Darrick J. Wong
2023-05-05 23:42       ` Dave Chinner
2023-05-03 18:38 ` [PATCH RFC 13/16] scsi: sd: Support reading atomic properties from block limits VPD John Garry
2023-05-03 18:38 ` [PATCH RFC 14/16] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
2023-05-03 18:48   ` Bart Van Assche
2023-05-04  8:17     ` John Garry
2023-05-03 18:38 ` [PATCH RFC 15/16] scsi: scsi_debug: Atomic write support John Garry
2023-05-03 18:38 ` [PATCH RFC 16/16] nvme: Support atomic writes John Garry
2023-05-03 18:49   ` Bart Van Assche
2023-05-04  8:19     ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).