All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/21] block atomic writes
@ 2023-09-29 10:27 John Garry
  2023-09-29 10:27 ` [PATCH 01/21] block: Add atomic write operations to request_queue limits John Garry
                   ` (21 more replies)
  0 siblings, 22 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

This series introduces a proposal to implementing atomic writes in the
kernel for torn-write protection.

This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically".

Only direct IO is supported and for block devices and XFS.

The atomic writes feature requires dedicated HW support, like
SCSI WRITE_ATOMIC_16 command.

man pages update has been posted at:
https://lore.kernel.org/linux-api/20230929093717.2972367-1-john.g.garry@oracle.com/T/#t

The goal here is to provide an interface that allow applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.

Two new fields are added to struct statx - atomic_write_unit_min and
atomic_write_unit_max. For each atomic individual write, the total length
of a write must be a between atomic_write_unit_min and
atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
at a natural offset in the file wrt the write length.

For XFS, we must ensure extent alignment with the userspace block size.
XFS supports an extent size hint. However, it must be ensured that the
hint is honoured. For this, a new flag is added - forcealign - to
instruct the XFS block allocator to always honour the extent size hint.

The user would typically set the extent size hint at the userspace
block size to support atomic writes.

The atomic_write_unit_{min, max} values from statx on an XFS file will
consider both the backing bdev atomic_write_unit_{min, max} values and
the extent alignment for the file.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

xfsprogs update for forcealign is at:
https://lore.kernel.org/linux-xfs/20230929095342.2976587-1-john.g.garry@oracle.com/T/#t

This series is based on v6.6-rc3.

Major changes since RFC (https://lore.kernel.org/linux-scsi/20230503183821.1473305-1-john.g.garry@oracle.com/):
- Add XFS forcealign feature
- Only allow writing a single userspace block

Alan Adamson (1):
  nvme: Support atomic writes

Darrick J. Wong (3):
  fs: xfs: Introduce FORCEALIGN inode flag
  fs: xfs: Make file data allocations observe the 'forcealign' flag
  fs: xfs: Enable file data forcealign feature

Himanshu Madhani (2):
  block: Add atomic write operations to request_queue limits
  block: Add REQ_ATOMIC flag

John Garry (13):
  block: Limit atomic writes according to bio and queue limits
  block: Pass blk_queue_get_max_sectors() a request pointer
  block: Limit atomic write IO size according to
    atomic_write_max_sectors
  block: Error an attempt to split an atomic write bio
  block: Add checks to merging of atomic writes
  block: Add fops atomic write support
  fs: xfs: Don't use low-space allocator for alignment > 1
  fs: xfs: Support atomic write for statx
  fs: iomap: Atomic write support
  fs: xfs: iomap atomic write support
  scsi: sd: Support reading atomic properties from block limits VPD
  scsi: sd: Add WRITE_ATOMIC_16 support
  scsi: scsi_debug: Atomic write support

Prasad Singamsetty (2):
  fs/bdev: Add atomic write support info to statx
  fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support

 Documentation/ABI/stable/sysfs-block |  42 ++
 block/bdev.c                         |  33 +-
 block/blk-merge.c                    |  92 ++++-
 block/blk-mq.c                       |   2 +-
 block/blk-settings.c                 |  76 ++++
 block/blk-sysfs.c                    |  33 ++
 block/blk.h                          |   9 +-
 block/fops.c                         |  42 +-
 drivers/nvme/host/core.c             |  29 ++
 drivers/scsi/scsi_debug.c            | 587 +++++++++++++++++++++------
 drivers/scsi/scsi_trace.c            |  22 +
 drivers/scsi/sd.c                    |  57 ++-
 drivers/scsi/sd.h                    |   7 +
 fs/iomap/direct-io.c                 |  26 +-
 fs/iomap/trace.h                     |   3 +-
 fs/stat.c                            |  15 +-
 fs/xfs/libxfs/xfs_bmap.c             |  26 +-
 fs/xfs/libxfs/xfs_format.h           |   9 +-
 fs/xfs/libxfs/xfs_inode_buf.c        |  40 ++
 fs/xfs/libxfs/xfs_inode_buf.h        |   3 +
 fs/xfs/libxfs/xfs_sb.c               |   3 +
 fs/xfs/xfs_inode.c                   |  12 +
 fs/xfs/xfs_inode.h                   |   5 +
 fs/xfs/xfs_ioctl.c                   |  18 +
 fs/xfs/xfs_iomap.c                   |  40 +-
 fs/xfs/xfs_iops.c                    |  51 +++
 fs/xfs/xfs_iops.h                    |   4 +
 fs/xfs/xfs_mount.h                   |   2 +
 fs/xfs/xfs_super.c                   |   4 +
 include/linux/blk_types.h            |   2 +
 include/linux/blkdev.h               |  37 +-
 include/linux/fs.h                   |   1 +
 include/linux/iomap.h                |   1 +
 include/linux/stat.h                 |   2 +
 include/scsi/scsi_proto.h            |   1 +
 include/trace/events/scsi.h          |   1 +
 include/uapi/linux/fs.h              |   7 +-
 include/uapi/linux/stat.h            |   7 +-
 38 files changed, 1179 insertions(+), 172 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-10-03 16:40   ` Bart Van Assche
  2023-11-09 15:10   ` Christoph Hellwig
  2023-09-29 10:27 ` [PATCH 02/21] block: Limit atomic writes according to bio and queue limits John Garry
                   ` (20 subsequent siblings)
  21 siblings, 2 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Himanshu Madhani, John Garry

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add the following limits:
- atomic_write_boundary_bytes
- atomic_write_max_bytes
- atomic_write_unit_max_bytes
- atomic_write_unit_min_bytes

All atomic writes limits are initialised to 0 to indicate no atomic write
support. Stacked devices are just not supported either for now.

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
#jpg: Heavy rewrite
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 Documentation/ABI/stable/sysfs-block | 42 +++++++++++++++++++
 block/blk-settings.c                 | 60 ++++++++++++++++++++++++++++
 block/blk-sysfs.c                    | 33 +++++++++++++++
 include/linux/blkdev.h               | 33 +++++++++++++++
 4 files changed, 168 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
index 1fe9a553c37b..05df7f74cbc1 100644
--- a/Documentation/ABI/stable/sysfs-block
+++ b/Documentation/ABI/stable/sysfs-block
@@ -21,6 +21,48 @@ Description:
 		device is offset from the internal allocation unit's
 		natural alignment.
 
+What:		/sys/block/<disk>/atomic_write_max_bytes
+Date:		May 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] This parameter specifies the maximum atomic write
+		size reported by the device. An atomic write operation
+		must not exceed this number of bytes.
+
+
+What:		/sys/block/<disk>/atomic_write_unit_min_bytes
+Date:		May 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] This parameter specifies the smallest block which can
+		be written atomically with an atomic write operation. All
+		atomic write operations must begin at a
+		atomic_write_unit_min boundary and must be multiples of
+		atomic_write_unit_min. This value must be a power-of-two.
+
+
+What:		/sys/block/<disk>/atomic_write_unit_max_bytes
+Date:		January 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] This parameter defines the largest block which can be
+		written atomically with an atomic write operation. This
+		value must be a multiple of atomic_write_unit_min and must
+		be a power-of-two.
+
+
+What:		/sys/block/<disk>/atomic_write_boundary_bytes
+Date:		May 2023
+Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
+Description:
+		[RO] A device may need to internally split I/Os which
+		straddle a given logical block address boundary. In that
+		case a single atomic write operation will be processed as
+		one of more sub-operations which each complete atomically.
+		This parameter specifies the size in bytes of the atomic
+		boundary if one is reported by the device. This value must
+		be a power-of-two.
+
 
 What:		/sys/block/<disk>/diskseq
 Date:		February 2021
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 0046b447268f..d151be394c98 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -59,6 +59,10 @@ void blk_set_default_limits(struct queue_limits *lim)
 	lim->zoned = BLK_ZONED_NONE;
 	lim->zone_write_granularity = 0;
 	lim->dma_alignment = 511;
+	lim->atomic_write_unit_min_sectors = 0;
+	lim->atomic_write_unit_max_sectors = 0;
+	lim->atomic_write_max_sectors = 0;
+	lim->atomic_write_boundary_sectors = 0;
 }
 
 /**
@@ -183,6 +187,62 @@ void blk_queue_max_discard_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_discard_sectors);
 
+/**
+ * blk_queue_atomic_write_max_bytes - set max bytes supported by
+ * the device for atomic write operations.
+ * @q:  the request queue for the device
+ * @size: maximum bytes supported
+ */
+void blk_queue_atomic_write_max_bytes(struct request_queue *q,
+				      unsigned int bytes)
+{
+	q->limits.atomic_write_max_sectors = bytes >> SECTOR_SHIFT;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_max_bytes);
+
+/**
+ * blk_queue_atomic_write_boundary_bytes - Device's logical block address space
+ * which an atomic write should not cross.
+ * @q:  the request queue for the device
+ * @bytes: must be a power-of-two.
+ */
+void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
+					   unsigned int bytes)
+{
+	q->limits.atomic_write_boundary_sectors = bytes >> SECTOR_SHIFT;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
+
+/**
+ * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
+ * atomically to the device.
+ * @q:  the request queue for the device
+ * @sectors: must be a power-of-two.
+ */
+void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
+					     unsigned int sectors)
+{
+	struct queue_limits *limits = &q->limits;
+
+	limits->atomic_write_unit_min_sectors = sectors;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
+
+/*
+ * blk_queue_atomic_write_unit_max_sectors - largest unit that can be written
+ * atomically to the device.
+ * @q: the request queue for the device
+ * @sectors: must be a power-of-two.
+ */
+void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
+					     unsigned int sectors)
+{
+	struct queue_limits *limits = &q->limits;
+
+	limits->atomic_write_unit_max_sectors = sectors;
+}
+EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
+
 /**
  * blk_queue_max_secure_erase_sectors - set max sectors for a secure erase
  * @q:  the request queue for the device
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 63e481262336..c193a04d7df7 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -118,6 +118,30 @@ static ssize_t queue_max_discard_segments_show(struct request_queue *q,
 	return queue_var_show(queue_max_discard_segments(q), page);
 }
 
+static ssize_t queue_atomic_write_max_bytes_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(queue_atomic_write_max_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_boundary_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(queue_atomic_write_boundary_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_unit_min_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(queue_atomic_write_unit_min_bytes(q), page);
+}
+
+static ssize_t queue_atomic_write_unit_max_show(struct request_queue *q,
+						char *page)
+{
+	return queue_var_show(queue_atomic_write_unit_max_bytes(q), page);
+}
+
 static ssize_t queue_max_integrity_segments_show(struct request_queue *q, char *page)
 {
 	return queue_var_show(q->limits.max_integrity_segments, page);
@@ -507,6 +531,11 @@ QUEUE_RO_ENTRY(queue_discard_max_hw, "discard_max_hw_bytes");
 QUEUE_RW_ENTRY(queue_discard_max, "discard_max_bytes");
 QUEUE_RO_ENTRY(queue_discard_zeroes_data, "discard_zeroes_data");
 
+QUEUE_RO_ENTRY(queue_atomic_write_max_bytes, "atomic_write_max_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_boundary, "atomic_write_boundary_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_unit_max, "atomic_write_unit_max_bytes");
+QUEUE_RO_ENTRY(queue_atomic_write_unit_min, "atomic_write_unit_min_bytes");
+
 QUEUE_RO_ENTRY(queue_write_same_max, "write_same_max_bytes");
 QUEUE_RO_ENTRY(queue_write_zeroes_max, "write_zeroes_max_bytes");
 QUEUE_RO_ENTRY(queue_zone_append_max, "zone_append_max_bytes");
@@ -633,6 +662,10 @@ static struct attribute *queue_attrs[] = {
 	&queue_discard_max_entry.attr,
 	&queue_discard_max_hw_entry.attr,
 	&queue_discard_zeroes_data_entry.attr,
+	&queue_atomic_write_max_bytes_entry.attr,
+	&queue_atomic_write_boundary_entry.attr,
+	&queue_atomic_write_unit_min_entry.attr,
+	&queue_atomic_write_unit_max_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
 	&queue_zone_append_max_entry.attr,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index eef450f25982..c10e47bdb34f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -309,6 +309,11 @@ struct queue_limits {
 	unsigned int		discard_alignment;
 	unsigned int		zone_write_granularity;
 
+	unsigned int		atomic_write_boundary_sectors;
+	unsigned int		atomic_write_max_sectors;
+	unsigned int		atomic_write_unit_min_sectors;
+	unsigned int		atomic_write_unit_max_sectors;
+
 	unsigned short		max_segments;
 	unsigned short		max_integrity_segments;
 	unsigned short		max_discard_segments;
@@ -908,6 +913,14 @@ void blk_queue_zone_write_granularity(struct request_queue *q,
 				      unsigned int size);
 extern void blk_queue_alignment_offset(struct request_queue *q,
 				       unsigned int alignment);
+extern void blk_queue_atomic_write_max_bytes(struct request_queue *q,
+					     unsigned int bytes);
+extern void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
+					    unsigned int sectors);
+extern void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
+					    unsigned int sectors);
+extern void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
+					    unsigned int bytes);
 void disk_update_readahead(struct gendisk *disk);
 extern void blk_limits_io_min(struct queue_limits *limits, unsigned int min);
 extern void blk_queue_io_min(struct request_queue *q, unsigned int min);
@@ -1312,6 +1325,26 @@ static inline int queue_dma_alignment(const struct request_queue *q)
 	return q ? q->limits.dma_alignment : 511;
 }
 
+static inline unsigned int queue_atomic_write_unit_max_bytes(const struct request_queue *q)
+{
+	return q->limits.atomic_write_unit_max_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int queue_atomic_write_unit_min_bytes(const struct request_queue *q)
+{
+	return q->limits.atomic_write_unit_min_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int queue_atomic_write_boundary_bytes(const struct request_queue *q)
+{
+	return q->limits.atomic_write_boundary_sectors << SECTOR_SHIFT;
+}
+
+static inline unsigned int queue_atomic_write_max_bytes(const struct request_queue *q)
+{
+	return q->limits.atomic_write_max_sectors << SECTOR_SHIFT;
+}
+
 static inline unsigned int bdev_dma_alignment(struct block_device *bdev)
 {
 	return queue_dma_alignment(bdev_get_queue(bdev));
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 02/21] block: Limit atomic writes according to bio and queue limits
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
  2023-09-29 10:27 ` [PATCH 01/21] block: Add atomic write operations to request_queue limits John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-11-09 15:13   ` Christoph Hellwig
  2023-12-04  3:19   ` Ming Lei
  2023-09-29 10:27 ` [PATCH 03/21] fs/bdev: Add atomic write support info to statx John Garry
                   ` (19 subsequent siblings)
  21 siblings, 2 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

We rely the block layer always being able to send a bio of size
atomic_write_unit_max without being required to split it due to request
queue or other bio limits.

A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
and each vector is at worst case the device logical block size from
direct IO alignment requirement.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-settings.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index d151be394c98..57d487a00c64 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -213,6 +213,18 @@ void blk_queue_atomic_write_boundary_bytes(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_atomic_write_boundary_bytes);
 
+static unsigned int blk_queue_max_guaranteed_bio_size_sectors(
+					struct request_queue *q)
+{
+	struct queue_limits *limits = &q->limits;
+	unsigned int max_segments = min_t(unsigned int, BIO_MAX_VECS,
+					limits->max_segments);
+	/*  Limit according to dev sector size as we only support direct-io */
+	unsigned int limit = max_segments * queue_logical_block_size(q);
+
+	return rounddown_pow_of_two(limit >> SECTOR_SHIFT);
+}
+
 /**
  * blk_queue_atomic_write_unit_min_sectors - smallest unit that can be written
  * atomically to the device.
@@ -223,8 +235,10 @@ void blk_queue_atomic_write_unit_min_sectors(struct request_queue *q,
 					     unsigned int sectors)
 {
 	struct queue_limits *limits = &q->limits;
+	unsigned int guaranteed_sectors =
+		blk_queue_max_guaranteed_bio_size_sectors(q);
 
-	limits->atomic_write_unit_min_sectors = sectors;
+	limits->atomic_write_unit_min_sectors = min(guaranteed_sectors, sectors);
 }
 EXPORT_SYMBOL(blk_queue_atomic_write_unit_min_sectors);
 
@@ -238,8 +252,10 @@ void blk_queue_atomic_write_unit_max_sectors(struct request_queue *q,
 					     unsigned int sectors)
 {
 	struct queue_limits *limits = &q->limits;
+	unsigned int guaranteed_sectors =
+		blk_queue_max_guaranteed_bio_size_sectors(q);
 
-	limits->atomic_write_unit_max_sectors = sectors;
+	limits->atomic_write_unit_max_sectors = min(guaranteed_sectors, sectors);
 }
 EXPORT_SYMBOL(blk_queue_atomic_write_unit_max_sectors);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
  2023-09-29 10:27 ` [PATCH 01/21] block: Add atomic write operations to request_queue limits John Garry
  2023-09-29 10:27 ` [PATCH 02/21] block: Limit atomic writes according to bio and queue limits John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-29 22:49   ` Eric Biggers
  2023-09-29 10:27 ` [PATCH 04/21] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
                   ` (18 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Prasad Singamsetty, John Garry

From: Prasad Singamsetty <prasad.singamsetty@oracle.com>

Extend statx system call to return additional info for atomic write support
support if the specified file is a block device.

Add initial support for a block device.

Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/bdev.c              | 33 +++++++++++++++++++++++----------
 fs/stat.c                 | 15 ++++++++-------
 include/linux/blkdev.h    |  4 ++--
 include/linux/stat.h      |  2 ++
 include/uapi/linux/stat.h |  7 ++++++-
 5 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index f3b13aa1b7d4..037a3d9ecbcb 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1028,24 +1028,37 @@ void sync_bdevs(bool wait)
 	iput(old_inode);
 }
 
+#define BDEV_STATX_SUPPORTED_MSK (STATX_DIOALIGN | STATX_WRITE_ATOMIC)
+
 /*
- * Handle STATX_DIOALIGN for block devices.
- *
- * Note that the inode passed to this is the inode of a block device node file,
- * not the block device's internal inode.  Therefore it is *not* valid to use
- * I_BDEV() here; the block device has to be looked up by i_rdev instead.
+ * Handle STATX_{DIOALIGN, WRITE_ATOMIC} for block devices.
  */
-void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
+void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask)
 {
 	struct block_device *bdev;
 
-	bdev = blkdev_get_no_open(inode->i_rdev);
+	if (!(request_mask & BDEV_STATX_SUPPORTED_MSK))
+		return;
+
+	bdev = blkdev_get_no_open(d_backing_inode(dentry)->i_rdev);
 	if (!bdev)
 		return;
 
-	stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
-	stat->dio_offset_align = bdev_logical_block_size(bdev);
-	stat->result_mask |= STATX_DIOALIGN;
+	if (request_mask & STATX_DIOALIGN) {
+		stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
+		stat->dio_offset_align = bdev_logical_block_size(bdev);
+		stat->result_mask |= STATX_DIOALIGN;
+	}
+
+	if (request_mask & STATX_WRITE_ATOMIC) {
+		stat->atomic_write_unit_min =
+			queue_atomic_write_unit_min_bytes(bdev->bd_queue);
+		stat->atomic_write_unit_max =
+			queue_atomic_write_unit_max_bytes(bdev->bd_queue);
+		stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
+		stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
+		stat->result_mask |= STATX_WRITE_ATOMIC;
+	}
 
 	blkdev_put_no_open(bdev);
 }
diff --git a/fs/stat.c b/fs/stat.c
index d43a5cc1bfa4..b840e58f41fa 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -250,13 +250,12 @@ static int vfs_statx(int dfd, struct filename *filename, int flags,
 		stat->attributes |= STATX_ATTR_MOUNT_ROOT;
 	stat->attributes_mask |= STATX_ATTR_MOUNT_ROOT;
 
-	/* Handle STATX_DIOALIGN for block devices. */
-	if (request_mask & STATX_DIOALIGN) {
-		struct inode *inode = d_backing_inode(path.dentry);
-
-		if (S_ISBLK(inode->i_mode))
-			bdev_statx_dioalign(inode, stat);
-	}
+	/* If this is a block device inode, override the filesystem
+	 * attributes with the block device specific parameters
+	 * that need to be obtained from the bdev backing inode
+	 */
+	if (S_ISBLK(d_backing_inode(path.dentry)->i_mode))
+		bdev_statx(path.dentry, stat, request_mask);
 
 	path_put(&path);
 	if (retry_estale(error, lookup_flags)) {
@@ -649,6 +648,8 @@ cp_statx(const struct kstat *stat, struct statx __user *buffer)
 	tmp.stx_mnt_id = stat->mnt_id;
 	tmp.stx_dio_mem_align = stat->dio_mem_align;
 	tmp.stx_dio_offset_align = stat->dio_offset_align;
+	tmp.stx_atomic_write_unit_min = stat->atomic_write_unit_min;
+	tmp.stx_atomic_write_unit_max = stat->atomic_write_unit_max;
 
 	return copy_to_user(buffer, &tmp, sizeof(tmp)) ? -EFAULT : 0;
 }
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c10e47bdb34f..f70988083734 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1533,7 +1533,7 @@ int sync_blockdev(struct block_device *bdev);
 int sync_blockdev_range(struct block_device *bdev, loff_t lstart, loff_t lend);
 int sync_blockdev_nowait(struct block_device *bdev);
 void sync_bdevs(bool wait);
-void bdev_statx_dioalign(struct inode *inode, struct kstat *stat);
+void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask);
 void printk_all_partitions(void);
 int __init early_lookup_bdev(const char *pathname, dev_t *dev);
 #else
@@ -1551,7 +1551,7 @@ static inline int sync_blockdev_nowait(struct block_device *bdev)
 static inline void sync_bdevs(bool wait)
 {
 }
-static inline void bdev_statx_dioalign(struct inode *inode, struct kstat *stat)
+static inline void bdev_statx(struct dentry *dentry, struct kstat *stat, u32 request_mask)
 {
 }
 static inline void printk_all_partitions(void)
diff --git a/include/linux/stat.h b/include/linux/stat.h
index 52150570d37a..dfa69ecfaacf 100644
--- a/include/linux/stat.h
+++ b/include/linux/stat.h
@@ -53,6 +53,8 @@ struct kstat {
 	u32		dio_mem_align;
 	u32		dio_offset_align;
 	u64		change_cookie;
+	u32		atomic_write_unit_max;
+	u32		atomic_write_unit_min;
 };
 
 /* These definitions are internal to the kernel for now. Mainly used by nfsd. */
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7cab2c65d3d7..c99d7cac2aa6 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -127,7 +127,10 @@ struct statx {
 	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
 	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
 	/* 0xa0 */
-	__u64	__spare3[12];	/* Spare space for future expansion */
+	__u32	stx_atomic_write_unit_max;
+	__u32	stx_atomic_write_unit_min;
+	/* 0xb0 */
+	__u64	__spare3[11];	/* Spare space for future expansion */
 	/* 0x100 */
 };
 
@@ -154,6 +157,7 @@ struct statx {
 #define STATX_BTIME		0x00000800U	/* Want/got stx_btime */
 #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
 #define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
+#define STATX_WRITE_ATOMIC	0x00004000U	/* Want/got atomic_write_* fields */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
@@ -189,6 +193,7 @@ struct statx {
 #define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
 #define STATX_ATTR_DAX			0x00200000 /* File is currently in DAX state */
+#define STATX_ATTR_WRITE_ATOMIC		0x00400000 /* File supports atomic write operations */
 
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 04/21] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (2 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 03/21] fs/bdev: Add atomic write support info to statx John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-10-06 18:15   ` Jeremy Bongio
  2023-09-29 10:27 ` [PATCH 05/21] block: Add REQ_ATOMIC flag John Garry
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Prasad Singamsetty, John Garry

From: Prasad Singamsetty <prasad.singamsetty@oracle.com>

Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn write prevention, according to special
alignment and length rules.

Torn write prevention means that for a power or any other HW failure, all
or none of the data will be committed to storage, but never a mix of old
and new.

For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.

A call to statx will give the relevant atomic write info:
- atomic_write_unit_min
- atomic_write_unit_max

Both values are a power-of-2.

Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length.

Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/linux/fs.h      | 1 +
 include/uapi/linux/fs.h | 5 ++++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b528f063e8ff..898952dee8eb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -328,6 +328,7 @@ enum rw_hint {
 #define IOCB_SYNC		(__force int) RWF_SYNC
 #define IOCB_NOWAIT		(__force int) RWF_NOWAIT
 #define IOCB_APPEND		(__force int) RWF_APPEND
+#define IOCB_ATOMIC		(__force int) RWF_ATOMIC
 
 /* non-RWF related bits - start at 16 */
 #define IOCB_EVENTFD		(1 << 16)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..e3b4f5bc6860 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -301,8 +301,11 @@ typedef int __bitwise __kernel_rwf_t;
 /* per-IO O_APPEND */
 #define RWF_APPEND	((__force __kernel_rwf_t)0x00000010)
 
+/* Atomic Write */
+#define RWF_ATOMIC	((__force __kernel_rwf_t)0x00000020)
+
 /* mask of flags supported by the kernel */
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
-			 RWF_APPEND)
+			 RWF_APPEND | RWF_ATOMIC)
 
 #endif /* _UAPI_LINUX_FS_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 05/21] block: Add REQ_ATOMIC flag
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (3 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 04/21] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-29 10:27 ` [PATCH 06/21] block: Pass blk_queue_get_max_sectors() a request pointer John Garry
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Himanshu Madhani, John Garry

From: Himanshu Madhani <himanshu.madhani@oracle.com>

Add flag REQ_ATOMIC, meaning an atomic operation. This should only be
used in conjunction with REQ_OP_WRITE.

We will not add a special "request atomic write" operation, as to try to
avoid maintenance effort for an operation which is almost the same as
REQ_OP_WRITE.

Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/linux/blk_types.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index d5c5e59ddbd2..4ef5ca64adb4 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -422,6 +422,7 @@ enum req_flag_bits {
 	__REQ_DRV,		/* for driver use */
 	__REQ_FS_PRIVATE,	/* for file system (submitter) use */
 
+	__REQ_ATOMIC,		/* for atomic write operations */
 	/*
 	 * Command specific flags, keep last:
 	 */
@@ -448,6 +449,7 @@ enum req_flag_bits {
 #define REQ_RAHEAD	(__force blk_opf_t)(1ULL << __REQ_RAHEAD)
 #define REQ_BACKGROUND	(__force blk_opf_t)(1ULL << __REQ_BACKGROUND)
 #define REQ_NOWAIT	(__force blk_opf_t)(1ULL << __REQ_NOWAIT)
+#define REQ_ATOMIC	(__force blk_opf_t)(1ULL << __REQ_ATOMIC)
 #define REQ_POLLED	(__force blk_opf_t)(1ULL << __REQ_POLLED)
 #define REQ_ALLOC_CACHE	(__force blk_opf_t)(1ULL << __REQ_ALLOC_CACHE)
 #define REQ_SWAP	(__force blk_opf_t)(1ULL << __REQ_SWAP)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 06/21] block: Pass blk_queue_get_max_sectors() a request pointer
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (4 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 05/21] block: Add REQ_ATOMIC flag John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-29 10:27 ` [PATCH 07/21] block: Limit atomic write IO size according to atomic_write_max_sectors John Garry
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

Currently blk_queue_get_max_sectors() is passed a enum req_op, which does
not work for atomic writes. This is because an atomic write has a different
max sectors values to a regular write, and we need the rq->cmd_flags
to know that we have an atomic write, so pass the request pointer, which
has all information available.

Also use rq->cmd_flags instead of rq->bio->bi_opf when possible.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-merge.c | 3 ++-
 block/blk-mq.c    | 2 +-
 block/blk.h       | 6 ++++--
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 65e75efa9bd3..0ccc251e22ff 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -596,7 +596,8 @@ static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
 	if (blk_rq_is_passthrough(rq))
 		return q->limits.max_hw_sectors;
 
-	max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
+	max_sectors = blk_queue_get_max_sectors(rq);
+
 	if (!q->limits.chunk_sectors ||
 	    req_op(rq) == REQ_OP_DISCARD ||
 	    req_op(rq) == REQ_OP_SECURE_ERASE)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1fafd54dce3c..21661778bdf0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3031,7 +3031,7 @@ void blk_mq_submit_bio(struct bio *bio)
 blk_status_t blk_insert_cloned_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	unsigned int max_sectors = blk_queue_get_max_sectors(q, req_op(rq));
+	unsigned int max_sectors = blk_queue_get_max_sectors(rq);
 	unsigned int max_segments = blk_rq_get_max_segments(rq);
 	blk_status_t ret;
 
diff --git a/block/blk.h b/block/blk.h
index 08a358bc0919..94e330e9c853 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -166,9 +166,11 @@ static inline unsigned int blk_rq_get_max_segments(struct request *rq)
 	return queue_max_segments(rq->q);
 }
 
-static inline unsigned int blk_queue_get_max_sectors(struct request_queue *q,
-						     enum req_op op)
+static inline unsigned int blk_queue_get_max_sectors(struct request *rq)
 {
+	struct request_queue *q = rq->q;
+	enum req_op op = req_op(rq);
+
 	if (unlikely(op == REQ_OP_DISCARD || op == REQ_OP_SECURE_ERASE))
 		return min(q->limits.max_discard_sectors,
 			   UINT_MAX >> SECTOR_SHIFT);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 07/21] block: Limit atomic write IO size according to atomic_write_max_sectors
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (5 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 06/21] block: Pass blk_queue_get_max_sectors() a request pointer John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-29 10:27 ` [PATCH 08/21] block: Error an attempt to split an atomic write bio John Garry
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

Currently an IO size is limited to the request_queue limits max_sectors.
Limit the size for an atomic write to queue limit atomic_write_max_sectors
value.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
note: atomic_write_max_sectors should prob be limited to max_sectors,
      which it isn't
 block/blk-merge.c | 12 +++++++++++-
 block/blk.h       |  3 +++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 0ccc251e22ff..8d4de9253fe9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -171,7 +171,17 @@ static inline unsigned get_max_io_size(struct bio *bio,
 {
 	unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
 	unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
-	unsigned max_sectors = lim->max_sectors, start, end;
+	unsigned max_sectors, start, end;
+
+	/*
+	 * We ignore lim->max_sectors for atomic writes simply because
+	 * it may less than bio->write_atomic_unit, which we cannot
+	 * tolerate.
+	 */
+	if (bio->bi_opf & REQ_ATOMIC)
+		max_sectors = lim->atomic_write_max_sectors;
+	else
+		max_sectors = lim->max_sectors;
 
 	if (lim->chunk_sectors) {
 		max_sectors = min(max_sectors,
diff --git a/block/blk.h b/block/blk.h
index 94e330e9c853..6f6cd5b1830a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -178,6 +178,9 @@ static inline unsigned int blk_queue_get_max_sectors(struct request *rq)
 	if (unlikely(op == REQ_OP_WRITE_ZEROES))
 		return q->limits.max_write_zeroes_sectors;
 
+	if (rq->cmd_flags & REQ_ATOMIC)
+		return q->limits.atomic_write_max_sectors;
+
 	return q->limits.max_sectors;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 08/21] block: Error an attempt to split an atomic write bio
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (6 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 07/21] block: Limit atomic write IO size according to atomic_write_max_sectors John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-29 10:27 ` [PATCH 09/21] block: Add checks to merging of atomic writes John Garry
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

As the name suggests, we should not be splitting these.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-merge.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8d4de9253fe9..bc21f8ff4842 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -319,6 +319,11 @@ struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
 	*segs = nsegs;
 	return NULL;
 split:
+	if (bio->bi_opf & REQ_ATOMIC) {
+		bio->bi_status = BLK_STS_IOERR;
+		bio_endio(bio);
+		return ERR_PTR(-EIO);
+	}
 	/*
 	 * We can't sanely support splitting for a REQ_NOWAIT bio. End it
 	 * with EAGAIN if splitting is required and return an error pointer.
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 09/21] block: Add checks to merging of atomic writes
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (7 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 08/21] block: Error an attempt to split an atomic write bio John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-30 13:40   ` kernel test robot
  2023-09-29 10:27 ` [PATCH 10/21] block: Add fops atomic write support John Garry
                   ` (12 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

For atomic writes we allow merging, but we must adhere to some additional
rules:
- Only allow merging of atomic writes with other atomic writes
- Ensure that the merged IO would not cross an atomic write boundary, if
  any

We already ensure that we don't exceed the atomic writes size limit in
get_max_io_size().

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/blk-merge.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index bc21f8ff4842..5dc850924e29 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -18,6 +18,23 @@
 #include "blk-rq-qos.h"
 #include "blk-throttle.h"
 
+static bool bio_straddles_atomic_write_boundary(loff_t bi_sector,
+				unsigned int bi_size,
+				unsigned int boundary)
+{
+	loff_t start = bi_sector << SECTOR_SHIFT;
+	loff_t end = start + bi_size;
+	loff_t start_mod = start % boundary;
+	loff_t end_mod = end % boundary;
+
+	if (end - start > boundary)
+		return true;
+	if ((start_mod > end_mod) && (start_mod && end_mod))
+		return true;
+
+	return false;
+}
+
 static inline void bio_get_first_bvec(struct bio *bio, struct bio_vec *bv)
 {
 	*bv = mp_bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
@@ -664,6 +681,18 @@ int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs)
 		return 0;
 	}
 
+	if (req->cmd_flags & REQ_ATOMIC) {
+		unsigned int atomic_write_boundary_bytes =
+				queue_atomic_write_boundary_bytes(req->q);
+
+		if (atomic_write_boundary_bytes &&
+		    bio_straddles_atomic_write_boundary(req->__sector,
+				bio->bi_iter.bi_size + blk_rq_bytes(req),
+				atomic_write_boundary_bytes)) {
+			return 0;
+		}
+	}
+
 	return ll_new_hw_segment(req, bio, nr_segs);
 }
 
@@ -683,6 +712,19 @@ static int ll_front_merge_fn(struct request *req, struct bio *bio,
 		return 0;
 	}
 
+	if (req->cmd_flags & REQ_ATOMIC) {
+		unsigned int atomic_write_boundary_bytes =
+				queue_atomic_write_boundary_bytes(req->q);
+
+		if (atomic_write_boundary_bytes &&
+		    bio_straddles_atomic_write_boundary(
+				bio->bi_iter.bi_sector,
+				bio->bi_iter.bi_size + blk_rq_bytes(req),
+				atomic_write_boundary_bytes)) {
+			return 0;
+		}
+	}
+
 	return ll_new_hw_segment(req, bio, nr_segs);
 }
 
@@ -719,6 +761,18 @@ static int ll_merge_requests_fn(struct request_queue *q, struct request *req,
 	    blk_rq_get_max_sectors(req, blk_rq_pos(req)))
 		return 0;
 
+	if (req->cmd_flags & REQ_ATOMIC) {
+		unsigned int atomic_write_boundary_bytes =
+				queue_atomic_write_boundary_bytes(req->q);
+
+		if (atomic_write_boundary_bytes &&
+		    bio_straddles_atomic_write_boundary(req->__sector,
+				blk_rq_bytes(req) + blk_rq_bytes(next),
+				atomic_write_boundary_bytes)) {
+			return 0;
+		}
+	}
+
 	total_phys_segments = req->nr_phys_segments + next->nr_phys_segments;
 	if (total_phys_segments > blk_rq_get_max_segments(req))
 		return 0;
@@ -814,6 +868,18 @@ static enum elv_merge blk_try_req_merge(struct request *req,
 	return ELEVATOR_NO_MERGE;
 }
 
+static bool blk_atomic_write_mergeable_rq_bio(struct request *rq,
+					      struct bio *bio)
+{
+	return (rq->cmd_flags & REQ_ATOMIC) == (bio->bi_opf & REQ_ATOMIC);
+}
+
+static bool blk_atomic_write_mergeable_rqs(struct request *rq,
+					   struct request *next)
+{
+	return (rq->cmd_flags & REQ_ATOMIC) == (next->cmd_flags & REQ_ATOMIC);
+}
+
 /*
  * For non-mq, this has to be called with the request spinlock acquired.
  * For mq with scheduling, the appropriate queue wide lock should be held.
@@ -833,6 +899,9 @@ static struct request *attempt_merge(struct request_queue *q,
 	if (req->ioprio != next->ioprio)
 		return NULL;
 
+	if (!blk_atomic_write_mergeable_rqs(req, next))
+		return NULL;
+
 	/*
 	 * If we are allowed to merge, then append bio list
 	 * from next to rq and release next. merge_requests_fn
@@ -960,6 +1029,9 @@ bool blk_rq_merge_ok(struct request *rq, struct bio *bio)
 	if (rq->ioprio != bio_prio(bio))
 		return false;
 
+	if (blk_atomic_write_mergeable_rq_bio(rq, bio) == false)
+		return false;
+
 	return true;
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 10/21] block: Add fops atomic write support
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (8 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 09/21] block: Add checks to merging of atomic writes John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-29 17:51   ` Bart Van Assche
  2023-12-04  2:30   ` Ming Lei
  2023-09-29 10:27 ` [PATCH 11/21] fs: xfs: Don't use low-space allocator for alignment > 1 John Garry
                   ` (11 subsequent siblings)
  21 siblings, 2 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

Add support for atomic writes, as follows:
- Ensure that the IO follows all the atomic writes rules, like must be
  naturally aligned
- Set REQ_ATOMIC

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/fops.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/block/fops.c b/block/fops.c
index acff3d5d22d4..516669ad69e5 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -41,6 +41,29 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
 		!bdev_iter_is_aligned(bdev, iter);
 }
 
+static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
+			      struct iov_iter *iter)
+{
+	unsigned int atomic_write_unit_min_bytes =
+			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
+	unsigned int atomic_write_unit_max_bytes =
+			queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
+
+	if (!atomic_write_unit_min_bytes)
+		return false;
+	if (pos % atomic_write_unit_min_bytes)
+		return false;
+	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
+		return false;
+	if (!is_power_of_2(iov_iter_count(iter)))
+		return false;
+	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
+		return false;
+	if (pos % iov_iter_count(iter))
+		return false;
+	return true;
+}
+
 #define DIO_INLINE_BIO_VECS 4
 
 static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
@@ -48,6 +71,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 {
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
 	struct bio_vec inline_vecs[DIO_INLINE_BIO_VECS], *vecs;
+	bool is_read = iov_iter_rw(iter) == READ;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 	loff_t pos = iocb->ki_pos;
 	bool should_dirty = false;
 	struct bio bio;
@@ -56,6 +81,9 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
+	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+		return -EINVAL;
+
 	if (nr_pages <= DIO_INLINE_BIO_VECS)
 		vecs = inline_vecs;
 	else {
@@ -65,7 +93,7 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 			return -ENOMEM;
 	}
 
-	if (iov_iter_rw(iter) == READ) {
+	if (is_read) {
 		bio_init(&bio, bdev, vecs, nr_pages, REQ_OP_READ);
 		if (user_backed_iter(iter))
 			should_dirty = true;
@@ -74,6 +102,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	}
 	bio.bi_iter.bi_sector = pos >> SECTOR_SHIFT;
 	bio.bi_ioprio = iocb->ki_ioprio;
+	if (atomic_write)
+		bio.bi_opf |= REQ_ATOMIC;
 
 	ret = bio_iov_iter_get_pages(&bio, iter);
 	if (unlikely(ret))
@@ -167,10 +197,14 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	struct blkdev_dio *dio;
 	struct bio *bio;
 	bool is_read = (iov_iter_rw(iter) == READ), is_sync;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
 	loff_t pos = iocb->ki_pos;
 	int ret = 0;
 
+	if (atomic_write)
+		return -EINVAL;
+
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
@@ -305,6 +339,7 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
 	bool is_read = iov_iter_rw(iter) == READ;
 	blk_opf_t opf = is_read ? REQ_OP_READ : dio_bio_write_op(iocb);
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 	struct blkdev_dio *dio;
 	struct bio *bio;
 	loff_t pos = iocb->ki_pos;
@@ -313,6 +348,9 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
+	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+		return -EINVAL;
+
 	if (iocb->ki_flags & IOCB_ALLOC_CACHE)
 		opf |= REQ_ALLOC_CACHE;
 	bio = bio_alloc_bioset(bdev, nr_pages, opf, GFP_KERNEL,
@@ -347,6 +385,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 			bio_set_pages_dirty(bio);
 		}
 	} else {
+		if (atomic_write)
+			bio->bi_opf |= REQ_ATOMIC;
 		task_io_account_write(bio->bi_iter.bi_size);
 	}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 11/21] fs: xfs: Don't use low-space allocator for alignment > 1
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (9 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 10/21] block: Add fops atomic write support John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-10-03  1:16   ` Dave Chinner
  2023-09-29 10:27 ` [PATCH 12/21] fs: xfs: Introduce FORCEALIGN inode flag John Garry
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

The low-space allocator doesn't honour the alignment requirement, so don't
attempt to even use it (when we have an alignment requirement).

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 30c931b38853..328134c22104 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3569,6 +3569,10 @@ xfs_bmap_btalloc_low_space(
 {
 	int			error;
 
+	/* The allocator doesn't honour args->alignment */
+	if (args->alignment > 1)
+		return 0;
+
 	if (args->minlen > ap->minlen) {
 		args->minlen = ap->minlen;
 		error = xfs_alloc_vextent_start_ag(args, ap->blkno);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 12/21] fs: xfs: Introduce FORCEALIGN inode flag
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (10 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 11/21] fs: xfs: Don't use low-space allocator for alignment > 1 John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-11-09 15:24   ` Christoph Hellwig
  2023-09-29 10:27 ` [PATCH 13/21] fs: xfs: Make file data allocations observe the 'forcealign' flag John Garry
                   ` (9 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

From: "Darrick J. Wong" <djwong@kernel.org>

Add a new inode flag to require that all file data extent mappings must
be aligned (both the file offset range and the allocated space itself)
to the extent size hint.  Having a separate COW extent size hint is no
longer allowed.

The goal here is to enable sysadmins and users to mandate that all space
mappings in a file must have a startoff/blockcount that are aligned to
(say) a 2MB alignment and that the startblock/blockcount will follow the
same alignment.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Co-developed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h    |  6 +++++-
 fs/xfs/libxfs/xfs_inode_buf.c | 40 +++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_inode_buf.h |  3 +++
 fs/xfs/libxfs/xfs_sb.c        |  3 +++
 fs/xfs/xfs_inode.c            | 12 +++++++++++
 fs/xfs/xfs_inode.h            |  5 +++++
 fs/xfs/xfs_ioctl.c            | 18 ++++++++++++++++
 fs/xfs/xfs_mount.h            |  2 ++
 fs/xfs/xfs_super.c            |  4 ++++
 include/uapi/linux/fs.h       |  2 ++
 10 files changed, 94 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 371dc07233e0..d718b73f48ca 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -353,6 +353,7 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_FORCEALIGN (1 << 30)	/* aligned file data extents */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
@@ -1069,16 +1070,19 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
 #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define XFS_DIFLAG2_FORCEALIGN_BIT 5
 
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
 #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
 #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_FORCEALIGN	(1 << XFS_DIFLAG2_FORCEALIGN_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_FORCEALIGN)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index a35781577cad..0c4d492c9363 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -605,6 +605,14 @@ xfs_dinode_verify(
 	    !xfs_has_bigtime(mp))
 		return __this_address;
 
+	if (flags2 & XFS_DIFLAG2_FORCEALIGN) {
+		fa = xfs_inode_validate_forcealign(mp, mode, flags,
+				be32_to_cpu(dip->di_extsize),
+				be32_to_cpu(dip->di_cowextsize));
+		if (fa)
+			return fa;
+	}
+
 	return NULL;
 }
 
@@ -772,3 +780,35 @@ xfs_inode_validate_cowextsize(
 
 	return NULL;
 }
+
+/* Validate the forcealign inode flag */
+xfs_failaddr_t
+xfs_inode_validate_forcealign(
+	struct xfs_mount	*mp,
+	uint16_t		mode,
+	uint16_t		flags,
+	uint32_t		extsize,
+	uint32_t		cowextsize)
+{
+	/* superblock rocompat feature flag */
+	if (!xfs_has_forcealign(mp))
+		return __this_address;
+
+	/* Only regular files and directories */
+	if (!S_ISDIR(mode) && !S_ISREG(mode))
+		return __this_address;
+
+	/* Doesn't apply to realtime files */
+	if (flags & XFS_DIFLAG_REALTIME)
+		return __this_address;
+
+	/* Requires a nonzero extent size hint */
+	if (extsize == 0)
+		return __this_address;
+
+	/* Requires no cow extent size hint */
+	if (cowextsize != 0)
+		return __this_address;
+
+	return NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 585ed5a110af..50db17d22b68 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -33,6 +33,9 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
 xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
 		uint32_t cowextsize, uint16_t mode, uint16_t flags,
 		uint64_t flags2);
+xfs_failaddr_t xfs_inode_validate_forcealign(struct xfs_mount *mp,
+		uint16_t mode, uint16_t flags, uint32_t extsize,
+		uint32_t cowextsize);
 
 static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
 {
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 6264daaab37b..c1be74222c70 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -162,6 +162,9 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_REFLINK;
 	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
 		features |= XFS_FEAT_INOBTCNT;
+	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
+		features |= XFS_FEAT_FORCEALIGN;
+
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
 		features |= XFS_FEAT_FTYPE;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index f94f7b374041..3fbfb052c778 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -634,6 +634,8 @@ xfs_ip2xflags(
 			flags |= FS_XFLAG_DAX;
 		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
 			flags |= FS_XFLAG_COWEXTSIZE;
+		if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+			flags |= FS_XFLAG_FORCEALIGN;
 	}
 
 	if (xfs_inode_has_attr_fork(ip))
@@ -761,6 +763,8 @@ xfs_inode_inherit_flags2(
 	}
 	if (pip->i_diflags2 & XFS_DIFLAG2_DAX)
 		ip->i_diflags2 |= XFS_DIFLAG2_DAX;
+	if (pip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)
+		ip->i_diflags2 |= XFS_DIFLAG2_FORCEALIGN;
 
 	/* Don't let invalid cowextsize hints propagate. */
 	failaddr = xfs_inode_validate_cowextsize(ip->i_mount, ip->i_cowextsize,
@@ -769,6 +773,14 @@ xfs_inode_inherit_flags2(
 		ip->i_diflags2 &= ~XFS_DIFLAG2_COWEXTSIZE;
 		ip->i_cowextsize = 0;
 	}
+
+	if (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN) {
+		failaddr = xfs_inode_validate_forcealign(ip->i_mount,
+				VFS_I(ip)->i_mode, ip->i_diflags, ip->i_extsize,
+				ip->i_cowextsize);
+		if (failaddr)
+			ip->i_diflags2 &= ~XFS_DIFLAG2_FORCEALIGN;
+	}
 }
 
 /*
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0c5bdb91152e..f66a57085908 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -305,6 +305,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
 }
 
+static inline bool xfs_inode_forcealign(struct xfs_inode *ip)
+{
+	return ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN;
+}
+
 /*
  * Return the buftarg used for data allocations on a given inode.
  */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 55bb01173cde..4c147def835f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1109,6 +1109,8 @@ xfs_flags2diflags2(
 		di_flags2 |= XFS_DIFLAG2_DAX;
 	if (xflags & FS_XFLAG_COWEXTSIZE)
 		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+	if (xflags & FS_XFLAG_FORCEALIGN)
+		di_flags2 |= XFS_DIFLAG2_FORCEALIGN;
 
 	return di_flags2;
 }
@@ -1143,6 +1145,22 @@ xfs_ioctl_setattr_xflags(
 	if (i_flags2 && !xfs_has_v3inodes(mp))
 		return -EINVAL;
 
+	/*
+	 * Force-align requires a nonzero extent size hint and a zero cow
+	 * extent size hint.  It doesn't apply to realtime files.
+	 */
+	if (fa->fsx_xflags & FS_XFLAG_FORCEALIGN) {
+		if (!xfs_has_forcealign(mp))
+			return -EINVAL;
+		if (fa->fsx_xflags & FS_XFLAG_COWEXTSIZE)
+			return -EINVAL;
+		if (!(fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+					FS_XFLAG_EXTSZINHERIT)))
+			return -EINVAL;
+		if (fa->fsx_xflags & FS_XFLAG_REALTIME)
+			return -EINVAL;
+	}
+
 	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
 	ip->i_diflags2 = i_flags2;
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index d19cca099bc3..a7553b8bcc64 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -287,6 +287,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_BIGTIME	(1ULL << 24)	/* large timestamps */
 #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
 #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
+#define XFS_FEAT_FORCEALIGN	(1ULL << 27)	/* aligned file data extents */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -350,6 +351,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
 __XFS_HAS_FEAT(bigtime, BIGTIME)
 __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
 __XFS_HAS_FEAT(large_extent_counts, NREXT64)
+__XFS_HAS_FEAT(forcealign, FORCEALIGN)
 
 /*
  * Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 819a3568b28f..1bbb25df23a7 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1703,6 +1703,10 @@ xfs_fs_fill_super(
 		mp->m_features &= ~XFS_FEAT_DISCARD;
 	}
 
+	if (xfs_has_forcealign(mp))
+		xfs_warn(mp,
+"EXPERIMENTAL forced data extent alignment feature in use. Use at your own risk!");
+
 	if (xfs_has_reflink(mp)) {
 		if (mp->m_sb.sb_rblocks) {
 			xfs_alert(mp,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index e3b4f5bc6860..cbfb09bc1717 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -140,6 +140,8 @@ struct fsxattr {
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+/* data extent mappings for regular files must be aligned to extent size hint */
+#define FS_XFLAG_FORCEALIGN	0x00020000
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 13/21] fs: xfs: Make file data allocations observe the 'forcealign' flag
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (11 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 12/21] fs: xfs: Introduce FORCEALIGN inode flag John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-10-03  1:42   ` Dave Chinner
  2023-09-29 10:27 ` [PATCH 14/21] fs: xfs: Enable file data forcealign feature John Garry
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

From: "Darrick J. Wong" <djwong@kernel.org>

The existing extsize hint code already did the work of expanding file
range mapping requests so that the range is aligned to the hint value.
Now add the code we need to guarantee that the space allocations are
also always aligned.

XXX: still need to check all this with reflink

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Co-developed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
 fs/xfs/xfs_iomap.c       |  4 +++-
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 328134c22104..6c864dc0a6ff 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3328,6 +3328,19 @@ xfs_bmap_compute_alignments(
 		align = xfs_get_cowextsz_hint(ap->ip);
 	else if (ap->datatype & XFS_ALLOC_USERDATA)
 		align = xfs_get_extsz_hint(ap->ip);
+
+	/*
+	 * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
+	 * set as forcealign and cowextsz_hint are mutually exclusive
+	 */
+	if (xfs_inode_forcealign(ap->ip) && align) {
+		args->alignment = align;
+		if (stripe_align % align)
+			stripe_align = align;
+	} else {
+		args->alignment = 1;
+	}
+
 	if (align) {
 		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
 					ap->eof, 0, ap->conv, &ap->offset,
@@ -3423,7 +3436,6 @@ xfs_bmap_exact_minlen_extent_alloc(
 	args.minlen = args.maxlen = ap->minlen;
 	args.total = ap->total;
 
-	args.alignment = 1;
 	args.minalignslop = 0;
 
 	args.minleft = ap->minleft;
@@ -3469,6 +3481,7 @@ xfs_bmap_btalloc_at_eof(
 {
 	struct xfs_mount	*mp = args->mp;
 	struct xfs_perag	*caller_pag = args->pag;
+	int			orig_alignment = args->alignment;
 	int			error;
 
 	/*
@@ -3543,10 +3556,10 @@ xfs_bmap_btalloc_at_eof(
 
 	/*
 	 * Allocation failed, so turn return the allocation args to their
-	 * original non-aligned state so the caller can proceed on allocation
-	 * failure as if this function was never called.
+	 * original state so the caller can proceed on allocation failure as
+	 * if this function was never called.
 	 */
-	args->alignment = 1;
+	args->alignment = orig_alignment;
 	return 0;
 }
 
@@ -3694,7 +3707,6 @@ xfs_bmap_btalloc(
 		.wasdel		= ap->wasdel,
 		.resv		= XFS_AG_RESV_NONE,
 		.datatype	= ap->datatype,
-		.alignment	= 1,
 		.minalignslop	= 0,
 	};
 	xfs_fileoff_t		orig_offset;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 18c8f168b153..70fe873951f3 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -181,7 +181,9 @@ xfs_eof_alignment(
 		 * If mounted with the "-o swalloc" option the alignment is
 		 * increased from the strip unit size to the stripe width.
 		 */
-		if (mp->m_swidth && xfs_has_swalloc(mp))
+		if (xfs_inode_forcealign(ip))
+			align = xfs_get_extsz_hint(ip);
+		else if (mp->m_swidth && xfs_has_swalloc(mp))
 			align = mp->m_swidth;
 		else if (mp->m_dalign)
 			align = mp->m_dalign;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 14/21] fs: xfs: Enable file data forcealign feature
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (12 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 13/21] fs: xfs: Make file data allocations observe the 'forcealign' flag John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-29 10:27 ` [PATCH 15/21] fs: xfs: Support atomic write for statx John Garry
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

From: "Darrick J. Wong" <djwong@kernel.org>

Enable this feature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index d718b73f48ca..afb843b14074 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -358,7 +358,8 @@ xfs_sb_has_compat_feature(
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
 		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
-		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+		 XFS_SB_FEAT_RO_COMPAT_FORCEALIGN)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 15/21] fs: xfs: Support atomic write for statx
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (13 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 14/21] fs: xfs: Enable file data forcealign feature John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-10-03  3:32   ` Dave Chinner
  2023-09-29 10:27 ` [PATCH 16/21] fs: iomap: Atomic write support John Garry
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

Support providing info on atomic write unit min and max for an inode.

For simplicity, currently we limit the min at the FS block size, but a
lower limit could be supported in future.

The atomic write unit min and max is limited by the guaranteed extent
alignment for the inode.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iops.h |  4 ++++
 2 files changed, 55 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 1c1e6171209d..5bff80748223 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -546,6 +546,46 @@ xfs_stat_blksize(
 	return PAGE_SIZE;
 }
 
+void xfs_ip_atomic_write_attr(struct xfs_inode *ip,
+			xfs_filblks_t *unit_min_fsb,
+			xfs_filblks_t *unit_max_fsb)
+{
+	xfs_extlen_t		extsz_hint = xfs_get_extsz_hint(ip);
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct block_device	*bdev = target->bt_bdev;
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_filblks_t		atomic_write_unit_min,
+				atomic_write_unit_max,
+				align;
+
+	atomic_write_unit_min = XFS_B_TO_FSB(mp,
+		queue_atomic_write_unit_min_bytes(bdev->bd_queue));
+	atomic_write_unit_max = XFS_B_TO_FSB(mp,
+		queue_atomic_write_unit_max_bytes(bdev->bd_queue));
+
+	/* for RT, unset extsize gives hint of 1 */
+	/* for !RT, unset extsize gives hint of 0 */
+	if (extsz_hint && (XFS_IS_REALTIME_INODE(ip) ||
+	    (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)))
+		align = extsz_hint;
+	else
+		align = 1;
+
+	if (atomic_write_unit_max == 0) {
+		*unit_min_fsb = 0;
+		*unit_max_fsb = 0;
+	} else if (atomic_write_unit_min == 0) {
+		*unit_min_fsb = 1;
+		*unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
+					align);
+	} else {
+		*unit_min_fsb = min_t(xfs_filblks_t, atomic_write_unit_min,
+					align);
+		*unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
+					align);
+	}
+}
+
 STATIC int
 xfs_vn_getattr(
 	struct mnt_idmap	*idmap,
@@ -614,6 +654,17 @@ xfs_vn_getattr(
 			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
 			stat->dio_offset_align = bdev_logical_block_size(bdev);
 		}
+		if (request_mask & STATX_WRITE_ATOMIC) {
+			xfs_filblks_t unit_min_fsb, unit_max_fsb;
+
+			xfs_ip_atomic_write_attr(ip, &unit_min_fsb,
+				&unit_max_fsb);
+			stat->atomic_write_unit_min = XFS_FSB_TO_B(mp, unit_min_fsb);
+			stat->atomic_write_unit_max = XFS_FSB_TO_B(mp, unit_max_fsb);
+			stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
+			stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
+			stat->result_mask |= STATX_WRITE_ATOMIC;
+		}
 		fallthrough;
 	default:
 		stat->blksize = xfs_stat_blksize(ip);
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 7f84a0843b24..b1e683b04301 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -19,4 +19,8 @@ int xfs_vn_setattr_size(struct mnt_idmap *idmap,
 int xfs_inode_init_security(struct inode *inode, struct inode *dir,
 		const struct qstr *qstr);
 
+void xfs_ip_atomic_write_attr(struct xfs_inode *ip,
+			xfs_filblks_t *unit_min_fsb,
+			xfs_filblks_t *unit_max_fsb);
+
 #endif /* __XFS_IOPS_H__ */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 16/21] fs: iomap: Atomic write support
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (14 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 15/21] fs: xfs: Support atomic write for statx John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-10-03  4:24   ` Dave Chinner
  2023-09-29 10:27 ` [PATCH 17/21] fs: xfs: iomap atomic " John Garry
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
bio is being created and all the rules there need to be followed.

It is the task of the FS iomap iter callbacks to ensure that the mapping
created adheres to those rules, like size is power-of-2, is at a
naturally-aligned offset, etc.

In iomap_dio_bio_iter(), ensure that for a non-dsync iocb that the mapping
is not dirty nor unmapped.

A write should only produce a single bio, so error when it doesn't.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c  | 26 ++++++++++++++++++++++++--
 fs/iomap/trace.h      |  3 ++-
 include/linux/iomap.h |  1 +
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..6ef25e26f1a1 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -275,10 +275,11 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		struct iomap_dio *dio)
 {
+	bool atomic_write = iter->flags & IOMAP_ATOMIC_WRITE;
 	const struct iomap *iomap = &iter->iomap;
 	struct inode *inode = iter->inode;
 	unsigned int fs_block_size = i_blocksize(inode), pad;
-	loff_t length = iomap_length(iter);
+	const loff_t length = iomap_length(iter);
 	loff_t pos = iter->pos;
 	blk_opf_t bio_opf;
 	struct bio *bio;
@@ -292,6 +293,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
 		return -EINVAL;
 
+	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
+		if (iomap->flags & IOMAP_F_DIRTY)
+			return -EIO;
+		if (iomap->type != IOMAP_MAPPED)
+			return -EIO;
+	}
+
 	if (iomap->type == IOMAP_UNWRITTEN) {
 		dio->flags |= IOMAP_DIO_UNWRITTEN;
 		need_zeroout = true;
@@ -381,6 +389,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 					  GFP_KERNEL);
 		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
 		bio->bi_ioprio = dio->iocb->ki_ioprio;
+		if (atomic_write)
+			bio->bi_opf |= REQ_ATOMIC;
+
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
 
@@ -397,6 +408,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		}
 
 		n = bio->bi_iter.bi_size;
+		if (atomic_write && n != length) {
+			/* This bio should have covered the complete length */
+			ret = -EINVAL;
+			bio_put(bio);
+			goto out;
+		}
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			task_io_account_write(n);
 		} else {
@@ -554,6 +571,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	struct blk_plug plug;
 	struct iomap_dio *dio;
 	loff_t ret = 0;
+	bool is_read = iov_iter_rw(iter) == READ;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 
 	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
 
@@ -579,7 +598,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iomi.flags |= IOMAP_NOWAIT;
 
-	if (iov_iter_rw(iter) == READ) {
+	if (is_read) {
 		/* reads can always complete inline */
 		dio->flags |= IOMAP_DIO_INLINE_COMP;
 
@@ -605,6 +624,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		if (iocb->ki_flags & IOCB_DIO_CALLER_COMP)
 			dio->flags |= IOMAP_DIO_CALLER_COMP;
 
+		if (atomic_write)
+			iomi.flags |= IOMAP_ATOMIC_WRITE;
+
 		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
 			ret = -EAGAIN;
 			if (iomi.pos >= dio->i_size ||
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index c16fd55f5595..f9932733c180 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
 	{ IOMAP_REPORT,		"REPORT" }, \
 	{ IOMAP_FAULT,		"FAULT" }, \
 	{ IOMAP_DIRECT,		"DIRECT" }, \
-	{ IOMAP_NOWAIT,		"NOWAIT" }
+	{ IOMAP_NOWAIT,		"NOWAIT" }, \
+	{ IOMAP_ATOMIC_WRITE,	"ATOMIC" }
 
 #define IOMAP_F_FLAGS_STRINGS \
 	{ IOMAP_F_NEW,		"NEW" }, \
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 96dd0acbba44..5138cede54fc 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -178,6 +178,7 @@ struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
+#define IOMAP_ATOMIC_WRITE	(1 << 9)
 
 struct iomap_ops {
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (15 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 16/21] fs: iomap: Atomic write support John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-11-09 15:26   ` Christoph Hellwig
  2023-09-29 10:27 ` [PATCH 18/21] scsi: sd: Support reading atomic properties from block limits VPD John Garry
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

Ensure that when creating a mapping that we adhere to all the atomic
write rules.

We check that the mapping covers the complete range of the write to ensure
that we'll be just creating a single mapping.

Currently minimum granularity is the FS block size, but it should be
possibly to support lower in future.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iomap.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 70fe873951f3..3424fcfc04f5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -783,6 +783,7 @@ xfs_direct_write_iomap_begin(
 {
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*m_sb = &mp->m_sb;
 	struct xfs_bmbt_irec	imap, cmap;
 	xfs_fileoff_t		offset_fsb = XFS_B_TO_FSBT(mp, offset);
 	xfs_fileoff_t		end_fsb = xfs_iomap_end_fsb(mp, offset, length);
@@ -814,6 +815,41 @@ xfs_direct_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
+	if (flags & IOMAP_ATOMIC_WRITE) {
+		xfs_filblks_t unit_min_fsb, unit_max_fsb;
+
+		xfs_ip_atomic_write_attr(ip, &unit_min_fsb, &unit_max_fsb);
+
+		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
+			error = -EIO;
+			goto out_unlock;
+		}
+
+		if (offset % m_sb->sb_blocksize ||
+		    length % m_sb->sb_blocksize) {
+			error = -EIO;
+			goto out_unlock;
+		}
+
+		if (imap.br_blockcount == unit_min_fsb ||
+		    imap.br_blockcount == unit_max_fsb) {
+			/* min and max must be a power-of-2 */
+		} else if (imap.br_blockcount < unit_min_fsb ||
+			   imap.br_blockcount > unit_max_fsb) {
+			error = -EIO;
+			goto out_unlock;
+		} else if (!is_power_of_2(imap.br_blockcount)) {
+			error = -EIO;
+			goto out_unlock;
+		}
+
+		if (imap.br_startoff &&
+		    imap.br_startoff % imap.br_blockcount) {
+			error =  -EIO;
+			goto out_unlock;
+		}
+	}
+
 	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
 		error = -EAGAIN;
 		if (flags & IOMAP_NOWAIT)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 18/21] scsi: sd: Support reading atomic properties from block limits VPD
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (16 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 17/21] fs: xfs: iomap atomic " John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-29 17:54   ` Bart Van Assche
  2023-09-29 10:27 ` [PATCH 19/21] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
                   ` (3 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

Also update block layer request queue sysfs properties.

See sbc4r22 section 6.6.4 - Block limits VPD page.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/sd.c | 37 ++++++++++++++++++++++++++++++++++++-
 drivers/scsi/sd.h |  7 +++++++
 2 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index c92a317ba547..7f6cadd1f8f3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -837,6 +837,33 @@ static blk_status_t sd_setup_unmap_cmnd(struct scsi_cmnd *cmd)
 	return scsi_alloc_sgtables(cmd);
 }
 
+static void sd_config_atomic(struct scsi_disk *sdkp)
+{
+	unsigned int logical_block_size = sdkp->device->sector_size;
+	struct request_queue *q = sdkp->disk->queue;
+
+	if (sdkp->max_atomic) {
+		unsigned int physical_block_size_sectors =
+			sdkp->physical_block_size / sdkp->device->sector_size;
+		unsigned int max_atomic = max_t(unsigned int,
+			rounddown_pow_of_two(sdkp->max_atomic),
+			rounddown_pow_of_two(sdkp->max_atomic_with_boundary));
+		unsigned int unit_min = sdkp->atomic_granularity ?
+			rounddown_pow_of_two(sdkp->atomic_granularity) :
+			physical_block_size_sectors;
+		unsigned int unit_max = max_atomic;
+
+		if (sdkp->max_atomic_boundary)
+			unit_max = min_t(unsigned int, unit_max,
+				rounddown_pow_of_two(sdkp->max_atomic_boundary));
+
+		blk_queue_atomic_write_max_bytes(q, max_atomic * logical_block_size);
+		blk_queue_atomic_write_unit_min_sectors(q, unit_min);
+		blk_queue_atomic_write_unit_max_sectors(q, unit_max);
+		blk_queue_atomic_write_boundary_bytes(q, 0);
+	}
+}
+
 static blk_status_t sd_setup_write_same16_cmnd(struct scsi_cmnd *cmd,
 		bool unmap)
 {
@@ -2982,7 +3009,7 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 		sdkp->max_ws_blocks = (u32)get_unaligned_be64(&vpd->data[36]);
 
 		if (!sdkp->lbpme)
-			goto out;
+			goto read_atomics;
 
 		lba_count = get_unaligned_be32(&vpd->data[20]);
 		desc_count = get_unaligned_be32(&vpd->data[24]);
@@ -3013,6 +3040,14 @@ static void sd_read_block_limits(struct scsi_disk *sdkp)
 			else
 				sd_config_discard(sdkp, SD_LBP_DISABLE);
 		}
+read_atomics:
+		sdkp->max_atomic = get_unaligned_be32(&vpd->data[44]);
+		sdkp->atomic_alignment  = get_unaligned_be32(&vpd->data[48]);
+		sdkp->atomic_granularity  = get_unaligned_be32(&vpd->data[52]);
+		sdkp->max_atomic_with_boundary  = get_unaligned_be32(&vpd->data[56]);
+		sdkp->max_atomic_boundary = get_unaligned_be32(&vpd->data[60]);
+
+		sd_config_atomic(sdkp);
 	}
 
  out:
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 5eea762f84d1..bca05fbd74df 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -121,6 +121,13 @@ struct scsi_disk {
 	u32		max_unmap_blocks;
 	u32		unmap_granularity;
 	u32		unmap_alignment;
+
+	u32		max_atomic;
+	u32		atomic_alignment;
+	u32		atomic_granularity;
+	u32		max_atomic_with_boundary;
+	u32		max_atomic_boundary;
+
 	u32		index;
 	unsigned int	physical_block_size;
 	unsigned int	max_medium_access_timeouts;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 19/21] scsi: sd: Add WRITE_ATOMIC_16 support
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (17 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 18/21] scsi: sd: Support reading atomic properties from block limits VPD John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-29 17:59   ` Bart Van Assche
  2023-09-29 10:27 ` [PATCH 20/21] scsi: scsi_debug: Atomic write support John Garry
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

Add function sd_setup_atomic_cmnd() to setup an WRITE_ATOMIC_16
CDB for when REQ_ATOMIC flag is set for the request.

Also add trace info.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_trace.c   | 22 ++++++++++++++++++++++
 drivers/scsi/sd.c           | 20 ++++++++++++++++++++
 include/scsi/scsi_proto.h   |  1 +
 include/trace/events/scsi.h |  1 +
 4 files changed, 44 insertions(+)

diff --git a/drivers/scsi/scsi_trace.c b/drivers/scsi/scsi_trace.c
index 41a950075913..3e47c4472a80 100644
--- a/drivers/scsi/scsi_trace.c
+++ b/drivers/scsi/scsi_trace.c
@@ -325,6 +325,26 @@ scsi_trace_zbc_out(struct trace_seq *p, unsigned char *cdb, int len)
 	return ret;
 }
 
+static const char *
+scsi_trace_atomic_write16_out(struct trace_seq *p, unsigned char *cdb, int len)
+{
+	const char *ret = trace_seq_buffer_ptr(p);
+	unsigned int boundary_size;
+	unsigned int nr_blocks;
+	sector_t lba;
+
+	lba = get_unaligned_be64(&cdb[2]);
+	boundary_size = get_unaligned_be16(&cdb[10]);
+	nr_blocks = get_unaligned_be16(&cdb[12]);
+
+	trace_seq_printf(p, "lba=%llu txlen=%u boundary_size=%u",
+			  lba, nr_blocks, boundary_size);
+
+	trace_seq_putc(p, 0);
+
+	return ret;
+}
+
 static const char *
 scsi_trace_varlen(struct trace_seq *p, unsigned char *cdb, int len)
 {
@@ -385,6 +405,8 @@ scsi_trace_parse_cdb(struct trace_seq *p, unsigned char *cdb, int len)
 		return scsi_trace_zbc_in(p, cdb, len);
 	case ZBC_OUT:
 		return scsi_trace_zbc_out(p, cdb, len);
+	case WRITE_ATOMIC_16:
+		return scsi_trace_atomic_write16_out(p, cdb, len);
 	default:
 		return scsi_trace_misc(p, cdb, len);
 	}
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 7f6cadd1f8f3..1a41656dac2d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1129,6 +1129,23 @@ static int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd)
 	return (hint - IOPRIO_HINT_DEV_DURATION_LIMIT_1) + 1;
 }
 
+static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd,
+					sector_t lba, unsigned int nr_blocks,
+					unsigned char flags)
+{
+	cmd->cmd_len  = 16;
+	cmd->cmnd[0]  = WRITE_ATOMIC_16;
+	cmd->cmnd[1]  = flags;
+	put_unaligned_be64(lba, &cmd->cmnd[2]);
+	cmd->cmnd[10] = 0;
+	cmd->cmnd[11] = 0;
+	put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
+	cmd->cmnd[14] = 0;
+	cmd->cmnd[15] = 0;
+
+	return BLK_STS_OK;
+}
+
 static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 {
 	struct request *rq = scsi_cmd_to_rq(cmd);
@@ -1139,6 +1156,7 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	unsigned int nr_blocks = sectors_to_logical(sdp, blk_rq_sectors(rq));
 	unsigned int mask = logical_to_sectors(sdp, 1) - 1;
 	bool write = rq_data_dir(rq) == WRITE;
+	bool atomic_write = !!(rq->cmd_flags & REQ_ATOMIC) && write;
 	unsigned char protect, fua;
 	unsigned int dld;
 	blk_status_t ret;
@@ -1200,6 +1218,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	if (protect && sdkp->protection_type == T10_PI_TYPE2_PROTECTION) {
 		ret = sd_setup_rw32_cmnd(cmd, write, lba, nr_blocks,
 					 protect | fua, dld);
+	} else if (atomic_write) {
+		ret = sd_setup_atomic_cmnd(cmd, lba, nr_blocks, protect | fua);
 	} else if (sdp->use_16_for_rw || (nr_blocks > 0xffff)) {
 		ret = sd_setup_rw16_cmnd(cmd, write, lba, nr_blocks,
 					 protect | fua, dld);
diff --git a/include/scsi/scsi_proto.h b/include/scsi/scsi_proto.h
index 07d65c1f59db..833de67305b5 100644
--- a/include/scsi/scsi_proto.h
+++ b/include/scsi/scsi_proto.h
@@ -119,6 +119,7 @@
 #define WRITE_SAME_16	      0x93
 #define ZBC_OUT		      0x94
 #define ZBC_IN		      0x95
+#define WRITE_ATOMIC_16	0x9c
 #define SERVICE_ACTION_BIDIRECTIONAL 0x9d
 #define SERVICE_ACTION_IN_16  0x9e
 #define SERVICE_ACTION_OUT_16 0x9f
diff --git a/include/trace/events/scsi.h b/include/trace/events/scsi.h
index 8e2d9b1b0e77..05f1945ed204 100644
--- a/include/trace/events/scsi.h
+++ b/include/trace/events/scsi.h
@@ -102,6 +102,7 @@
 		scsi_opcode_name(WRITE_32),			\
 		scsi_opcode_name(WRITE_SAME_32),		\
 		scsi_opcode_name(ATA_16),			\
+		scsi_opcode_name(WRITE_ATOMIC_16),		\
 		scsi_opcode_name(ATA_12))
 
 #define scsi_hostbyte_name(result)	{ result, #result }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 20/21] scsi: scsi_debug: Atomic write support
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (18 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 19/21] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
@ 2023-09-29 10:27 ` John Garry
  2023-09-29 10:27 ` [PATCH 21/21] nvme: Support atomic writes John Garry
  2023-09-29 14:58 ` [PATCH 00/21] block " Bart Van Assche
  21 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, John Garry

Add initial support for atomic writes.

As is standard method, feed device properties via modules param, those
being:
- atomic_max_size_blks
- atomic_alignment_blks
- atomic_granularity_blks
- atomic_max_size_with_boundary_blks
- atomic_max_boundary_blks

These just match sbc4r22 section 6.6.4 - Block limits VPD page.

We just support ATOMIC_WRITE_16.

The major change in the driver is how we lock the device for RW accesses.

Currently the driver uses a per-device lock for accessing device metadata
and "media" data (calls to do_device_access()) atomically for the duration
of the whole read/write command.

This should not suit verifying atomic writes. Reason being that currently
all reads/writes are atomic, so using atomic writes does not prove
anything.

Change device access model to basis that regular writes only atomic on a
per-sector basis, while reads and atomic writes are fully atomic.

As mentioned, since accessing metadata and device media is atomic,
continue to have regular writes involving metadata - like discard or PI -
as atomic. We can improve this later.

Currently we only support model where overlapping going reads or writes
wait for current access to complete before commencing an atomic write.
This is described in 4.29.3.2 section of the SBC. However, we simplify,
things and wait for all accesses to complete (when issuing an atomic
write).

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/scsi/scsi_debug.c | 587 +++++++++++++++++++++++++++++---------
 1 file changed, 454 insertions(+), 133 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 9c0af50501f9..c27f9bfcd365 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -66,6 +66,8 @@ static const char *sdebug_version_date = "20210520";
 
 /* Additional Sense Code (ASC) */
 #define NO_ADDITIONAL_SENSE 0x0
+#define OVERLAP_ATOMIC_COMMAND_ASC 0x0
+#define OVERLAP_ATOMIC_COMMAND_ASCQ 0x23
 #define LOGICAL_UNIT_NOT_READY 0x4
 #define LOGICAL_UNIT_COMMUNICATION_FAILURE 0x8
 #define UNRECOVERED_READ_ERR 0x11
@@ -100,6 +102,7 @@ static const char *sdebug_version_date = "20210520";
 #define READ_BOUNDARY_ASCQ 0x7
 #define ATTEMPT_ACCESS_GAP 0x9
 #define INSUFF_ZONE_ASCQ 0xe
+/* see drivers/scsi/sense_codes.h */
 
 /* Additional Sense Code Qualifier (ASCQ) */
 #define ACK_NAK_TO 0x3
@@ -149,6 +152,12 @@ static const char *sdebug_version_date = "20210520";
 #define DEF_VIRTUAL_GB   0
 #define DEF_VPD_USE_HOSTNO 1
 #define DEF_WRITESAME_LENGTH 0xFFFF
+#define DEF_ATOMIC_WRITE 1
+#define DEF_ATOMIC_MAX_LENGTH 8192
+#define DEF_ATOMIC_ALIGNMENT 2
+#define DEF_ATOMIC_GRANULARITY 2
+#define DEF_ATOMIC_BOUNDARY_MAX_LENGTH (DEF_ATOMIC_MAX_LENGTH)
+#define DEF_ATOMIC_MAX_BOUNDARY 128
 #define DEF_STRICT 0
 #define DEF_STATISTICS false
 #define DEF_SUBMIT_QUEUES 1
@@ -322,7 +331,9 @@ struct sdebug_host_info {
 
 /* There is an xarray of pointers to this struct's objects, one per host */
 struct sdeb_store_info {
-	rwlock_t macc_lck;	/* for atomic media access on this store */
+	rwlock_t macc_data_lck;	/* for media data access on this store */
+	rwlock_t macc_meta_lck;	/* for atomic media meta access on this store */
+	rwlock_t macc_sector_lck;	/* per-sector media data access on this store */
 	u8 *storep;		/* user data storage (ram) */
 	struct t10_pi_tuple *dif_storep; /* protection info */
 	void *map_storep;	/* provisioning map */
@@ -346,12 +357,20 @@ struct sdebug_defer {
 	enum sdeb_defer_type defer_t;
 };
 
+struct sdebug_device_access_info {
+	bool atomic_write;
+	u64 lba;
+	u32 num;
+	struct scsi_cmnd *self;
+};
+
 struct sdebug_queued_cmd {
 	/* corresponding bit set in in_use_bm[] in owning struct sdebug_queue
 	 * instance indicates this slot is in use.
 	 */
 	struct sdebug_defer sd_dp;
 	struct scsi_cmnd *scmd;
+	struct sdebug_device_access_info *i;
 };
 
 struct sdebug_scsi_cmd {
@@ -411,7 +430,8 @@ enum sdeb_opcode_index {
 	SDEB_I_PRE_FETCH = 29,		/* 10, 16 */
 	SDEB_I_ZONE_OUT = 30,		/* 0x94+SA; includes no data xfer */
 	SDEB_I_ZONE_IN = 31,		/* 0x95+SA; all have data-in */
-	SDEB_I_LAST_ELEM_P1 = 32,	/* keep this last (previous + 1) */
+	SDEB_I_ATOMIC_WRITE_16 = 32,	/* keep this last (previous + 1) */
+	SDEB_I_LAST_ELEM_P1 = 33,	/* keep this last (previous + 1) */
 };
 
 
@@ -445,7 +465,8 @@ static const unsigned char opcode_ind_arr[256] = {
 	0, 0, 0, SDEB_I_VERIFY,
 	SDEB_I_PRE_FETCH, SDEB_I_SYNC_CACHE, 0, SDEB_I_WRITE_SAME,
 	SDEB_I_ZONE_OUT, SDEB_I_ZONE_IN, 0, 0,
-	0, 0, 0, 0, 0, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16,
+	0, 0, 0, 0,
+	SDEB_I_ATOMIC_WRITE_16, 0, SDEB_I_SERV_ACT_IN_16, SDEB_I_SERV_ACT_OUT_16,
 /* 0xa0; 0xa0->0xbf: 12 byte cdbs */
 	SDEB_I_REPORT_LUNS, SDEB_I_ATA_PT, 0, SDEB_I_MAINT_IN,
 	     SDEB_I_MAINT_OUT, 0, 0, 0,
@@ -493,6 +514,7 @@ static int resp_write_buffer(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_sync_cache(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_pre_fetch(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_report_zones(struct scsi_cmnd *, struct sdebug_dev_info *);
+static int resp_atomic_write(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_open_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_close_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
 static int resp_finish_zone(struct scsi_cmnd *, struct sdebug_dev_info *);
@@ -731,6 +753,11 @@ static const struct opcode_info_t opcode_info_arr[SDEB_I_LAST_ELEM_P1 + 1] = {
 	    resp_report_zones, zone_in_iarr, /* ZONE_IN(16), REPORT ZONES) */
 		{16,  0x0 /* SA */, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
 		 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xbf, 0xc7} },
+/* 31 */
+	{0, 0x0, 0x0, F_D_OUT | FF_MEDIA_IO,
+	    resp_atomic_write, NULL, /* ATOMIC WRITE 16 */
+		{16,  0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+		 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff} },
 /* sentinel */
 	{0xff, 0, 0, 0, NULL, NULL,		/* terminating element */
 	    {0,  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0} },
@@ -778,6 +805,12 @@ static unsigned int sdebug_unmap_granularity = DEF_UNMAP_GRANULARITY;
 static unsigned int sdebug_unmap_max_blocks = DEF_UNMAP_MAX_BLOCKS;
 static unsigned int sdebug_unmap_max_desc = DEF_UNMAP_MAX_DESC;
 static unsigned int sdebug_write_same_length = DEF_WRITESAME_LENGTH;
+static unsigned int sdebug_atomic_write = DEF_ATOMIC_WRITE;
+static unsigned int sdebug_atomic_max_size_blks = DEF_ATOMIC_MAX_LENGTH;
+static unsigned int sdebug_atomic_alignment_blks = DEF_ATOMIC_ALIGNMENT;
+static unsigned int sdebug_atomic_granularity_blks = DEF_ATOMIC_GRANULARITY;
+static unsigned int sdebug_atomic_max_size_with_boundary_blks = DEF_ATOMIC_BOUNDARY_MAX_LENGTH;
+static unsigned int sdebug_atomic_max_boundary_blks = DEF_ATOMIC_MAX_BOUNDARY;
 static int sdebug_uuid_ctl = DEF_UUID_CTL;
 static bool sdebug_random = DEF_RANDOM;
 static bool sdebug_per_host_store = DEF_PER_HOST_STORE;
@@ -873,6 +906,11 @@ static inline bool scsi_debug_lbp(void)
 		(sdebug_lbpu || sdebug_lbpws || sdebug_lbpws10);
 }
 
+static inline bool scsi_debug_atomic_write(void)
+{
+	return 0 == sdebug_fake_rw && sdebug_atomic_write;
+}
+
 static void *lba2fake_store(struct sdeb_store_info *sip,
 			    unsigned long long lba)
 {
@@ -1500,6 +1538,14 @@ static int inquiry_vpd_b0(unsigned char *arr)
 	/* Maximum WRITE SAME Length */
 	put_unaligned_be64(sdebug_write_same_length, &arr[32]);
 
+	if (sdebug_atomic_write) {
+		put_unaligned_be32(sdebug_atomic_max_size_blks, &arr[40]);
+		put_unaligned_be32(sdebug_atomic_alignment_blks, &arr[44]);
+		put_unaligned_be32(sdebug_atomic_granularity_blks, &arr[48]);
+		put_unaligned_be32(sdebug_atomic_max_size_with_boundary_blks, &arr[52]);
+		put_unaligned_be32(sdebug_atomic_max_boundary_blks, &arr[56]);
+	}
+
 	return 0x3c; /* Mandatory page length for Logical Block Provisioning */
 }
 
@@ -3001,15 +3047,240 @@ static inline struct sdeb_store_info *devip2sip(struct sdebug_dev_info *devip,
 	return xa_load(per_store_ap, devip->sdbg_host->si_idx);
 }
 
+
+static inline void
+sdeb_read_lock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__acquire(lock);
+	else
+		read_lock(lock);
+}
+
+static inline void
+sdeb_read_unlock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__release(lock);
+	else
+		read_unlock(lock);
+}
+
+static inline void
+sdeb_write_lock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__acquire(lock);
+	else
+		write_lock(lock);
+}
+
+static inline void
+sdeb_write_unlock(rwlock_t *lock)
+{
+	if (sdebug_no_rwlock)
+		__release(lock);
+	else
+		write_unlock(lock);
+}
+
+static inline void
+sdeb_data_read_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_lock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_read_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_unlock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_write_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_lock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_write_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_unlock(&sip->macc_data_lck);
+}
+
+static inline void
+sdeb_data_sector_read_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_lock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_read_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_read_unlock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_write_lock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_lock(&sip->macc_sector_lck);
+}
+
+static inline void
+sdeb_data_sector_write_unlock(struct sdeb_store_info *sip)
+{
+	BUG_ON(!sip);
+
+	sdeb_write_unlock(&sip->macc_sector_lck);
+}
+
+/*
+Atomic locking:
+We simplify the atomic model to allow only 1x atomic
+write and many non-atomic reads or writes for all
+LBAs.
+
+A RW lock has a similar bahaviour:
+Only 1x writer and many readers.
+
+So use a RW lock for per-device read and write locking:
+An atomic access grabs the lock as a writer and
+non-atomic grabs the lock as a reader.
+*/
+
+static inline void
+sdeb_data_lock(struct sdeb_store_info *sip, bool atomic_write)
+{
+	if (atomic_write)
+		sdeb_data_write_lock(sip);
+	else
+		sdeb_data_read_lock(sip);
+}
+
+static inline void
+sdeb_data_unlock(struct sdeb_store_info *sip, bool atomic_write)
+{
+	if (atomic_write)
+		sdeb_data_write_unlock(sip);
+	else
+		sdeb_data_read_unlock(sip);
+}
+
+/* Allow many reads but only 1x write per sector */
+static inline void
+sdeb_data_sector_lock(struct sdeb_store_info *sip, bool do_write)
+{
+	if (do_write)
+		sdeb_data_sector_write_lock(sip);
+	else
+		sdeb_data_sector_read_lock(sip);
+}
+
+static inline void
+sdeb_data_sector_unlock(struct sdeb_store_info *sip, bool do_write)
+{
+	if (do_write)
+		sdeb_data_sector_write_unlock(sip);
+	else
+		sdeb_data_sector_read_unlock(sip);
+}
+
+static inline void
+sdeb_meta_read_lock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__acquire(&sip->macc_meta_lck);
+		else
+			__acquire(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			read_lock(&sip->macc_meta_lck);
+		else
+			read_lock(&sdeb_fake_rw_lck);
+	}
+}
+
+static inline void
+sdeb_meta_read_unlock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__release(&sip->macc_meta_lck);
+		else
+			__release(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			read_unlock(&sip->macc_meta_lck);
+		else
+			read_unlock(&sdeb_fake_rw_lck);
+	}
+}
+
+static inline void
+sdeb_meta_write_lock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__acquire(&sip->macc_meta_lck);
+		else
+			__acquire(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			write_lock(&sip->macc_meta_lck);
+		else
+			write_lock(&sdeb_fake_rw_lck);
+	}
+}
+
+static inline void
+sdeb_meta_write_unlock(struct sdeb_store_info *sip)
+{
+	if (sdebug_no_rwlock) {
+		if (sip)
+			__release(&sip->macc_meta_lck);
+		else
+			__release(&sdeb_fake_rw_lck);
+	} else {
+		if (sip)
+			write_unlock(&sip->macc_meta_lck);
+		else
+			write_unlock(&sdeb_fake_rw_lck);
+	}
+}
+
 /* Returns number of bytes copied or -1 if error. */
 static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp,
-			    u32 sg_skip, u64 lba, u32 num, bool do_write)
+			    u32 sg_skip, u64 lba, u32 num, bool do_write,
+			    bool atomic_write)
 {
 	int ret;
-	u64 block, rest = 0;
+	u64 block;
 	enum dma_data_direction dir;
 	struct scsi_data_buffer *sdb = &scp->sdb;
 	u8 *fsp;
+	int i;
+
+	/*
+	 * Even though reads are inherently atomic (in this driver), we expect
+	 * the atomic flag only for writes.
+	 */
+	if (!do_write && atomic_write)
+		return -1;
 
 	if (do_write) {
 		dir = DMA_TO_DEVICE;
@@ -3025,21 +3296,26 @@ static int do_device_access(struct sdeb_store_info *sip, struct scsi_cmnd *scp,
 	fsp = sip->storep;
 
 	block = do_div(lba, sdebug_store_sectors);
-	if (block + num > sdebug_store_sectors)
-		rest = block + num - sdebug_store_sectors;
 
-	ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
+	/* Only allow 1x atomic write or multiple non-atomic writes at any given time */
+	sdeb_data_lock(sip, atomic_write);
+	for (i = 0; i < num; i++) {
+		/* We shouldn't need to lock for atomic writes, but do it anyway */
+		sdeb_data_sector_lock(sip, do_write);
+		ret = sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
 		   fsp + (block * sdebug_sector_size),
-		   (num - rest) * sdebug_sector_size, sg_skip, do_write);
-	if (ret != (num - rest) * sdebug_sector_size)
-		return ret;
-
-	if (rest) {
-		ret += sg_copy_buffer(sdb->table.sgl, sdb->table.nents,
-			    fsp, rest * sdebug_sector_size,
-			    sg_skip + ((num - rest) * sdebug_sector_size),
-			    do_write);
+		   sdebug_sector_size, sg_skip, do_write);
+		sdeb_data_sector_unlock(sip, do_write);
+		if (ret != sdebug_sector_size) {
+			ret += (i * sdebug_sector_size);
+			break;
+		}
+		sg_skip += sdebug_sector_size;
+		if (++block >= sdebug_store_sectors)
+			block = 0;
 	}
+	ret = num * sdebug_sector_size;
+	sdeb_data_unlock(sip, atomic_write);
 
 	return ret;
 }
@@ -3215,70 +3491,6 @@ static int prot_verify_read(struct scsi_cmnd *scp, sector_t start_sec,
 	return ret;
 }
 
-static inline void
-sdeb_read_lock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__acquire(&sip->macc_lck);
-		else
-			__acquire(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			read_lock(&sip->macc_lck);
-		else
-			read_lock(&sdeb_fake_rw_lck);
-	}
-}
-
-static inline void
-sdeb_read_unlock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__release(&sip->macc_lck);
-		else
-			__release(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			read_unlock(&sip->macc_lck);
-		else
-			read_unlock(&sdeb_fake_rw_lck);
-	}
-}
-
-static inline void
-sdeb_write_lock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__acquire(&sip->macc_lck);
-		else
-			__acquire(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			write_lock(&sip->macc_lck);
-		else
-			write_lock(&sdeb_fake_rw_lck);
-	}
-}
-
-static inline void
-sdeb_write_unlock(struct sdeb_store_info *sip)
-{
-	if (sdebug_no_rwlock) {
-		if (sip)
-			__release(&sip->macc_lck);
-		else
-			__release(&sdeb_fake_rw_lck);
-	} else {
-		if (sip)
-			write_unlock(&sip->macc_lck);
-		else
-			write_unlock(&sdeb_fake_rw_lck);
-	}
-}
-
 static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 {
 	bool check_prot;
@@ -3288,6 +3500,7 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	u64 lba;
 	struct sdeb_store_info *sip = devip2sip(devip, true);
 	u8 *cmd = scp->cmnd;
+	bool meta_data_locked = false;
 
 	switch (cmd[0]) {
 	case READ_16:
@@ -3346,6 +3559,10 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		atomic_set(&sdeb_inject_pending, 0);
 	}
 
+	/*
+	 * When checking device access params, for reads we only check data
+	 * versus what is set at init time, so no need to lock.
+	 */
 	ret = check_device_access_params(scp, lba, num, false);
 	if (ret)
 		return ret;
@@ -3365,29 +3582,33 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		return check_condition_result;
 	}
 
-	sdeb_read_lock(sip);
+	if (sdebug_dev_is_zoned(devip) ||
+	    (sdebug_dix && scsi_prot_sg_count(scp)))  {
+		sdeb_meta_read_lock(sip);
+		meta_data_locked = true;
+	}
 
 	/* DIX + T10 DIF */
 	if (unlikely(sdebug_dix && scsi_prot_sg_count(scp))) {
 		switch (prot_verify_read(scp, lba, num, ei_lba)) {
 		case 1: /* Guard tag error */
 			if (cmd[1] >> 5 != 3) { /* RDPROTECT != 3 */
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 1);
 				return check_condition_result;
 			} else if (scp->prot_flags & SCSI_PROT_GUARD_CHECK) {
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 1);
 				return illegal_condition_result;
 			}
 			break;
 		case 3: /* Reference tag error */
 			if (cmd[1] >> 5 != 3) { /* RDPROTECT != 3 */
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 3);
 				return check_condition_result;
 			} else if (scp->prot_flags & SCSI_PROT_REF_CHECK) {
-				sdeb_read_unlock(sip);
+				sdeb_meta_read_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 3);
 				return illegal_condition_result;
 			}
@@ -3395,8 +3616,9 @@ static int resp_read_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		}
 	}
 
-	ret = do_device_access(sip, scp, 0, lba, num, false);
-	sdeb_read_unlock(sip);
+	ret = do_device_access(sip, scp, 0, lba, num, false, false);
+	if (meta_data_locked)
+		sdeb_meta_read_unlock(sip);
 	if (unlikely(ret == -1))
 		return DID_ERROR << 16;
 
@@ -3585,6 +3807,7 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	u64 lba;
 	struct sdeb_store_info *sip = devip2sip(devip, true);
 	u8 *cmd = scp->cmnd;
+	bool meta_data_locked = false;
 
 	switch (cmd[0]) {
 	case WRITE_16:
@@ -3638,10 +3861,17 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 				    "to DIF device\n");
 	}
 
-	sdeb_write_lock(sip);
+	if (sdebug_dev_is_zoned(devip) ||
+	    (sdebug_dix && scsi_prot_sg_count(scp)) ||
+	    scsi_debug_lbp())  {
+		sdeb_meta_write_lock(sip);
+		meta_data_locked = true;
+	}
+
 	ret = check_device_access_params(scp, lba, num, true);
 	if (ret) {
-		sdeb_write_unlock(sip);
+		if (meta_data_locked)
+			sdeb_meta_write_unlock(sip);
 		return ret;
 	}
 
@@ -3650,22 +3880,22 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		switch (prot_verify_write(scp, lba, num, ei_lba)) {
 		case 1: /* Guard tag error */
 			if (scp->prot_flags & SCSI_PROT_GUARD_CHECK) {
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 1);
 				return illegal_condition_result;
 			} else if (scp->cmnd[1] >> 5 != 3) { /* WRPROTECT != 3 */
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 1);
 				return check_condition_result;
 			}
 			break;
 		case 3: /* Reference tag error */
 			if (scp->prot_flags & SCSI_PROT_REF_CHECK) {
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ILLEGAL_REQUEST, 0x10, 3);
 				return illegal_condition_result;
 			} else if (scp->cmnd[1] >> 5 != 3) { /* WRPROTECT != 3 */
-				sdeb_write_unlock(sip);
+				sdeb_meta_write_unlock(sip);
 				mk_sense_buffer(scp, ABORTED_COMMAND, 0x10, 3);
 				return check_condition_result;
 			}
@@ -3673,13 +3903,16 @@ static int resp_write_dt0(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		}
 	}
 
-	ret = do_device_access(sip, scp, 0, lba, num, true);
+	ret = do_device_access(sip, scp, 0, lba, num, true, false);
 	if (unlikely(scsi_debug_lbp()))
 		map_region(sip, lba, num);
+
 	/* If ZBC zone then bump its write pointer */
 	if (sdebug_dev_is_zoned(devip))
 		zbc_inc_wp(devip, lba, num);
-	sdeb_write_unlock(sip);
+	if (meta_data_locked)
+		sdeb_meta_write_unlock(sip);
+
 	if (unlikely(-1 == ret))
 		return DID_ERROR << 16;
 	else if (unlikely(sdebug_verbose &&
@@ -3786,7 +4019,8 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 		goto err_out;
 	}
 
-	sdeb_write_lock(sip);
+	/* Just keep it simple and always lock for now */
+	sdeb_meta_write_lock(sip);
 	sg_off = lbdof_blen;
 	/* Spec says Buffer xfer Length field in number of LBs in dout */
 	cum_lb = 0;
@@ -3829,7 +4063,11 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 			}
 		}
 
-		ret = do_device_access(sip, scp, sg_off, lba, num, true);
+		/*
+		 * Write ranges atomically to keep as close to pre-atomic
+		 * writes behaviour as possible.
+		 */
+		ret = do_device_access(sip, scp, sg_off, lba, num, true, true);
 		/* If ZBC zone then bump its write pointer */
 		if (sdebug_dev_is_zoned(devip))
 			zbc_inc_wp(devip, lba, num);
@@ -3868,7 +4106,7 @@ static int resp_write_scat(struct scsi_cmnd *scp,
 	}
 	ret = 0;
 err_out_unlock:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 err_out:
 	kfree(lrdp);
 	return ret;
@@ -3887,14 +4125,16 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 						scp->device->hostdata, true);
 	u8 *fs1p;
 	u8 *fsp;
+	bool meta_data_locked = false;
 
-	sdeb_write_lock(sip);
+	if (sdebug_dev_is_zoned(devip) || scsi_debug_lbp()) {
+		sdeb_meta_write_lock(sip);
+		meta_data_locked = true;
+	}
 
 	ret = check_device_access_params(scp, lba, num, true);
-	if (ret) {
-		sdeb_write_unlock(sip);
-		return ret;
-	}
+	if (ret)
+		goto out;
 
 	if (unmap && scsi_debug_lbp()) {
 		unmap_region(sip, lba, num);
@@ -3905,6 +4145,7 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 	/* if ndob then zero 1 logical block, else fetch 1 logical block */
 	fsp = sip->storep;
 	fs1p = fsp + (block * lb_size);
+	sdeb_data_write_lock(sip);
 	if (ndob) {
 		memset(fs1p, 0, lb_size);
 		ret = 0;
@@ -3912,8 +4153,8 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 		ret = fetch_to_dev_buffer(scp, fs1p, lb_size);
 
 	if (-1 == ret) {
-		sdeb_write_unlock(sip);
-		return DID_ERROR << 16;
+		ret = DID_ERROR << 16;
+		goto out;
 	} else if (sdebug_verbose && !ndob && (ret < lb_size))
 		sdev_printk(KERN_INFO, scp->device,
 			    "%s: %s: lb size=%u, IO sent=%d bytes\n",
@@ -3930,10 +4171,12 @@ static int resp_write_same(struct scsi_cmnd *scp, u64 lba, u32 num,
 	/* If ZBC zone then bump its write pointer */
 	if (sdebug_dev_is_zoned(devip))
 		zbc_inc_wp(devip, lba, num);
+	sdeb_data_write_unlock(sip);
+	ret = 0;
 out:
-	sdeb_write_unlock(sip);
-
-	return 0;
+	if (meta_data_locked)
+		sdeb_meta_write_unlock(sip);
+	return ret;
 }
 
 static int resp_write_same_10(struct scsi_cmnd *scp,
@@ -4076,25 +4319,30 @@ static int resp_comp_write(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
-
 	ret = do_dout_fetch(scp, dnum, arr);
 	if (ret == -1) {
 		retval = DID_ERROR << 16;
-		goto cleanup;
+		goto cleanup_free;
 	} else if (sdebug_verbose && (ret < (dnum * lb_size)))
 		sdev_printk(KERN_INFO, scp->device, "%s: compare_write: cdb "
 			    "indicated=%u, IO sent=%d bytes\n", my_name,
 			    dnum * lb_size, ret);
+
+	sdeb_data_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 	if (!comp_write_worker(sip, lba, num, arr, false)) {
 		mk_sense_buffer(scp, MISCOMPARE, MISCOMPARE_VERIFY_ASC, 0);
 		retval = check_condition_result;
-		goto cleanup;
+		goto cleanup_unlock;
 	}
+
+	/* Cover sip->map_storep (which map_region()) sets with data lock */
 	if (scsi_debug_lbp())
 		map_region(sip, lba, num);
-cleanup:
-	sdeb_write_unlock(sip);
+cleanup_unlock:
+	sdeb_meta_write_unlock(sip);
+	sdeb_data_write_unlock(sip);
+cleanup_free:
 	kfree(arr);
 	return retval;
 }
@@ -4138,7 +4386,7 @@ static int resp_unmap(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 
 	desc = (void *)&buf[8];
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	for (i = 0 ; i < descriptors ; i++) {
 		unsigned long long lba = get_unaligned_be64(&desc[i].lba);
@@ -4154,7 +4402,7 @@ static int resp_unmap(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 	ret = 0;
 
 out:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	kfree(buf);
 
 	return ret;
@@ -4267,12 +4515,13 @@ static int resp_pre_fetch(struct scsi_cmnd *scp,
 		rest = block + nblks - sdebug_store_sectors;
 
 	/* Try to bring the PRE-FETCH range into CPU's cache */
-	sdeb_read_lock(sip);
+	sdeb_data_read_lock(sip);
 	prefetch_range(fsp + (sdebug_sector_size * block),
 		       (nblks - rest) * sdebug_sector_size);
 	if (rest)
 		prefetch_range(fsp, rest * sdebug_sector_size);
-	sdeb_read_unlock(sip);
+
+	sdeb_data_read_unlock(sip);
 fini:
 	if (cmd[1] & 0x2)
 		res = SDEG_RES_IMMED_MASK;
@@ -4431,7 +4680,7 @@ static int resp_verify(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		return check_condition_result;
 	}
 	/* Not changing store, so only need read access */
-	sdeb_read_lock(sip);
+	sdeb_data_read_lock(sip);
 
 	ret = do_dout_fetch(scp, a_num, arr);
 	if (ret == -1) {
@@ -4453,7 +4702,7 @@ static int resp_verify(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		goto cleanup;
 	}
 cleanup:
-	sdeb_read_unlock(sip);
+	sdeb_data_read_unlock(sip);
 	kfree(arr);
 	return ret;
 }
@@ -4499,7 +4748,7 @@ static int resp_report_zones(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_read_lock(sip);
+	sdeb_meta_read_lock(sip);
 
 	desc = arr + 64;
 	for (lba = zs_lba; lba < sdebug_capacity;
@@ -4597,11 +4846,68 @@ static int resp_report_zones(struct scsi_cmnd *scp,
 	ret = fill_from_dev_buffer(scp, arr, min_t(u32, alloc_len, rep_len));
 
 fini:
-	sdeb_read_unlock(sip);
+	sdeb_meta_read_unlock(sip);
 	kfree(arr);
 	return ret;
 }
 
+static int resp_atomic_write(struct scsi_cmnd *scp,
+			     struct sdebug_dev_info *devip)
+{
+	struct sdeb_store_info *sip;
+	u8 *cmd = scp->cmnd;
+	u16 boundary, len;
+	u64 lba;
+	int ret;
+
+	if (!scsi_debug_atomic_write()) {
+		mk_sense_invalid_opcode(scp);
+		return check_condition_result;
+	}
+
+	sip = devip2sip(devip, true);
+
+	lba = get_unaligned_be64(cmd + 2);
+	boundary = get_unaligned_be16(cmd + 10);
+	len = get_unaligned_be16(cmd + 12);
+
+	if (sdebug_atomic_alignment_blks && lba % sdebug_atomic_alignment_blks) {
+		/* Does not meet alignment requirement */
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, INVALID_FIELD_IN_CDB, 0);
+		return check_condition_result;
+	}
+
+	if (sdebug_atomic_granularity_blks && len % sdebug_atomic_granularity_blks) {
+		/* Does not meet alignment requirement */
+		mk_sense_buffer(scp, ILLEGAL_REQUEST, INVALID_FIELD_IN_CDB, 0);
+		return check_condition_result;
+	}
+
+	if (boundary > 0) {
+		if (boundary > sdebug_atomic_max_boundary_blks) {
+			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+			return check_condition_result;
+		}
+
+		if (len > sdebug_atomic_max_size_with_boundary_blks) {
+			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+			return check_condition_result;
+		}
+	} else {
+		if (len > sdebug_atomic_max_size_blks) {
+			mk_sense_invalid_fld(scp, SDEB_IN_CDB, 12, -1);
+			return check_condition_result;
+		}
+	}
+
+	ret = do_device_access(sip, scp, 0, lba, len, true, true);
+	if (unlikely(ret == -1))
+		return DID_ERROR << 16;
+	if (unlikely(ret != len * sdebug_sector_size))
+		return DID_ERROR << 16;
+	return 0;
+}
+
 /* Logic transplanted from tcmu-runner, file_zbc.c */
 static void zbc_open_all(struct sdebug_dev_info *devip)
 {
@@ -4628,8 +4934,7 @@ static int resp_open_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		mk_sense_invalid_opcode(scp);
 		return check_condition_result;
 	}
-
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		/* Check if all closed zones can be open */
@@ -4678,7 +4983,7 @@ static int resp_open_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 
 	zbc_open_zone(devip, zsp, true);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -4705,7 +5010,7 @@ static int resp_close_zone(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		zbc_close_all(devip);
@@ -4734,7 +5039,7 @@ static int resp_close_zone(struct scsi_cmnd *scp,
 
 	zbc_close_zone(devip, zsp);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -4777,7 +5082,7 @@ static int resp_finish_zone(struct scsi_cmnd *scp,
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		zbc_finish_all(devip);
@@ -4806,7 +5111,7 @@ static int resp_finish_zone(struct scsi_cmnd *scp,
 
 	zbc_finish_zone(devip, zsp, true);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -4857,7 +5162,7 @@ static int resp_rwp_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 		return check_condition_result;
 	}
 
-	sdeb_write_lock(sip);
+	sdeb_meta_write_lock(sip);
 
 	if (all) {
 		zbc_rwp_all(devip);
@@ -4885,7 +5190,7 @@ static int resp_rwp_zone(struct scsi_cmnd *scp, struct sdebug_dev_info *devip)
 
 	zbc_rwp_zone(devip, zsp);
 fini:
-	sdeb_write_unlock(sip);
+	sdeb_meta_write_unlock(sip);
 	return res;
 }
 
@@ -4912,6 +5217,7 @@ static void sdebug_q_cmd_complete(struct sdebug_defer *sd_dp)
 	if (!scp) {
 		pr_err("scmd=NULL\n");
 		goto out;
+
 	}
 
 	sdsc = scsi_cmd_priv(scp);
@@ -5726,6 +6032,7 @@ module_param_named(lbprz, sdebug_lbprz, int, S_IRUGO);
 module_param_named(lbpu, sdebug_lbpu, int, S_IRUGO);
 module_param_named(lbpws, sdebug_lbpws, int, S_IRUGO);
 module_param_named(lbpws10, sdebug_lbpws10, int, S_IRUGO);
+module_param_named(atomic_write, sdebug_atomic_write, int, S_IRUGO);
 module_param_named(lowest_aligned, sdebug_lowest_aligned, int, S_IRUGO);
 module_param_named(lun_format, sdebug_lun_am_i, int, S_IRUGO | S_IWUSR);
 module_param_named(max_luns, sdebug_max_luns, int, S_IRUGO | S_IWUSR);
@@ -5760,6 +6067,11 @@ module_param_named(unmap_alignment, sdebug_unmap_alignment, int, S_IRUGO);
 module_param_named(unmap_granularity, sdebug_unmap_granularity, int, S_IRUGO);
 module_param_named(unmap_max_blocks, sdebug_unmap_max_blocks, int, S_IRUGO);
 module_param_named(unmap_max_desc, sdebug_unmap_max_desc, int, S_IRUGO);
+module_param_named(atomic_max_size_blks, sdebug_unmap_alignment, int, S_IRUGO);
+module_param_named(atomic_alignment_blks, sdebug_atomic_alignment_blks, int, S_IRUGO);
+module_param_named(atomic_granularity_blks, sdebug_atomic_granularity_blks, int, S_IRUGO);
+module_param_named(atomic_max_size_with_boundary_blks, sdebug_atomic_max_size_with_boundary_blks, int, S_IRUGO);
+module_param_named(atomic_max_boundary_blks, sdebug_atomic_max_boundary_blks, int, S_IRUGO);
 module_param_named(uuid_ctl, sdebug_uuid_ctl, int, S_IRUGO);
 module_param_named(virtual_gb, sdebug_virtual_gb, int, S_IRUGO | S_IWUSR);
 module_param_named(vpd_use_hostno, sdebug_vpd_use_hostno, int,
@@ -5802,6 +6114,7 @@ MODULE_PARM_DESC(lbprz,
 MODULE_PARM_DESC(lbpu, "enable LBP, support UNMAP command (def=0)");
 MODULE_PARM_DESC(lbpws, "enable LBP, support WRITE SAME(16) with UNMAP bit (def=0)");
 MODULE_PARM_DESC(lbpws10, "enable LBP, support WRITE SAME(10) with UNMAP bit (def=0)");
+MODULE_PARM_DESC(atomic_write, "enable ATOMIC WRITE support, support WRITE ATOMIC(16) (def=1)");
 MODULE_PARM_DESC(lowest_aligned, "lowest aligned lba (def=0)");
 MODULE_PARM_DESC(lun_format, "LUN format: 0->peripheral (def); 1 --> flat address method");
 MODULE_PARM_DESC(max_luns, "number of LUNs per target to simulate(def=1)");
@@ -5833,6 +6146,11 @@ MODULE_PARM_DESC(unmap_alignment, "lowest aligned thin provisioning lba (def=0)"
 MODULE_PARM_DESC(unmap_granularity, "thin provisioning granularity in blocks (def=1)");
 MODULE_PARM_DESC(unmap_max_blocks, "max # of blocks can be unmapped in one cmd (def=0xffffffff)");
 MODULE_PARM_DESC(unmap_max_desc, "max # of ranges that can be unmapped in one cmd (def=256)");
+MODULE_PARM_DESC(atomic_max_size_blks, "max # of blocks can be atomically written in one cmd (def=0xff)");
+MODULE_PARM_DESC(atomic_alignment_blks, "minimum alignment of atomic write in blocks (def=2)");
+MODULE_PARM_DESC(atomic_granularity_blks, "minimum granularity of atomic write in blocks (def=2)");
+MODULE_PARM_DESC(atomic_max_size_with_boundary_blks, "max # of blocks can be atomically written in one cmd with boundary set (def=0xff)");
+MODULE_PARM_DESC(atomic_boundary_blks, "max # boundaries per atomic write (def=0)");
 MODULE_PARM_DESC(uuid_ctl,
 		 "1->use uuid for lu name, 0->don't, 2->all use same (def=0)");
 MODULE_PARM_DESC(virtual_gb, "virtual gigabyte (GiB) size (def=0 -> use dev_size_mb)");
@@ -6978,6 +7296,7 @@ static int __init scsi_debug_init(void)
 			return -EINVAL;
 		}
 	}
+
 	xa_init_flags(per_store_ap, XA_FLAGS_ALLOC | XA_FLAGS_LOCK_IRQ);
 	if (want_store) {
 		idx = sdebug_add_store();
@@ -7180,7 +7499,9 @@ static int sdebug_add_store(void)
 			map_region(sip, 0, 2);
 	}
 
-	rwlock_init(&sip->macc_lck);
+	rwlock_init(&sip->macc_data_lck);
+	rwlock_init(&sip->macc_meta_lck);
+	rwlock_init(&sip->macc_sector_lck);
 	return (int)n_idx;
 err:
 	sdebug_erase_store((int)n_idx, sip);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 21/21] nvme: Support atomic writes
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (19 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 20/21] scsi: scsi_debug: Atomic write support John Garry
@ 2023-09-29 10:27 ` John Garry
       [not found]   ` <CGME20231004113943eucas1p23a51ce5ef06c36459f826101bb7b85fc@eucas1p2.samsung.com>
  2023-11-09 15:36   ` Christoph Hellwig
  2023-09-29 14:58 ` [PATCH 00/21] block " Bart Van Assche
  21 siblings, 2 replies; 124+ messages in thread
From: John Garry @ 2023-09-29 10:27 UTC (permalink / raw)
  To: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Alan Adamson, John Garry

From: Alan Adamson <alan.adamson@oracle.com>

Support reading atomic write registers to fill in request_queue
properties.

Use following method to calculate limits:
atomic_write_max_bytes = flp2(NAWUPF ?: AWUPF)
atomic_write_unit_min = logical_block_size
atomic_write_unit_max = flp2(NAWUPF ?: AWUPF)
atomic_write_boundary = NABSPF

Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 drivers/nvme/host/core.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 21783aa2ee8e..aa0daacf4d7c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1926,6 +1926,35 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	blk_queue_io_min(disk->queue, phys_bs);
 	blk_queue_io_opt(disk->queue, io_opt);
 
+	atomic_bs = rounddown_pow_of_two(atomic_bs);
+	if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf) {
+		if (id->nabo) {
+			dev_err(ns->ctrl->device, "Support atomic NABO=%x\n",
+				id->nabo);
+		} else {
+			u32 boundary = 0;
+
+			if (le16_to_cpu(id->nabspf))
+				boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
+
+			if (is_power_of_2(boundary) || !boundary) {
+				blk_queue_atomic_write_max_bytes(disk->queue, atomic_bs);
+				blk_queue_atomic_write_unit_min_sectors(disk->queue, 1);
+				blk_queue_atomic_write_unit_max_sectors(disk->queue,
+									atomic_bs / bs);
+				blk_queue_atomic_write_boundary_bytes(disk->queue, boundary);
+			} else {
+				dev_err(ns->ctrl->device, "Unsupported atomic boundary=0x%x\n",
+					boundary);
+			}
+		}
+	} else if (ns->ctrl->subsys->awupf) {
+		blk_queue_atomic_write_max_bytes(disk->queue, atomic_bs);
+		blk_queue_atomic_write_unit_min_sectors(disk->queue, 1);
+		blk_queue_atomic_write_unit_max_sectors(disk->queue, atomic_bs / bs);
+		blk_queue_atomic_write_boundary_bytes(disk->queue, 0);
+	}
+
 	/*
 	 * Register a metadata profile for PI, or the plain non-integrity NVMe
 	 * metadata masquerading as Type 0 if supported, otherwise reject block
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH 00/21] block atomic writes
  2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
                   ` (20 preceding siblings ...)
  2023-09-29 10:27 ` [PATCH 21/21] nvme: Support atomic writes John Garry
@ 2023-09-29 14:58 ` Bart Van Assche
  21 siblings, 0 replies; 124+ messages in thread
From: Bart Van Assche @ 2023-09-29 14:58 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 9/29/23 03:27, John Garry wrote:
> The atomic writes feature requires dedicated HW support, like
> SCSI WRITE_ATOMIC_16 command.

This is not correct. Log-structured filesystems can implement atomic
writes without support for atomic writes in the block device(s) used
by the filesystem. See also the F2FS_IOC_*_ATOMIC_WRITE ioctls. This
being said, I hope that atomic write support will be added in the
block layer and also that a single interface will be supported by all
filesystems.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-09-29 10:27 ` [PATCH 10/21] block: Add fops atomic write support John Garry
@ 2023-09-29 17:51   ` Bart Van Assche
  2023-10-02 10:10     ` John Garry
  2023-12-04  2:30   ` Ming Lei
  1 sibling, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-09-29 17:51 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 9/29/23 03:27, John Garry wrote:
> +	if (pos % atomic_write_unit_min_bytes)
> +		return false;
> +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> +		return false;
> +	if (!is_power_of_2(iov_iter_count(iter)))
> +		return false;
[ ... ]
> +	if (pos % iov_iter_count(iter))
> +		return false;

Where do these rules come from? Is there any standard that requires
any of the above?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 18/21] scsi: sd: Support reading atomic properties from block limits VPD
  2023-09-29 10:27 ` [PATCH 18/21] scsi: sd: Support reading atomic properties from block limits VPD John Garry
@ 2023-09-29 17:54   ` Bart Van Assche
  2023-10-02 11:27     ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-09-29 17:54 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 9/29/23 03:27, John Garry wrote:
> +static void sd_config_atomic(struct scsi_disk *sdkp)
> +{
> +	unsigned int logical_block_size = sdkp->device->sector_size;
> +	struct request_queue *q = sdkp->disk->queue;
> +
> +	if (sdkp->max_atomic) {

Please use the "return early" style here to keep the indentation
level in this function low.

> +		unsigned int max_atomic = max_t(unsigned int,
> +			rounddown_pow_of_two(sdkp->max_atomic),
> +			rounddown_pow_of_two(sdkp->max_atomic_with_boundary));
> +		unsigned int unit_min = sdkp->atomic_granularity ?
> +			rounddown_pow_of_two(sdkp->atomic_granularity) :
> +			physical_block_size_sectors;
> +		unsigned int unit_max = max_atomic;
> +
> +		if (sdkp->max_atomic_boundary)
> +			unit_max = min_t(unsigned int, unit_max,
> +				rounddown_pow_of_two(sdkp->max_atomic_boundary));

Why does "rounddown_pow_of_two()" occur in the above code?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 19/21] scsi: sd: Add WRITE_ATOMIC_16 support
  2023-09-29 10:27 ` [PATCH 19/21] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
@ 2023-09-29 17:59   ` Bart Van Assche
  2023-10-02 11:36     ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-09-29 17:59 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 9/29/23 03:27, John Garry wrote:
> +static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd,
> +					sector_t lba, unsigned int nr_blocks,
> +					unsigned char flags)
> +{
> +	cmd->cmd_len  = 16;
> +	cmd->cmnd[0]  = WRITE_ATOMIC_16;
> +	cmd->cmnd[1]  = flags;
> +	put_unaligned_be64(lba, &cmd->cmnd[2]);
> +	cmd->cmnd[10] = 0;
> +	cmd->cmnd[11] = 0;
> +	put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
> +	cmd->cmnd[14] = 0;
> +	cmd->cmnd[15] = 0;
> +
> +	return BLK_STS_OK;
> +}

Please store the 'dld' value in the GROUP NUMBER field. See e.g.
sd_setup_rw16_cmnd().

> @@ -1139,6 +1156,7 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>   	unsigned int nr_blocks = sectors_to_logical(sdp, blk_rq_sectors(rq));
>   	unsigned int mask = logical_to_sectors(sdp, 1) - 1;
>   	bool write = rq_data_dir(rq) == WRITE;
> +	bool atomic_write = !!(rq->cmd_flags & REQ_ATOMIC) && write;

Please leave out the superfluous "!!".

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-09-29 10:27 ` [PATCH 03/21] fs/bdev: Add atomic write support info to statx John Garry
@ 2023-09-29 22:49   ` Eric Biggers
  2023-10-01 13:23     ` Bart Van Assche
  0 siblings, 1 reply; 124+ messages in thread
From: Eric Biggers @ 2023-09-29 22:49 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Prasad Singamsetty

On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote:
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7cab2c65d3d7..c99d7cac2aa6 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -127,7 +127,10 @@ struct statx {
>  	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
>  	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
>  	/* 0xa0 */
> -	__u64	__spare3[12];	/* Spare space for future expansion */
> +	__u32	stx_atomic_write_unit_max;
> +	__u32	stx_atomic_write_unit_min;

Maybe min first and then max?  That seems a bit more natural, and a lot of the
code you've written handle them in that order.

> +#define STATX_ATTR_WRITE_ATOMIC		0x00400000 /* File supports atomic write operations */

How would this differ from stx_atomic_write_unit_min != 0?

- Eric

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 09/21] block: Add checks to merging of atomic writes
  2023-09-29 10:27 ` [PATCH 09/21] block: Add checks to merging of atomic writes John Garry
@ 2023-09-30 13:40   ` kernel test robot
  2023-10-02 22:50     ` Nathan Chancellor
  0 siblings, 1 reply; 124+ messages in thread
From: kernel test robot @ 2023-09-30 13:40 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: llvm, oe-kbuild-all, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-api, John Garry

Hi John,

kernel test robot noticed the following build errors:

[auto build test ERROR on xfs-linux/for-next]
[also build test ERROR on axboe-block/for-next mkp-scsi/for-next jejb-scsi/for-next linus/master v6.6-rc3 next-20230929]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/John-Garry/block-Add-atomic-write-operations-to-request_queue-limits/20230929-184542
base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
patch link:    https://lore.kernel.org/r/20230929102726.2985188-10-john.g.garry%40oracle.com
patch subject: [PATCH 09/21] block: Add checks to merging of atomic writes
config: mips-mtx1_defconfig (https://download.01.org/0day-ci/archive/20230930/202309302100.L6ynQWub-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230930/202309302100.L6ynQWub-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309302100.L6ynQWub-lkp@intel.com/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: __moddi3
   >>> referenced by blk-merge.c
   >>>               block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
   >>> referenced by blk-merge.c
   >>>               block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
   >>> referenced by blk-merge.c
   >>>               block/blk-merge.o:(bio_attempt_front_merge) in archive vmlinux.a
   >>> referenced 3 more times

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-09-29 22:49   ` Eric Biggers
@ 2023-10-01 13:23     ` Bart Van Assche
  2023-10-02  9:51       ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-01 13:23 UTC (permalink / raw)
  To: Eric Biggers, John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Prasad Singamsetty

On 9/29/23 15:49, Eric Biggers wrote:
> On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote:
>> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
>> index 7cab2c65d3d7..c99d7cac2aa6 100644
>> --- a/include/uapi/linux/stat.h
>> +++ b/include/uapi/linux/stat.h
>> @@ -127,7 +127,10 @@ struct statx {
>>   	__u32	stx_dio_mem_align;	/* Memory buffer alignment for direct I/O */
>>   	__u32	stx_dio_offset_align;	/* File offset alignment for direct I/O */
>>   	/* 0xa0 */
>> -	__u64	__spare3[12];	/* Spare space for future expansion */
>> +	__u32	stx_atomic_write_unit_max;
>> +	__u32	stx_atomic_write_unit_min;
> 
> Maybe min first and then max?  That seems a bit more natural, and a lot of the
> code you've written handle them in that order.
> 
>> +#define STATX_ATTR_WRITE_ATOMIC		0x00400000 /* File supports atomic write operations */
> 
> How would this differ from stx_atomic_write_unit_min != 0?

Is it even possible that stx_atomic_write_unit_min == 0? My understanding
is that all Linux filesystems rely on the assumption that writing a single
logical block either succeeds or does not happen, even if a power failure
occurs between writing and reading a logical block.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-10-01 13:23     ` Bart Van Assche
@ 2023-10-02  9:51       ` John Garry
  2023-10-02 18:39         ` Bart Van Assche
  2023-10-03  1:51         ` Dave Chinner
  0 siblings, 2 replies; 124+ messages in thread
From: John Garry @ 2023-10-02  9:51 UTC (permalink / raw)
  To: Bart Van Assche, Eric Biggers
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Prasad Singamsetty

On 01/10/2023 14:23, Bart Van Assche wrote:
> On 9/29/23 15:49, Eric Biggers wrote:
>> On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote:
>>> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
>>> index 7cab2c65d3d7..c99d7cac2aa6 100644
>>> --- a/include/uapi/linux/stat.h
>>> +++ b/include/uapi/linux/stat.h
>>> @@ -127,7 +127,10 @@ struct statx {
>>>       __u32    stx_dio_mem_align;    /* Memory buffer alignment for 
>>> direct I/O */
>>>       __u32    stx_dio_offset_align;    /* File offset alignment for 
>>> direct I/O */
>>>       /* 0xa0 */
>>> -    __u64    __spare3[12];    /* Spare space for future expansion */
>>> +    __u32    stx_atomic_write_unit_max;
>>> +    __u32    stx_atomic_write_unit_min;
>>
>> Maybe min first and then max?  That seems a bit more natural, and a 
>> lot of the
>> code you've written handle them in that order.

ok, I think it's fine to reorder

>>
>>> +#define STATX_ATTR_WRITE_ATOMIC        0x00400000 /* File supports 
>>> atomic write operations */
>>
>> How would this differ from stx_atomic_write_unit_min != 0?

Yeah, I suppose that we can just not set this for the case of 
stx_atomic_write_unit_min == 0.

> 
> Is it even possible that stx_atomic_write_unit_min == 0? My understanding
> is that all Linux filesystems rely on the assumption that writing a single
> logical block either succeeds or does not happen, even if a power failure
> occurs between writing and reading a logical block.
> 

Maybe they do rely on this, but is it particularly interesting?

BTW, I would not like to provide assurances that every storage media 
produced writes logical blocks atomically.

Thanks,
John


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-09-29 17:51   ` Bart Van Assche
@ 2023-10-02 10:10     ` John Garry
  2023-10-02 19:12       ` Bart Van Assche
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-10-02 10:10 UTC (permalink / raw)
  To: Bart Van Assche, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 29/09/2023 18:51, Bart Van Assche wrote:
> On 9/29/23 03:27, John Garry wrote:
>> +    if (pos % atomic_write_unit_min_bytes)
>> +        return false;
>> +    if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
>> +        return false;
>> +    if (!is_power_of_2(iov_iter_count(iter)))
>> +        return false;
> [ ... ]
>> +    if (pos % iov_iter_count(iter))
>> +        return false;
> 
> Where do these rules come from? Is there any standard that requires
> any of the above?

SCSI and NVMe have slightly different atomic writes semantics, and the 
rules are created to work for both.

In addition, the rules are related to FS extent alignment.

Note that for simplicity and consistency we use the same rules for 
regular files as for bdev's.

This is the coding for the rules and where they come from:

 > +	if (!atomic_write_unit_min_bytes)
 > +		return false;

If atomic_write_unit_min_bytes == 0, then we just don't support atomic 
writes.

 > +	if (pos % atomic_write_unit_min_bytes)
 > +		return false;

See later rules.

 > +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
 > +		return false;

For SCSI, there is an atomic write granularity, which dictates 
atomic_write_unit_min_bytes. So here we need to ensure that the length 
is a multiple of this value.

 > +	if (!is_power_of_2(iov_iter_count(iter)))
 > +		return false;

This rule comes from FS block alignment and NVMe atomic boundary.

FSes (XFS) have discontiguous extents. We need to ensure that an atomic 
write does not cross discontiguous extents. To do this we ensure extent 
length and alignment and limit atomic_write_unit_max_bytes to that.

For NVMe, an atomic write boundary is a boundary in LBA space which an 
atomic write should not cross. We limit atomic_write_unit_max_bytes such 
that it is evenly divisible into this atomic write boundary.

To ensure that the write does not cross these alignment boundaries we 
say that it must be naturally aligned and a power-of-2 in length.

We may be able to relax this rule but I am not sure it buys us anything 
- typically we want to be writing a 64KB block aligned to 64KB, for example.

 > +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
 > +		return false;

We just can't exceed this length.

 > +	if (pos % iov_iter_count(iter))
 > +		return false;

As above, ensure naturally aligned.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 18/21] scsi: sd: Support reading atomic properties from block limits VPD
  2023-09-29 17:54   ` Bart Van Assche
@ 2023-10-02 11:27     ` John Garry
  2023-10-06 17:52       ` Bart Van Assche
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-10-02 11:27 UTC (permalink / raw)
  To: Bart Van Assche, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 29/09/2023 18:54, Bart Van Assche wrote:
> On 9/29/23 03:27, John Garry wrote:
>> +static void sd_config_atomic(struct scsi_disk *sdkp)
>> +{
>> +    unsigned int logical_block_size = sdkp->device->sector_size;
>> +    struct request_queue *q = sdkp->disk->queue;
>> +
>> +    if (sdkp->max_atomic) {
> 
> Please use the "return early" style here to keep the indentation
> level in this function low.

ok, fine.

> 
>> +        unsigned int max_atomic = max_t(unsigned int,
>> +            rounddown_pow_of_two(sdkp->max_atomic),
>> +            rounddown_pow_of_two(sdkp->max_atomic_with_boundary));
>> +        unsigned int unit_min = sdkp->atomic_granularity ?
>> +            rounddown_pow_of_two(sdkp->atomic_granularity) :
>> +            physical_block_size_sectors;
>> +        unsigned int unit_max = max_atomic;
>> +
>> +        if (sdkp->max_atomic_boundary)
>> +            unit_max = min_t(unsigned int, unit_max,
>> +                rounddown_pow_of_two(sdkp->max_atomic_boundary));
> 
> Why does "rounddown_pow_of_two()" occur in the above code?

I assume that you are talking about all the code above to calculate 
atomic write values for the device.

The reason is that atomic write unit min and max are always a power-of-2 
- see rules described earlier - as so that we why we rounddown to a 
power-of-2.

Thanks,
John


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 19/21] scsi: sd: Add WRITE_ATOMIC_16 support
  2023-09-29 17:59   ` Bart Van Assche
@ 2023-10-02 11:36     ` John Garry
  2023-10-02 19:21       ` Bart Van Assche
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-10-02 11:36 UTC (permalink / raw)
  To: Bart Van Assche, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 29/09/2023 18:59, Bart Van Assche wrote:
> On 9/29/23 03:27, John Garry wrote:
>> +static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd,
>> +                    sector_t lba, unsigned int nr_blocks,
>> +                    unsigned char flags)
>> +{
>> +    cmd->cmd_len  = 16;
>> +    cmd->cmnd[0]  = WRITE_ATOMIC_16;
>> +    cmd->cmnd[1]  = flags;
>> +    put_unaligned_be64(lba, &cmd->cmnd[2]);
>> +    cmd->cmnd[10] = 0;
>> +    cmd->cmnd[11] = 0;
>> +    put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
>> +    cmd->cmnd[14] = 0;
>> +    cmd->cmnd[15] = 0;
>> +
>> +    return BLK_STS_OK;
>> +}
> 
> Please store the 'dld' value in the GROUP NUMBER field. See e.g.
> sd_setup_rw16_cmnd().

Are you sure that WRITE ATOMIC (16) supports dld?

> 
>> @@ -1139,6 +1156,7 @@ static blk_status_t 
>> sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>       unsigned int nr_blocks = sectors_to_logical(sdp, 
>> blk_rq_sectors(rq));
>>       unsigned int mask = logical_to_sectors(sdp, 1) - 1;
>>       bool write = rq_data_dir(rq) == WRITE;
>> +    bool atomic_write = !!(rq->cmd_flags & REQ_ATOMIC) && write;
> 
> Please leave out the superfluous "!!".

ok, fine.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-10-02  9:51       ` John Garry
@ 2023-10-02 18:39         ` Bart Van Assche
  2023-10-03  0:28           ` Martin K. Petersen
  2023-10-03  1:51         ` Dave Chinner
  1 sibling, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-02 18:39 UTC (permalink / raw)
  To: John Garry, Eric Biggers
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Prasad Singamsetty

On 10/2/23 02:51, John Garry wrote:
> On 01/10/2023 14:23, Bart Van Assche wrote:
>> Is it even possible that stx_atomic_write_unit_min == 0? My 
>> understanding is that all Linux filesystems rely on the assumption 
>> that writing a single logical block either succeeds or does not 
>> happen, even if a power failure occurs between writing and reading 
>> a logical block.
> 
> Maybe they do rely on this, but is it particularly interesting?
> 
> BTW, I would not like to provide assurances that every storage media 
> produced writes logical blocks atomically.

Neither the SCSI SBC standard nor the NVMe standard defines a "minimum
atomic write unit". So why to introduce something in the Linux kernel
that is not defined in common storage standards?

I propose to leave out stx_atomic_write_unit_min from
struct statx and also to leave out atomic_write_unit_min_sectors from
struct queue_limits. My opinion is that we should not support block
devices in the Linux kernel that do not write logical blocks atomically.
Block devices that do not write logical blocks atomically are not
compatible with Linux kernel journaling filesystems. Additionally, I'm
not sure it's even possible to write a journaling filesystem for such 
block devices.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-02 10:10     ` John Garry
@ 2023-10-02 19:12       ` Bart Van Assche
  2023-10-03  0:48         ` Martin K. Petersen
  2023-10-03  8:37         ` John Garry
  0 siblings, 2 replies; 124+ messages in thread
From: Bart Van Assche @ 2023-10-02 19:12 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 10/2/23 03:10, John Garry wrote:
> On 29/09/2023 18:51, Bart Van Assche wrote:
>> On 9/29/23 03:27, John Garry wrote:
>  > +    if (pos % atomic_write_unit_min_bytes)
>  > +        return false;
> 
> See later rules.

Is atomic_write_unit_min_bytes always equal to the logical block size?
If so, can the above test be left out?

>  > +    if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
>  > +        return false;
> 
> For SCSI, there is an atomic write granularity, which dictates 
> atomic_write_unit_min_bytes. So here we need to ensure that the length 
> is a multiple of this value.

Are there any SCSI devices that we care about that report an ATOMIC 
TRANSFER LENGTH GRANULARITY that is larger than a single logical block?
I'm wondering whether we really have to support such devices.

>  > +    if (!is_power_of_2(iov_iter_count(iter)))
>  > +        return false;
> 
> This rule comes from FS block alignment and NVMe atomic boundary.
> 
> FSes (XFS) have discontiguous extents. We need to ensure that an atomic 
> write does not cross discontiguous extents. To do this we ensure extent 
> length and alignment and limit atomic_write_unit_max_bytes to that.
> 
> For NVMe, an atomic write boundary is a boundary in LBA space which an 
> atomic write should not cross. We limit atomic_write_unit_max_bytes such 
> that it is evenly divisible into this atomic write boundary.
> 
> To ensure that the write does not cross these alignment boundaries we 
> say that it must be naturally aligned and a power-of-2 in length.
> 
> We may be able to relax this rule but I am not sure it buys us anything 
> - typically we want to be writing a 64KB block aligned to 64KB, for 
> example.

It seems to me that the requirement is_power_of_2(iov_iter_count(iter))
is necessary for some filesystems but not for all filesystems. 
Restrictions that are specific to a single filesystem (XFS) should not 
occur in code that is intended to be used by all filesystems 
(blkdev_atomic_write_valid()).

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 19/21] scsi: sd: Add WRITE_ATOMIC_16 support
  2023-10-02 11:36     ` John Garry
@ 2023-10-02 19:21       ` Bart Van Assche
  0 siblings, 0 replies; 124+ messages in thread
From: Bart Van Assche @ 2023-10-02 19:21 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 10/2/23 04:36, John Garry wrote:
> On 29/09/2023 18:59, Bart Van Assche wrote:
>> On 9/29/23 03:27, John Garry wrote:
>>> +static blk_status_t sd_setup_atomic_cmnd(struct scsi_cmnd *cmd,
>>> +                    sector_t lba, unsigned int nr_blocks,
>>> +                    unsigned char flags)
>>> +{
>>> +    cmd->cmd_len  = 16;
>>> +    cmd->cmnd[0]  = WRITE_ATOMIC_16;
>>> +    cmd->cmnd[1]  = flags;
>>> +    put_unaligned_be64(lba, &cmd->cmnd[2]);
>>> +    cmd->cmnd[10] = 0;
>>> +    cmd->cmnd[11] = 0;
>>> +    put_unaligned_be16(nr_blocks, &cmd->cmnd[12]);
>>> +    cmd->cmnd[14] = 0;
>>> +    cmd->cmnd[15] = 0;
>>> +
>>> +    return BLK_STS_OK;
>>> +}
>>
>> Please store the 'dld' value in the GROUP NUMBER field. See e.g.
>> sd_setup_rw16_cmnd().
> 
> Are you sure that WRITE ATOMIC (16) supports dld?

Hi John,

I was assuming that DLD would be supported by the WRITE ATOMIC(16) 
command. After having taken another look at the latest SBC-5 draft
I see that the DLD2/DLD1/DLD0 bits are not present in the WRITE 
ATOMIC(16) command. So please ignore my comment above.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 09/21] block: Add checks to merging of atomic writes
  2023-09-30 13:40   ` kernel test robot
@ 2023-10-02 22:50     ` Nathan Chancellor
  2023-10-04 11:40       ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Nathan Chancellor @ 2023-10-02 22:50 UTC (permalink / raw)
  To: kernel test robot
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, llvm,
	oe-kbuild-all, linux-block, linux-kernel, linux-nvme, linux-xfs,
	linux-fsdevel, tytso, jbongio, linux-api

On Sat, Sep 30, 2023 at 09:40:30PM +0800, kernel test robot wrote:
> Hi John,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on xfs-linux/for-next]
> [also build test ERROR on axboe-block/for-next mkp-scsi/for-next jejb-scsi/for-next linus/master v6.6-rc3 next-20230929]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/John-Garry/block-Add-atomic-write-operations-to-request_queue-limits/20230929-184542
> base:   https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git for-next
> patch link:    https://lore.kernel.org/r/20230929102726.2985188-10-john.g.garry%40oracle.com
> patch subject: [PATCH 09/21] block: Add checks to merging of atomic writes
> config: mips-mtx1_defconfig (https://download.01.org/0day-ci/archive/20230930/202309302100.L6ynQWub-lkp@intel.com/config)
> compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230930/202309302100.L6ynQWub-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202309302100.L6ynQWub-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
> >> ld.lld: error: undefined symbol: __moddi3
>    >>> referenced by blk-merge.c
>    >>>               block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
>    >>> referenced by blk-merge.c
>    >>>               block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
>    >>> referenced by blk-merge.c
>    >>>               block/blk-merge.o:(bio_attempt_front_merge) in archive vmlinux.a
>    >>> referenced 3 more times

This does not appear to be clang specific, I can reproduce it with GCC
12.3.0 and the same configuration target.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-10-02 18:39         ` Bart Van Assche
@ 2023-10-03  0:28           ` Martin K. Petersen
  2023-11-09 15:15             ` Christoph Hellwig
  0 siblings, 1 reply; 124+ messages in thread
From: Martin K. Petersen @ 2023-10-03  0:28 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: John Garry, Eric Biggers, axboe, kbusch, hch, sagi, jejb,
	martin.petersen, djwong, viro, brauner, chandan.babu, dchinner,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Prasad Singamsetty


Bart,

> Neither the SCSI SBC standard nor the NVMe standard defines a "minimum
> atomic write unit". So why to introduce something in the Linux kernel
> that is not defined in common storage standards?

From SBC-5:

"The ATOMIC TRANSFER LENGTH GRANULARITY field indicates the minimum
transfer length for an atomic write command."

> I propose to leave out stx_atomic_write_unit_min from
> struct statx and also to leave out atomic_write_unit_min_sectors from
> struct queue_limits. My opinion is that we should not support block
> devices in the Linux kernel that do not write logical blocks atomically.

The statx values exist to describe the limits for I/Os sent using
RWF_ATOMIC and IOCB_ATOMIC. These limits may be different from other
reported values such as the filesystem block size and the logical block
size of the underlying device.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-02 19:12       ` Bart Van Assche
@ 2023-10-03  0:48         ` Martin K. Petersen
  2023-10-03 16:55           ` Bart Van Assche
  2023-10-03  8:37         ` John Garry
  1 sibling, 1 reply; 124+ messages in thread
From: Martin K. Petersen @ 2023-10-03  0:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api


Bart,

> Are there any SCSI devices that we care about that report an ATOMIC
> TRANSFER LENGTH GRANULARITY that is larger than a single logical
> block?

Yes.

Note that code path used inside a storage device to guarantee atomicity
of an entire I/O may be substantially different from the code path which
only offers an incremental guarantee at a single logical or physical
block level (to the extent that those guarantees are offered at all but
that's a different kettle of fish).

> I'm wondering whether we really have to support such devices.

Yes.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 11/21] fs: xfs: Don't use low-space allocator for alignment > 1
  2023-09-29 10:27 ` [PATCH 11/21] fs: xfs: Don't use low-space allocator for alignment > 1 John Garry
@ 2023-10-03  1:16   ` Dave Chinner
  2023-10-03  3:00     ` Darrick J. Wong
  0 siblings, 1 reply; 124+ messages in thread
From: Dave Chinner @ 2023-10-03  1:16 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Fri, Sep 29, 2023 at 10:27:16AM +0000, John Garry wrote:
> The low-space allocator doesn't honour the alignment requirement, so don't
> attempt to even use it (when we have an alignment requirement).
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 30c931b38853..328134c22104 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3569,6 +3569,10 @@ xfs_bmap_btalloc_low_space(
>  {
>  	int			error;
>  
> +	/* The allocator doesn't honour args->alignment */
> +	if (args->alignment > 1)
> +		return 0;
> +

How does this happen?

The earlier failing aligned allocations will clear alignment before
we get here....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 13/21] fs: xfs: Make file data allocations observe the 'forcealign' flag
  2023-09-29 10:27 ` [PATCH 13/21] fs: xfs: Make file data allocations observe the 'forcealign' flag John Garry
@ 2023-10-03  1:42   ` Dave Chinner
  2023-10-03 10:13     ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Dave Chinner @ 2023-10-03  1:42 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Fri, Sep 29, 2023 at 10:27:18AM +0000, John Garry wrote:
> From: "Darrick J. Wong" <djwong@kernel.org>
> 
> The existing extsize hint code already did the work of expanding file
> range mapping requests so that the range is aligned to the hint value.
> Now add the code we need to guarantee that the space allocations are
> also always aligned.
> 
> XXX: still need to check all this with reflink
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Co-developed-by: John Garry <john.g.garry@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
>  fs/xfs/xfs_iomap.c       |  4 +++-
>  2 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 328134c22104..6c864dc0a6ff 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3328,6 +3328,19 @@ xfs_bmap_compute_alignments(
>  		align = xfs_get_cowextsz_hint(ap->ip);
>  	else if (ap->datatype & XFS_ALLOC_USERDATA)
>  		align = xfs_get_extsz_hint(ap->ip);
> +
> +	/*
> +	 * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
> +	 * set as forcealign and cowextsz_hint are mutually exclusive
> +	 */
> +	if (xfs_inode_forcealign(ap->ip) && align) {
> +		args->alignment = align;
> +		if (stripe_align % align)
> +			stripe_align = align;
> +	} else {
> +		args->alignment = 1;
> +	}

This smells wrong.

If a filesystem has a stripe unit set (hence stripe_align is
non-zero) then any IO that crosses stripe unit boundaries will not
be atomic - they will require multiple IOs to different devices.

Hence if the filesystem has a stripe unit set, then all forced
alignment hints for atomic IO *must* be an exact integer divider
of the stripe unit. hence when an atomic IO bundle is aligned, the
atomic boundaries within the bundle always fall on a stripe unit
boundary and never cross devices.

IOWs, for a striped filesystem, the maximum size/alignment for a
single atomic IO unit is the stripe unit.

This should be enforced when the forced align flag is set on the
inode (i.e. from the ioctl)


> +
>  	if (align) {
>  		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>  					ap->eof, 0, ap->conv, &ap->offset,
> @@ -3423,7 +3436,6 @@ xfs_bmap_exact_minlen_extent_alloc(
>  	args.minlen = args.maxlen = ap->minlen;
>  	args.total = ap->total;
>  
> -	args.alignment = 1;
>  	args.minalignslop = 0;
>  
>  	args.minleft = ap->minleft;
> @@ -3469,6 +3481,7 @@ xfs_bmap_btalloc_at_eof(
>  {
>  	struct xfs_mount	*mp = args->mp;
>  	struct xfs_perag	*caller_pag = args->pag;
> +	int			orig_alignment = args->alignment;
>  	int			error;
>  
>  	/*
> @@ -3543,10 +3556,10 @@ xfs_bmap_btalloc_at_eof(
>  
>  	/*
>  	 * Allocation failed, so turn return the allocation args to their
> -	 * original non-aligned state so the caller can proceed on allocation
> -	 * failure as if this function was never called.
> +	 * original state so the caller can proceed on allocation failure as
> +	 * if this function was never called.
>  	 */
> -	args->alignment = 1;
> +	args->alignment = orig_alignment;
>  	return 0;
>  }

Urk. Not sure that is right, it's certainly a change of behaviour.

> @@ -3694,7 +3707,6 @@ xfs_bmap_btalloc(
>  		.wasdel		= ap->wasdel,
>  		.resv		= XFS_AG_RESV_NONE,
>  		.datatype	= ap->datatype,
> -		.alignment	= 1,
>  		.minalignslop	= 0,
>  	};
>  	xfs_fileoff_t		orig_offset;
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 18c8f168b153..70fe873951f3 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -181,7 +181,9 @@ xfs_eof_alignment(
>  		 * If mounted with the "-o swalloc" option the alignment is
>  		 * increased from the strip unit size to the stripe width.
>  		 */
> -		if (mp->m_swidth && xfs_has_swalloc(mp))
> +		if (xfs_inode_forcealign(ip))
> +			align = xfs_get_extsz_hint(ip);
> +		else if (mp->m_swidth && xfs_has_swalloc(mp))
>  			align = mp->m_swidth;
>  		else if (mp->m_dalign)
>  			align = mp->m_dalign;

Ah. Now I see. This abuses the stripe alignment code to try to
implement this new inode allocation alignment restriction, rather
than just making the extent size hint alignment mandatory....

Yeah, this can be done better... :)

As it is, I have been working on a series that reworks all this
allocator code to separate out the aligned IO from the exact EOF
allocation case to help clean this up for better perag selection
during allocation. I think that needs to be done first before we go
making the alignment code more intricate like this....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-10-02  9:51       ` John Garry
  2023-10-02 18:39         ` Bart Van Assche
@ 2023-10-03  1:51         ` Dave Chinner
  2023-10-03  2:57           ` Darrick J. Wong
  1 sibling, 1 reply; 124+ messages in thread
From: Dave Chinner @ 2023-10-03  1:51 UTC (permalink / raw)
  To: John Garry
  Cc: Bart Van Assche, Eric Biggers, axboe, kbusch, hch, sagi, jejb,
	martin.petersen, djwong, viro, brauner, chandan.babu, dchinner,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Prasad Singamsetty

On Mon, Oct 02, 2023 at 10:51:36AM +0100, John Garry wrote:
> On 01/10/2023 14:23, Bart Van Assche wrote:
> > On 9/29/23 15:49, Eric Biggers wrote:
> > > On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote:
> > > > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > > > index 7cab2c65d3d7..c99d7cac2aa6 100644
> > > > --- a/include/uapi/linux/stat.h
> > > > +++ b/include/uapi/linux/stat.h
> > > > @@ -127,7 +127,10 @@ struct statx {
> > > >       __u32    stx_dio_mem_align;    /* Memory buffer alignment
> > > > for direct I/O */
> > > >       __u32    stx_dio_offset_align;    /* File offset alignment
> > > > for direct I/O */
> > > >       /* 0xa0 */
> > > > -    __u64    __spare3[12];    /* Spare space for future expansion */
> > > > +    __u32    stx_atomic_write_unit_max;
> > > > +    __u32    stx_atomic_write_unit_min;
> > > 
> > > Maybe min first and then max?  That seems a bit more natural, and a
> > > lot of the
> > > code you've written handle them in that order.
> 
> ok, I think it's fine to reorder
> 
> > > 
> > > > +#define STATX_ATTR_WRITE_ATOMIC        0x00400000 /* File
> > > > supports atomic write operations */
> > > 
> > > How would this differ from stx_atomic_write_unit_min != 0?
> 
> Yeah, I suppose that we can just not set this for the case of
> stx_atomic_write_unit_min == 0.

Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the
filesystem, file and underlying device support atomic writes when
the values are non-zero. The whole point of the attribute mask is
that the caller can check the mask for supported functionality
without having to read every field in the statx structure to
determine if the functionality it wants is present.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-10-03  1:51         ` Dave Chinner
@ 2023-10-03  2:57           ` Darrick J. Wong
  2023-10-03  7:23             ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Darrick J. Wong @ 2023-10-03  2:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, Bart Van Assche, Eric Biggers, axboe, kbusch, hch,
	sagi, jejb, martin.petersen, viro, brauner, chandan.babu,
	dchinner, linux-block, linux-kernel, linux-nvme, linux-xfs,
	linux-fsdevel, tytso, jbongio, linux-api, Prasad Singamsetty

On Tue, Oct 03, 2023 at 12:51:49PM +1100, Dave Chinner wrote:
> On Mon, Oct 02, 2023 at 10:51:36AM +0100, John Garry wrote:
> > On 01/10/2023 14:23, Bart Van Assche wrote:
> > > On 9/29/23 15:49, Eric Biggers wrote:
> > > > On Fri, Sep 29, 2023 at 10:27:08AM +0000, John Garry wrote:
> > > > > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > > > > index 7cab2c65d3d7..c99d7cac2aa6 100644
> > > > > --- a/include/uapi/linux/stat.h
> > > > > +++ b/include/uapi/linux/stat.h
> > > > > @@ -127,7 +127,10 @@ struct statx {
> > > > >       __u32    stx_dio_mem_align;    /* Memory buffer alignment
> > > > > for direct I/O */
> > > > >       __u32    stx_dio_offset_align;    /* File offset alignment
> > > > > for direct I/O */
> > > > >       /* 0xa0 */
> > > > > -    __u64    __spare3[12];    /* Spare space for future expansion */
> > > > > +    __u32    stx_atomic_write_unit_max;
> > > > > +    __u32    stx_atomic_write_unit_min;
> > > > 
> > > > Maybe min first and then max?  That seems a bit more natural, and a
> > > > lot of the
> > > > code you've written handle them in that order.
> > 
> > ok, I think it's fine to reorder
> > 
> > > > 
> > > > > +#define STATX_ATTR_WRITE_ATOMIC        0x00400000 /* File
> > > > > supports atomic write operations */
> > > > 
> > > > How would this differ from stx_atomic_write_unit_min != 0?
> > 
> > Yeah, I suppose that we can just not set this for the case of
> > stx_atomic_write_unit_min == 0.
> 
> Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the
> filesystem, file and underlying device support atomic writes when
> the values are non-zero. The whole point of the attribute mask is
> that the caller can check the mask for supported functionality
> without having to read every field in the statx structure to
> determine if the functionality it wants is present.

^^ Seconding what Dave said.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 11/21] fs: xfs: Don't use low-space allocator for alignment > 1
  2023-10-03  1:16   ` Dave Chinner
@ 2023-10-03  3:00     ` Darrick J. Wong
  2023-10-03  4:34       ` Dave Chinner
  2023-10-03 10:22       ` John Garry
  0 siblings, 2 replies; 124+ messages in thread
From: Darrick J. Wong @ 2023-10-03  3:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	viro, brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Tue, Oct 03, 2023 at 12:16:26PM +1100, Dave Chinner wrote:
> On Fri, Sep 29, 2023 at 10:27:16AM +0000, John Garry wrote:
> > The low-space allocator doesn't honour the alignment requirement, so don't
> > attempt to even use it (when we have an alignment requirement).
> > 
> > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 30c931b38853..328134c22104 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3569,6 +3569,10 @@ xfs_bmap_btalloc_low_space(
> >  {
> >  	int			error;
> >  
> > +	/* The allocator doesn't honour args->alignment */
> > +	if (args->alignment > 1)
> > +		return 0;
> > +
> 
> How does this happen?
> 
> The earlier failing aligned allocations will clear alignment before
> we get here....

I was thinking the predicate should be xfs_inode_force_align(ip) to save
me/us from thinking about all the other weird ways args->alignment could
end up 1.

	/* forced-alignment means we don't use low mode */
	if (xfs_inode_force_align(ip))
		return -ENOSPC;

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 15/21] fs: xfs: Support atomic write for statx
  2023-09-29 10:27 ` [PATCH 15/21] fs: xfs: Support atomic write for statx John Garry
@ 2023-10-03  3:32   ` Dave Chinner
  2023-10-03 10:56     ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Dave Chinner @ 2023-10-03  3:32 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Fri, Sep 29, 2023 at 10:27:20AM +0000, John Garry wrote:
> Support providing info on atomic write unit min and max for an inode.
> 
> For simplicity, currently we limit the min at the FS block size, but a
> lower limit could be supported in future.
> 
> The atomic write unit min and max is limited by the guaranteed extent
> alignment for the inode.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iops.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_iops.h |  4 ++++
>  2 files changed, 55 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 1c1e6171209d..5bff80748223 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -546,6 +546,46 @@ xfs_stat_blksize(
>  	return PAGE_SIZE;
>  }
>  
> +void xfs_ip_atomic_write_attr(struct xfs_inode *ip,
> +			xfs_filblks_t *unit_min_fsb,
> +			xfs_filblks_t *unit_max_fsb)

Formatting.

Also, we don't use variable name shorthand for function names -
xfs_get_atomic_write_hint(ip) to match xfs_get_extsz_hint(ip)
would be appropriate, right?



> +{
> +	xfs_extlen_t		extsz_hint = xfs_get_extsz_hint(ip);
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	struct block_device	*bdev = target->bt_bdev;
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_filblks_t		atomic_write_unit_min,
> +				atomic_write_unit_max,
> +				align;
> +
> +	atomic_write_unit_min = XFS_B_TO_FSB(mp,
> +		queue_atomic_write_unit_min_bytes(bdev->bd_queue));
> +	atomic_write_unit_max = XFS_B_TO_FSB(mp,
> +		queue_atomic_write_unit_max_bytes(bdev->bd_queue));

These should be set in the buftarg at mount time, like we do with
sector size masks. Then we don't need to convert them to fsbs on
every single lookup.

> +	/* for RT, unset extsize gives hint of 1 */
> +	/* for !RT, unset extsize gives hint of 0 */
> +	if (extsz_hint && (XFS_IS_REALTIME_INODE(ip) ||
> +	    (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)))

Logic is non-obvious. The compound is (rt || force), not
(extsz && rt), so it took me a while to actually realise I read this
incorrectly.

	if (extsz_hint &&
	    (XFS_IS_REALTIME_INODE(ip) ||
	     (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))) {

> +		align = extsz_hint;
> +	else
> +		align = 1;

And now the logic looks wrong to me. We don't want to use extsz hint
for RT inodes if force align is not set, this will always use it
regardless of the fact it has nothing to do with force alignment.

Indeed, if XFS_DIFLAG2_FORCEALIGN is not set, then shouldn't this
always return min/max = 0 because atomic alignments are not in us on
this inode?

i.e. the first thing this code should do is:

	*unit_min_fsb = 0;
	*unit_max_fsb = 0;
	if (!(ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))
		return;

Then we can check device support:

	if (!buftarg->bt_atomic_write_max)
		return;

Then we can check for extent size hints. If that's not set:

	align = xfs_get_extsz_hint(ip);
	if (align <= 1) {
		unit_min_fsb = 1;
		unit_max_fsb = 1;
		return;
	}

And finally, if there is an extent size hint, we can return that.

> +	if (atomic_write_unit_max == 0) {
> +		*unit_min_fsb = 0;
> +		*unit_max_fsb = 0;
> +	} else if (atomic_write_unit_min == 0) {
> +		*unit_min_fsb = 1;
> +		*unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
> +					align);

Why is it valid for a device to have a zero minimum size? If it can
set a maximum, it should -always- set a minimum size as logical
sector size is a valid lower bound, yes?

> +	} else {
> +		*unit_min_fsb = min_t(xfs_filblks_t, atomic_write_unit_min,
> +					align);
> +		*unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
> +					align);
> +	}

Nothing here guarantees the power-of-2 sizes that the RWF_ATOMIC
user interface requires....

It also doesn't check that the extent size hint is aligned with
atomic write units.

It also doesn't check either against stripe unit alignment....

> +}
> +
>  STATIC int
>  xfs_vn_getattr(
>  	struct mnt_idmap	*idmap,
> @@ -614,6 +654,17 @@ xfs_vn_getattr(
>  			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
>  			stat->dio_offset_align = bdev_logical_block_size(bdev);
>  		}
> +		if (request_mask & STATX_WRITE_ATOMIC) {
> +			xfs_filblks_t unit_min_fsb, unit_max_fsb;
> +
> +			xfs_ip_atomic_write_attr(ip, &unit_min_fsb,
> +				&unit_max_fsb);
> +			stat->atomic_write_unit_min = XFS_FSB_TO_B(mp, unit_min_fsb);
> +			stat->atomic_write_unit_max = XFS_FSB_TO_B(mp, unit_max_fsb);

That's just nasty. We pull byte units from the bdev, convert them to
fsb to round them, then convert them back to byte counts. We should
be doing all the work in one set of units....

> +			stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
> +			stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
> +			stat->result_mask |= STATX_WRITE_ATOMIC;

If the min/max are zero, then atomic writes are not supported on
this inode, right? Why would we set any of the attributes or result
mask to say it is supported on this file?


-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 16/21] fs: iomap: Atomic write support
  2023-09-29 10:27 ` [PATCH 16/21] fs: iomap: Atomic write support John Garry
@ 2023-10-03  4:24   ` Dave Chinner
  2023-10-03 12:55     ` John Garry
                       ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: Dave Chinner @ 2023-10-03  4:24 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Fri, Sep 29, 2023 at 10:27:21AM +0000, John Garry wrote:
> Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
> bio is being created and all the rules there need to be followed.
> 
> It is the task of the FS iomap iter callbacks to ensure that the mapping
> created adheres to those rules, like size is power-of-2, is at a
> naturally-aligned offset, etc.

The mapping being returned by the filesystem can span a much greater
range than the actual IO needs - the iomap itself is not guaranteed
to be aligned to anything in particular, but the IO location within
that map can still conform to atomic IO constraints. See how
iomap_sector() calculates the actual LBA address of the IO from
the iomap and the current file position the IO is being done at.

hence I think saying "the filesysetm should make sure all IO
alignment adheres to atomic IO rules is probably wrong. The iomap
layer doesn't care what the filesystem does, all it cares about is
whether the IO can be done given the extent map that was returned to
it.

Indeed, iomap_dio_bio_iter() is doing all these alignment checks for
normal DIO reads and writes which must be logical block sized
aligned. i.e. this check:

        if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
            !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
                return -EINVAL;

Hence I think that atomic IO units, which are similarly defined by
the bdev, should be checked at the iomap layer, too. e.g, by
following up with:

	if ((dio->iocb->ki_flags & IOCB_ATOMIC) &&
	    ((pos | length) & (bdev_atomic_unit_min(iomap->bdev) - 1) ||
	     !bdev_iter_is_atomic_aligned(iomap->bdev, dio->submit.iter))
		return -EINVAL;

At this point, filesystems don't really need to know anything about
atomic IO - if they've allocated a large contiguous extent (e.g. via
fallocate()), then RWF_ATOMIC will just work for the cases where the
block device supports it...

This then means that stuff like XFS extent size hints only need to
check when the hint is set that it is aligned to the underlying
device atomic IO constraints. Then when it sees the IOMAP_ATOMIC
modifier, it can fail allocation if it can't get extent size hint
aligned allocation.

IOWs, I'm starting to think this doesn't need any change to the
on-disk format for XFS - it can be driven entirely through two
dynamic mechanisms:

1. (IOMAP_WRITE | IOMAP_ATOMIC) requests from the direct IO layer
which causes mapping/allocation to fail if it can't allocate (or
map) atomic IO compatible extents for the IO.

2. FALLOC_FL_ATOMIC preallocation flag modifier to tell fallocate()
to force alignment of all preallocated extents to atomic IO
constraints.

This doesn't require extent size hints at all. The filesystem can
query the bdev at mount time, store the min/max atomic write sizes,
and then use them for all requests that have _ATOMIC modifiers set
on them.

With iomap doing the same "get the atomic constraints from the bdev"
style lookups for per-IO file offset and size checking, I don't
think we actually need extent size hints or an on-disk flag to force
extent size hint alignment.

That doesn't mean extent size hints can't be used - it just means
that extent size hints have to be constrained to being aligned to
atomic IOs (e.g. extent size hint must be an integer multiple of the
max atomic IO size). This then acts as a modifier for _ATOMIC
context allocations, much like it is a modifier for normal
allocations now.

> In iomap_dio_bio_iter(), ensure that for a non-dsync iocb that the mapping
> is not dirty nor unmapped.
>
> A write should only produce a single bio, so error when it doesn't.

I comment on both these things below.

> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/iomap/direct-io.c  | 26 ++++++++++++++++++++++++--
>  fs/iomap/trace.h      |  3 ++-
>  include/linux/iomap.h |  1 +
>  3 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..6ef25e26f1a1 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -275,10 +275,11 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		struct iomap_dio *dio)
>  {
> +	bool atomic_write = iter->flags & IOMAP_ATOMIC_WRITE;
>  	const struct iomap *iomap = &iter->iomap;
>  	struct inode *inode = iter->inode;
>  	unsigned int fs_block_size = i_blocksize(inode), pad;
> -	loff_t length = iomap_length(iter);
> +	const loff_t length = iomap_length(iter);
>  	loff_t pos = iter->pos;
>  	blk_opf_t bio_opf;
>  	struct bio *bio;
> @@ -292,6 +293,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>  		return -EINVAL;
>  
> +	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
> +		if (iomap->flags & IOMAP_F_DIRTY)
> +			return -EIO;
> +		if (iomap->type != IOMAP_MAPPED)
> +			return -EIO;
> +	}

How do we get here without space having been allocated for the
write?

Perhaps what this is trying to do is make RWF_ATOMIC only be valid
into written space? I mean, this will fail with preallocated space
(IOMAP_UNWRITTEN) even though we still have exactly the RWF_ATOMIC
all-or-nothing behaviour guaranteed after a crash because of journal
recovery behaviour. i.e. if the unwritten conversion gets written to
the journal, the data will be there. If it isn't written to the
journal, then the space remains unwritten and there's no data across
that entire range....

So I'm not really sure that either of these checks are valid or why
they are actually needed....

> +
>  	if (iomap->type == IOMAP_UNWRITTEN) {
>  		dio->flags |= IOMAP_DIO_UNWRITTEN;
>  		need_zeroout = true;
> @@ -381,6 +389,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  					  GFP_KERNEL);
>  		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>  		bio->bi_ioprio = dio->iocb->ki_ioprio;
> +		if (atomic_write)
> +			bio->bi_opf |= REQ_ATOMIC;
> +
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> @@ -397,6 +408,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		}
>  
>  		n = bio->bi_iter.bi_size;
> +		if (atomic_write && n != length) {
> +			/* This bio should have covered the complete length */
> +			ret = -EINVAL;
> +			bio_put(bio);
> +			goto out;

Why? The actual bio can be any length that meets the aligned
criteria between min and max, yes? So it's valid to split a
RWF_ATOMIC write request up into multiple min unit sized bios, is it
not? I mean, that's the whole point of the min/max unit setup, isn't
it? That the max sized write only guarantees that it will tear at
min unit boundaries, not within those min unit boundaries? If
I've understood this correctly, then why does this "single bio for
large atomic write" constraint need to exist?


> +		}
>  		if (dio->flags & IOMAP_DIO_WRITE) {
>  			task_io_account_write(n);
>  		} else {
> @@ -554,6 +571,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	struct blk_plug plug;
>  	struct iomap_dio *dio;
>  	loff_t ret = 0;
> +	bool is_read = iov_iter_rw(iter) == READ;
> +	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;

This does not need to be done here, because....

>  
>  	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
>  
> @@ -579,7 +598,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		iomi.flags |= IOMAP_NOWAIT;
>  
> -	if (iov_iter_rw(iter) == READ) {
> +	if (is_read) {
>  		/* reads can always complete inline */
>  		dio->flags |= IOMAP_DIO_INLINE_COMP;
>  
> @@ -605,6 +624,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		if (iocb->ki_flags & IOCB_DIO_CALLER_COMP)
>  			dio->flags |= IOMAP_DIO_CALLER_COMP;
>  
> +		if (atomic_write)
> +			iomi.flags |= IOMAP_ATOMIC_WRITE;

.... it is only checked once in the write path, so

		if (iocb->ki_flags & IOCB_ATOMIC)
			iomi.flags |= IOMAP_ATOMIC;

> +
>  		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
>  			ret = -EAGAIN;
>  			if (iomi.pos >= dio->i_size ||
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index c16fd55f5595..f9932733c180 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
>  	{ IOMAP_REPORT,		"REPORT" }, \
>  	{ IOMAP_FAULT,		"FAULT" }, \
>  	{ IOMAP_DIRECT,		"DIRECT" }, \
> -	{ IOMAP_NOWAIT,		"NOWAIT" }
> +	{ IOMAP_NOWAIT,		"NOWAIT" }, \
> +	{ IOMAP_ATOMIC_WRITE,	"ATOMIC" }

We already have an IOMAP_WRITE flag, so IOMAP_ATOMIC is the modifier
for the write IO behaviour (like NOWAIT), not a replacement write
flag.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 11/21] fs: xfs: Don't use low-space allocator for alignment > 1
  2023-10-03  3:00     ` Darrick J. Wong
@ 2023-10-03  4:34       ` Dave Chinner
  2023-10-03 10:22       ` John Garry
  1 sibling, 0 replies; 124+ messages in thread
From: Dave Chinner @ 2023-10-03  4:34 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	viro, brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Mon, Oct 02, 2023 at 08:00:10PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 03, 2023 at 12:16:26PM +1100, Dave Chinner wrote:
> > On Fri, Sep 29, 2023 at 10:27:16AM +0000, John Garry wrote:
> > > The low-space allocator doesn't honour the alignment requirement, so don't
> > > attempt to even use it (when we have an alignment requirement).
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 30c931b38853..328134c22104 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -3569,6 +3569,10 @@ xfs_bmap_btalloc_low_space(
> > >  {
> > >  	int			error;
> > >  
> > > +	/* The allocator doesn't honour args->alignment */
> > > +	if (args->alignment > 1)
> > > +		return 0;
> > > +
> > 
> > How does this happen?
> > 
> > The earlier failing aligned allocations will clear alignment before
> > we get here....
> 
> I was thinking the predicate should be xfs_inode_force_align(ip) to save
> me/us from thinking about all the other weird ways args->alignment could
> end up 1.
> 
> 	/* forced-alignment means we don't use low mode */
> 	if (xfs_inode_force_align(ip))
> 		return -ENOSPC;

See the email I just wrote about not needing per-inode on-disk state
or even extent size hints for doing allocation for atomic IO. Atomic
write unit alignment is a device parameter (similar to stripe unit)
that applies to context specific allocation requests - it's not an
inode property as such....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-10-03  2:57           ` Darrick J. Wong
@ 2023-10-03  7:23             ` John Garry
  2023-10-03 15:46               ` Darrick J. Wong
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-10-03  7:23 UTC (permalink / raw)
  To: Darrick J. Wong, Dave Chinner
  Cc: Bart Van Assche, Eric Biggers, axboe, kbusch, hch, sagi, jejb,
	martin.petersen, viro, brauner, chandan.babu, dchinner,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Prasad Singamsetty

On 03/10/2023 03:57, Darrick J. Wong wrote:
>>>>> +#define STATX_ATTR_WRITE_ATOMIC        0x00400000 /* File
>>>>> supports atomic write operations */
>>>> How would this differ from stx_atomic_write_unit_min != 0?
>> Yeah, I suppose that we can just not set this for the case of
>> stx_atomic_write_unit_min == 0.
> Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the
> filesystem, file and underlying device support atomic writes when
> the values are non-zero. The whole point of the attribute mask is
> that the caller can check the mask for supported functionality
> without having to read every field in the statx structure to
> determine if the functionality it wants is present.

Sure, but again that would be just checking atomic_write_unit_min_bytes 
or another atomic write block setting as that is the only way to tell 
from the block layer (if atomic writes are supported), so it will be 
something like:

if (request_mask & STATX_WRITE_ATOMIC && 
queue_atomic_write_unit_min_bytes(bdev->bd_queue)) {
     stat->atomic_write_unit_min =
       queue_atomic_write_unit_min_bytes(bdev->bd_queue);
     stat->atomic_write_unit_max =
       queue_atomic_write_unit_max_bytes(bdev->bd_queue);
     stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
     stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
     stat->result_mask |= STATX_WRITE_ATOMIC;
}

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-02 19:12       ` Bart Van Assche
  2023-10-03  0:48         ` Martin K. Petersen
@ 2023-10-03  8:37         ` John Garry
  2023-10-03 16:45           ` Bart Van Assche
  1 sibling, 1 reply; 124+ messages in thread
From: John Garry @ 2023-10-03  8:37 UTC (permalink / raw)
  To: Bart Van Assche, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 02/10/2023 20:12, Bart Van Assche wrote:
>>  > +    if (!is_power_of_2(iov_iter_count(iter)))
>>  > +        return false;
>>
>> This rule comes from FS block alignment and NVMe atomic boundary.
>>
>> FSes (XFS) have discontiguous extents. We need to ensure that an 
>> atomic write does not cross discontiguous extents. To do this we 
>> ensure extent length and alignment and limit 
>> atomic_write_unit_max_bytes to that.
>>
>> For NVMe, an atomic write boundary is a boundary in LBA space which an 
>> atomic write should not cross. We limit atomic_write_unit_max_bytes 
>> such that it is evenly divisible into this atomic write boundary.
>>
>> To ensure that the write does not cross these alignment boundaries we 
>> say that it must be naturally aligned and a power-of-2 in length.
>>
>> We may be able to relax this rule but I am not sure it buys us 
>> anything - typically we want to be writing a 64KB block aligned to 
>> 64KB, for example.
> 
> It seems to me that the requirement is_power_of_2(iov_iter_count(iter))
> is necessary for some filesystems but not for all filesystems. 
> Restrictions that are specific to a single filesystem (XFS) should not 
> occur in code that is intended to be used by all filesystems 
> (blkdev_atomic_write_valid()).

I don't think that is_power_of_2(write length) is specific to XFS. It is 
just a simple mathematical method to ensure we obey length and alignment 
requirement always.

Furthermore, if ext4 wants to support atomic writes, for example, then 
it will probably base that on bigalloc. And bigalloc is power-of-2 based.

As for the rules, current proposal is:
- atomic_write_unit_min and atomic_write_unit_max are power-of-2
- write needs to be at a naturally aligned file offset
- write length needs to be a power-of-2 between atomic_write_unit_min 
and atomic_write_unit_max, inclusive

Those could be relaxed to:
- atomic_write_unit_min and atomic_write_unit_max are power-of-2
- write length needs to be a multiple of atomic_write_unit_min and a max 
of atomic_write_unit_max
- write needs to be at an offset aligned to atomic_write_unit_min
- write cannot cross atomic_write_unit_max boundary within the file

Are the relaxed rules better? I don't think so, and I don't like "write 
cannot cross atomic_write_unit_max boundary" in terms of wording.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 13/21] fs: xfs: Make file data allocations observe the 'forcealign' flag
  2023-10-03  1:42   ` Dave Chinner
@ 2023-10-03 10:13     ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-10-03 10:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 03/10/2023 02:42, Dave Chinner wrote:
> On Fri, Sep 29, 2023 at 10:27:18AM +0000, John Garry wrote:
>> From: "Darrick J. Wong" <djwong@kernel.org>
>>
>> The existing extsize hint code already did the work of expanding file
>> range mapping requests so that the range is aligned to the hint value.
>> Now add the code we need to guarantee that the space allocations are
>> also always aligned.
>>
>> XXX: still need to check all this with reflink
>>
>> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
>> Co-developed-by: John Garry <john.g.garry@oracle.com>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
>>   fs/xfs/xfs_iomap.c       |  4 +++-
>>   2 files changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
>> index 328134c22104..6c864dc0a6ff 100644
>> --- a/fs/xfs/libxfs/xfs_bmap.c
>> +++ b/fs/xfs/libxfs/xfs_bmap.c
>> @@ -3328,6 +3328,19 @@ xfs_bmap_compute_alignments(
>>   		align = xfs_get_cowextsz_hint(ap->ip);
>>   	else if (ap->datatype & XFS_ALLOC_USERDATA)
>>   		align = xfs_get_extsz_hint(ap->ip);
>> +
>> +	/*
>> +	 * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
>> +	 * set as forcealign and cowextsz_hint are mutually exclusive
>> +	 */
>> +	if (xfs_inode_forcealign(ap->ip) && align) {
>> +		args->alignment = align;
>> +		if (stripe_align % align)
>> +			stripe_align = align;
>> +	} else {
>> +		args->alignment = 1;
>> +	}
> 
> This smells wrong.
> 
> If a filesystem has a stripe unit set (hence stripe_align is
> non-zero) then any IO that crosses stripe unit boundaries will not
> be atomic - they will require multiple IOs to different devices.
> 
> Hence if the filesystem has a stripe unit set, then all forced
> alignment hints for atomic IO *must* be an exact integer divider
> of the stripe unit. hence when an atomic IO bundle is aligned, the
> atomic boundaries within the bundle always fall on a stripe unit
> boundary and never cross devices.
> 
> IOWs, for a striped filesystem, the maximum size/alignment for a
> single atomic IO unit is the stripe unit.
> 

ok, when I added this I was looking at being robust against wacky 
scenarios when that is not true, like forcealign = stripe alignment * 2.

Please note that this forcealign feature is being added with the view 
that it can be useful for other scenarios, and not just atomic writes.

> This should be enforced when the forced align flag is set on the
> inode (i.e. from the ioctl)

ok, fine.

> 
> 
>> +
>>   	if (align) {
>>   		if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
>>   					ap->eof, 0, ap->conv, &ap->offset,
>> @@ -3423,7 +3436,6 @@ xfs_bmap_exact_minlen_extent_alloc(
>>   	args.minlen = args.maxlen = ap->minlen;
>>   	args.total = ap->total;
>>   
>> -	args.alignment = 1;
>>   	args.minalignslop = 0;
>>   
>>   	args.minleft = ap->minleft;
>> @@ -3469,6 +3481,7 @@ xfs_bmap_btalloc_at_eof(
>>   {
>>   	struct xfs_mount	*mp = args->mp;
>>   	struct xfs_perag	*caller_pag = args->pag;
>> +	int			orig_alignment = args->alignment;
>>   	int			error;
>>   
>>   	/*
>> @@ -3543,10 +3556,10 @@ xfs_bmap_btalloc_at_eof(
>>   
>>   	/*
>>   	 * Allocation failed, so turn return the allocation args to their
>> -	 * original non-aligned state so the caller can proceed on allocation
>> -	 * failure as if this function was never called.
>> +	 * original state so the caller can proceed on allocation failure as
>> +	 * if this function was never called.
>>   	 */
>> -	args->alignment = 1;
>> +	args->alignment = orig_alignment;
>>   	return 0;
>>   }
> 
> Urk. Not sure that is right, it's certainly a change of behaviour.

Is it really a change in behaviour? We just restore the args->alignment 
value, which was originally always 1.

As described in the comment, above, args->alignment is temporarily set 
to the stripe align to try to align a new alloc on a stripe boundary.

> 
>> @@ -3694,7 +3707,6 @@ xfs_bmap_btalloc(
>>   		.wasdel		= ap->wasdel,
>>   		.resv		= XFS_AG_RESV_NONE,
>>   		.datatype	= ap->datatype,
>> -		.alignment	= 1,
>>   		.minalignslop	= 0,
>>   	};
>>   	xfs_fileoff_t		orig_offset;
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 18c8f168b153..70fe873951f3 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -181,7 +181,9 @@ xfs_eof_alignment(
>>   		 * If mounted with the "-o swalloc" option the alignment is
>>   		 * increased from the strip unit size to the stripe width.
>>   		 */
>> -		if (mp->m_swidth && xfs_has_swalloc(mp))
>> +		if (xfs_inode_forcealign(ip))
>> +			align = xfs_get_extsz_hint(ip);
>> +		else if (mp->m_swidth && xfs_has_swalloc(mp))
>>   			align = mp->m_swidth;
>>   		else if (mp->m_dalign)
>>   			align = mp->m_dalign;
> 
> Ah. Now I see. This abuses the stripe alignment code to try to
> implement this new inode allocation alignment restriction, rather
> than just making the extent size hint alignment mandatory....
> 
> Yeah, this can be done better... :)
> 
> As it is, I have been working on a series that reworks all this
> allocator code to separate out the aligned IO from the exact EOF
> allocation case to help clean this up for better perag selection
> during allocation. I think that needs to be done first before we go
> making the alignment code more intricate like this....
> 
> -Dave.

ok, fine. I think that we'll just keep this code as is until that code 
you mention appears, apart from enforcing stripe alignment % forcealign 
== 0.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 11/21] fs: xfs: Don't use low-space allocator for alignment > 1
  2023-10-03  3:00     ` Darrick J. Wong
  2023-10-03  4:34       ` Dave Chinner
@ 2023-10-03 10:22       ` John Garry
  1 sibling, 0 replies; 124+ messages in thread
From: John Garry @ 2023-10-03 10:22 UTC (permalink / raw)
  To: Darrick J. Wong, Dave Chinner
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, viro, brauner,
	chandan.babu, dchinner, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 03/10/2023 04:00, Darrick J. Wong wrote:
>> How does this happen?
>>
>> The earlier failing aligned allocations will clear alignment before
>> we get here....
> I was thinking the predicate should be xfs_inode_force_align(ip) to save
> me/us from thinking about all the other weird ways args->alignment could
> end up 1.
> 
> 	/* forced-alignment means we don't use low mode */
> 	if (xfs_inode_force_align(ip))

My idea was that if we add another feature which requires 
args->alignment > 1 be honoured, then we would need to change this code 
to cover both features, so better just check args->alignment > 1.

> 		return -ENOSPC;
Thanks,
John


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 15/21] fs: xfs: Support atomic write for statx
  2023-10-03  3:32   ` Dave Chinner
@ 2023-10-03 10:56     ` John Garry
  2023-10-03 16:10       ` Darrick J. Wong
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-10-03 10:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 03/10/2023 04:32, Dave Chinner wrote:
> On Fri, Sep 29, 2023 at 10:27:20AM +0000, John Garry wrote:
>> Support providing info on atomic write unit min and max for an inode.
>>
>> For simplicity, currently we limit the min at the FS block size, but a
>> lower limit could be supported in future.
>>
>> The atomic write unit min and max is limited by the guaranteed extent
>> alignment for the inode.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_iops.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
>>   fs/xfs/xfs_iops.h |  4 ++++
>>   2 files changed, 55 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index 1c1e6171209d..5bff80748223 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -546,6 +546,46 @@ xfs_stat_blksize(
>>   	return PAGE_SIZE;
>>   }
>>   
>> +void xfs_ip_atomic_write_attr(struct xfs_inode *ip,
>> +			xfs_filblks_t *unit_min_fsb,
>> +			xfs_filblks_t *unit_max_fsb)
> 
> Formatting.

Change args to 1x tab indent, right?

> 
> Also, we don't use variable name shorthand for function names -
> xfs_get_atomic_write_hint(ip) to match xfs_get_extsz_hint(ip)
> would be appropriate, right?

Changing the name format would be ok. However we are not returning a 
hint, but rather the inode atomic write unit min and max values in FS 
blocks. Anyway, I'll look to rework the name.

> 
> 
> 
>> +{
>> +	xfs_extlen_t		extsz_hint = xfs_get_extsz_hint(ip);
>> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>> +	struct block_device	*bdev = target->bt_bdev;
>> +	struct xfs_mount	*mp = ip->i_mount;
>> +	xfs_filblks_t		atomic_write_unit_min,
>> +				atomic_write_unit_max,
>> +				align;
>> +
>> +	atomic_write_unit_min = XFS_B_TO_FSB(mp,
>> +		queue_atomic_write_unit_min_bytes(bdev->bd_queue));
>> +	atomic_write_unit_max = XFS_B_TO_FSB(mp,
>> +		queue_atomic_write_unit_max_bytes(bdev->bd_queue));
> 
> These should be set in the buftarg at mount time, like we do with
> sector size masks. Then we don't need to convert them to fsbs on
> every single lookup.

ok, fine. However I do still have a doubt on whether these values should 
be changeable - please see (small) comment about 
atomic_write_max_sectors in patch 7/21

> 
>> +	/* for RT, unset extsize gives hint of 1 */
>> +	/* for !RT, unset extsize gives hint of 0 */
>> +	if (extsz_hint && (XFS_IS_REALTIME_INODE(ip) ||
>> +	    (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)))
> 
> Logic is non-obvious. The compound is (rt || force), not
> (extsz && rt), so it took me a while to actually realise I read this
> incorrectly.
> 
> 	if (extsz_hint &&
> 	    (XFS_IS_REALTIME_INODE(ip) ||
> 	     (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))) {
> 
>> +		align = extsz_hint;
>> +	else
>> +		align = 1;
> 
> And now the logic looks wrong to me. We don't want to use extsz hint
> for RT inodes if force align is not set, this will always use it
> regardless of the fact it has nothing to do with force alignment.

extsz_hint comes from xfs_get_extsz_hint(), which gives us the SB 
extsize for the RT inode and this alignment is guaranteed, no?

> 
> Indeed, if XFS_DIFLAG2_FORCEALIGN is not set, then shouldn't this
> always return min/max = 0 because atomic alignments are not in us on
> this inode?

As above, for RT I thought that extsize alignment was guaranteed and we 
don't need to bother with XFS_DIFLAG2_FORCEALIGN there.

> 
> i.e. the first thing this code should do is:
> 
> 	*unit_min_fsb = 0;
> 	*unit_max_fsb = 0;
> 	if (!(ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))
> 		return;
> 
> Then we can check device support:
> 
> 	if (!buftarg->bt_atomic_write_max)
> 		return;
> 
> Then we can check for extent size hints. If that's not set:
> 
> 	align = xfs_get_extsz_hint(ip);
> 	if (align <= 1) {
> 		unit_min_fsb = 1;
> 		unit_max_fsb = 1;
> 		return;
> 	}
> 
> And finally, if there is an extent size hint, we can return that.
> 
>> +	if (atomic_write_unit_max == 0) {
>> +		*unit_min_fsb = 0;
>> +		*unit_max_fsb = 0;
>> +	} else if (atomic_write_unit_min == 0) {
>> +		*unit_min_fsb = 1;
>> +		*unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
>> +					align);
> 
> Why is it valid for a device to have a zero minimum size?

It's not valid. Local variables atomic_write_unit_max and 
atomic_write_unit_min unit here is FS blocks - maybe I should change names.

The idea is that for simplicity we won't support atomic writes for XFS 
of size less than 1x FS block initially. So if the bdev has - for 
example - queue_atomic_write_unit_min_bytes() == 2K and 
queue_atomic_write_unit_max_bytes() == 64K, then (ignoring alignment) we 
say that unit_min_fsb = 1 and unit_max_fsb = 16 (for 4K FS blocks).

> If it can
> set a maximum, it should -always- set a minimum size as logical
> sector size is a valid lower bound, yes?
> 
>> +	} else {
>> +		*unit_min_fsb = min_t(xfs_filblks_t, atomic_write_unit_min,
>> +					align);
>> +		*unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
>> +					align);
>> +	}
> 
> Nothing here guarantees the power-of-2 sizes that the RWF_ATOMIC
> user interface requires....

atomic_write_unit_min and atomic_write_unit_max will be powers-of-2 (or 0).

But, you are right, we don't check align is a power-of-2 - that can be 
added.

> 
> It also doesn't check that the extent size hint is aligned with
> atomic write units.

If we add a check for align being a power-of-2 and atomic_write_unit_min 
and atomic_write_unit_max are already powers-of-2, then this can be 
relied on, right?

> 
> It also doesn't check either against stripe unit alignment....

As mentioned in earlier response, this could be enforced.

> 
>> +}
>> +
>>   STATIC int
>>   xfs_vn_getattr(
>>   	struct mnt_idmap	*idmap,
>> @@ -614,6 +654,17 @@ xfs_vn_getattr(
>>   			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
>>   			stat->dio_offset_align = bdev_logical_block_size(bdev);
>>   		}
>> +		if (request_mask & STATX_WRITE_ATOMIC) {
>> +			xfs_filblks_t unit_min_fsb, unit_max_fsb;
>> +
>> +			xfs_ip_atomic_write_attr(ip, &unit_min_fsb,
>> +				&unit_max_fsb);
>> +			stat->atomic_write_unit_min = XFS_FSB_TO_B(mp, unit_min_fsb);
>> +			stat->atomic_write_unit_max = XFS_FSB_TO_B(mp, unit_max_fsb);
> 
> That's just nasty. We pull byte units from the bdev, convert them to
> fsb to round them, then convert them back to byte counts. We should
> be doing all the work in one set of units....

ok, agreed. bytes is probably best.

> 
>> +			stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
>> +			stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
>> +			stat->result_mask |= STATX_WRITE_ATOMIC;
> 
> If the min/max are zero, then atomic writes are not supported on
> this inode, right? Why would we set any of the attributes or result
> mask to say it is supported on this file?

ok, we won't set STATX_ATTR_WRITE_ATOMIC for min/max are zero

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 16/21] fs: iomap: Atomic write support
  2023-10-03  4:24   ` Dave Chinner
@ 2023-10-03 12:55     ` John Garry
  2023-10-03 16:47     ` Darrick J. Wong
  2023-10-24 12:59     ` John Garry
  2 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-10-03 12:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 03/10/2023 05:24, Dave Chinner wrote:
> On Fri, Sep 29, 2023 at 10:27:21AM +0000, John Garry wrote:
>> Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
>> bio is being created and all the rules there need to be followed.
>>
>> It is the task of the FS iomap iter callbacks to ensure that the mapping
>> created adheres to those rules, like size is power-of-2, is at a
>> naturally-aligned offset, etc.
> 
> The mapping being returned by the filesystem can span a much greater
> range than the actual IO needs - the iomap itself is not guaranteed
> to be aligned to anything in particular, but the IO location within
> that map can still conform to atomic IO constraints. See how
> iomap_sector() calculates the actual LBA address of the IO from
> the iomap and the current file position the IO is being done at.

I see, but I was working on the basis that the filesystem produces an 
iomap which itself conforms to all the rules. And that is because the 
atomic write unit min and max for the file depend on the extent 
alignment, which only the filesystem is aware of.

> 
> hence I think saying "the filesysetm should make sure all IO
> alignment adheres to atomic IO rules is probably wrong. The iomap
> layer doesn't care what the filesystem does, all it cares about is
> whether the IO can be done given the extent map that was returned to
> it.
> 
> Indeed, iomap_dio_bio_iter() is doing all these alignment checks for
> normal DIO reads and writes which must be logical block sized
> aligned. i.e. this check:
> 
>          if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>              !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>                  return -EINVAL;
> 
> Hence I think that atomic IO units, which are similarly defined by
> the bdev, should be checked at the iomap layer, too. e.g, by
> following up with:
> 
> 	if ((dio->iocb->ki_flags & IOCB_ATOMIC) &&
> 	    ((pos | length) & (bdev_atomic_unit_min(iomap->bdev) - 1) ||
> 	     !bdev_iter_is_atomic_aligned(iomap->bdev, dio->submit.iter))
> 		return -EINVAL;

Seems ok for at least enforcing alignment for the bdev. Again, 
filesystem extent alignment is my concern.

> 
> At this point, filesystems don't really need to know anything about
> atomic IO - if they've allocated a large contiguous extent (e.g. via
> fallocate()), then RWF_ATOMIC will just work for the cases where the
> block device supports it...
> 
> This then means that stuff like XFS extent size hints only need to
> check when the hint is set that it is aligned to the underlying
> device atomic IO constraints. Then when it sees the IOMAP_ATOMIC
> modifier, it can fail allocation if it can't get extent size hint
> aligned allocation.

I am not sure what you mean by allocation in this context. I assume that 
fallocate allocates the extents, but they remain unwritten. So if we 
then dd into that file to zero it or init it any other way, they become 
written and the extent size hint or bdev atomic write constraints would 
be just ignored then.

BTW, if you remember, we did propose an XFS fallocate extension for 
extent alignment in the initial RFC, but decided to drop it.

> 
> IOWs, I'm starting to think this doesn't need any change to the
> on-disk format for XFS - it can be driven entirely through two
> dynamic mechanisms:
> 
> 1. (IOMAP_WRITE | IOMAP_ATOMIC) requests from the direct IO layer
> which causes mapping/allocation to fail if it can't allocate (or
> map) atomic IO compatible extents for the IO.
> 
> 2. FALLOC_FL_ATOMIC preallocation flag modifier to tell fallocate()
> to force alignment of all preallocated extents to atomic IO
> constraints.

Would that be a sticky flag? What stops the extents mutating before the 
atomic write?

> 
> This doesn't require extent size hints at all. The filesystem can
> query the bdev at mount time, store the min/max atomic write sizes,
> and then use them for all requests that have _ATOMIC modifiers set
> on them.

A drawback is that the storage device may support atomic write unit max 
much bigger than the user requires and cause inefficient alignment, e.g. 
  bdev atomic write unit max = 1M, and we only ever want 8KB atomic 
writes. But you are mentioning extent size hints can be paid attention 
to, below.

> 
> With iomap doing the same "get the atomic constraints from the bdev"
> style lookups for per-IO file offset and size checking, I don't
> think we actually need extent size hints or an on-disk flag to force
> extent size hint alignment.
> 
> That doesn't mean extent size hints can't be used - it just means
> that extent size hints have to be constrained to being aligned to
> atomic IOs (e.g. extent size hint must be an integer multiple of the
> max atomic IO size). 

Yeah, well I think that we already agreed something like this.

> This then acts as a modifier for _ATOMIC
> context allocations, much like it is a modifier for normal
> allocations now.
> 
>> In iomap_dio_bio_iter(), ensure that for a non-dsync iocb that the mapping
>> is not dirty nor unmapped.
>>
>> A write should only produce a single bio, so error when it doesn't.
> 
> I comment on both these things below.
> 
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/iomap/direct-io.c  | 26 ++++++++++++++++++++++++--
>>   fs/iomap/trace.h      |  3 ++-
>>   include/linux/iomap.h |  1 +
>>   3 files changed, 27 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index bcd3f8cf5ea4..6ef25e26f1a1 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -275,10 +275,11 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>>   static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		struct iomap_dio *dio)
>>   {
>> +	bool atomic_write = iter->flags & IOMAP_ATOMIC_WRITE;
>>   	const struct iomap *iomap = &iter->iomap;
>>   	struct inode *inode = iter->inode;
>>   	unsigned int fs_block_size = i_blocksize(inode), pad;
>> -	loff_t length = iomap_length(iter);
>> +	const loff_t length = iomap_length(iter);
>>   	loff_t pos = iter->pos;
>>   	blk_opf_t bio_opf;
>>   	struct bio *bio;
>> @@ -292,6 +293,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>>   		return -EINVAL;
>>   
>> +	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
>> +		if (iomap->flags & IOMAP_F_DIRTY)
>> +			return -EIO;
>> +		if (iomap->type != IOMAP_MAPPED)
>> +			return -EIO;
>> +	}
> 
> How do we get here without space having been allocated for the
> write?

I don't think that we can, but we are checking that the space is also 
written.

> 
> Perhaps what this is trying to do is make RWF_ATOMIC only be valid
> into written space? 

Yes, and we now detail this in the man pages.

> I mean, this will fail with preallocated space
> (IOMAP_UNWRITTEN) even though we still have exactly the RWF_ATOMIC
> all-or-nothing behaviour guaranteed after a crash because of journal
> recovery behaviour. i.e. if the unwritten conversion gets written to
> the journal, the data will be there. If it isn't written to the
> journal, then the space remains unwritten and there's no data across
> that entire range....
> 
> So I'm not really sure that either of these checks are valid or why
> they are actually needed....

I think that the idea is that the space is already written and the 
metadata for the space is persisted or going to be. Darrick guided me on 
this, so hopefully can comment more.

> 
>> +
>>   	if (iomap->type == IOMAP_UNWRITTEN) {
>>   		dio->flags |= IOMAP_DIO_UNWRITTEN;
>>   		need_zeroout = true;
>> @@ -381,6 +389,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   					  GFP_KERNEL);
>>   		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>>   		bio->bi_ioprio = dio->iocb->ki_ioprio;
>> +		if (atomic_write)
>> +			bio->bi_opf |= REQ_ATOMIC;
>> +
>>   		bio->bi_private = dio;
>>   		bio->bi_end_io = iomap_dio_bio_end_io;
>>   
>> @@ -397,6 +408,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		}
>>   
>>   		n = bio->bi_iter.bi_size;
>> +		if (atomic_write && n != length) {
>> +			/* This bio should have covered the complete length */
>> +			ret = -EINVAL;
>> +			bio_put(bio);
>> +			goto out;
> 
> Why? The actual bio can be any length that meets the aligned
> criteria between min and max, yes?

The write also needs to be a power-of-2 in length. atomic write min and 
max will always be a power-of-2.

> So it's valid to split a
> RWF_ATOMIC write request up into multiple min unit sized bios, is it
> not?

It is not. In the RFC we sent in May there was a scheme to break up the 
atomic write into multiple userspace block-sized bios, but that is no 
longer supported.

Now an atomic write only produces a single bio. So userspace may do a 
16KB atomic write, for example, and we only ever issue that as a single 
16KB operation to the storage device.

> I mean, that's the whole point of the min/max unit setup, isn't
> it?

The point of min/max is to ensure that userspace executes an atomic 
write which is guaranteed to be only ever issued as a single write to 
the storage device. In addition, the length and position for that write 
conforms to the storage device atomic write constraints.

> That the max sized write only guarantees that it will tear at
> min unit boundaries, not within those min unit boundaries?

There is no tearing. As mentioned, the RFC in May did support some 
splitting but we decided to drop it.

> If
> I've understood this correctly, then why does this "single bio for
> large atomic write" constraint need to exist?

atomic write means that a write will never we torn.

> 
> 
>> +		}
>>   		if (dio->flags & IOMAP_DIO_WRITE) {
>>   			task_io_account_write(n);
>>   		} else {
>> @@ -554,6 +571,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   	struct blk_plug plug;
>>   	struct iomap_dio *dio;
>>   	loff_t ret = 0;
>> +	bool is_read = iov_iter_rw(iter) == READ;
>> +	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
> 
> This does not need to be done here, because....
> 
>>   
>>   	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
>>   
>> @@ -579,7 +598,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   	if (iocb->ki_flags & IOCB_NOWAIT)
>>   		iomi.flags |= IOMAP_NOWAIT;
>>   
>> -	if (iov_iter_rw(iter) == READ) {
>> +	if (is_read) {
>>   		/* reads can always complete inline */
>>   		dio->flags |= IOMAP_DIO_INLINE_COMP;
>>   
>> @@ -605,6 +624,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   		if (iocb->ki_flags & IOCB_DIO_CALLER_COMP)
>>   			dio->flags |= IOMAP_DIO_CALLER_COMP;
>>   
>> +		if (atomic_write)
>> +			iomi.flags |= IOMAP_ATOMIC_WRITE;
> 
> .... it is only checked once in the write path, so

ok

> 
> 		if (iocb->ki_flags & IOCB_ATOMIC)
> 			iomi.flags |= IOMAP_ATOMIC;
> 
>> +
>>   		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
>>   			ret = -EAGAIN;
>>   			if (iomi.pos >= dio->i_size ||
>> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
>> index c16fd55f5595..f9932733c180 100644
>> --- a/fs/iomap/trace.h
>> +++ b/fs/iomap/trace.h
>> @@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
>>   	{ IOMAP_REPORT,		"REPORT" }, \
>>   	{ IOMAP_FAULT,		"FAULT" }, \
>>   	{ IOMAP_DIRECT,		"DIRECT" }, \
>> -	{ IOMAP_NOWAIT,		"NOWAIT" }
>> +	{ IOMAP_NOWAIT,		"NOWAIT" }, \
>> +	{ IOMAP_ATOMIC_WRITE,	"ATOMIC" }
> 
> We already have an IOMAP_WRITE flag, so IOMAP_ATOMIC is the modifier
> for the write IO behaviour (like NOWAIT), not a replacement write
> flag.

The name IOMAP_ATOMIC_WRITE is the issue then. The iomap trace still 
just has "ATOMIC" as the trace modifier.

Thanks,
John


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-10-03  7:23             ` John Garry
@ 2023-10-03 15:46               ` Darrick J. Wong
  2023-10-04 14:19                 ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Darrick J. Wong @ 2023-10-03 15:46 UTC (permalink / raw)
  To: John Garry
  Cc: Dave Chinner, Bart Van Assche, Eric Biggers, axboe, kbusch, hch,
	sagi, jejb, martin.petersen, viro, brauner, chandan.babu,
	dchinner, linux-block, linux-kernel, linux-nvme, linux-xfs,
	linux-fsdevel, tytso, jbongio, linux-api, Prasad Singamsetty

On Tue, Oct 03, 2023 at 08:23:26AM +0100, John Garry wrote:
> On 03/10/2023 03:57, Darrick J. Wong wrote:
> > > > > > +#define STATX_ATTR_WRITE_ATOMIC        0x00400000 /* File
> > > > > > supports atomic write operations */
> > > > > How would this differ from stx_atomic_write_unit_min != 0?
> > > Yeah, I suppose that we can just not set this for the case of
> > > stx_atomic_write_unit_min == 0.
> > Please use the STATX_ATTR_WRITE_ATOMIC flag to indicate that the
> > filesystem, file and underlying device support atomic writes when
> > the values are non-zero. The whole point of the attribute mask is
> > that the caller can check the mask for supported functionality
> > without having to read every field in the statx structure to
> > determine if the functionality it wants is present.
> 
> Sure, but again that would be just checking atomic_write_unit_min_bytes or
> another atomic write block setting as that is the only way to tell from the
> block layer (if atomic writes are supported), so it will be something like:
> 
> if (request_mask & STATX_WRITE_ATOMIC &&
> queue_atomic_write_unit_min_bytes(bdev->bd_queue)) {
>     stat->atomic_write_unit_min =
>       queue_atomic_write_unit_min_bytes(bdev->bd_queue);
>     stat->atomic_write_unit_max =
>       queue_atomic_write_unit_max_bytes(bdev->bd_queue);
>     stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
>     stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
>     stat->result_mask |= STATX_WRITE_ATOMIC;

The result_mask (which becomes the statx stx_mask) needs to have
STATX_WRITE_ATOMIC set any time a filesystem responds to
STATX_WRITE_ATOMIC being set in the request_mask, even if the response
is "not supported".

The attributes_mask also needs to have STATX_ATTR_WRITE_ATOMIC set if
the filesystem+file can support the flag, even if it's not currently set
for that file.  This should get turned into a generic vfs helper for the
next fs that wants to support atomic write units:

static void generic_fill_statx_atomic_writes(struct kstat *stat,
		struct block_device *bdev)
{
	u64 min_bytes;

	/* Confirm that the fs driver knows about this statx request */
	stat->result_mask |= STATX_WRITE_ATOMIC;

	/* Confirm that the file attribute is known to the fs. */
	stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;

	/* Fill out the rest of the atomic write fields if supported */
	min_bytes = queue_atomic_write_unit_min_bytes(bdev->bd_queue);
	if (min_bytes == 0)
		return;

	stat->atomic_write_unit_min = min_bytes;
	stat->atomic_write_unit_max =
			queue_atomic_write_unit_max_bytes(bdev->bd_queue);

	/* Atomic writes actually supported on this file. */
	stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
}

and then:

	if (request_mask & STATX_WRITE_ATOMIC)
		generic_fill_statx_atomic_writes(stat, bdev);


> }
> 
> Thanks,
> John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 15/21] fs: xfs: Support atomic write for statx
  2023-10-03 10:56     ` John Garry
@ 2023-10-03 16:10       ` Darrick J. Wong
  0 siblings, 0 replies; 124+ messages in thread
From: Darrick J. Wong @ 2023-10-03 16:10 UTC (permalink / raw)
  To: John Garry
  Cc: Dave Chinner, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	viro, brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Tue, Oct 03, 2023 at 11:56:52AM +0100, John Garry wrote:
> On 03/10/2023 04:32, Dave Chinner wrote:
> > On Fri, Sep 29, 2023 at 10:27:20AM +0000, John Garry wrote:
> > > Support providing info on atomic write unit min and max for an inode.
> > > 
> > > For simplicity, currently we limit the min at the FS block size, but a
> > > lower limit could be supported in future.
> > > 
> > > The atomic write unit min and max is limited by the guaranteed extent
> > > alignment for the inode.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_iops.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++
> > >   fs/xfs/xfs_iops.h |  4 ++++
> > >   2 files changed, 55 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index 1c1e6171209d..5bff80748223 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -546,6 +546,46 @@ xfs_stat_blksize(
> > >   	return PAGE_SIZE;
> > >   }
> > > +void xfs_ip_atomic_write_attr(struct xfs_inode *ip,
> > > +			xfs_filblks_t *unit_min_fsb,
> > > +			xfs_filblks_t *unit_max_fsb)
> > 
> > Formatting.
> 
> Change args to 1x tab indent, right?
> 
> > 
> > Also, we don't use variable name shorthand for function names -
> > xfs_get_atomic_write_hint(ip) to match xfs_get_extsz_hint(ip)
> > would be appropriate, right?
> 
> Changing the name format would be ok. However we are not returning a hint,
> but rather the inode atomic write unit min and max values in FS blocks.
> Anyway, I'll look to rework the name.
> 
> > 
> > 
> > 
> > > +{
> > > +	xfs_extlen_t		extsz_hint = xfs_get_extsz_hint(ip);
> > > +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> > > +	struct block_device	*bdev = target->bt_bdev;
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	xfs_filblks_t		atomic_write_unit_min,
> > > +				atomic_write_unit_max,
> > > +				align;
> > > +
> > > +	atomic_write_unit_min = XFS_B_TO_FSB(mp,
> > > +		queue_atomic_write_unit_min_bytes(bdev->bd_queue));
> > > +	atomic_write_unit_max = XFS_B_TO_FSB(mp,
> > > +		queue_atomic_write_unit_max_bytes(bdev->bd_queue));
> > 
> > These should be set in the buftarg at mount time, like we do with
> > sector size masks. Then we don't need to convert them to fsbs on
> > every single lookup.
> 
> ok, fine. However I do still have a doubt on whether these values should be
> changeable - please see (small) comment about atomic_write_max_sectors in
> patch 7/21

No, this /does/ have to be looked up every time, because the geometry of
the device can change underneath the fs without us knowing about it.  If
someone snapshots an LV with different (or no) atomic write abilities
then we'll be doing the wrong checks.

And yes, it's true that this is a benign problem because we don't lock
anything in the bdev here and the block device driver will eventually
have to catch that anyway.

> > 
> > > +	/* for RT, unset extsize gives hint of 1 */
> > > +	/* for !RT, unset extsize gives hint of 0 */
> > > +	if (extsz_hint && (XFS_IS_REALTIME_INODE(ip) ||
> > > +	    (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN)))
> > 
> > Logic is non-obvious. The compound is (rt || force), not
> > (extsz && rt), so it took me a while to actually realise I read this
> > incorrectly.
> > 
> > 	if (extsz_hint &&
> > 	    (XFS_IS_REALTIME_INODE(ip) ||
> > 	     (ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))) {
> > 
> > > +		align = extsz_hint;
> > > +	else
> > > +		align = 1;
> > 
> > And now the logic looks wrong to me. We don't want to use extsz hint
> > for RT inodes if force align is not set, this will always use it
> > regardless of the fact it has nothing to do with force alignment.
> 
> extsz_hint comes from xfs_get_extsz_hint(), which gives us the SB extsize
> for the RT inode and this alignment is guaranteed, no?

One can also set an extent size hint on realtime files that is a
multiple of the realtime extent size.  IOWs, I can decide that space on
the rt volume should be given out in 32k chunks, and then later decide
that a specific rt file should actually try for 64k chunks.

> > 
> > Indeed, if XFS_DIFLAG2_FORCEALIGN is not set, then shouldn't this
> > always return min/max = 0 because atomic alignments are not in us on
> > this inode?
> 
> As above, for RT I thought that extsize alignment was guaranteed and we
> don't need to bother with XFS_DIFLAG2_FORCEALIGN there.
> 
> > 
> > i.e. the first thing this code should do is:
> > 
> > 	*unit_min_fsb = 0;
> > 	*unit_max_fsb = 0;
> > 	if (!(ip->i_diflags2 & XFS_DIFLAG2_FORCEALIGN))
> > 		return;
> > 
> > Then we can check device support:
> > 
> > 	if (!buftarg->bt_atomic_write_max)
> > 		return;
> > 
> > Then we can check for extent size hints. If that's not set:
> > 
> > 	align = xfs_get_extsz_hint(ip);
> > 	if (align <= 1) {
> > 		unit_min_fsb = 1;
> > 		unit_max_fsb = 1;
> > 		return;
> > 	}
> > 
> > And finally, if there is an extent size hint, we can return that.
> > 
> > > +	if (atomic_write_unit_max == 0) {
> > > +		*unit_min_fsb = 0;
> > > +		*unit_max_fsb = 0;
> > > +	} else if (atomic_write_unit_min == 0) {
> > > +		*unit_min_fsb = 1;
> > > +		*unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
> > > +					align);
> > 
> > Why is it valid for a device to have a zero minimum size?
> 
> It's not valid. Local variables atomic_write_unit_max and
> atomic_write_unit_min unit here is FS blocks - maybe I should change names.

Yes, please, the variable names throughout are long enough to make for
ugly code.

	/* "awu" = atomic write unit */
	xfs_filblks_t	awu_min_fsb, align;
	u64		awu_min_bytes;

	awu_min_bytes = queue_atomic_write_unit_min_bytes(bdev->bd_queue);
	if (!awu_min_bytes) {
		/* Not supported at all. */
		*unit_min_fsb = 0;
		return;
	}

	awu_min_fsb = XFS_B_TO_FSBT(mp, awu_min_bytes);
	if (awu_min_fsb < 1) {
		/* Don't allow smaller than fsb atomic writes */
		*unit_min_fsb = 1;
		return;
	}

	*unit_min_fsb = min(awu_min_fsb, align);

--D

> The idea is that for simplicity we won't support atomic writes for XFS of
> size less than 1x FS block initially. So if the bdev has - for example -
> queue_atomic_write_unit_min_bytes() == 2K and
> queue_atomic_write_unit_max_bytes() == 64K, then (ignoring alignment) we say
> that unit_min_fsb = 1 and unit_max_fsb = 16 (for 4K FS blocks).
> 
> > If it can
> > set a maximum, it should -always- set a minimum size as logical
> > sector size is a valid lower bound, yes?
> > 
> > > +	} else {
> > > +		*unit_min_fsb = min_t(xfs_filblks_t, atomic_write_unit_min,
> > > +					align);
> > > +		*unit_max_fsb = min_t(xfs_filblks_t, atomic_write_unit_max,
> > > +					align);
> > > +	}
> > 
> > Nothing here guarantees the power-of-2 sizes that the RWF_ATOMIC
> > user interface requires....
> 
> atomic_write_unit_min and atomic_write_unit_max will be powers-of-2 (or 0).
> 
> But, you are right, we don't check align is a power-of-2 - that can be
> added.
> 
> > 
> > It also doesn't check that the extent size hint is aligned with
> > atomic write units.
> 
> If we add a check for align being a power-of-2 and atomic_write_unit_min and
> atomic_write_unit_max are already powers-of-2, then this can be relied on,
> right?
> 
> > 
> > It also doesn't check either against stripe unit alignment....
> 
> As mentioned in earlier response, this could be enforced.
> 
> > 
> > > +}
> > > +
> > >   STATIC int
> > >   xfs_vn_getattr(
> > >   	struct mnt_idmap	*idmap,
> > > @@ -614,6 +654,17 @@ xfs_vn_getattr(
> > >   			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
> > >   			stat->dio_offset_align = bdev_logical_block_size(bdev);
> > >   		}
> > > +		if (request_mask & STATX_WRITE_ATOMIC) {
> > > +			xfs_filblks_t unit_min_fsb, unit_max_fsb;
> > > +
> > > +			xfs_ip_atomic_write_attr(ip, &unit_min_fsb,
> > > +				&unit_max_fsb);
> > > +			stat->atomic_write_unit_min = XFS_FSB_TO_B(mp, unit_min_fsb);
> > > +			stat->atomic_write_unit_max = XFS_FSB_TO_B(mp, unit_max_fsb);
> > 
> > That's just nasty. We pull byte units from the bdev, convert them to
> > fsb to round them, then convert them back to byte counts. We should
> > be doing all the work in one set of units....
> 
> ok, agreed. bytes is probably best.
> 
> > 
> > > +			stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
> > > +			stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
> > > +			stat->result_mask |= STATX_WRITE_ATOMIC;
> > 
> > If the min/max are zero, then atomic writes are not supported on
> > this inode, right? Why would we set any of the attributes or result
> > mask to say it is supported on this file?
> 
> ok, we won't set STATX_ATTR_WRITE_ATOMIC for min/max are zero
> 
> Thanks,
> John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-09-29 10:27 ` [PATCH 01/21] block: Add atomic write operations to request_queue limits John Garry
@ 2023-10-03 16:40   ` Bart Van Assche
  2023-10-04  3:00     ` Martin K. Petersen
  2023-11-09 15:10   ` Christoph Hellwig
  1 sibling, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-03 16:40 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Himanshu Madhani

On 9/29/23 03:27, John Garry wrote:
> +What:		/sys/block/<disk>/atomic_write_unit_min_bytes
> +Date:		May 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] This parameter specifies the smallest block which can
> +		be written atomically with an atomic write operation. All
> +		atomic write operations must begin at a
> +		atomic_write_unit_min boundary and must be multiples of
> +		atomic_write_unit_min. This value must be a power-of-two.

I have two comments about these descriptions:
- Referring to "atomic writes" only is not sufficient. It should be
   explained that in this context "atomic" means "indivisible" only and
   also that there are no guarantees that the data written by an atomic
   write will survive a power failure. See also the difference between
   the NVMe parameters AWUN and AWUPF.
- atomic_write_unit_min_bytes will always be the logical block size so I
   don't think it is useful to make the block layer track this value nor
   to export this value through sysfs.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-03  8:37         ` John Garry
@ 2023-10-03 16:45           ` Bart Van Assche
  2023-10-04  9:14             ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-03 16:45 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 10/3/23 01:37, John Garry wrote:
> I don't think that is_power_of_2(write length) is specific to XFS.

I think this is specific to XFS. Can you show me the F2FS code that 
restricts the length of an atomic write to a power of two? I haven't 
found it. The only power-of-two check that I found in F2FS is the 
following (maybe I overlooked something):

$ git grep -nH is_power fs/f2fs
fs/f2fs/super.c:3914:	if (!is_power_of_2(zone_sectors)) {

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 16/21] fs: iomap: Atomic write support
  2023-10-03  4:24   ` Dave Chinner
  2023-10-03 12:55     ` John Garry
@ 2023-10-03 16:47     ` Darrick J. Wong
  2023-10-04  1:16       ` Dave Chinner
  2023-10-24 12:59     ` John Garry
  2 siblings, 1 reply; 124+ messages in thread
From: Darrick J. Wong @ 2023-10-03 16:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	viro, brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Tue, Oct 03, 2023 at 03:24:23PM +1100, Dave Chinner wrote:
> On Fri, Sep 29, 2023 at 10:27:21AM +0000, John Garry wrote:
> > Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
> > bio is being created and all the rules there need to be followed.
> > 
> > It is the task of the FS iomap iter callbacks to ensure that the mapping
> > created adheres to those rules, like size is power-of-2, is at a
> > naturally-aligned offset, etc.
> 
> The mapping being returned by the filesystem can span a much greater
> range than the actual IO needs - the iomap itself is not guaranteed
> to be aligned to anything in particular, but the IO location within
> that map can still conform to atomic IO constraints. See how
> iomap_sector() calculates the actual LBA address of the IO from
> the iomap and the current file position the IO is being done at.
> 
> hence I think saying "the filesysetm should make sure all IO
> alignment adheres to atomic IO rules is probably wrong. The iomap
> layer doesn't care what the filesystem does, all it cares about is
> whether the IO can be done given the extent map that was returned to
> it.
> 
> Indeed, iomap_dio_bio_iter() is doing all these alignment checks for
> normal DIO reads and writes which must be logical block sized
> aligned. i.e. this check:
> 
>         if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
>             !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
>                 return -EINVAL;
> 
> Hence I think that atomic IO units, which are similarly defined by
> the bdev, should be checked at the iomap layer, too. e.g, by
> following up with:
> 
> 	if ((dio->iocb->ki_flags & IOCB_ATOMIC) &&
> 	    ((pos | length) & (bdev_atomic_unit_min(iomap->bdev) - 1) ||
> 	     !bdev_iter_is_atomic_aligned(iomap->bdev, dio->submit.iter))
> 		return -EINVAL;
> 
> At this point, filesystems don't really need to know anything about
> atomic IO - if they've allocated a large contiguous extent (e.g. via
> fallocate()), then RWF_ATOMIC will just work for the cases where the
> block device supports it...
> 
> This then means that stuff like XFS extent size hints only need to
> check when the hint is set that it is aligned to the underlying
> device atomic IO constraints. Then when it sees the IOMAP_ATOMIC
> modifier, it can fail allocation if it can't get extent size hint
> aligned allocation.
> 
> IOWs, I'm starting to think this doesn't need any change to the
> on-disk format for XFS - it can be driven entirely through two
> dynamic mechanisms:
> 
> 1. (IOMAP_WRITE | IOMAP_ATOMIC) requests from the direct IO layer
> which causes mapping/allocation to fail if it can't allocate (or
> map) atomic IO compatible extents for the IO.
> 
> 2. FALLOC_FL_ATOMIC preallocation flag modifier to tell fallocate()
> to force alignment of all preallocated extents to atomic IO
> constraints.

Ugh, let's not relitigate problems that you (Dave) and I have already
solved.

Back in 2018, our internal proto-users of pmem asked for aligned
allocations so they could use PMD mappings to reduce TLB pressure.  At
the time, you and I talked on IRC about whether that should be done via
fallocate flag or setting extszinherit+sunit at mkfs time.  We decided
against adding fallocate flags because linux-api bikeshed hell.

Ever since, we've been shipping UEK with a mkfs.xmem scripts that
automates computing the mkfs.xfs geometry CLI options.  It works,
mostly, except for the unaligned allocations that one gets when the free
space gets fragmented.  The xfsprogs side of the forcealign patchset
moves most of the mkfs.xmem cli option setting logic into mkfs itself,
and the kernel side shuts off the lowspace allocator to fix the
fragmentation problem.

I'd rather fix the remaining quirks and not reinvent solved solutions,
as popular as that is in programming circles.

Why is mandatory allocation alignment for atomic writes different?
Forcealign solves the problem for NVME/SCSI AWU and pmem PMD in the same
way with the same control knobs for sysadmins.  I don't want to have
totally separate playbooks for accomplishing nearly the same things.

I don't like encoding hardware details in the fallocate uapi either.
That implies adding FALLOC_FL_HUGEPAGE for pmem, and possibly
FALLOC_FL_{SUNIT,SWIDTH} for users with RAIDs.

> This doesn't require extent size hints at all. The filesystem can
> query the bdev at mount time, store the min/max atomic write sizes,
> and then use them for all requests that have _ATOMIC modifiers set
> on them.
> 
> With iomap doing the same "get the atomic constraints from the bdev"
> style lookups for per-IO file offset and size checking, I don't
> think we actually need extent size hints or an on-disk flag to force
> extent size hint alignment.
> 
> That doesn't mean extent size hints can't be used - it just means
> that extent size hints have to be constrained to being aligned to
> atomic IOs (e.g. extent size hint must be an integer multiple of the
> max atomic IO size). This then acts as a modifier for _ATOMIC
> context allocations, much like it is a modifier for normal
> allocations now.

(One behavior change that comes with FORCEALIGN is that without it,
extent size hints affect only the alignment of the file range mappings.
With FORCEALIGN, the space allocation itself *and* the mapping are
aligned.)

The one big downside of FORCEALIGN is that the extent size hint can
become misaligned with the AWU (or pagetable) geometry if the fs is
moved to a different computing environment.  I prefer not to couple the
interface to the hardware because that leaves open the possibility for
users to discover more use cases.

> 
> > In iomap_dio_bio_iter(), ensure that for a non-dsync iocb that the mapping
> > is not dirty nor unmapped.
> >
> > A write should only produce a single bio, so error when it doesn't.
> 
> I comment on both these things below.
> 
> > 
> > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > ---
> >  fs/iomap/direct-io.c  | 26 ++++++++++++++++++++++++--
> >  fs/iomap/trace.h      |  3 ++-
> >  include/linux/iomap.h |  1 +
> >  3 files changed, 27 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index bcd3f8cf5ea4..6ef25e26f1a1 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -275,10 +275,11 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> >  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> >  		struct iomap_dio *dio)
> >  {
> > +	bool atomic_write = iter->flags & IOMAP_ATOMIC_WRITE;
> >  	const struct iomap *iomap = &iter->iomap;
> >  	struct inode *inode = iter->inode;
> >  	unsigned int fs_block_size = i_blocksize(inode), pad;
> > -	loff_t length = iomap_length(iter);
> > +	const loff_t length = iomap_length(iter);
> >  	loff_t pos = iter->pos;
> >  	blk_opf_t bio_opf;
> >  	struct bio *bio;
> > @@ -292,6 +293,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> >  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> >  		return -EINVAL;
> >  
> > +	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
> > +		if (iomap->flags & IOMAP_F_DIRTY)
> > +			return -EIO;
> > +		if (iomap->type != IOMAP_MAPPED)
> > +			return -EIO;
> > +	}
> 
> How do we get here without space having been allocated for the
> write?
> 
> Perhaps what this is trying to do is make RWF_ATOMIC only be valid
> into written space? I mean, this will fail with preallocated space
> (IOMAP_UNWRITTEN) even though we still have exactly the RWF_ATOMIC
> all-or-nothing behaviour guaranteed after a crash because of journal
> recovery behaviour. i.e. if the unwritten conversion gets written to
> the journal, the data will be there. If it isn't written to the
> journal, then the space remains unwritten and there's no data across
> that entire range....
> 
> So I'm not really sure that either of these checks are valid or why
> they are actually needed....

This requires O_DSYNC (or RWF_DSYNC) for atomic writes to unwritten or
COW space.  We want failures in forcing the log transactions for the
endio processing to be reported to the pwrite caller as EIO, right?

--D

> > +
> >  	if (iomap->type == IOMAP_UNWRITTEN) {
> >  		dio->flags |= IOMAP_DIO_UNWRITTEN;
> >  		need_zeroout = true;
> > @@ -381,6 +389,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> >  					  GFP_KERNEL);
> >  		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
> >  		bio->bi_ioprio = dio->iocb->ki_ioprio;
> > +		if (atomic_write)
> > +			bio->bi_opf |= REQ_ATOMIC;
> > +
> >  		bio->bi_private = dio;
> >  		bio->bi_end_io = iomap_dio_bio_end_io;
> >  
> > @@ -397,6 +408,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> >  		}
> >  
> >  		n = bio->bi_iter.bi_size;
> > +		if (atomic_write && n != length) {
> > +			/* This bio should have covered the complete length */
> > +			ret = -EINVAL;
> > +			bio_put(bio);
> > +			goto out;
> 
> Why? The actual bio can be any length that meets the aligned
> criteria between min and max, yes? So it's valid to split a
> RWF_ATOMIC write request up into multiple min unit sized bios, is it
> not? I mean, that's the whole point of the min/max unit setup, isn't
> it? That the max sized write only guarantees that it will tear at
> min unit boundaries, not within those min unit boundaries? If
> I've understood this correctly, then why does this "single bio for
> large atomic write" constraint need to exist?
> 
> 
> > +		}
> >  		if (dio->flags & IOMAP_DIO_WRITE) {
> >  			task_io_account_write(n);
> >  		} else {
> > @@ -554,6 +571,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  	struct blk_plug plug;
> >  	struct iomap_dio *dio;
> >  	loff_t ret = 0;
> > +	bool is_read = iov_iter_rw(iter) == READ;
> > +	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
> 
> This does not need to be done here, because....
> 
> >  
> >  	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
> >  
> > @@ -579,7 +598,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  	if (iocb->ki_flags & IOCB_NOWAIT)
> >  		iomi.flags |= IOMAP_NOWAIT;
> >  
> > -	if (iov_iter_rw(iter) == READ) {
> > +	if (is_read) {
> >  		/* reads can always complete inline */
> >  		dio->flags |= IOMAP_DIO_INLINE_COMP;
> >  
> > @@ -605,6 +624,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> >  		if (iocb->ki_flags & IOCB_DIO_CALLER_COMP)
> >  			dio->flags |= IOMAP_DIO_CALLER_COMP;
> >  
> > +		if (atomic_write)
> > +			iomi.flags |= IOMAP_ATOMIC_WRITE;
> 
> .... it is only checked once in the write path, so
> 
> 		if (iocb->ki_flags & IOCB_ATOMIC)
> 			iomi.flags |= IOMAP_ATOMIC;
> 
> > +
> >  		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
> >  			ret = -EAGAIN;
> >  			if (iomi.pos >= dio->i_size ||
> > diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> > index c16fd55f5595..f9932733c180 100644
> > --- a/fs/iomap/trace.h
> > +++ b/fs/iomap/trace.h
> > @@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
> >  	{ IOMAP_REPORT,		"REPORT" }, \
> >  	{ IOMAP_FAULT,		"FAULT" }, \
> >  	{ IOMAP_DIRECT,		"DIRECT" }, \
> > -	{ IOMAP_NOWAIT,		"NOWAIT" }
> > +	{ IOMAP_NOWAIT,		"NOWAIT" }, \
> > +	{ IOMAP_ATOMIC_WRITE,	"ATOMIC" }
> 
> We already have an IOMAP_WRITE flag, so IOMAP_ATOMIC is the modifier
> for the write IO behaviour (like NOWAIT), not a replacement write
> flag.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-03  0:48         ` Martin K. Petersen
@ 2023-10-03 16:55           ` Bart Van Assche
  2023-10-04  2:53             ` Martin K. Petersen
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-03 16:55 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 10/2/23 17:48, Martin K. Petersen wrote:
> 
> Bart,
> 
>> Are there any SCSI devices that we care about that report an ATOMIC
>> TRANSFER LENGTH GRANULARITY that is larger than a single logical
>> block?
> 
> Yes.
> 
> Note that code path used inside a storage device to guarantee atomicity
> of an entire I/O may be substantially different from the code path which
> only offers an incremental guarantee at a single logical or physical
> block level (to the extent that those guarantees are offered at all but
> that's a different kettle of fish).
> 
>> I'm wondering whether we really have to support such devices.
> 
> Yes.

Hi Martin,

I'm still wondering whether we really should support storage devices
that report an ATOMIC TRANSFER LENGTH GRANULARITY that is larger than
the logical block size. Is my understanding correct that the NVMe
specification makes it mandatory to support single logical block atomic
writes since the smallest value that can be reported as the AWUN 
parameter is one logical block because this parameter is a 0's based
value? Is my understanding correct that SCSI devices that report an 
ATOMIC TRANSFER LENGTH GRANULARITY that is larger than the logical block
size are not able to support the NVMe protocol?

 From the NVMe specification section about the identify controller 
response: "Atomic Write Unit Normal (AWUN): This field indicates the 
size of the write operation guaranteed to be written atomically to the 
NVM across all namespaces with any supported namespace format during 
normal operation. This field is specified in logical blocks and is a 0’s 
based value."

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 16/21] fs: iomap: Atomic write support
  2023-10-03 16:47     ` Darrick J. Wong
@ 2023-10-04  1:16       ` Dave Chinner
  0 siblings, 0 replies; 124+ messages in thread
From: Dave Chinner @ 2023-10-04  1:16 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	viro, brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Tue, Oct 03, 2023 at 09:47:49AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 03, 2023 at 03:24:23PM +1100, Dave Chinner wrote:
> > On Fri, Sep 29, 2023 at 10:27:21AM +0000, John Garry wrote:
> > > Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
> > > bio is being created and all the rules there need to be followed.
> > > 
> > > It is the task of the FS iomap iter callbacks to ensure that the mapping
> > > created adheres to those rules, like size is power-of-2, is at a
> > > naturally-aligned offset, etc.
> > 
> > The mapping being returned by the filesystem can span a much greater
> > range than the actual IO needs - the iomap itself is not guaranteed
> > to be aligned to anything in particular, but the IO location within
> > that map can still conform to atomic IO constraints. See how
> > iomap_sector() calculates the actual LBA address of the IO from
> > the iomap and the current file position the IO is being done at.
> > 
> > hence I think saying "the filesysetm should make sure all IO
> > alignment adheres to atomic IO rules is probably wrong. The iomap
> > layer doesn't care what the filesystem does, all it cares about is
> > whether the IO can be done given the extent map that was returned to
> > it.
> > 
> > Indeed, iomap_dio_bio_iter() is doing all these alignment checks for
> > normal DIO reads and writes which must be logical block sized
> > aligned. i.e. this check:
> > 
> >         if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> >             !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> >                 return -EINVAL;
> > 
> > Hence I think that atomic IO units, which are similarly defined by
> > the bdev, should be checked at the iomap layer, too. e.g, by
> > following up with:
> > 
> > 	if ((dio->iocb->ki_flags & IOCB_ATOMIC) &&
> > 	    ((pos | length) & (bdev_atomic_unit_min(iomap->bdev) - 1) ||
> > 	     !bdev_iter_is_atomic_aligned(iomap->bdev, dio->submit.iter))
> > 		return -EINVAL;
> > 
> > At this point, filesystems don't really need to know anything about
> > atomic IO - if they've allocated a large contiguous extent (e.g. via
> > fallocate()), then RWF_ATOMIC will just work for the cases where the
> > block device supports it...
> > 
> > This then means that stuff like XFS extent size hints only need to
> > check when the hint is set that it is aligned to the underlying
> > device atomic IO constraints. Then when it sees the IOMAP_ATOMIC
> > modifier, it can fail allocation if it can't get extent size hint
> > aligned allocation.
> > 
> > IOWs, I'm starting to think this doesn't need any change to the
> > on-disk format for XFS - it can be driven entirely through two
> > dynamic mechanisms:
> > 
> > 1. (IOMAP_WRITE | IOMAP_ATOMIC) requests from the direct IO layer
> > which causes mapping/allocation to fail if it can't allocate (or
> > map) atomic IO compatible extents for the IO.
> > 
> > 2. FALLOC_FL_ATOMIC preallocation flag modifier to tell fallocate()
> > to force alignment of all preallocated extents to atomic IO
> > constraints.
> 
> Ugh, let's not relitigate problems that you (Dave) and I have already
> solved.
> 
> Back in 2018, our internal proto-users of pmem asked for aligned
> allocations so they could use PMD mappings to reduce TLB pressure.  At
> the time, you and I talked on IRC about whether that should be done via
> fallocate flag or setting extszinherit+sunit at mkfs time.  We decided
> against adding fallocate flags because linux-api bikeshed hell.

Ok, but I don't see how I'm supposed to correlate a discussion from
5 years ago on a different topic with this one. I can only comment
on what I see in front of me. And what is in front of me is
something that doesn't need on-disk changes to implement....

> Ever since, we've been shipping UEK with a mkfs.xmem scripts that
> automates computing the mkfs.xfs geometry CLI options.  It works,
> mostly, except for the unaligned allocations that one gets when the free
> space gets fragmented.  The xfsprogs side of the forcealign patchset
> moves most of the mkfs.xmem cli option setting logic into mkfs itself,
> and the kernel side shuts off the lowspace allocator to fix the
> fragmentation problem.
> 
> I'd rather fix the remaining quirks and not reinvent solved solutions,
> as popular as that is in programming circles.
> 
> Why is mandatory allocation alignment for atomic writes different?
> Forcealign solves the problem for NVME/SCSI AWU and pmem PMD in the same
> way with the same control knobs for sysadmins.  I don't want to have
> totally separate playbooks for accomplishing nearly the same things.

Which is fair enough, but that's not the context under which this
has been presented.

Can we please get the forced-align stuff separated from atomic write
support - the atomic write requirements completely overwhelms small
amount of change needed to support physical file offset
alignment....

> I don't like encoding hardware details in the fallocate uapi either.
> That implies adding FALLOC_FL_HUGEPAGE for pmem, and possibly
> FALLOC_FL_{SUNIT,SWIDTH} for users with RAIDs.

No, that's reading way too much into it. FALLOC_FL_ATOMIC would mean
"ensure preallocation is valid for RWF_ATOMIC based IO contrainsts",
nothing more, nothing less. This isn't -hardware specific-, it's
simply a flag to tell the filesystem to align file offsets to
physical storage constraints so the allocated space works works
appropriately for a specific IO API.

IOWs, it is little different from the FALLOC_FL_NOHIDE_STALE flag
for modifying fallocate() behaviour...

> > This doesn't require extent size hints at all. The filesystem can
> > query the bdev at mount time, store the min/max atomic write sizes,
> > and then use them for all requests that have _ATOMIC modifiers set
> > on them.
> > 
> > With iomap doing the same "get the atomic constraints from the bdev"
> > style lookups for per-IO file offset and size checking, I don't
> > think we actually need extent size hints or an on-disk flag to force
> > extent size hint alignment.
> > 
> > That doesn't mean extent size hints can't be used - it just means
> > that extent size hints have to be constrained to being aligned to
> > atomic IOs (e.g. extent size hint must be an integer multiple of the
> > max atomic IO size). This then acts as a modifier for _ATOMIC
> > context allocations, much like it is a modifier for normal
> > allocations now.
> 
> (One behavior change that comes with FORCEALIGN is that without it,
> extent size hints affect only the alignment of the file range mappings.
> With FORCEALIGN, the space allocation itself *and* the mapping are
> aligned.)
> 
> The one big downside of FORCEALIGN is that the extent size hint can
> become misaligned with the AWU (or pagetable) geometry if the fs is
> moved to a different computing environment.  I prefer not to couple the
> interface to the hardware because that leaves open the possibility for
> users to discover more use cases.

Sure, but this isn't really a "forced" alignment. This is a feature
that is providing "file offset is physically aligned to an
underlying hardware address space" instead of doing the normal thing
of abstracting file data away from the physical layout of the
storage.

If we can have user APIs that say "file data should be physically
aligned to storage" then we don't need on-disk flags to implement
this. Extent size hints could still be used to indicate the required
alignment, but we could also pull it straight from the hardware if
those aren't set. AFAICT only fallocate() and pwritev2() need these
flags for IO, but we could add a fadvise() command to set it on a
struct file, if mmap()/madvise is told to use hugepages we can use
PMD alignment rather than storage hardware alignment, etc.

IOWs actually having APIs that simply say "use physical offset
alignment" without actually saying exactly which hardware alignment
they want allows the filesystem to dynamically select the optimal
alignment for the given application use case rather than requiring
the admin to set up specific configuration at mkfs time....



> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >  fs/iomap/direct-io.c  | 26 ++++++++++++++++++++++++--
> > >  fs/iomap/trace.h      |  3 ++-
> > >  include/linux/iomap.h |  1 +
> > >  3 files changed, 27 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index bcd3f8cf5ea4..6ef25e26f1a1 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -275,10 +275,11 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> > >  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >  		struct iomap_dio *dio)
> > >  {
> > > +	bool atomic_write = iter->flags & IOMAP_ATOMIC_WRITE;
> > >  	const struct iomap *iomap = &iter->iomap;
> > >  	struct inode *inode = iter->inode;
> > >  	unsigned int fs_block_size = i_blocksize(inode), pad;
> > > -	loff_t length = iomap_length(iter);
> > > +	const loff_t length = iomap_length(iter);
> > >  	loff_t pos = iter->pos;
> > >  	blk_opf_t bio_opf;
> > >  	struct bio *bio;
> > > @@ -292,6 +293,13 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >  	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > >  		return -EINVAL;
> > >  
> > > +	if (atomic_write && !iocb_is_dsync(dio->iocb)) {
> > > +		if (iomap->flags & IOMAP_F_DIRTY)
> > > +			return -EIO;
> > > +		if (iomap->type != IOMAP_MAPPED)
> > > +			return -EIO;
> > > +	}
> > 
> > How do we get here without space having been allocated for the
> > write?
> > 
> > Perhaps what this is trying to do is make RWF_ATOMIC only be valid
> > into written space? I mean, this will fail with preallocated space
> > (IOMAP_UNWRITTEN) even though we still have exactly the RWF_ATOMIC
> > all-or-nothing behaviour guaranteed after a crash because of journal
> > recovery behaviour. i.e. if the unwritten conversion gets written to
> > the journal, the data will be there. If it isn't written to the
> > journal, then the space remains unwritten and there's no data across
> > that entire range....
> > 
> > So I'm not really sure that either of these checks are valid or why
> > they are actually needed....
> 
> This requires O_DSYNC (or RWF_DSYNC) for atomic writes to unwritten or
> COW space.

COW, maybe - I haven't thought that far through it. 

However, for unwritten extents we just don't need O_DSYNC to
guarantee all or nothing writes. The application still has to use
fdatasync() to determine if the IO succeeded, but the actual IO and
unwritten conversion transaction ordering guarantee the
"all-or-nothing" behaviour of a RWF_ATOMIC write that is not using
O_DSYNC.

i.e.  It just doesn't matter when the conversion transaction hits
the journal. If it doesn't hit the journal before the crash, the
write never happened. If it does hit the journal, then the cache
flush before the journal write ensures all the data from the
RWF_ATOMIC write is present on disk before the unwritten conversion
hits the journal.

> We want failures in forcing the log transactions for the
> endio processing to be reported to the pwrite caller as EIO, right?

A failure to force the log will result in a filesystem shutdown. It
doesn't matter if that happens during IO completion or sometime
before or during the fdatasync() call the application would still
need to use to guarantee data integrity.

RWF_ATOMIC implies FUA semantics, right? i.e. if the RWF_ATOMIC
write is a pure overwrite, there are no journal or cache flushes
needed to complete the write. If so, batching up all the metadata
updates between data integrity checkpoints can still make
performance much better.  If the filesystem flushes the journal
itself, it's no different from an application crash recovery
perspective to using RWF_DSYNC|RWF_ATOMIC and failing in the middle
of a multi-IO update....

Hence I just don't see why RWF_ATOMIC requires O_DSYNC semantics at
all; all RWF_ATOMIC provides is larger "non-tearing" IO granularity
and this doesn't change filesystem data integrity semantics at all.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-03 16:55           ` Bart Van Assche
@ 2023-10-04  2:53             ` Martin K. Petersen
  2023-10-04 17:22               ` Bart Van Assche
  0 siblings, 1 reply; 124+ messages in thread
From: Martin K. Petersen @ 2023-10-04  2:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Martin K. Petersen, John Garry, axboe, kbusch, hch, sagi, jejb,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api


Bart,

> I'm still wondering whether we really should support storage devices
> that report an ATOMIC TRANSFER LENGTH GRANULARITY that is larger than
> the logical block size.

We should. The common case is that the device reports an ATOMIC TRANSFER
LENGTH GRANULARITY matching the reported physical block size. I.e. a
logical block size of 512 bytes and a physical block size of 4KB. In
that scenario a write of a single logical block would require
read-modify-write of a physical block.

> Is my understanding correct that the NVMe specification makes it
> mandatory to support single logical block atomic writes since the
> smallest value that can be reported as the AWUN parameter is one
> logical block because this parameter is a 0's based value? Is my
> understanding correct that SCSI devices that report an ATOMIC TRANSFER
> LENGTH GRANULARITY that is larger than the logical block size are not
> able to support the NVMe protocol?

That's correct. There are obviously things you can express in SCSI that
you can't in NVMe. And the other way around. Our intent is to support
both protocols.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-10-03 16:40   ` Bart Van Assche
@ 2023-10-04  3:00     ` Martin K. Petersen
  2023-10-04 17:28       ` Bart Van Assche
  2023-10-04 21:00       ` Bart Van Assche
  0 siblings, 2 replies; 124+ messages in thread
From: Martin K. Petersen @ 2023-10-04  3:00 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api, Himanshu Madhani


Bart,

>   also that there are no guarantees that the data written by an atomic
>   write will survive a power failure. See also the difference between
>   the NVMe parameters AWUN and AWUPF.

We only care about *PF. The *N variants were cut from the same cloth as
TRIM and UNMAP.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-03 16:45           ` Bart Van Assche
@ 2023-10-04  9:14             ` John Garry
  2023-10-04 17:34               ` Bart Van Assche
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-10-04  9:14 UTC (permalink / raw)
  To: Bart Van Assche, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 03/10/2023 17:45, Bart Van Assche wrote:
> On 10/3/23 01:37, John Garry wrote:
>> I don't think that is_power_of_2(write length) is specific to XFS.
> 
> I think this is specific to XFS. Can you show me the F2FS code that 
> restricts the length of an atomic write to a power of two? I haven't 
> found it. The only power-of-two check that I found in F2FS is the 
> following (maybe I overlooked something):
> 
> $ git grep -nH is_power fs/f2fs
> fs/f2fs/super.c:3914:    if (!is_power_of_2(zone_sectors)) {

Any usecases which we know of requires a power-of-2 block size.

Do you know of a requirement for other sizes? Or are you concerned that 
it is unnecessarily restrictive?

We have to deal with HW features like atomic write boundary and FS 
restrictions like extent and stripe alignment transparent, which are 
almost always powers-of-2, so naturally we would want to work with 
powers-of-2 for atomic write sizes.

The power-of-2 stuff could be dropped if that is what people want. 
However we still want to provide a set of rules to the user to make 
those HW and FS features mentioned transparent to the user.

Thanks,
John


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 21/21] nvme: Support atomic writes
       [not found]   ` <CGME20231004113943eucas1p23a51ce5ef06c36459f826101bb7b85fc@eucas1p2.samsung.com>
@ 2023-10-04 11:39     ` Pankaj Raghav
  2023-10-05 10:24       ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Pankaj Raghav @ 2023-10-04 11:39 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Alan Adamson, p.raghav

> +++ b/drivers/nvme/host/core.c
> @@ -1926,6 +1926,35 @@ static void nvme_update_disk_info(struct gendisk *disk,
>  	blk_queue_io_min(disk->queue, phys_bs);
>  	blk_queue_io_opt(disk->queue, io_opt);
>  
> +	atomic_bs = rounddown_pow_of_two(atomic_bs);
> +	if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf) {
> +		if (id->nabo) {
> +			dev_err(ns->ctrl->device, "Support atomic NABO=%x\n",
> +				id->nabo);
> +		} else {
> +			u32 boundary = 0;
> +
> +			if (le16_to_cpu(id->nabspf))
> +				boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
> +
> +			if (is_power_of_2(boundary) || !boundary) {
> +				blk_queue_atomic_write_max_bytes(disk->queue, atomic_bs);
> +				blk_queue_atomic_write_unit_min_sectors(disk->queue, 1);
> +				blk_queue_atomic_write_unit_max_sectors(disk->queue,
> +									atomic_bs / bs);
blk_queue_atomic_write_unit_[min| max]_sectors expects sectors (512 bytes unit)
as input but no conversion is done here from device logical block size
to SECTORs.
> +				blk_queue_atomic_write_boundary_bytes(disk->queue, boundary);
> +			} else {
> +				dev_err(ns->ctrl->device, "Unsupported atomic boundary=0x%x\n",
> +					boundary);
> +			}
> 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 09/21] block: Add checks to merging of atomic writes
  2023-10-02 22:50     ` Nathan Chancellor
@ 2023-10-04 11:40       ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-10-04 11:40 UTC (permalink / raw)
  To: Nathan Chancellor, kernel test robot
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, llvm, oe-kbuild-all,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 02/10/2023 23:50, Nathan Chancellor wrote:
>>>> ld.lld: error: undefined symbol: __moddi3
>>     >>> referenced by blk-merge.c
>>     >>>               block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
>>     >>> referenced by blk-merge.c
>>     >>>               block/blk-merge.o:(ll_back_merge_fn) in archive vmlinux.a
>>     >>> referenced by blk-merge.c
>>     >>>               block/blk-merge.o:(bio_attempt_front_merge) in archive vmlinux.a
>>     >>> referenced 3 more times
> This does not appear to be clang specific, I can reproduce it with GCC
> 12.3.0 and the same configuration target.

Yeah, I just need to stop using the modulo operator for 64b values, 
which I had already been advised to :|

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-10-03 15:46               ` Darrick J. Wong
@ 2023-10-04 14:19                 ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-10-04 14:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dave Chinner, Bart Van Assche, Eric Biggers, axboe, kbusch, hch,
	sagi, jejb, martin.petersen, viro, brauner, chandan.babu,
	dchinner, linux-block, linux-kernel, linux-nvme, linux-xfs,
	linux-fsdevel, tytso, jbongio, linux-api, Prasad Singamsetty

On 03/10/2023 16:46, Darrick J. Wong wrote:
>>      stat->result_mask |= STATX_WRITE_ATOMIC;
> The result_mask (which becomes the statx stx_mask) needs to have
> STATX_WRITE_ATOMIC set any time a filesystem responds to
> STATX_WRITE_ATOMIC being set in the request_mask, even if the response
> is "not supported".
> 
> The attributes_mask also needs to have STATX_ATTR_WRITE_ATOMIC set if
> the filesystem+file can support the flag, even if it's not currently set
> for that file.  This should get turned into a generic vfs helper for the
> next fs that wants to support atomic write units:
> 
> static void generic_fill_statx_atomic_writes(struct kstat *stat,
> 		struct block_device *bdev)
> {
> 	u64 min_bytes;
> 
> 	/* Confirm that the fs driver knows about this statx request */
> 	stat->result_mask |= STATX_WRITE_ATOMIC;
> 
> 	/* Confirm that the file attribute is known to the fs. */
> 	stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
> 
> 	/* Fill out the rest of the atomic write fields if supported */
> 	min_bytes = queue_atomic_write_unit_min_bytes(bdev->bd_queue);
> 	if (min_bytes == 0)
> 		return;
> 
> 	stat->atomic_write_unit_min = min_bytes;
> 	stat->atomic_write_unit_max =
> 			queue_atomic_write_unit_max_bytes(bdev->bd_queue);
> 
> 	/* Atomic writes actually supported on this file. */
> 	stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
> }
> 
> and then:
> 
> 	if (request_mask & STATX_WRITE_ATOMIC)
> 		generic_fill_statx_atomic_writes(stat, bdev);
> 
> 

That looks sensible, but, if used by an FS, we would still need a method 
to include extra FS restrictions, like extent alignment as in 15/21.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-04  2:53             ` Martin K. Petersen
@ 2023-10-04 17:22               ` Bart Van Assche
  2023-10-04 18:17                 ` Martin K. Petersen
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-04 17:22 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 10/3/23 19:53, Martin K. Petersen wrote:
> 
> Bart,
> 
>> I'm still wondering whether we really should support storage
>> devices that report an ATOMIC TRANSFER LENGTH GRANULARITY that is
>> larger than the logical block size.
> 
> We should. The common case is that the device reports an ATOMIC
> TRANSFER LENGTH GRANULARITY matching the reported physical block
> size. I.e. a logical block size of 512 bytes and a physical block
> size of 4KB. In that scenario a write of a single logical block would
> require read-modify-write of a physical block.

Block devices must serialize read-modify-write operations internally
that happen when there are multiple logical blocks per physical block.
Otherwise it is not guaranteed that a READ command returns the most
recently written data to the same LBA. I think we can ignore concurrent
and overlapping writes in this discussion since these can be considered
as bugs in host software.

In other words, also for the above example it is guaranteed that writes
of a single logical block (512 bytes) are atomic, no matter what value
is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.

>> Is my understanding correct that the NVMe specification makes it 
>> mandatory to support single logical block atomic writes since the 
>> smallest value that can be reported as the AWUN parameter is one 
>> logical block because this parameter is a 0's based value? Is my 
>> understanding correct that SCSI devices that report an ATOMIC
>> TRANSFER LENGTH GRANULARITY that is larger than the logical block
>> size are not able to support the NVMe protocol?
> 
> That's correct. There are obviously things you can express in SCSI
> that you can't in NVMe. And the other way around. Our intent is to
> support both protocols.

How about aligning the features of the two protocols as much as
possible? My understanding is that all long-term T10 contributors are
all in favor of this.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-10-04  3:00     ` Martin K. Petersen
@ 2023-10-04 17:28       ` Bart Van Assche
  2023-10-04 18:26         ` Martin K. Petersen
  2023-10-04 21:00       ` Bart Van Assche
  1 sibling, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-04 17:28 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Himanshu Madhani

On 10/3/23 20:00, Martin K. Petersen wrote:
> 
> Bart,
> 
>> also that there are no guarantees that the data written by an 
>> atomic write will survive a power failure. See also the difference 
>> between the NVMe parameters AWUN and AWUPF.
> 
> We only care about *PF. The *N variants were cut from the same cloth 
> as TRIM and UNMAP.

In my opinion there is a contradiction between the above reply and patch
19/21 of this series. Data written with the SCSI WRITE ATOMIC command is
not guaranteed to survive a power failure. The following quote from
SBC-5 makes this clear:

"4.29.2 Atomic write operations that do not complete

If the device server is not able to successfully complete an atomic
write operation (e.g., the command is terminated or aborted), then the
device server shall ensure that none of the LBAs specified by the atomic
write operation have been altered by any logical block data from the
atomic write operation (i.e., the specified LBAs return logical block
data as if the atomic write operation had not occurred).

If a power loss causes loss of logical block data from an atomic write
operation in a volatile write cache that has not yet been stored on the
medium, then the device server shall ensure that none of the LBAs
specified by the atomic write operation have been altered by any logical
block data from the atomic write operation (i.e., the specified LBAs
return logical block data as if the atomic write operation had not
occurred and writes from the cache to the medium preserve the specified
atomicity)."

In other words, if a power failure occurs, SCSI devices are allowed to
discard the data written with a WRITE ATOMIC command if no SYNCHRONIZE
CACHE command has been submitted after that WRITE ATOMIC command or if
the SYNCHRONIZE CACHE command did not complete before the power failure.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-04  9:14             ` John Garry
@ 2023-10-04 17:34               ` Bart Van Assche
  2023-10-04 21:59                 ` Dave Chinner
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-04 17:34 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 10/4/23 02:14, John Garry wrote:
> On 03/10/2023 17:45, Bart Van Assche wrote:
>> On 10/3/23 01:37, John Garry wrote:
>>> I don't think that is_power_of_2(write length) is specific to XFS.
>>
>> I think this is specific to XFS. Can you show me the F2FS code that 
>> restricts the length of an atomic write to a power of two? I haven't 
>> found it. The only power-of-two check that I found in F2FS is the 
>> following (maybe I overlooked something):
>>
>> $ git grep -nH is_power fs/f2fs
>> fs/f2fs/super.c:3914:    if (!is_power_of_2(zone_sectors)) {
> 
> Any usecases which we know of requires a power-of-2 block size.
> 
> Do you know of a requirement for other sizes? Or are you concerned that 
> it is unnecessarily restrictive?
> 
> We have to deal with HW features like atomic write boundary and FS 
> restrictions like extent and stripe alignment transparent, which are 
> almost always powers-of-2, so naturally we would want to work with 
> powers-of-2 for atomic write sizes.
> 
> The power-of-2 stuff could be dropped if that is what people want. 
> However we still want to provide a set of rules to the user to make 
> those HW and FS features mentioned transparent to the user.

Hi John,

My concern is that the power-of-2 requirements are only needed for
traditional filesystems and not for log-structured filesystems (BTRFS,
F2FS, BCACHEFS).

What I'd like to see is that each filesystem declares its atomic write
requirements (in struct address_space_operations?) and that
blkdev_atomic_write_valid() checks the filesystem-specific atomic write
requirements.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-04 17:22               ` Bart Van Assche
@ 2023-10-04 18:17                 ` Martin K. Petersen
  2023-10-05 17:10                   ` Bart Van Assche
  0 siblings, 1 reply; 124+ messages in thread
From: Martin K. Petersen @ 2023-10-04 18:17 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Martin K. Petersen, John Garry, axboe, kbusch, hch, sagi, jejb,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api


Hi Bart!

> In other words, also for the above example it is guaranteed that
> writes of a single logical block (512 bytes) are atomic, no matter
> what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.

There is no formal guarantee that a disk drive sector read-modify-write
operation results in a readable sector after a power failure. We have
definitely seen blocks being mangled in the field.

Wrt. supporting SCSI atomic block operations, the device rejects the
WRITE ATOMIC(16) command if you attempt to use a transfer length smaller
than the reported granularity. If we want to support WRITE ATOMIC(16) we
have to abide by the values reported by the device. It is not optional.

Besides, the whole point of this patch set is to increase the
"observable atomic block size" beyond the physical block size to
facilitate applications that prefer to use blocks in the 8-64KB range.
IOW, using the logical block size is not particularly interesting. The
objective is to prevent tearing of much larger blocks.

> How about aligning the features of the two protocols as much as
> possible? My understanding is that all long-term T10 contributors are
> all in favor of this.

That is exactly what this patch set does. Out of the 5-6 different
"atomic" modes of operation permitted by SCSI and NVMe, our exposed
semantics are carefully chosen to permit all compliant devices to be
used. Based on only two reported queue limits (FWIW, we started with way
more than that. I believe that complexity was part of the first RFC we
posted). Whereas this series hides most of the complexity in the various
unfortunate protocol quirks behind a simple interface: Your tear-proof
writes can't be smaller than X bytes and larger than Y bytes and they
must be naturally aligned. This simplified things substantially from an
application perspective.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-10-04 17:28       ` Bart Van Assche
@ 2023-10-04 18:26         ` Martin K. Petersen
  0 siblings, 0 replies; 124+ messages in thread
From: Martin K. Petersen @ 2023-10-04 18:26 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Martin K. Petersen, John Garry, axboe, kbusch, hch, sagi, jejb,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api, Himanshu Madhani


Bart,

> In my opinion there is a contradiction between the above reply and
> patch 19/21 of this series. Data written with the SCSI WRITE ATOMIC
> command is not guaranteed to survive a power failure.

That is not the intent. The intent is to ensure that for any given
application block (say 16KB), the application block on media will
contain either 100% old data or 100% new data. Always.

If a storage device offers no such guarantee across a power failure,
then it is not suitable for use by applications which do not tolerate
torn writes. That is why the writes-are-atomic-unless-there's-a-problem
variant of the values reports in NVMe are of no interest.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-10-04  3:00     ` Martin K. Petersen
  2023-10-04 17:28       ` Bart Van Assche
@ 2023-10-04 21:00       ` Bart Van Assche
  2023-10-05  8:22         ` John Garry
  1 sibling, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-04 21:00 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Himanshu Madhani

On 10/3/23 20:00, Martin K. Petersen wrote:
> 
> Bart,
> 
>>    also that there are no guarantees that the data written by an atomic
>>    write will survive a power failure. See also the difference between
>>    the NVMe parameters AWUN and AWUPF.
> 
> We only care about *PF. The *N variants were cut from the same cloth as
> TRIM and UNMAP.

Hi Martin,

Has the following approach been considered? RWF_ATOMIC only guarantees 
atomicity. Persistence is not guaranteed without fsync() / fdatasync().

I think this would be more friendly towards battery-powered devices
(smartphones). On these devices it can be safe to skip fsync() / 
fdatasync() if the battery level is high enough.

Thanks,

Bart.




^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-04 17:34               ` Bart Van Assche
@ 2023-10-04 21:59                 ` Dave Chinner
  0 siblings, 0 replies; 124+ messages in thread
From: Dave Chinner @ 2023-10-04 21:59 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api

On Wed, Oct 04, 2023 at 10:34:13AM -0700, Bart Van Assche wrote:
> On 10/4/23 02:14, John Garry wrote:
> > On 03/10/2023 17:45, Bart Van Assche wrote:
> > > On 10/3/23 01:37, John Garry wrote:
> > > > I don't think that is_power_of_2(write length) is specific to XFS.
> > > 
> > > I think this is specific to XFS. Can you show me the F2FS code that
> > > restricts the length of an atomic write to a power of two? I haven't
> > > found it. The only power-of-two check that I found in F2FS is the
> > > following (maybe I overlooked something):
> > > 
> > > $ git grep -nH is_power fs/f2fs
> > > fs/f2fs/super.c:3914:    if (!is_power_of_2(zone_sectors)) {
> > 
> > Any usecases which we know of requires a power-of-2 block size.
> > 
> > Do you know of a requirement for other sizes? Or are you concerned that
> > it is unnecessarily restrictive?
> > 
> > We have to deal with HW features like atomic write boundary and FS
> > restrictions like extent and stripe alignment transparent, which are
> > almost always powers-of-2, so naturally we would want to work with
> > powers-of-2 for atomic write sizes.
> > 
> > The power-of-2 stuff could be dropped if that is what people want.
> > However we still want to provide a set of rules to the user to make
> > those HW and FS features mentioned transparent to the user.
> 
> Hi John,
> 
> My concern is that the power-of-2 requirements are only needed for
> traditional filesystems and not for log-structured filesystems (BTRFS,
> F2FS, BCACHEFS).

Filesystems that support copy-on-write data (needed for arbitrary
filesystem block aligned RWF_ATOMIC support) are not necessarily log
structured. For example: XFS.

All three of the filesystems you list above still use power-of-2
block sizes for most of their metadata structures and for large data
extents. Hence once you go above a certain file size they are going
to be doing full power-of-2 block size aligned IO anyway. hence the
constraint of atomic writes needing to be power-of-2 block size
aligned to avoid RMW cycles doesn't really change for these
filesystems.

In which case, they can just set their minimum atomic IO size to be
the same as their block size (e.g. 4kB) and set the maximum to
something they can guarantee gets COW'd in a single atomic
transaction. What the hardware can do with REQ_ATOMIC IO is
completely irrelevant at this point....

> What I'd like to see is that each filesystem declares its atomic write
> requirements (in struct address_space_operations?) and that
> blkdev_atomic_write_valid() checks the filesystem-specific atomic write
> requirements.

That seems unworkable to me - IO constraints propagate from the
bottom up, not from the top down.

Consider multi-device filesystems (btrfs and XFS), where different
devices might have different atomic write parameters.  Which
set of bdev parameters does the filesystem report to the querying
bdev?  (And doesn't that question just sound completely wrong?)

It also doesn't work for filesystems that can configure extent
allocation alignment at an individual inode level (like XFS) - what
does the filesystem report to the device when it doesn't know what
alignment constraints individual on-disk inodes might be using?

That's why statx() vectors through filesystems to all them to set
their own parameters based on the inode statx() is being called on.
If the filesystem has a native RWF_ATOMIC implementation, it can put
it's own parameters in the statx min/max atomic write size fields.
If the fs doesn't have it's own native support, but can do physical
file offset/LBA alignment, then it publishes the block device atomic
support parameters or overrides them with it's internal allocation
alignment constraints. If the bdev doesn't support REQ_ATOMIC, the
filesystem says "atomic writes are not supported".

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-10-04 21:00       ` Bart Van Assche
@ 2023-10-05  8:22         ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-10-05  8:22 UTC (permalink / raw)
  To: Bart Van Assche, Martin K. Petersen
  Cc: axboe, kbusch, hch, sagi, jejb, djwong, viro, brauner,
	chandan.babu, dchinner, linux-block, linux-kernel, linux-nvme,
	linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Himanshu Madhani

On 04/10/2023 22:00, Bart Van Assche wrote:
>>
>> We only care about *PF. The *N variants were cut from the same cloth as
>> TRIM and UNMAP.
> 
> Hi Martin,
> 
> Has the following approach been considered? RWF_ATOMIC only guarantees 
> atomicity. Persistence is not guaranteed without fsync() / fdatasync().

This is the approach taken. Please consult the proposed man pages, where 
we say that persistence is not guaranteed without 
O_SYNC/O_DSYNC/fsync()/fdatasync()

The only thing which RWF_ATOMIC guarantees is that the write will not be 
torn.

If you see 2.1.4.2.2 Non-volatile requirements in the NVMe spec, it 
implies that the FUA bit or a flush command is required for persistence.

In 4.29.2 Atomic write operations that do not complete in SBC-4, we are 
told that atomic writes may pend in the device volatile cache and no 
atomic write data will be written if a power failure causes loss of data 
from the write.

> 
> I think this would be more friendly towards battery-powered devices
> (smartphones). On these devices it can be safe to skip fsync() / 
> fdatasync() if the battery level is high enough.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 21/21] nvme: Support atomic writes
  2023-10-04 11:39     ` Pankaj Raghav
@ 2023-10-05 10:24       ` John Garry
  2023-10-05 13:32         ` Pankaj Raghav
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-10-05 10:24 UTC (permalink / raw)
  To: Pankaj Raghav, Alan Adamson
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 04/10/2023 12:39, Pankaj Raghav wrote:
>> +++ b/drivers/nvme/host/core.c
>> @@ -1926,6 +1926,35 @@ static void nvme_update_disk_info(struct gendisk *disk,
>>   	blk_queue_io_min(disk->queue, phys_bs);
>>   	blk_queue_io_opt(disk->queue, io_opt);
>>   
>> +	atomic_bs = rounddown_pow_of_two(atomic_bs);
>> +	if (id->nsfeat & NVME_NS_FEAT_ATOMICS && id->nawupf) {
>> +		if (id->nabo) {
>> +			dev_err(ns->ctrl->device, "Support atomic NABO=%x\n",
>> +				id->nabo);
>> +		} else {
>> +			u32 boundary = 0;
>> +
>> +			if (le16_to_cpu(id->nabspf))
>> +				boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
>> +
>> +			if (is_power_of_2(boundary) || !boundary) {

note to self/Alan: boundary just needs to be multiple of atomic write 
unit max, and not necessarily a power-of-2

>> +				blk_queue_atomic_write_max_bytes(disk->queue, atomic_bs);
>> +				blk_queue_atomic_write_unit_min_sectors(disk->queue, 1);
>> +				blk_queue_atomic_write_unit_max_sectors(disk->queue,
>> +									atomic_bs / bs);
> blk_queue_atomic_write_unit_[min| max]_sectors expects sectors (512 bytes unit)
> as input but no conversion is done here from device logical block size
> to SECTORs.

Yeah, you are right. I think that we can just use:

blk_queue_atomic_write_unit_max_sectors(disk->queue,
atomic_bs >> SECTOR_SHIFT);

Thanks,
John

>> +				blk_queue_atomic_write_boundary_bytes(disk->queue, boundary);
>> +			} else {
>> +				dev_err(ns->ctrl->device, "Unsupported atomic boundary=0x%x\n",
>> +					boundary);
>> +			}
>>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 21/21] nvme: Support atomic writes
  2023-10-05 10:24       ` John Garry
@ 2023-10-05 13:32         ` Pankaj Raghav
  2023-10-05 15:05           ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Pankaj Raghav @ 2023-10-05 13:32 UTC (permalink / raw)
  To: John Garry, Alan Adamson
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

>>> +                blk_queue_atomic_write_max_bytes(disk->queue, atomic_bs);
>>> +                blk_queue_atomic_write_unit_min_sectors(disk->queue, 1);
>>> +                blk_queue_atomic_write_unit_max_sectors(disk->queue,
>>> +                                    atomic_bs / bs);
>> blk_queue_atomic_write_unit_[min| max]_sectors expects sectors (512 bytes unit)
>> as input but no conversion is done here from device logical block size
>> to SECTORs.
> 
> Yeah, you are right. I think that we can just use:
> 
> blk_queue_atomic_write_unit_max_sectors(disk->queue,
> atomic_bs >> SECTOR_SHIFT);
> 

Makes sense.
I still don't grok the difference between max_bytes and unit_max_sectors here.
(Maybe NVMe spec does not differentiate it?)

I assume min_sectors should be as follows instead of setting it to 1 (512 bytes)?

blk_queue_atomic_write_unit_min_sectors(disk->queue, bs >> SECTORS_SHIFT);


> Thanks,
> John
> 
>>> +                blk_queue_atomic_write_boundary_bytes(disk->queue, boundary);
>>> +            } else {
>>> +                dev_err(ns->ctrl->device, "Unsupported atomic boundary=0x%x\n",
>>> +                    boundary);
>>> +            }
>>>
> 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 21/21] nvme: Support atomic writes
  2023-10-05 13:32         ` Pankaj Raghav
@ 2023-10-05 15:05           ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-10-05 15:05 UTC (permalink / raw)
  To: Pankaj Raghav, Alan Adamson
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 05/10/2023 14:32, Pankaj Raghav wrote:
>>> te_unit_[min| max]_sectors expects sectors (512 bytes unit)
>>> as input but no conversion is done here from device logical block size
>>> to SECTORs.
>> Yeah, you are right. I think that we can just use:
>>
>> blk_queue_atomic_write_unit_max_sectors(disk->queue,
>> atomic_bs >> SECTOR_SHIFT);
>>
> Makes sense.
> I still don't grok the difference between max_bytes and unit_max_sectors here.
> (Maybe NVMe spec does not differentiate it?)

I think that max_bytes does not need to be a power-of-2 and could be 
relaxed.

Having said that, max_bytes comes into play for merging of bios - so if 
we are in a scenario with no merging, then may a well leave 
atomic_write_max_bytes == atomic_write_unit_max.

But let us check this proposal to relax.

> 
> I assume min_sectors should be as follows instead of setting it to 1 (512 bytes)?
> 
> blk_queue_atomic_write_unit_min_sectors(disk->queue, bs >> SECTORS_SHIFT);

Yeah, right, we want unit_min to be the logical block size.

Thanks,
John



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-04 18:17                 ` Martin K. Petersen
@ 2023-10-05 17:10                   ` Bart Van Assche
  2023-10-05 22:36                     ` Dave Chinner
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-05 17:10 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 10/4/23 11:17, Martin K. Petersen wrote:
> 
> Hi Bart!
> 
>> In other words, also for the above example it is guaranteed that 
>> writes of a single logical block (512 bytes) are atomic, no matter
>> what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.
> 
> There is no formal guarantee that a disk drive sector 
> read-modify-write operation results in a readable sector after a 
> power failure. We have definitely seen blocks being mangled in the 
> field.

Aren't block devices expected to use a capacitor that provides enough
power to handle power failures cleanly?

How about blacklisting block devices that mangle blocks if a power
failure occurs? I think such block devices are not compatible with
journaling filesystems nor with log-structured filesystems.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-05 17:10                   ` Bart Van Assche
@ 2023-10-05 22:36                     ` Dave Chinner
  2023-10-05 22:58                       ` Bart Van Assche
  0 siblings, 1 reply; 124+ messages in thread
From: Dave Chinner @ 2023-10-05 22:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Martin K. Petersen, John Garry, axboe, kbusch, hch, sagi, jejb,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api

On Thu, Oct 05, 2023 at 10:10:45AM -0700, Bart Van Assche wrote:
> On 10/4/23 11:17, Martin K. Petersen wrote:
> > 
> > Hi Bart!
> > 
> > > In other words, also for the above example it is guaranteed that
> > > writes of a single logical block (512 bytes) are atomic, no matter
> > > what value is reported as the ATOMIC TRANSFER LENGTH GRANULARITY.
> > 
> > There is no formal guarantee that a disk drive sector read-modify-write
> > operation results in a readable sector after a power failure. We have
> > definitely seen blocks being mangled in the field.
> 
> Aren't block devices expected to use a capacitor that provides enough
> power to handle power failures cleanly?

Nope.

Any block device that says it operates in writeback cache mode (i.e.
almost every single consumer SATA and NVMe drive ever made) has a
volatile write back cache and so does not provide any power fail
data integrity guarantees. Simple to check, my less-than-1-yr-old
workstation tells me:

$ lspci |grep -i nvme
03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
$ cat /sys/block/nvme*n1/queue/write_cache
write back
write back
$

That they have volatile writeback caches....

> How about blacklisting block devices that mangle blocks if a power
> failure occurs? I think such block devices are not compatible with
> journaling filesystems nor with log-structured filesystems.

Statements like this from people working on storage hardware really
worry me. It demonstrates a lack of understanding of how filesystems
actually work, not to mention the fact that this architectural
problem (i.e. handling volatile device write caches correctly) was
solved in the Linux IO stack a couple of decades ago. This isn't
even 'state of the art' knowledge - this is foundational knowlege
that everyone working on storage should know.

The tl;dr summary is that filesystems will issue a cache flush
request (REQ_PREFLUSH) and/or write-through to stable storage
semantics (REQ_FUA) for any data, metadata or journal IO that has
data integrity and/or ordering requirements associated with it. The
block layer will then do the most optimal correct thing with that
request (e.g. ignore them for IO being directed at WC disabled
devices), but it guarantees the flush/fua semantics for those IOs
will be provided by all layers in the stack right down to the
persistent storage media itself. Hence all the filesystem has to do
is get it's IO and cache flush ordering correct, and everything
just works regardless of the underlying storage capabilities.

And, yes, any storage device with volatile caches that doesn't
implement cache flushes correctly is considered broken and will get
black listed....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-05 22:36                     ` Dave Chinner
@ 2023-10-05 22:58                       ` Bart Van Assche
  2023-10-06  4:31                         ` Dave Chinner
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-05 22:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, John Garry, axboe, kbusch, hch, sagi, jejb,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api

On 10/5/23 15:36, Dave Chinner wrote:
> $ lspci |grep -i nvme
> 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> 06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> $ cat /sys/block/nvme*n1/queue/write_cache
> write back
> write back
> $
> 
> That they have volatile writeback caches....

It seems like what I wrote has been misunderstood completely. With
"handling a power failure cleanly" I meant that power cycling a block device
does not result in read errors nor in reading data that has never been written.
Although it is hard to find information about this topic, here is what I found
online:
* About certain SSDs with power loss protection:
   https://us.transcend-info.com/embedded/technology/power-loss-protection-plp
* About another class of SSDs with power loss protection:
   https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection
* About yet another class of SSDs with power loss protection:
   https://phisonblog.com/avoiding-ssd-data-loss-with-phisons-power-loss-protection-2/

So far I haven't found any information about hard disks and power failure
handling. What I found is that most current hard disks protect data with ECC.
The ECC mechanism should provide good protection against reading data that
has never been written. If a power failure occurs while a hard disk is writing
a physical block, can this result in a read error after power is restored? If
so, is this behavior allowed by storage standards?

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-05 22:58                       ` Bart Van Assche
@ 2023-10-06  4:31                         ` Dave Chinner
  2023-10-06 17:22                           ` Bart Van Assche
  0 siblings, 1 reply; 124+ messages in thread
From: Dave Chinner @ 2023-10-06  4:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Martin K. Petersen, John Garry, axboe, kbusch, hch, sagi, jejb,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api

On Thu, Oct 05, 2023 at 03:58:38PM -0700, Bart Van Assche wrote:
> On 10/5/23 15:36, Dave Chinner wrote:
> > $ lspci |grep -i nvme
> > 03:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> > 06:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
> > $ cat /sys/block/nvme*n1/queue/write_cache
> > write back
> > write back
> > $
> > 
> > That they have volatile writeback caches....
> 
> It seems like what I wrote has been misunderstood completely. With
> "handling a power failure cleanly" I meant that power cycling a block device
> does not result in read errors nor in reading data that has never been written.

Then I don't see what your concern is. 

Single sector writes are guaranteed atomic and have been for as long
as I've worked in this game. OTOH, multi-sector writes are not
guaranteed to be atomic - they can get torn on sector boundaries,
but the individual sectors within that write are guaranteed to be
all-or-nothing. 

Any hardware device that does not guarantee single sector write
atomicity (i.e. tears in the middle of a sector) is, by definition,
broken. And we all know that broken hardware means nothing in the
storage stack works as it should, so I just don't see what point you
are trying to make...

> Although it is hard to find information about this topic, here is what I found
> online:
> * About certain SSDs with power loss protection:
>   https://us.transcend-info.com/embedded/technology/power-loss-protection-plp
> * About another class of SSDs with power loss protection:
>   https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection
> * About yet another class of SSDs with power loss protection:
>   https://phisonblog.com/avoiding-ssd-data-loss-with-phisons-power-loss-protection-2/

Yup, devices that behave as if they have non-volatile write caches.
Such devices have been around for more than 30 years, they operate
the same as devices without caches at all.

> So far I haven't found any information about hard disks and power failure
> handling. What I found is that most current hard disks protect data with ECC.
> The ECC mechanism should provide good protection against reading data that
> has never been written.  If a power failure occurs while a hard disk is writing
> a physical block, can this result in a read error after power is restored? If
> so, is this behavior allowed by storage standards?

If a power fail results in read errors from the storage media being
reported to the OS instead of the data that was present in the
sector before the power failure, then the device is broken. If there
is no data in the region being read because it has never been
written, then it should return zeros (no data) rather than stale
data or a media read error.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-06  4:31                         ` Dave Chinner
@ 2023-10-06 17:22                           ` Bart Van Assche
  2023-10-07  1:21                             ` Martin K. Petersen
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-06 17:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Martin K. Petersen, John Garry, axboe, kbusch, hch, sagi, jejb,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api

On 10/5/23 21:31, Dave Chinner wrote:
> Then I don't see what your concern is.
> 
> Single sector writes are guaranteed atomic and have been for as long 
> as I've worked in this game. OTOH, multi-sector writes are not 
> guaranteed to be atomic - they can get torn on sector boundaries, but
> the individual sectors within that write are guaranteed to be 
> all-or-nothing.
> 
> Any hardware device that does not guarantee single sector write 
> atomicity (i.e. tears in the middle of a sector) is, by definition, 
> broken. And we all know that broken hardware means nothing in the 
> storage stack works as it should, so I just don't see what point you 
> are trying to make...

Do you agree that the above implies that it is not useful in patch 01/21
of this series to track atomic_write_unit_min_bytes in the block layer
nor to export this information to user space? The above implies that
this parameter will always be equal to the logical block size. Writes to
a single physical block happen atomically. If there are multiple logical
blocks per physical block, the block device must serialize 
read/modify/write cycles internally.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 18/21] scsi: sd: Support reading atomic properties from block limits VPD
  2023-10-02 11:27     ` John Garry
@ 2023-10-06 17:52       ` Bart Van Assche
  2023-10-06 23:48         ` Martin K. Petersen
  0 siblings, 1 reply; 124+ messages in thread
From: Bart Van Assche @ 2023-10-06 17:52 UTC (permalink / raw)
  To: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner
  Cc: linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api

On 10/2/23 04:27, John Garry wrote:
> On 29/09/2023 18:54, Bart Van Assche wrote:
>> On 9/29/23 03:27, John Garry wrote:
>>> +static void sd_config_atomic(struct scsi_disk *sdkp)
>>> +{
>>> +    unsigned int logical_block_size = sdkp->device->sector_size;
>>> +    struct request_queue *q = sdkp->disk->queue;
>>> +
>>> +    if (sdkp->max_atomic) {
>>
>> Please use the "return early" style here to keep the indentation
>> level in this function low.
> 
> ok, fine.
> 
>>
>>> +        unsigned int max_atomic = max_t(unsigned int,
>>> +            rounddown_pow_of_two(sdkp->max_atomic),
>>> +            rounddown_pow_of_two(sdkp->max_atomic_with_boundary));
>>> +        unsigned int unit_min = sdkp->atomic_granularity ?
>>> +            rounddown_pow_of_two(sdkp->atomic_granularity) :
>>> +            physical_block_size_sectors;
>>> +        unsigned int unit_max = max_atomic;
>>> +
>>> +        if (sdkp->max_atomic_boundary)
>>> +            unit_max = min_t(unsigned int, unit_max,
>>> +                rounddown_pow_of_two(sdkp->max_atomic_boundary));
>>
>> Why does "rounddown_pow_of_two()" occur in the above code?
> 
> I assume that you are talking about all the code above to calculate 
> atomic write values for the device.
> 
> The reason is that atomic write unit min and max are always a power-of-2 
> - see rules described earlier - as so that we why we rounddown to a 
> power-of-2.

 From SBC-5: "The ATOMIC ALIGNMENT field indicates the required alignment
of the starting LBA in an atomic write command. If the ATOMIC ALIGNMENT
field is set to 0000_0000h, then there is no alignment requirement for
atomic write commands.

The ATOMIC TRANSFER LENGTH GRANULARITY field indicates the minimum
transfer length for an atomic write command. Atomic write operations are
required to have a transfer length that is a multiple of the atomic
transfer length granularity. An ATOMIC TRANSFER LENGTH GRANULARITY field
set to 0000_0000h indicates that there is no atomic transfer length
granularity requirement."

I think the above means that it is wrong to round down the ATOMIC
TRANSFER LENGTH GRANULARITY or the ATOMIC BOUNDARY values.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 04/21] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support
  2023-09-29 10:27 ` [PATCH 04/21] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
@ 2023-10-06 18:15   ` Jeremy Bongio
  2023-10-09 22:02     ` Dave Chinner
  0 siblings, 1 reply; 124+ messages in thread
From: Jeremy Bongio @ 2023-10-06 18:15 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, linux-api,
	Prasad Singamsetty

What is the advantage of using write flags instead of using an atomic
open flag (O_ATOMIC)? With an open flag, write, writev, pwritev would
all be supported for atomic writes. And this would potentially require
less application changes to take advantage of atomic writes.

On Fri, Sep 29, 2023 at 3:28 AM John Garry <john.g.garry@oracle.com> wrote:
>
> From: Prasad Singamsetty <prasad.singamsetty@oracle.com>
>
> Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
> write is to be issued with torn write prevention, according to special
> alignment and length rules.
>
> Torn write prevention means that for a power or any other HW failure, all
> or none of the data will be committed to storage, but never a mix of old
> and new.
>
> For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
> iocb->ki_flags field to indicate the same.
>
> A call to statx will give the relevant atomic write info:
> - atomic_write_unit_min
> - atomic_write_unit_max
>
> Both values are a power-of-2.
>
> Applications can avail of atomic write feature by ensuring that the total
> length of a write is a power-of-2 in size and also sized between
> atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
> must ensure that the write is at a naturally-aligned offset in the file
> wrt the total write length.
>
> Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  include/linux/fs.h      | 1 +
>  include/uapi/linux/fs.h | 5 ++++-
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b528f063e8ff..898952dee8eb 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -328,6 +328,7 @@ enum rw_hint {
>  #define IOCB_SYNC              (__force int) RWF_SYNC
>  #define IOCB_NOWAIT            (__force int) RWF_NOWAIT
>  #define IOCB_APPEND            (__force int) RWF_APPEND
> +#define IOCB_ATOMIC            (__force int) RWF_ATOMIC
>
>  /* non-RWF related bits - start at 16 */
>  #define IOCB_EVENTFD           (1 << 16)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..e3b4f5bc6860 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -301,8 +301,11 @@ typedef int __bitwise __kernel_rwf_t;
>  /* per-IO O_APPEND */
>  #define RWF_APPEND     ((__force __kernel_rwf_t)0x00000010)
>
> +/* Atomic Write */
> +#define RWF_ATOMIC     ((__force __kernel_rwf_t)0x00000020)
> +
>  /* mask of flags supported by the kernel */
>  #define RWF_SUPPORTED  (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
> -                        RWF_APPEND)
> +                        RWF_APPEND | RWF_ATOMIC)
>
>  #endif /* _UAPI_LINUX_FS_H */
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 18/21] scsi: sd: Support reading atomic properties from block limits VPD
  2023-10-06 17:52       ` Bart Van Assche
@ 2023-10-06 23:48         ` Martin K. Petersen
  0 siblings, 0 replies; 124+ messages in thread
From: Martin K. Petersen @ 2023-10-06 23:48 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api


Bart,

> I think the above means that it is wrong to round down the ATOMIC
> TRANSFER LENGTH GRANULARITY or the ATOMIC BOUNDARY values.

It was a deliberate design choice to facilitate supporting MAM. John
will look into relaxing the constraints.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-10-06 17:22                           ` Bart Van Assche
@ 2023-10-07  1:21                             ` Martin K. Petersen
  0 siblings, 0 replies; 124+ messages in thread
From: Martin K. Petersen @ 2023-10-07  1:21 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Dave Chinner, Martin K. Petersen, John Garry, axboe, kbusch, hch,
	sagi, jejb, djwong, viro, brauner, chandan.babu, dchinner,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api


Bart,

> The above implies that this parameter will always be equal to the
> logical block size.

It does not. Being able to write each individual block in an I/O without
tearing does not imply that a device can write two blocks as a single
atomic operation.

> Writes to a single physical block happen atomically. If there are
> multiple logical blocks per physical block, the block device must
> serialize read/modify/write cycles internally.

This is what SBC has to say:

"If any write command that is not an atomic write command, does not
complete successfully (e.g., the command completed with CHECK CONDITION
status, or the command was being processed at the time of a power loss
or an incorrect demount of a removable medium), then any data in the
logical blocks referenced by the LBAs specified by that command is
indeterminate."

SBC defines "atomic write command" like this:

"An atomic write command performs one or more atomic write operations.
 The following write commands are atomic write commands:

 a) WRITE ATOMIC (16) (see 5.48); and
 b) WRITE ATOMIC (32) (see 5.49)."

You will note that none of the regular WRITE commands appear in that
list.

Now, in practice we obviously rely heavily on the fact that most devices
are implemented in a sane fashion which doesn't mess up individual
logical blocks on power fail. But the spec does not guarantee this; it
is device implementation dependent. And again, we have seen both hard
disk drives and SSDs that cause collateral damage to an entire physical
block when power is lost at the wrong time.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 04/21] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support
  2023-10-06 18:15   ` Jeremy Bongio
@ 2023-10-09 22:02     ` Dave Chinner
  0 siblings, 0 replies; 124+ messages in thread
From: Dave Chinner @ 2023-10-09 22:02 UTC (permalink / raw)
  To: Jeremy Bongio
  Cc: John Garry, axboe, kbusch, hch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	linux-api, Prasad Singamsetty

On Fri, Oct 06, 2023 at 11:15:11AM -0700, Jeremy Bongio wrote:
> What is the advantage of using write flags instead of using an atomic
> open flag (O_ATOMIC)? With an open flag, write, writev, pwritev would
> all be supported for atomic writes. And this would potentially require
> less application changes to take advantage of atomic writes.

Atomic writes are not a property of the file or even the inode
itself, they are an attribute of the specific IO being issued by
the application.

Most applications that want atomic writes are using it as a
performance optimisation. They are likely already using DIO with
either AIO, pwritev2 or io_uring and so are already using the
interfaces that support per-IO attributes. Not every IO to every
file needs to be atomic, so a per-IO attribute makes a lot of sense
for these applications.

Add to that that implementing atomic IO semantics in the generic IO
paths (e.g. for buffered writes) is much more difficult. It's
not an unsolvable problem (especially now with high-order folio
support in the page cache), it's just way outside the scope of this
patchset.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 16/21] fs: iomap: Atomic write support
  2023-10-03  4:24   ` Dave Chinner
  2023-10-03 12:55     ` John Garry
  2023-10-03 16:47     ` Darrick J. Wong
@ 2023-10-24 12:59     ` John Garry
  2 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-10-24 12:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 03/10/2023 05:24, Dave Chinner wrote:

I don't think that this was ever responded to - apologies for that.

>> 		n = bio->bi_iter.bi_size;
>> +		if (atomic_write && n != length) {
>> +			/* This bio should have covered the complete length */
>> +			ret = -EINVAL;
>> +			bio_put(bio);
>> +			goto out;
> Why? The actual bio can be any length that meets the aligned
> criteria between min and max, yes?
> So it's valid to split a
> RWF_ATOMIC write request up into multiple min unit sized bios, is it
> not?

It is not.

> I mean, that's the whole point of the min/max unit setup, isn't
> it?

atomic write unit min/max are lower and upper limits for the atomic 
write length only.

> That the max sized write only guarantees that it will tear at
> min unit boundaries, not within those min unit boundaries?

We will never split an atomic write nor create multiple bios for an 
atomic write. unit min is the minimum size supported for an atomic write 
length. It is not also a boundary size which we may split a write. An 
atomic write will only ever produce a maximum for a single IO operation. 
We do support merging of atomic writes in the block layer, but this is 
transparent to the user.

Please let me know if 
https://lore.kernel.org/linux-api/20230929093717.2972367-1-john.g.garry@oracle.com/T/#mb48328cf84b1643b651b5f1293f443e26f18fbb5 
needs to be improved to make this clear.

> If
> I've understood this correctly, then why does this "single bio for
> large atomic write" constraint need to exist?
> 
> 

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-09-29 10:27 ` [PATCH 01/21] block: Add atomic write operations to request_queue limits John Garry
  2023-10-03 16:40   ` Bart Van Assche
@ 2023-11-09 15:10   ` Christoph Hellwig
  2023-11-09 17:01     ` John Garry
  1 sibling, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2023-11-09 15:10 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Himanshu Madhani

On Fri, Sep 29, 2023 at 10:27:06AM +0000, John Garry wrote:
> From: Himanshu Madhani <himanshu.madhani@oracle.com>
> 
> Add the following limits:
> - atomic_write_boundary_bytes
> - atomic_write_max_bytes
> - atomic_write_unit_max_bytes
> - atomic_write_unit_min_bytes
> 
> All atomic writes limits are initialised to 0 to indicate no atomic write
> support. Stacked devices are just not supported either for now.
> 
> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> #jpg: Heavy rewrite
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  Documentation/ABI/stable/sysfs-block | 42 +++++++++++++++++++
>  block/blk-settings.c                 | 60 ++++++++++++++++++++++++++++
>  block/blk-sysfs.c                    | 33 +++++++++++++++
>  include/linux/blkdev.h               | 33 +++++++++++++++
>  4 files changed, 168 insertions(+)
> 
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index 1fe9a553c37b..05df7f74cbc1 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -21,6 +21,48 @@ Description:
>  		device is offset from the internal allocation unit's
>  		natural alignment.
>  
> +What:		/sys/block/<disk>/atomic_write_max_bytes
> +Date:		May 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] This parameter specifies the maximum atomic write
> +		size reported by the device. An atomic write operation
> +		must not exceed this number of bytes.

> +What:		/sys/block/<disk>/atomic_write_unit_max_bytes
> +Date:		January 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] This parameter defines the largest block which can be
> +		written atomically with an atomic write operation. This
> +		value must be a multiple of atomic_write_unit_min and must
> +		be a power-of-two.

What is the difference between these two values?


> +Date:		May 2023
> +Contact:	Himanshu Madhani <himanshu.madhani@oracle.com>
> +Description:
> +		[RO] This parameter specifies the smallest block which can
> +		be written atomically with an atomic write operation. All
> +		atomic write operations must begin at a
> +		atomic_write_unit_min boundary and must be multiples of
> +		atomic_write_unit_min. This value must be a power-of-two.

How can the minimum unit be anythіng but one logical block?

> +extern void blk_queue_atomic_write_max_bytes(struct request_queue *q,
> +					     unsigned int bytes);

Please don't add pointless externs to prototypes in headers.

> +static inline unsigned int queue_atomic_write_unit_max_bytes(const struct request_queue *q)

.. and please avoid the overly long lines.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 02/21] block: Limit atomic writes according to bio and queue limits
  2023-09-29 10:27 ` [PATCH 02/21] block: Limit atomic writes according to bio and queue limits John Garry
@ 2023-11-09 15:13   ` Christoph Hellwig
  2023-11-09 17:41     ` John Garry
  2023-12-04  3:19   ` Ming Lei
  1 sibling, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2023-11-09 15:13 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Fri, Sep 29, 2023 at 10:27:07AM +0000, John Garry wrote:
> We rely the block layer always being able to send a bio of size
> atomic_write_unit_max without being required to split it due to request
> queue or other bio limits.
> 
> A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
> and each vector is at worst case the device logical block size from
> direct IO alignment requirement.

A bio can have more than BIO_MAX_VECS if you use bio_init.

> +static unsigned int blk_queue_max_guaranteed_bio_size_sectors(
> +					struct request_queue *q)
> +{
> +	struct queue_limits *limits = &q->limits;
> +	unsigned int max_segments = min_t(unsigned int, BIO_MAX_VECS,
> +					limits->max_segments);
> +	/*  Limit according to dev sector size as we only support direct-io */

Who is "we", and how tells the caller to only ever use direct I/O?
And how would a type of userspace I/O even matter for low-level
block code.  What if I wanted to use this for file system metadata?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 03/21] fs/bdev: Add atomic write support info to statx
  2023-10-03  0:28           ` Martin K. Petersen
@ 2023-11-09 15:15             ` Christoph Hellwig
  0 siblings, 0 replies; 124+ messages in thread
From: Christoph Hellwig @ 2023-11-09 15:15 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Bart Van Assche, John Garry, Eric Biggers, axboe, kbusch, hch,
	sagi, jejb, djwong, viro, brauner, chandan.babu, dchinner,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Prasad Singamsetty

On Mon, Oct 02, 2023 at 08:28:04PM -0400, Martin K. Petersen wrote:
> 
> Bart,
> 
> > Neither the SCSI SBC standard nor the NVMe standard defines a "minimum
> > atomic write unit". So why to introduce something in the Linux kernel
> > that is not defined in common storage standards?
> 
> >From SBC-5:
> 
> "The ATOMIC TRANSFER LENGTH GRANULARITY field indicates the minimum
> transfer length for an atomic write command."

I would suggest that we don't try to claim any atomic write capability
if this is not a logical block as such devices are completely useless.
In fact I'd add a big warning to the kernel log if a device claims this,
as this breaks all the implicit assumptions that a single logical
block write is atomic.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 12/21] fs: xfs: Introduce FORCEALIGN inode flag
  2023-09-29 10:27 ` [PATCH 12/21] fs: xfs: Introduce FORCEALIGN inode flag John Garry
@ 2023-11-09 15:24   ` Christoph Hellwig
  0 siblings, 0 replies; 124+ messages in thread
From: Christoph Hellwig @ 2023-11-09 15:24 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Fri, Sep 29, 2023 at 10:27:17AM +0000, John Garry wrote:
> From: "Darrick J. Wong" <djwong@kernel.org>
> 
> Add a new inode flag to require that all file data extent mappings must
> be aligned (both the file offset range and the allocated space itself)
> to the extent size hint.  Having a separate COW extent size hint is no
> longer allowed.
> 
> The goal here is to enable sysadmins and users to mandate that all space
> mappings in a file must have a startoff/blockcount that are aligned to
> (say) a 2MB alignment and that the startblock/blockcount will follow the
> same alignment.

This needs a good explanation of why someone would want this.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-09-29 10:27 ` [PATCH 17/21] fs: xfs: iomap atomic " John Garry
@ 2023-11-09 15:26   ` Christoph Hellwig
  2023-11-10 10:42     ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2023-11-09 15:26 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Fri, Sep 29, 2023 at 10:27:22AM +0000, John Garry wrote:
> Ensure that when creating a mapping that we adhere to all the atomic
> write rules.
> 
> We check that the mapping covers the complete range of the write to ensure
> that we'll be just creating a single mapping.
> 
> Currently minimum granularity is the FS block size, but it should be
> possibly to support lower in future.

I really dislike how this forces aligned allocations.  Aligned
allocations are a nice optimization to offload some of the work
to the storage hard/firmware, but we need to support it in general.
And I think with out of place writes into the COW fork, and atomic
transactions to swap it in we can do that pretty easily.

That should also allow to get rid of the horrible forcealign mode,
as we can still try align if possible and just fall back to the
out of place writes.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 21/21] nvme: Support atomic writes
  2023-09-29 10:27 ` [PATCH 21/21] nvme: Support atomic writes John Garry
       [not found]   ` <CGME20231004113943eucas1p23a51ce5ef06c36459f826101bb7b85fc@eucas1p2.samsung.com>
@ 2023-11-09 15:36   ` Christoph Hellwig
  2023-11-09 15:42     ` Matthew Wilcox
  1 sibling, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2023-11-09 15:36 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Alan Adamson

> +			if (le16_to_cpu(id->nabspf))
> +				boundary = (le16_to_cpu(id->nabspf) + 1) * bs;
> +
> +			if (is_power_of_2(boundary) || !boundary) {
> +				blk_queue_atomic_write_max_bytes(disk->queue, atomic_bs);
> +				blk_queue_atomic_write_unit_min_sectors(disk->queue, 1);
> +				blk_queue_atomic_write_unit_max_sectors(disk->queue,
> +									atomic_bs / bs);
> +				blk_queue_atomic_write_boundary_bytes(disk->queue, boundary);
> +			} else {
> +				dev_err(ns->ctrl->device, "Unsupported atomic boundary=0x%x\n",
> +					boundary);
> +			}

Please figure out a way to split the atomic configuration into a
helper and avoid all those crazy long lines,  preferable also avoid
the double calls to the block helpers as well while you're at it.

Also I really want a check in the NVMe I/O path that any request
with the atomic flag set actually adhers to the limits to at least
partially paper over the annoying lack of a separate write atomic
command in nvme.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 21/21] nvme: Support atomic writes
  2023-11-09 15:36   ` Christoph Hellwig
@ 2023-11-09 15:42     ` Matthew Wilcox
  2023-11-09 15:46       ` Christoph Hellwig
  0 siblings, 1 reply; 124+ messages in thread
From: Matthew Wilcox @ 2023-11-09 15:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Garry, axboe, kbusch, sagi, jejb, martin.petersen, djwong,
	viro, brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Alan Adamson

On Thu, Nov 09, 2023 at 04:36:03PM +0100, Christoph Hellwig wrote:
> Also I really want a check in the NVMe I/O path that any request
> with the atomic flag set actually adhers to the limits to at least
> partially paper over the annoying lack of a separate write atomic
> command in nvme.

That wasn't the model we had in mind.  In our thinking, it was fine to
send a write that crossed the atomic write limit, but the drive wouldn't
guarantee that it was atomic except at the atomic write boundary.
Eg with an AWUN of 16kB, you could send five 16kB writes, combine them
into a single 80kB write, and if the power failed midway through, the
drive would guarantee that it had written 0, 16kB, 32kB, 48kB, 64kB or
all 80kB.  Not necessarily in order; it might have written bytes 16-32kB,
64-80kB and not the other three.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 21/21] nvme: Support atomic writes
  2023-11-09 15:42     ` Matthew Wilcox
@ 2023-11-09 15:46       ` Christoph Hellwig
  2023-11-09 19:08         ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2023-11-09 15:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, John Garry, axboe, kbusch, sagi, jejb,
	martin.petersen, djwong, viro, brauner, chandan.babu, dchinner,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Alan Adamson

On Thu, Nov 09, 2023 at 03:42:40PM +0000, Matthew Wilcox wrote:
> That wasn't the model we had in mind.  In our thinking, it was fine to
> send a write that crossed the atomic write limit, but the drive wouldn't
> guarantee that it was atomic except at the atomic write boundary.
> Eg with an AWUN of 16kB, you could send five 16kB writes, combine them
> into a single 80kB write, and if the power failed midway through, the
> drive would guarantee that it had written 0, 16kB, 32kB, 48kB, 64kB or
> all 80kB.  Not necessarily in order; it might have written bytes 16-32kB,
> 64-80kB and not the other three.

I can see some use for that, but I'm really worried that debugging
problems in the I/O merging and splitting will be absolute hell.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-11-09 15:10   ` Christoph Hellwig
@ 2023-11-09 17:01     ` John Garry
  2023-11-10  6:23       ` Christoph Hellwig
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-11-09 17:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Himanshu Madhani

On 09/11/2023 15:10, Christoph Hellwig wrote:
>> Documentation/ABI/stable/sysfs-block | 42 +++++++++++++++++++
>>   block/blk-settings.c                 | 60 ++++++++++++++++++++++++++++
>>   block/blk-sysfs.c                    | 33 +++++++++++++++
>>   include/linux/blkdev.h               | 33 +++++++++++++++
>>   4 files changed, 168 insertions(+)
>>
>> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
>> index 1fe9a553c37b..05df7f74cbc1 100644
>> --- a/Documentation/ABI/stable/sysfs-block
>> +++ b/Documentation/ABI/stable/sysfs-block
>> @@ -21,6 +21,48 @@ Description:
>>   		device is offset from the internal allocation unit's
>>   		natural alignment.
>>   
>> +What:		/sys/block/<disk>/atomic_write_max_bytes
>> +Date:		May 2023
>> +Contact:	Himanshu Madhani<himanshu.madhani@oracle.com>
>> +Description:
>> +		[RO] This parameter specifies the maximum atomic write
>> +		size reported by the device. An atomic write operation
>> +		must not exceed this number of bytes.
>> +What:		/sys/block/<disk>/atomic_write_unit_max_bytes
>> +Date:		January 2023
>> +Contact:	Himanshu Madhani<himanshu.madhani@oracle.com>
>> +Description:
>> +		[RO] This parameter defines the largest block which can be
>> +		written atomically with an atomic write operation. This
>> +		value must be a multiple of atomic_write_unit_min and must
>> +		be a power-of-two.
> What is the difference between these two values?

Generally they come from the same device property. Then since 
atomic_write_unit_max_bytes must be a power-of-2 (and 
atomic_write_max_bytes may not be), they may be different. In addition, 
atomic_write_unit_max_bytes is required to be limited by whatever is 
guaranteed to be able to fit in a bio.

atomic_write_max_bytes is really only relevant for merging writes. Maybe 
we should not even expose via sysfs.

BTW, I do still wonder whether all these values should be limited by 
max_sectors_kb (which they aren't currently).

> 
> 
>> +Date:		May 2023
>> +Contact:	Himanshu Madhani<himanshu.madhani@oracle.com>
>> +Description:
>> +		[RO] This parameter specifies the smallest block which can
>> +		be written atomically with an atomic write operation. All
>> +		atomic write operations must begin at a
>> +		atomic_write_unit_min boundary and must be multiples of
>> +		atomic_write_unit_min. This value must be a power-of-two.
> How can the minimum unit be anythіng but one logical block?
> 
>> +extern void blk_queue_atomic_write_max_bytes(struct request_queue *q,
>> +					     unsigned int bytes);
> Please don't add pointless externs to prototypes in headers.

ok, fine - blkdev.h seems to have a mix for declarations with and 
without extern, so at least we would be consistently inconsistent.

> 
>> +static inline unsigned int queue_atomic_write_unit_max_bytes(const struct request_queue *q)
> .. and please avoid the overly long lines.

ok

Thanks,
John

> 


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 02/21] block: Limit atomic writes according to bio and queue limits
  2023-11-09 15:13   ` Christoph Hellwig
@ 2023-11-09 17:41     ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-11-09 17:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 09/11/2023 15:13, Christoph Hellwig wrote:
> On Fri, Sep 29, 2023 at 10:27:07AM +0000, John Garry wrote:
>> We rely the block layer always being able to send a bio of size
>> atomic_write_unit_max without being required to split it due to request
>> queue or other bio limits.
>>
>> A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
>> and each vector is at worst case the device logical block size from
>> direct IO alignment requirement.
> A bio can have more than BIO_MAX_VECS if you use bio_init.

Right, FWIW we are only concerned with codepaths which use BIO_MAX_VECS, 
but I suppose that is not good enough as a guarantee.

> 
>> +static unsigned int blk_queue_max_guaranteed_bio_size_sectors(
>> +					struct request_queue *q)
>> +{
>> +	struct queue_limits *limits = &q->limits;
>> +	unsigned int max_segments = min_t(unsigned int, BIO_MAX_VECS,
>> +					limits->max_segments);
>> +	/*  Limit according to dev sector size as we only support direct-io */
> Who is "we", and how tells the caller to only ever use direct I/O?

I think that this can be dropped as a comment. My earlier series used 
PAGE_SIZE and not sector size here, which I think was proper.

> And how would a type of userspace I/O even matter for low-level
> block code.

It shouldn't do, but we still need to limit according to request queue 
limits.

>  What if I wanted to use this for file system metadata?
> 

As mentioned, I think that the direct-IO comment can be dropped.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 21/21] nvme: Support atomic writes
  2023-11-09 15:46       ` Christoph Hellwig
@ 2023-11-09 19:08         ` John Garry
  2023-11-10  6:29           ` Christoph Hellwig
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-11-09 19:08 UTC (permalink / raw)
  To: Christoph Hellwig, Matthew Wilcox
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Alan Adamson

On 09/11/2023 15:46, Christoph Hellwig wrote:
> On Thu, Nov 09, 2023 at 03:42:40PM +0000, Matthew Wilcox wrote:
>> That wasn't the model we had in mind.  In our thinking, it was fine to
>> send a write that crossed the atomic write limit, but the drive wouldn't
>> guarantee that it was atomic except at the atomic write boundary.
>> Eg with an AWUN of 16kB, you could send five 16kB writes, combine them
>> into a single 80kB write, and if the power failed midway through, the
>> drive would guarantee that it had written 0, 16kB, 32kB, 48kB, 64kB or
>> all 80kB.  Not necessarily in order; it might have written bytes 16-32kB,
>> 64-80kB and not the other three.

I didn't think that there are any atomic write guarantees at all if we 
ever exceed AWUN or AWUPF or cross the atomic write boundary (if any).

> I can see some use for that, but I'm really worried that debugging
> problems in the I/O merging and splitting will be absolute hell.

Even if bios were merged for NVMe the total request length still should 
not exceed AWUPF. However a check can be added to ensure this for a 
submitted atomic write request.

As for splitting, it is not permitted for atomic writes and only a 
single bio is permitted to be created per write. Are more integrity 
checks required?

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-11-09 17:01     ` John Garry
@ 2023-11-10  6:23       ` Christoph Hellwig
  2023-11-10  9:04         ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2023-11-10  6:23 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api, Himanshu Madhani

On Thu, Nov 09, 2023 at 05:01:10PM +0000, John Garry wrote:
> Generally they come from the same device property. Then since 
> atomic_write_unit_max_bytes must be a power-of-2 (and 
> atomic_write_max_bytes may not be), they may be different.

How much do we care about supporting the additional slack over the
power of two version?  

> In addition, 
> atomic_write_unit_max_bytes is required to be limited by whatever is 
> guaranteed to be able to fit in a bio.

The limit what fits into a bio is UINT_MAX, not sure that matters :)

> atomic_write_max_bytes is really only relevant for merging writes. Maybe we 
> should not even expose via sysfs.

Or we need to have a good separate discussion on even supporting any
merges.  Willy chimed in that supporting merges was intentional,
but I'd really like to see numbers justifying it.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 21/21] nvme: Support atomic writes
  2023-11-09 19:08         ` John Garry
@ 2023-11-10  6:29           ` Christoph Hellwig
  2023-11-10  8:44             ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2023-11-10  6:29 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Matthew Wilcox, axboe, kbusch, sagi, jejb,
	martin.petersen, djwong, viro, brauner, chandan.babu, dchinner,
	linux-block, linux-kernel, linux-nvme, linux-xfs, linux-fsdevel,
	tytso, jbongio, linux-api, Alan Adamson

On Thu, Nov 09, 2023 at 07:08:40PM +0000, John Garry wrote:
>>> send a write that crossed the atomic write limit, but the drive wouldn't
>>> guarantee that it was atomic except at the atomic write boundary.
>>> Eg with an AWUN of 16kB, you could send five 16kB writes, combine them
>>> into a single 80kB write, and if the power failed midway through, the
>>> drive would guarantee that it had written 0, 16kB, 32kB, 48kB, 64kB or
>>> all 80kB.  Not necessarily in order; it might have written bytes 16-32kB,
>>> 64-80kB and not the other three.
>
> I didn't think that there are any atomic write guarantees at all if we ever 
> exceed AWUN or AWUPF or cross the atomic write boundary (if any).

You're quoting a few mails before me, but I agree.

>> I can see some use for that, but I'm really worried that debugging
>> problems in the I/O merging and splitting will be absolute hell.
>
> Even if bios were merged for NVMe the total request length still should not 
> exceed AWUPF. However a check can be added to ensure this for a submitted 
> atomic write request.

Yes.

> As for splitting, it is not permitted for atomic writes and only a single 
> bio is permitted to be created per write. Are more integrity checks 
> required?

I'm more worried about the problem where we accidentally add a split.
The whole bio merge/split path is convoluted and we had plenty of
bugs in the past by not looking at all the correct flags or opcodes.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 21/21] nvme: Support atomic writes
  2023-11-10  6:29           ` Christoph Hellwig
@ 2023-11-10  8:44             ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-11-10  8:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Matthew Wilcox, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api, Alan Adamson

On 10/11/2023 06:29, Christoph Hellwig wrote:
> Yes.
> 
>> As for splitting, it is not permitted for atomic writes and only a single
>> bio is permitted to be created per write. Are more integrity checks
>> required?
> I'm more worried about the problem where we accidentally add a split.
> The whole bio merge/split path is convoluted and we had plenty of
> bugs in the past by not looking at all the correct flags or opcodes.

Yes, this is always a concern.

Some thoughts on things which could be done:
- For no merging, ensure request length is a power-of-2 when enqueuing 
to block driver. This is simple but not watertight.
- Create a per-bio checksum when the bio is created for the atomic write 
and ensure integrity when queuing to the block driver
- a new block layer datapath which ensures no merging or splitting, but 
this seems a bit OTT

BTW, on topic of splitting, that NVMe virt boundary is a pain and I hope 
that we could ignore/avoid it for atomic writes.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 01/21] block: Add atomic write operations to request_queue limits
  2023-11-10  6:23       ` Christoph Hellwig
@ 2023-11-10  9:04         ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-11-10  9:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	Himanshu Madhani

On 10/11/2023 06:23, Christoph Hellwig wrote:
> On Thu, Nov 09, 2023 at 05:01:10PM +0000, John Garry wrote:
>> Generally they come from the same device property. Then since
>> atomic_write_unit_max_bytes must be a power-of-2 (and
>> atomic_write_max_bytes may not be), they may be different.
> How much do we care about supporting the additional slack over the
> power of two version?

I'm not sure yet. It depends on any merging support and splitting 
safeguards introduced.

> 
>> In addition,
>> atomic_write_unit_max_bytes is required to be limited by whatever is
>> guaranteed to be able to fit in a bio.
> The limit what fits into a bio is UINT_MAX, not sure that matters 😄

I am talking about what we guarantee that we can always fit in a bio 
according to request queue limits and bio vector count, e.g. if the 
request queue limits us to 8 segments only, then we can't guarantee to 
fit much in (without splitting) and need to limit atomic_write_unit_max 
accordingly.

> 
>> atomic_write_max_bytes is really only relevant for merging writes. Maybe we
>> should not even expose via sysfs.
> Or we need to have a good separate discussion on even supporting any
> merges.  Willy chimed in that supporting merges was intentional,
> but I'd really like to see numbers justifying it.
> 

So far I have tested on an environment where the datarates are not high 
and any merging benefit was minimal to non-existent. But that is not to 
say it could help elsewhere.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-11-09 15:26   ` Christoph Hellwig
@ 2023-11-10 10:42     ` John Garry
  2023-11-28  8:56       ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-11-10 10:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 09/11/2023 15:26, Christoph Hellwig wrote:
> On Fri, Sep 29, 2023 at 10:27:22AM +0000, John Garry wrote:
>> Ensure that when creating a mapping that we adhere to all the atomic
>> write rules.
>>
>> We check that the mapping covers the complete range of the write to ensure
>> that we'll be just creating a single mapping.
>>
>> Currently minimum granularity is the FS block size, but it should be
>> possibly to support lower in future.
> I really dislike how this forces aligned allocations.  Aligned
> allocations are a nice optimization to offload some of the work
> to the storage hard/firmware, but we need to support it in general.
> And I think with out of place writes into the COW fork, and atomic
> transactions to swap it in we can do that pretty easily.
> 
> That should also allow to get rid of the horrible forcealign mode,
> as we can still try align if possible and just fall back to the
> out of place writes.
> 
> 

How could we try to align? Do you mean that we try to align up to some 
stage in the block allocator search? That seems like some middle ground 
between no alignment and forcealign.

And what would we be aligning to?

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-11-10 10:42     ` John Garry
@ 2023-11-28  8:56       ` John Garry
  2023-11-28 13:56         ` Christoph Hellwig
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-11-28  8:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

Hi Christoph,

>>>
>>> Currently minimum granularity is the FS block size, but it should be
>>> possibly to support lower in future.
>> I really dislike how this forces aligned allocations.  Aligned
>> allocations are a nice optimization to offload some of the work
>> to the storage hard/firmware, but we need to support it in general.
>> And I think with out of place writes into the COW fork, and atomic
>> transactions to swap it in we can do that pretty easily.
>>
>> That should also allow to get rid of the horrible forcealign mode,
>> as we can still try align if possible and just fall back to the
>> out of place writes.
>>

Can you try to explain your idea a bit more? This is blocking us.

Are you suggesting some sort of hybrid between the atomic write series 
you had a few years ago and this solution?

To me that would be continuing with the following:
- per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written 
until some data sync)
- writes must be a power-of-two and at a naturally-aligned offset
- relying on atomic write HW support always

But for extents which are misaligned, we CoW to a new extent? I suppose 
we would align that extent to alignment of the write (i.e. length of write).

BTW, we also have rtvol support which does not use forcealign as it 
already can guarantee alignment, but still does rely on the same 
principle of requiring alignment - would you want CoW support there also?

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-11-28  8:56       ` John Garry
@ 2023-11-28 13:56         ` Christoph Hellwig
  2023-11-28 17:42           ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2023-11-28 13:56 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api

On Tue, Nov 28, 2023 at 08:56:37AM +0000, John Garry wrote:
> Are you suggesting some sort of hybrid between the atomic write series you 
> had a few years ago and this solution?

Very roughly, yes.

> To me that would be continuing with the following:
> - per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written until 
> some data sync)

Yes.

> - writes must be a power-of-two and at a naturally-aligned offset

Where offset is offset in the file?  It would not require it.  You
probably want to do it for optimal performance, but requiring it
feeels rather limited.

> - relying on atomic write HW support always

And I think that's where we have different opinions.  I think the hw
offload is a nice optimization and we should use it wherever we can.
But building the entire userspace API around it feels like a mistake.

> BTW, we also have rtvol support which does not use forcealign as it already 
> can guarantee alignment, but still does rely on the same principle of 
> requiring alignment - would you want CoW support there also?

Upstream doesn't have out of place write support for the RT subvolume
yet.  But Darrick has a series for it and we're actively working on
upstreaming it.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-11-28 13:56         ` Christoph Hellwig
@ 2023-11-28 17:42           ` John Garry
  2023-11-29  2:45             ` Martin K. Petersen
  2023-12-04 13:45             ` Christoph Hellwig
  0 siblings, 2 replies; 124+ messages in thread
From: John Garry @ 2023-11-28 17:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 28/11/2023 13:56, Christoph Hellwig wrote:
> On Tue, Nov 28, 2023 at 08:56:37AM +0000, John Garry wrote:
>> Are you suggesting some sort of hybrid between the atomic write series you
>> had a few years ago and this solution?
> Very roughly, yes.
> 
>> To me that would be continuing with the following:
>> - per-IO RWF_ATOMIC (and not O_ATOMIC semantics of nothing is written until
>> some data sync)
> Yes.
> 
>> - writes must be a power-of-two and at a naturally-aligned offset
> Where offset is offset in the file? 

ok, fine, it would not be required for XFS with CoW. Some concerns still:
a. device atomic write boundary, if any
b. other FSes which do not have CoW support. ext4 is already being used 
for "atomic writes" in the field - see dubious amazon torn-write prevention.

About b., we could add the pow-of-2 and file offset alignment 
requirement for other FSes, but then need to add some method to 
advertise that restriction.

> It would not require it.  You
> probably want to do it for optimal performance, but requiring it
> feeels rather limited.
> 
>> - relying on atomic write HW support always
> And I think that's where we have different opinions.

I'm just trying to understand your idea and that is not necessarily my 
final opinion.

>  I think the hw
> offload is a nice optimization and we should use it wherever we can.

Sure, but to me it is a concern that we have 2x paths to make robust a. 
offload via hw, which may involve CoW b. no HW support, i.e. CoW always

And for no HW support, if we don't follow the O_ATOMIC model of 
committing nothing until a SYNC is issued, would we allocate, write, and 
later free a new extent for each write, right?

> But building the entire userspace API around it feels like a mistake.
> 

ok, but FWIW it works for the usecases which we know.

>> BTW, we also have rtvol support which does not use forcealign as it already
>> can guarantee alignment, but still does rely on the same principle of
>> requiring alignment - would you want CoW support there also?
> Upstream doesn't have out of place write support for the RT subvolume
> yet.  But Darrick has a series for it and we're actively working on
> upstreaming it.
Yeah, I thought that I heard this.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-11-28 17:42           ` John Garry
@ 2023-11-29  2:45             ` Martin K. Petersen
  2023-12-04 13:45             ` Christoph Hellwig
  1 sibling, 0 replies; 124+ messages in thread
From: Martin K. Petersen @ 2023-11-29  2:45 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api


> b. other FSes which do not have CoW support. ext4 is already being
> used for "atomic writes" in the field

We also need raw block device access to work within the constraints
required by the hardware.

>> probably want to do it for optimal performance, but requiring it
>> feeels rather limited.

The application developers we are working with generally prefer an error
when things are not aligned properly. Predictable performance is key.
Removing the performance variability of doing double writes is the
reason for supporting atomics in the first place.

I think there is value in providing a more generic (file-centric) atomic
user API. And I think the I/O stack plumbing we provide would be useful
in supporting such an endeavor. But I am not convinced that atomic
operations in general should be limited to the couple of filesystems
that can do CoW.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-09-29 10:27 ` [PATCH 10/21] block: Add fops atomic write support John Garry
  2023-09-29 17:51   ` Bart Van Assche
@ 2023-12-04  2:30   ` Ming Lei
  2023-12-04  9:27     ` John Garry
  1 sibling, 1 reply; 124+ messages in thread
From: Ming Lei @ 2023-12-04  2:30 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	ming.lei

On Fri, Sep 29, 2023 at 10:27:15AM +0000, John Garry wrote:
> Add support for atomic writes, as follows:
> - Ensure that the IO follows all the atomic writes rules, like must be
>   naturally aligned
> - Set REQ_ATOMIC
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  block/fops.c | 42 +++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/block/fops.c b/block/fops.c
> index acff3d5d22d4..516669ad69e5 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -41,6 +41,29 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
>  		!bdev_iter_is_aligned(bdev, iter);
>  }
>  
> +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
> +			      struct iov_iter *iter)
> +{
> +	unsigned int atomic_write_unit_min_bytes =
> +			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
> +	unsigned int atomic_write_unit_max_bytes =
> +			queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
> +
> +	if (!atomic_write_unit_min_bytes)
> +		return false;

The above check should have be moved to limit setting code path.

> +	if (pos % atomic_write_unit_min_bytes)
> +		return false;
> +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> +		return false;
> +	if (!is_power_of_2(iov_iter_count(iter)))
> +		return false;
> +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> +		return false;
> +	if (pos % iov_iter_count(iter))
> +		return false;

I am a bit confused about relation between atomic_write_unit_max_bytes and
atomic_write_max_bytes.

Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
so looks userspace can only submit IO with write-atomic-unit naturally
aligned IO(such as, 4k, 8k, 16k, 32k, ...), but these user IOs are
allowed to be merged to big one if naturally alignment is respected and
the merged IO size is <= atomic_write_max_bytes.

Is my understanding right? If yes, I'd suggest to document the point,
and the last two checks could be change to:

	/* naturally aligned */
	if (pos % iov_iter_count(iter))
		return false;

	if (iov_iter_count(iter) > atomic_write_max_bytes)
		return false;

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 02/21] block: Limit atomic writes according to bio and queue limits
  2023-09-29 10:27 ` [PATCH 02/21] block: Limit atomic writes according to bio and queue limits John Garry
  2023-11-09 15:13   ` Christoph Hellwig
@ 2023-12-04  3:19   ` Ming Lei
  2023-12-04  3:55     ` Ming Lei
  1 sibling, 1 reply; 124+ messages in thread
From: Ming Lei @ 2023-12-04  3:19 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	ming.lei

On Fri, Sep 29, 2023 at 10:27:07AM +0000, John Garry wrote:
> We rely the block layer always being able to send a bio of size
> atomic_write_unit_max without being required to split it due to request
> queue or other bio limits.
> 
> A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
> and each vector is at worst case the device logical block size from
> direct IO alignment requirement.

Both unit_max and unit_min are applied to FS bio, which is built over
single userspace buffer, so only the 1st and last vector can include
partial page, and the other vectors should always cover whole page,
then the minimal size could be:

	(max_segments - 2) * PAGE_SIZE + 2 * queue_logical_block_size(q)


Thanks,
Ming


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 02/21] block: Limit atomic writes according to bio and queue limits
  2023-12-04  3:19   ` Ming Lei
@ 2023-12-04  3:55     ` Ming Lei
  2023-12-04  9:35       ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Ming Lei @ 2023-12-04  3:55 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Mon, Dec 04, 2023 at 11:19:20AM +0800, Ming Lei wrote:
> On Fri, Sep 29, 2023 at 10:27:07AM +0000, John Garry wrote:
> > We rely the block layer always being able to send a bio of size
> > atomic_write_unit_max without being required to split it due to request
> > queue or other bio limits.
> > 
> > A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
> > and each vector is at worst case the device logical block size from
> > direct IO alignment requirement.
> 
> Both unit_max and unit_min are applied to FS bio, which is built over
> single userspace buffer, so only the 1st and last vector can include

Actually it isn't true for pwritev, and sorry for the noise.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-12-04  2:30   ` Ming Lei
@ 2023-12-04  9:27     ` John Garry
  2023-12-04 12:18       ` Ming Lei
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-12-04  9:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 04/12/2023 02:30, Ming Lei wrote:

Hi Ming,

>> +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
>> +			      struct iov_iter *iter)
>> +{
>> +	unsigned int atomic_write_unit_min_bytes =
>> +			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
>> +	unsigned int atomic_write_unit_max_bytes =
>> +			queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
>> +
>> +	if (!atomic_write_unit_min_bytes)
>> +		return false;
> The above check should have be moved to limit setting code path.

Sorry, I didn't fully understand your point.

I added this here (as opposed to the caller), as I was not really 
worried about speeding up the failure path. Are you saying to call even 
earlier in submission path?

> 
>> +	if (pos % atomic_write_unit_min_bytes)
>> +		return false;
>> +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
>> +		return false;
>> +	if (!is_power_of_2(iov_iter_count(iter)))
>> +		return false;
>> +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
>> +		return false;
>> +	if (pos % iov_iter_count(iter))
>> +		return false;
> I am a bit confused about relation between atomic_write_unit_max_bytes and
> atomic_write_max_bytes.

I think that naming could be improved. Or even just drop merging (and 
atomic_write_max_bytes concept) until we show it to improve performance.

So generally atomic_write_unit_max_bytes will be same as 
atomic_write_max_bytes, however it could be different if:
a. request queue nr hw segments or other request queue limits needs to 
restrict atomic_write_unit_max_bytes
b. atomic_write_unit_max_bytes does not need to be a power-of-2 and 
atomic_write_max_bytes does. So essentially:
atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)

> 
> Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
> so looks userspace can only submit IO with write-atomic-unit naturally
> aligned IO(such as, 4k, 8k, 16k, 32k, ...), 

correct

> but these user IOs are
> allowed to be merged to big one if naturally alignment is respected and
> the merged IO size is <= atomic_write_max_bytes.

correct, but the resultant merged IO does not have have to be naturally 
aligned.

> 
> Is my understanding right? 

Yes, but...

> If yes, I'd suggest to document the point,
> and the last two checks could be change to:
> 
> 	/* naturally aligned */
> 	if (pos % iov_iter_count(iter))
> 		return false;
> 
> 	if (iov_iter_count(iter) > atomic_write_max_bytes)
> 		return false;

.. we would not be merging at this point as this is just IO submission 
to the block layer, so atomic_write_max_bytes does not come into play 
yet. If you check patch 7/21, you will see that we limit IO size to 
atomic_write_max_bytes, which is relevant merging.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 02/21] block: Limit atomic writes according to bio and queue limits
  2023-12-04  3:55     ` Ming Lei
@ 2023-12-04  9:35       ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-12-04  9:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 04/12/2023 03:55, Ming Lei wrote:

Hi Ming,

> On Mon, Dec 04, 2023 at 11:19:20AM +0800, Ming Lei wrote:
>> On Fri, Sep 29, 2023 at 10:27:07AM +0000, John Garry wrote:
>>> We rely the block layer always being able to send a bio of size
>>> atomic_write_unit_max without being required to split it due to request
>>> queue or other bio limits.
>>>
>>> A bio may contain min(BIO_MAX_VECS, limits->max_segments) vectors,
>>> and each vector is at worst case the device logical block size from
>>> direct IO alignment requirement.
>> Both unit_max and unit_min are applied to FS bio, which is built over
>> single userspace buffer, so only the 1st and last vector can include
> Actually it isn't true for pwritev, and sorry for the noise.

Yeah, I think that it should be:

(max_segments - 2) * PAGE_SIZE

And we need to enforce that any middle vectors are PAGE-aligned.

Thanks,
John


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-12-04  9:27     ` John Garry
@ 2023-12-04 12:18       ` Ming Lei
  2023-12-04 13:13         ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Ming Lei @ 2023-12-04 12:18 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api,
	ming.lei

On Mon, Dec 04, 2023 at 09:27:00AM +0000, John Garry wrote:
> On 04/12/2023 02:30, Ming Lei wrote:
> 
> Hi Ming,
> 
> > > +static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
> > > +			      struct iov_iter *iter)
> > > +{
> > > +	unsigned int atomic_write_unit_min_bytes =
> > > +			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
> > > +	unsigned int atomic_write_unit_max_bytes =
> > > +			queue_atomic_write_unit_max_bytes(bdev_get_queue(bdev));
> > > +
> > > +	if (!atomic_write_unit_min_bytes)
> > > +		return false;
> > The above check should have be moved to limit setting code path.
> 
> Sorry, I didn't fully understand your point.
> 
> I added this here (as opposed to the caller), as I was not really worried
> about speeding up the failure path. Are you saying to call even earlier in
> submission path?

atomic_write_unit_min is one hardware property, and it should be checked
in blk_queue_atomic_write_unit_min_sectors() from beginning, then you
can avoid this check every other where.

> 
> > 
> > > +	if (pos % atomic_write_unit_min_bytes)
> > > +		return false;
> > > +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> > > +		return false;
> > > +	if (!is_power_of_2(iov_iter_count(iter)))
> > > +		return false;
> > > +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> > > +		return false;
> > > +	if (pos % iov_iter_count(iter))
> > > +		return false;
> > I am a bit confused about relation between atomic_write_unit_max_bytes and
> > atomic_write_max_bytes.
> 
> I think that naming could be improved. Or even just drop merging (and
> atomic_write_max_bytes concept) until we show it to improve performance.
> 
> So generally atomic_write_unit_max_bytes will be same as
> atomic_write_max_bytes, however it could be different if:
> a. request queue nr hw segments or other request queue limits needs to
> restrict atomic_write_unit_max_bytes
> b. atomic_write_unit_max_bytes does not need to be a power-of-2 and
> atomic_write_max_bytes does. So essentially:
> atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)
> 

plug merge often improves sequential IO perf, so if the hardware supports
this way, I think 'atomic_write_max_bytes' should be supported from the
beginning, such as:

- user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can
be merged to single IO request, which is issued to driver.

Or 

- user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can
be merged to single IO request, which is issued to driver.

The hardware should recognize unit size by start LBA, and check if length is
valid, so probably the interface might be relaxed to:

1) start lba is unit aligned, and this unit is in the supported unit
range(power_2 in [unit_min, unit_max])

2) length needs to be:

- N * this_unit_size
- <= atomic_write_max_bytes


> > 
> > Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
> > so looks userspace can only submit IO with write-atomic-unit naturally
> > aligned IO(such as, 4k, 8k, 16k, 32k, ...),
> 
> correct
> 
> > but these user IOs are
> > allowed to be merged to big one if naturally alignment is respected and
> > the merged IO size is <= atomic_write_max_bytes.
> 
> correct, but the resultant merged IO does not have have to be naturally
> aligned.
> 
> > 
> > Is my understanding right?
> 
> Yes, but...
> 
> > If yes, I'd suggest to document the point,
> > and the last two checks could be change to:
> > 
> > 	/* naturally aligned */
> > 	if (pos % iov_iter_count(iter))
> > 		return false;
> > 
> > 	if (iov_iter_count(iter) > atomic_write_max_bytes)
> > 		return false;
> 
> .. we would not be merging at this point as this is just IO submission to
> the block layer, so atomic_write_max_bytes does not come into play yet. If
> you check patch 7/21, you will see that we limit IO size to
> atomic_write_max_bytes, which is relevant merging.

I know the motivation of atomic_write_max_bytes, and now I am wondering
atomic_write_max_bytes may be exported to userspace for the sake of
atomic write performance.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-12-04 12:18       ` Ming Lei
@ 2023-12-04 13:13         ` John Garry
  2023-12-05  1:45           ` Ming Lei
  0 siblings, 1 reply; 124+ messages in thread
From: John Garry @ 2023-12-04 13:13 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api


>>
>> I added this here (as opposed to the caller), as I was not really worried
>> about speeding up the failure path. Are you saying to call even earlier in
>> submission path?
> atomic_write_unit_min is one hardware property, and it should be checked
> in blk_queue_atomic_write_unit_min_sectors() from beginning, then you
> can avoid this check every other where.

ok, but we still need to ensure in the submission path that the block 
device actually supports atomic writes - this was the initial check.

> 
>>>> +	if (pos % atomic_write_unit_min_bytes)
>>>> +		return false;
>>>> +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
>>>> +		return false;
>>>> +	if (!is_power_of_2(iov_iter_count(iter)))
>>>> +		return false;
>>>> +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
>>>> +		return false;
>>>> +	if (pos % iov_iter_count(iter))
>>>> +		return false;
>>> I am a bit confused about relation between atomic_write_unit_max_bytes and
>>> atomic_write_max_bytes.
>> I think that naming could be improved. Or even just drop merging (and
>> atomic_write_max_bytes concept) until we show it to improve performance.
>>
>> So generally atomic_write_unit_max_bytes will be same as
>> atomic_write_max_bytes, however it could be different if:
>> a. request queue nr hw segments or other request queue limits needs to
>> restrict atomic_write_unit_max_bytes
>> b. atomic_write_unit_max_bytes does not need to be a power-of-2 and
>> atomic_write_max_bytes does. So essentially:
>> atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)
>>
> plug merge often improves sequential IO perf, so if the hardware supports
> this way, I think 'atomic_write_max_bytes' should be supported from the
> beginning, such as:
> 
> - user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can
> be merged to single IO request, which is issued to driver.
> 
> Or
> 
> - user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can
> be merged to single IO request, which is issued to driver.

Right, we do expect userspace to use a fixed block size, but we give 
scope in the API to use variable size.

> 
> The hardware should recognize unit size by start LBA, and check if length is
> valid, so probably the interface might be relaxed to:
> 
> 1) start lba is unit aligned, and this unit is in the supported unit
> range(power_2 in [unit_min, unit_max])
> 
> 2) length needs to be:
> 
> - N * this_unit_size
> - <= atomic_write_max_bytes

Please note that we also need to consider:
- any atomic write boundary (from NVMe)
- virt boundary (from NVMe)

And, as I mentioned elsewhere, I am still not 100% comfortable that we 
don't pay attention to regular max_sectors_kb...

> 
> 
>>> Here the max IO length is limited to be <= atomic_write_unit_max_bytes,
>>> so looks userspace can only submit IO with write-atomic-unit naturally
>>> aligned IO(such as, 4k, 8k, 16k, 32k, ...),
>> correct
>>
>>> but these user IOs are
>>> allowed to be merged to big one if naturally alignment is respected and
>>> the merged IO size is <= atomic_write_max_bytes.
>> correct, but the resultant merged IO does not have have to be naturally
>> aligned.
>>
>>> Is my understanding right?
>> Yes, but...
>>
>>> If yes, I'd suggest to document the point,
>>> and the last two checks could be change to:
>>>
>>> 	/* naturally aligned */
>>> 	if (pos % iov_iter_count(iter))
>>> 		return false;
>>>
>>> 	if (iov_iter_count(iter) > atomic_write_max_bytes)
>>> 		return false;
>> .. we would not be merging at this point as this is just IO submission to
>> the block layer, so atomic_write_max_bytes does not come into play yet. If
>> you check patch 7/21, you will see that we limit IO size to
>> atomic_write_max_bytes, which is relevant merging.
> I know the motivation of atomic_write_max_bytes, and now I am wondering
> atomic_write_max_bytes may be exported to userspace for the sake of
> atomic write performance.

It is available from sysfs for the request queue, but in an earlier 
series Dave Chinner suggested doing more to expose to the application 
programmer. So here that would mean a statx member. I'm still not 
sure... it just didn't seem like a detail which the user would need to 
know or be able to do much with.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-11-28 17:42           ` John Garry
  2023-11-29  2:45             ` Martin K. Petersen
@ 2023-12-04 13:45             ` Christoph Hellwig
  2023-12-04 15:19               ` John Garry
  1 sibling, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2023-12-04 13:45 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api

On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
> ok, fine, it would not be required for XFS with CoW. Some concerns still:
> a. device atomic write boundary, if any
> b. other FSes which do not have CoW support. ext4 is already being used for 
> "atomic writes" in the field - see dubious amazon torn-write prevention.

What is the 'dubious amazon torn-write prevention'?

> About b., we could add the pow-of-2 and file offset alignment requirement 
> for other FSes, but then need to add some method to advertise that 
> restriction.

We really need a better way to communicate I/O limitations anyway.
Something like XFS_IOC_DIOINFO on steroids.

> Sure, but to me it is a concern that we have 2x paths to make robust a. 
> offload via hw, which may involve CoW b. no HW support, i.e. CoW always

Relying just on the hardware seems very limited, especially as there is
plenty of hardware that won't guarantee anything larger than 4k, and
plenty of NVMe hardware without has some other small limit like 32k
because it doesn't support multiple atomicy mode.

> And for no HW support, if we don't follow the O_ATOMIC model of committing 
> nothing until a SYNC is issued, would we allocate, write, and later free a 
> new extent for each write, right?

Yes. Then again if you do data journalling you do that anyway, and as
one little project I'm doing right now shows that data journling is
often the fastest thing we can do for very small writes.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-12-04 13:45             ` Christoph Hellwig
@ 2023-12-04 15:19               ` John Garry
  2023-12-04 15:39                 ` Christoph Hellwig
                                   ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: John Garry @ 2023-12-04 15:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 04/12/2023 13:45, Christoph Hellwig wrote:
> On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
>> ok, fine, it would not be required for XFS with CoW. Some concerns still:
>> a. device atomic write boundary, if any
>> b. other FSes which do not have CoW support. ext4 is already being used for
>> "atomic writes" in the field - see dubious amazon torn-write prevention.
> 
> What is the 'dubious amazon torn-write prevention'?

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html

AFAICS, this is without any kernel changes, so no guarantee of unwanted 
splitting or merging of bios.

Anyway, there will still be !CoW FSes which people want to support.

> 
>> About b., we could add the pow-of-2 and file offset alignment requirement
>> for other FSes, but then need to add some method to advertise that
>> restriction.
> 
> We really need a better way to communicate I/O limitations anyway.
> Something like XFS_IOC_DIOINFO on steroids.
> 
>> Sure, but to me it is a concern that we have 2x paths to make robust a.
>> offload via hw, which may involve CoW b. no HW support, i.e. CoW always
> 
> Relying just on the hardware seems very limited, especially as there is
> plenty of hardware that won't guarantee anything larger than 4k, and
> plenty of NVMe hardware without has some other small limit like 32k
> because it doesn't support multiple atomicy mode.

So what would you propose as the next step? Would it to be first achieve 
atomic write support for XFS with HW support + CoW to ensure contiguous 
extents (and without XFS forcealign)?

> 
>> And for no HW support, if we don't follow the O_ATOMIC model of committing
>> nothing until a SYNC is issued, would we allocate, write, and later free a
>> new extent for each write, right?
> 
> Yes. Then again if you do data journalling you do that anyway, and as
> one little project I'm doing right now shows that data journling is
> often the fastest thing we can do for very small writes.

Ignoring FSes, then how is this supposed to work for block devices? We 
just always need HW support, right?

Thanks,
John


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-12-04 15:19               ` John Garry
@ 2023-12-04 15:39                 ` Christoph Hellwig
  2023-12-04 18:06                   ` John Garry
  2023-12-05  4:55                 ` Theodore Ts'o
  2023-12-05 13:59                 ` Ming Lei
  2 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2023-12-04 15:39 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api

On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote:
> On 04/12/2023 13:45, Christoph Hellwig wrote:
>> On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
>>> ok, fine, it would not be required for XFS with CoW. Some concerns still:
>>> a. device atomic write boundary, if any
>>> b. other FSes which do not have CoW support. ext4 is already being used for
>>> "atomic writes" in the field - see dubious amazon torn-write prevention.
>>
>> What is the 'dubious amazon torn-write prevention'?
>
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
>
> AFAICS, this is without any kernel changes, so no guarantee of unwanted 
> splitting or merging of bios.
>
> Anyway, there will still be !CoW FSes which people want to support.

Ugg, so they badly reimplement NVMe atomic write support and use it
without software stack enablement.  Calling it dubious is way to
gentle..

>> Relying just on the hardware seems very limited, especially as there is
>> plenty of hardware that won't guarantee anything larger than 4k, and
>> plenty of NVMe hardware without has some other small limit like 32k
>> because it doesn't support multiple atomicy mode.
>
> So what would you propose as the next step? Would it to be first achieve 
> atomic write support for XFS with HW support + CoW to ensure contiguous 
> extents (and without XFS forcealign)?

I think the very first priority is just block device support without
any fs enablement.  We just need to make sure the API isn't too limited
for additional use cases.

> Ignoring FSes, then how is this supposed to work for block devices? We just 
> always need HW support, right?

Yes.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-12-04 15:39                 ` Christoph Hellwig
@ 2023-12-04 18:06                   ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-12-04 18:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, kbusch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 04/12/2023 15:39, Christoph Hellwig wrote:
>> So what would you propose as the next step? Would it to be first achieve
>> atomic write support for XFS with HW support + CoW to ensure contiguous
>> extents (and without XFS forcealign)?
> I think the very first priority is just block device support without
> any fs enablement.  We just need to make sure the API isn't too limited
> for additional use cases.

Sounds ok

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-12-04 13:13         ` John Garry
@ 2023-12-05  1:45           ` Ming Lei
  2023-12-05 10:49             ` John Garry
  0 siblings, 1 reply; 124+ messages in thread
From: Ming Lei @ 2023-12-05  1:45 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On Mon, Dec 04, 2023 at 01:13:55PM +0000, John Garry wrote:
> 
> > > 
> > > I added this here (as opposed to the caller), as I was not really worried
> > > about speeding up the failure path. Are you saying to call even earlier in
> > > submission path?
> > atomic_write_unit_min is one hardware property, and it should be checked
> > in blk_queue_atomic_write_unit_min_sectors() from beginning, then you
> > can avoid this check every other where.
> 
> ok, but we still need to ensure in the submission path that the block device
> actually supports atomic writes - this was the initial check.

Then you may add one helper bdev_support_atomic_write().

> 
> > 
> > > > > +	if (pos % atomic_write_unit_min_bytes)
> > > > > +		return false;
> > > > > +	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> > > > > +		return false;
> > > > > +	if (!is_power_of_2(iov_iter_count(iter)))
> > > > > +		return false;
> > > > > +	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> > > > > +		return false;
> > > > > +	if (pos % iov_iter_count(iter))
> > > > > +		return false;
> > > > I am a bit confused about relation between atomic_write_unit_max_bytes and
> > > > atomic_write_max_bytes.
> > > I think that naming could be improved. Or even just drop merging (and
> > > atomic_write_max_bytes concept) until we show it to improve performance.
> > > 
> > > So generally atomic_write_unit_max_bytes will be same as
> > > atomic_write_max_bytes, however it could be different if:
> > > a. request queue nr hw segments or other request queue limits needs to
> > > restrict atomic_write_unit_max_bytes
> > > b. atomic_write_unit_max_bytes does not need to be a power-of-2 and
> > > atomic_write_max_bytes does. So essentially:
> > > atomic_write_unit_max_bytes = rounddown_pow_of_2(atomic_write_max_bytes)
> > > 
> > plug merge often improves sequential IO perf, so if the hardware supports
> > this way, I think 'atomic_write_max_bytes' should be supported from the
> > beginning, such as:
> > 
> > - user space submits sequential N * (4k, 8k, 16k, ...) atomic writes, all can
> > be merged to single IO request, which is issued to driver.
> > 
> > Or
> > 
> > - user space submits sequential 4k, 4k, 8k, 16K, 32k, 64k atomic writes, all can
> > be merged to single IO request, which is issued to driver.
> 
> Right, we do expect userspace to use a fixed block size, but we give scope
> in the API to use variable size.

Maybe it is enough to just take atomic_write_unit_min_bytes
only, and allow length to be N * atomic_write_unit_min_bytes.

But it may violate atomic write boundary?

> 
> > 
> > The hardware should recognize unit size by start LBA, and check if length is
> > valid, so probably the interface might be relaxed to:
> > 
> > 1) start lba is unit aligned, and this unit is in the supported unit
> > range(power_2 in [unit_min, unit_max])
> > 
> > 2) length needs to be:
> > 
> > - N * this_unit_size
> > - <= atomic_write_max_bytes
> 
> Please note that we also need to consider:
> - any atomic write boundary (from NVMe)

Can you provide actual NVMe boundary value?

Firstly natural aligned write won't cross boundary, so boundary should
be >= write_unit_max, see blow code from patch 10/21:

+static bool bio_straddles_atomic_write_boundary(loff_t bi_sector,
+				unsigned int bi_size,
+				unsigned int boundary)
+{
+	loff_t start = bi_sector << SECTOR_SHIFT;
+	loff_t end = start + bi_size;
+	loff_t start_mod = start % boundary;
+	loff_t end_mod = end % boundary;
+
+	if (end - start > boundary)
+		return true;
+	if ((start_mod > end_mod) && (start_mod && end_mod))
+		return true;
+
+	return false;
+}
+

Then if the WRITE size is <= boundary, the above function should return
false, right? Looks like it is power_of(2) & aligned atomic_write_max_bytes?

> - virt boundary (from NVMe)

virt boundary is applied on bv_offset and bv_len, and NVMe's virt
bounary is (4k - 1), it shouldn't be one issue in reality.

> 
> And, as I mentioned elsewhere, I am still not 100% comfortable that we don't
> pay attention to regular max_sectors_kb...

max_sectors_kb should be bigger than atomic_write_max_bytes actually,
then what is your concern?



Thanks,
Ming


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-12-04 15:19               ` John Garry
  2023-12-04 15:39                 ` Christoph Hellwig
@ 2023-12-05  4:55                 ` Theodore Ts'o
  2023-12-05 11:09                   ` John Garry
  2023-12-05 13:59                 ` Ming Lei
  2 siblings, 1 reply; 124+ messages in thread
From: Theodore Ts'o @ 2023-12-05  4:55 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, jbongio,
	linux-api

On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote:
> > 
> > What is the 'dubious amazon torn-write prevention'?
> 
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
> 
> AFAICS, this is without any kernel changes, so no guarantee of unwanted
> splitting or merging of bios.

Well, more than one company has audited the kernel paths, and it turns
out that for selected Kernel versions, after doing desk-check
verification of the relevant kernel baths, as well as experimental
verification via testing to try to find torn writes in the kernel, we
can make it safe for specific kernel versions which might be used in
hosted MySQL instances where we control the kernel, the mysql server,
and the emulated block device (and we know the database is doing
Direct I/O writes --- this won't work for PostgreSQL).  I gave a talk
about this at Google I/O Next '18, five years ago[1].

[1] https://www.youtube.com/watch?v=gIeuiGg-_iw

Given the performance gains (see the talk (see the comparison of the
at time 19:31 and at 29:57) --- it's quite compelling. 

Of course, I wouldn't recommend this approach for a naive sysadmin,
since most database adminsitrators won't know how to audit kernel code
(see the discussion at time 35:10 of the video), and reverify the
entire software stack before every kernel upgrade.  The challenge is
how to do this safely.

The fact remains that both Amazon's EBS and Google's Persistent Disk
products are implemented in such a way that writes will not be torn
below the virtual machine, and the guarantees are in fact quite a bit
stronger than what we will probably end up advertising via NVMe and/or
SCSI.  It wouldn't surprise me if this is the case (or could be made
to be the case) For Oracle Cloud as well.

The question is how to make this guarantee so that the kernel knows
when various cloud-provided block devicse do provide these greater
guarantees, and then how to make it be an architected feature, as
opposed to a happy implementation detail that has to be verified at
every kernel upgrade.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 10/21] block: Add fops atomic write support
  2023-12-05  1:45           ` Ming Lei
@ 2023-12-05 10:49             ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-12-05 10:49 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, kbusch, hch, sagi, jejb, martin.petersen, djwong, viro,
	brauner, chandan.babu, dchinner, linux-block, linux-kernel,
	linux-nvme, linux-xfs, linux-fsdevel, tytso, jbongio, linux-api

On 05/12/2023 01:45, Ming Lei wrote:
>> Right, we do expect userspace to use a fixed block size, but we give scope
>> in the API to use variable size.
> Maybe it is enough to just take atomic_write_unit_min_bytes
> only, and allow length to be N * atomic_write_unit_min_bytes.
> 
> But it may violate atomic write boundary?

About atomic boundary, we just don't allow a merge which will result in 
a write which will straddle a boundary as there are no guarantees of 
atomicity then.

Having said this, atomic write boundary is just relevant to NVMe, so if 
we don't have merges there, then we could just omit this code.

> 
>>> The hardware should recognize unit size by start LBA, and check if length is
>>> valid, so probably the interface might be relaxed to:
>>>
>>> 1) start lba is unit aligned, and this unit is in the supported unit
>>> range(power_2 in [unit_min, unit_max])
>>>
>>> 2) length needs to be:
>>>
>>> - N * this_unit_size
>>> - <= atomic_write_max_bytes
>> Please note that we also need to consider:
>> - any atomic write boundary (from NVMe)
> Can you provide actual NVMe boundary value?
> 
> Firstly natural aligned write won't cross boundary, so boundary should
> be >= write_unit_max,

Correct

> see blow code from patch 10/21:
> 
> +static bool bio_straddles_atomic_write_boundary(loff_t bi_sector,
> +				unsigned int bi_size,
> +				unsigned int boundary)
> +{
> +	loff_t start = bi_sector << SECTOR_SHIFT;
> +	loff_t end = start + bi_size;
> +	loff_t start_mod = start % boundary;
> +	loff_t end_mod = end % boundary;
> +
> +	if (end - start > boundary)
> +		return true;
> +	if ((start_mod > end_mod) && (start_mod && end_mod))
> +		return true;
> +
> +	return false;
> +}
> +
> 
> Then if the WRITE size is <= boundary, the above function should return
> false, right? 

Actually if WRITE > boundary then we must be crossing a boundary and 
should return true, which is what the first condition checks.

However 2x naturally-aligned atomic writes could be less than 
atomic_write_max_bytes but still straddle if merged.

> Looks like it is power_of(2) & aligned atomic_write_max_bytes?
> 
>> - virt boundary (from NVMe)
> virt boundary is applied on bv_offset and bv_len, and NVMe's virt
> bounary is (4k - 1), it shouldn't be one issue in reality.

On a related topic, as I understand, for NVMe we just split bios 
according to virt boundary for PRP, but we won't always use PRP. So is 
there value in not splitting bio according to PRP if SGL will actually 
be used?

> 
>> And, as I mentioned elsewhere, I am still not 100% comfortable that we don't
>> pay attention to regular max_sectors_kb...
> max_sectors_kb should be bigger than atomic_write_max_bytes actually,
> then what is your concern?

My concern is that we don't enforce that, so we may issue atomic writes 
which exceed max_sectors_kb.

If we did enforce it, then atomic_write_unit_min_bytes, 
atomic_write_unit_max_bytes, and atomic_write_max_bytes all need to be 
limited according to max_sectors_kb.

Thanks,
John

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-12-05  4:55                 ` Theodore Ts'o
@ 2023-12-05 11:09                   ` John Garry
  0 siblings, 0 replies; 124+ messages in thread
From: John Garry @ 2023-12-05 11:09 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, jbongio,
	linux-api

On 05/12/2023 04:55, Theodore Ts'o wrote:
>> AFAICS, this is without any kernel changes, so no guarantee of unwanted
>> splitting or merging of bios.
> Well, more than one company has audited the kernel paths, and it turns
> out that for selected Kernel versions, after doing desk-check
> verification of the relevant kernel baths, as well as experimental
> verification via testing to try to find torn writes in the kernel, we
> can make it safe for specific kernel versions which might be used in
> hosted MySQL instances where we control the kernel, the mysql server,
> and the emulated block device (and we know the database is doing
> Direct I/O writes --- this won't work for PostgreSQL).  I gave a talk
> about this at Google I/O Next '18, five years ago[1].
> 
> [1]https://urldefense.com/v3/__https://www.youtube.com/watch?v=gIeuiGg-_iw__;!!ACWV5N9M2RV99hQ!I4iRp4xUyzAT0UwuEcnUBBCPKLXFKfk5FNmysFbKcQYfl0marAll5xEEVyB5mMFDqeckCWLmjU1aCR2Z$  
> 
> Given the performance gains (see the talk (see the comparison of the
> at time 19:31 and at 29:57) --- it's quite compelling.
> 
> Of course, I wouldn't recommend this approach for a naive sysadmin,
> since most database adminsitrators won't know how to audit kernel code
> (see the discussion at time 35:10 of the video), and reverify the
> entire software stack before every kernel upgrade.

Sure

>  The challenge is
> how to do this safely.

Right, and that is why I would be concerned about advertising torn-write 
protection support, but someone has not gone through the effort of 
auditing and verification phase to ensure that this does not happen in 
their software stack ever.

> 
> The fact remains that both Amazon's EBS and Google's Persistent Disk
> products are implemented in such a way that writes will not be torn
> below the virtual machine, and the guarantees are in fact quite a bit
> stronger than what we will probably end up advertising via NVMe and/or
> SCSI.  It wouldn't surprise me if this is the case (or could be made
> to be the case) For Oracle Cloud as well.
> 
> The question is how to make this guarantee so that the kernel knows
> when various cloud-provided block devicse do provide these greater
> guarantees, and then how to make it be an architected feature, as
> opposed to a happy implementation detail that has to be verified at
> every kernel upgrade.

The kernel can only judge atomic write support from what the HW product 
data tells us, so cloud-provided block devices need to provide that 
information as best possible if emulating the some storage technology.

Thanks,
John


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 17/21] fs: xfs: iomap atomic write support
  2023-12-04 15:19               ` John Garry
  2023-12-04 15:39                 ` Christoph Hellwig
  2023-12-05  4:55                 ` Theodore Ts'o
@ 2023-12-05 13:59                 ` Ming Lei
  2 siblings, 0 replies; 124+ messages in thread
From: Ming Lei @ 2023-12-05 13:59 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, axboe, kbusch, sagi, jejb, martin.petersen,
	djwong, viro, brauner, chandan.babu, dchinner, linux-block,
	linux-kernel, linux-nvme, linux-xfs, linux-fsdevel, tytso,
	jbongio, linux-api

On Mon, Dec 04, 2023 at 03:19:15PM +0000, John Garry wrote:
> On 04/12/2023 13:45, Christoph Hellwig wrote:
> > On Tue, Nov 28, 2023 at 05:42:10PM +0000, John Garry wrote:
> > > ok, fine, it would not be required for XFS with CoW. Some concerns still:
> > > a. device atomic write boundary, if any
> > > b. other FSes which do not have CoW support. ext4 is already being used for
> > > "atomic writes" in the field - see dubious amazon torn-write prevention.
> > 
> > What is the 'dubious amazon torn-write prevention'?
> 
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
> 
> AFAICS, this is without any kernel changes, so no guarantee of unwanted
> splitting or merging of bios.
> 
> Anyway, there will still be !CoW FSes which people want to support.
> 
> > 
> > > About b., we could add the pow-of-2 and file offset alignment requirement
> > > for other FSes, but then need to add some method to advertise that
> > > restriction.
> > 
> > We really need a better way to communicate I/O limitations anyway.
> > Something like XFS_IOC_DIOINFO on steroids.
> > 
> > > Sure, but to me it is a concern that we have 2x paths to make robust a.
> > > offload via hw, which may involve CoW b. no HW support, i.e. CoW always
> > 
> > Relying just on the hardware seems very limited, especially as there is
> > plenty of hardware that won't guarantee anything larger than 4k, and
> > plenty of NVMe hardware without has some other small limit like 32k
> > because it doesn't support multiple atomicy mode.
> 
> So what would you propose as the next step? Would it to be first achieve
> atomic write support for XFS with HW support + CoW to ensure contiguous
> extents (and without XFS forcealign)?
> 
> > 
> > > And for no HW support, if we don't follow the O_ATOMIC model of committing
> > > nothing until a SYNC is issued, would we allocate, write, and later free a
> > > new extent for each write, right?
> > 
> > Yes. Then again if you do data journalling you do that anyway, and as
> > one little project I'm doing right now shows that data journling is
> > often the fastest thing we can do for very small writes.
> 
> Ignoring FSes, then how is this supposed to work for block devices? We just
> always need HW support, right?

Looks the HW support could be minimized, just like what Google and Amazon did,
16KB physical block size with proper queue limit setting.

Now seems it is easy to make such device with ublk-loop by:

- use one backing disk with 16KB/32KB/.. physical block size
- expose proper physical bs & chunk_sectors & max sectors queue limit

Then any 16KB aligned direct WRITE with N*16KB length(N in [1, 8] with 256
chunk_sectors) can be atomic-write.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 124+ messages in thread

end of thread, other threads:[~2023-12-05 13:59 UTC | newest]

Thread overview: 124+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-29 10:27 [PATCH 00/21] block atomic writes John Garry
2023-09-29 10:27 ` [PATCH 01/21] block: Add atomic write operations to request_queue limits John Garry
2023-10-03 16:40   ` Bart Van Assche
2023-10-04  3:00     ` Martin K. Petersen
2023-10-04 17:28       ` Bart Van Assche
2023-10-04 18:26         ` Martin K. Petersen
2023-10-04 21:00       ` Bart Van Assche
2023-10-05  8:22         ` John Garry
2023-11-09 15:10   ` Christoph Hellwig
2023-11-09 17:01     ` John Garry
2023-11-10  6:23       ` Christoph Hellwig
2023-11-10  9:04         ` John Garry
2023-09-29 10:27 ` [PATCH 02/21] block: Limit atomic writes according to bio and queue limits John Garry
2023-11-09 15:13   ` Christoph Hellwig
2023-11-09 17:41     ` John Garry
2023-12-04  3:19   ` Ming Lei
2023-12-04  3:55     ` Ming Lei
2023-12-04  9:35       ` John Garry
2023-09-29 10:27 ` [PATCH 03/21] fs/bdev: Add atomic write support info to statx John Garry
2023-09-29 22:49   ` Eric Biggers
2023-10-01 13:23     ` Bart Van Assche
2023-10-02  9:51       ` John Garry
2023-10-02 18:39         ` Bart Van Assche
2023-10-03  0:28           ` Martin K. Petersen
2023-11-09 15:15             ` Christoph Hellwig
2023-10-03  1:51         ` Dave Chinner
2023-10-03  2:57           ` Darrick J. Wong
2023-10-03  7:23             ` John Garry
2023-10-03 15:46               ` Darrick J. Wong
2023-10-04 14:19                 ` John Garry
2023-09-29 10:27 ` [PATCH 04/21] fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support John Garry
2023-10-06 18:15   ` Jeremy Bongio
2023-10-09 22:02     ` Dave Chinner
2023-09-29 10:27 ` [PATCH 05/21] block: Add REQ_ATOMIC flag John Garry
2023-09-29 10:27 ` [PATCH 06/21] block: Pass blk_queue_get_max_sectors() a request pointer John Garry
2023-09-29 10:27 ` [PATCH 07/21] block: Limit atomic write IO size according to atomic_write_max_sectors John Garry
2023-09-29 10:27 ` [PATCH 08/21] block: Error an attempt to split an atomic write bio John Garry
2023-09-29 10:27 ` [PATCH 09/21] block: Add checks to merging of atomic writes John Garry
2023-09-30 13:40   ` kernel test robot
2023-10-02 22:50     ` Nathan Chancellor
2023-10-04 11:40       ` John Garry
2023-09-29 10:27 ` [PATCH 10/21] block: Add fops atomic write support John Garry
2023-09-29 17:51   ` Bart Van Assche
2023-10-02 10:10     ` John Garry
2023-10-02 19:12       ` Bart Van Assche
2023-10-03  0:48         ` Martin K. Petersen
2023-10-03 16:55           ` Bart Van Assche
2023-10-04  2:53             ` Martin K. Petersen
2023-10-04 17:22               ` Bart Van Assche
2023-10-04 18:17                 ` Martin K. Petersen
2023-10-05 17:10                   ` Bart Van Assche
2023-10-05 22:36                     ` Dave Chinner
2023-10-05 22:58                       ` Bart Van Assche
2023-10-06  4:31                         ` Dave Chinner
2023-10-06 17:22                           ` Bart Van Assche
2023-10-07  1:21                             ` Martin K. Petersen
2023-10-03  8:37         ` John Garry
2023-10-03 16:45           ` Bart Van Assche
2023-10-04  9:14             ` John Garry
2023-10-04 17:34               ` Bart Van Assche
2023-10-04 21:59                 ` Dave Chinner
2023-12-04  2:30   ` Ming Lei
2023-12-04  9:27     ` John Garry
2023-12-04 12:18       ` Ming Lei
2023-12-04 13:13         ` John Garry
2023-12-05  1:45           ` Ming Lei
2023-12-05 10:49             ` John Garry
2023-09-29 10:27 ` [PATCH 11/21] fs: xfs: Don't use low-space allocator for alignment > 1 John Garry
2023-10-03  1:16   ` Dave Chinner
2023-10-03  3:00     ` Darrick J. Wong
2023-10-03  4:34       ` Dave Chinner
2023-10-03 10:22       ` John Garry
2023-09-29 10:27 ` [PATCH 12/21] fs: xfs: Introduce FORCEALIGN inode flag John Garry
2023-11-09 15:24   ` Christoph Hellwig
2023-09-29 10:27 ` [PATCH 13/21] fs: xfs: Make file data allocations observe the 'forcealign' flag John Garry
2023-10-03  1:42   ` Dave Chinner
2023-10-03 10:13     ` John Garry
2023-09-29 10:27 ` [PATCH 14/21] fs: xfs: Enable file data forcealign feature John Garry
2023-09-29 10:27 ` [PATCH 15/21] fs: xfs: Support atomic write for statx John Garry
2023-10-03  3:32   ` Dave Chinner
2023-10-03 10:56     ` John Garry
2023-10-03 16:10       ` Darrick J. Wong
2023-09-29 10:27 ` [PATCH 16/21] fs: iomap: Atomic write support John Garry
2023-10-03  4:24   ` Dave Chinner
2023-10-03 12:55     ` John Garry
2023-10-03 16:47     ` Darrick J. Wong
2023-10-04  1:16       ` Dave Chinner
2023-10-24 12:59     ` John Garry
2023-09-29 10:27 ` [PATCH 17/21] fs: xfs: iomap atomic " John Garry
2023-11-09 15:26   ` Christoph Hellwig
2023-11-10 10:42     ` John Garry
2023-11-28  8:56       ` John Garry
2023-11-28 13:56         ` Christoph Hellwig
2023-11-28 17:42           ` John Garry
2023-11-29  2:45             ` Martin K. Petersen
2023-12-04 13:45             ` Christoph Hellwig
2023-12-04 15:19               ` John Garry
2023-12-04 15:39                 ` Christoph Hellwig
2023-12-04 18:06                   ` John Garry
2023-12-05  4:55                 ` Theodore Ts'o
2023-12-05 11:09                   ` John Garry
2023-12-05 13:59                 ` Ming Lei
2023-09-29 10:27 ` [PATCH 18/21] scsi: sd: Support reading atomic properties from block limits VPD John Garry
2023-09-29 17:54   ` Bart Van Assche
2023-10-02 11:27     ` John Garry
2023-10-06 17:52       ` Bart Van Assche
2023-10-06 23:48         ` Martin K. Petersen
2023-09-29 10:27 ` [PATCH 19/21] scsi: sd: Add WRITE_ATOMIC_16 support John Garry
2023-09-29 17:59   ` Bart Van Assche
2023-10-02 11:36     ` John Garry
2023-10-02 19:21       ` Bart Van Assche
2023-09-29 10:27 ` [PATCH 20/21] scsi: scsi_debug: Atomic write support John Garry
2023-09-29 10:27 ` [PATCH 21/21] nvme: Support atomic writes John Garry
     [not found]   ` <CGME20231004113943eucas1p23a51ce5ef06c36459f826101bb7b85fc@eucas1p2.samsung.com>
2023-10-04 11:39     ` Pankaj Raghav
2023-10-05 10:24       ` John Garry
2023-10-05 13:32         ` Pankaj Raghav
2023-10-05 15:05           ` John Garry
2023-11-09 15:36   ` Christoph Hellwig
2023-11-09 15:42     ` Matthew Wilcox
2023-11-09 15:46       ` Christoph Hellwig
2023-11-09 19:08         ` John Garry
2023-11-10  6:29           ` Christoph Hellwig
2023-11-10  8:44             ` John Garry
2023-09-29 14:58 ` [PATCH 00/21] block " Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.