linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/7] buffered block atomic writes
@ 2024-04-22 14:39 John Garry
  2024-04-22 14:39 ` [PATCH RFC 1/7] fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO John Garry
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: John Garry @ 2024-04-22 14:39 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, akpm, willy, dchinner, tytso,
	hch, martin.petersen, nilay, ritesh.list, mcgrof
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm,
	ojaswin, p.raghav, jbongio, okiselev, John Garry

This series introduces a proof-of-concept for buffered block atomic
writes.

There is a requirement for userspace to be able to issue a write which
will not be torn due to HW or some other failure. A solution is presented
in [0] and [1].

Those series mentioned only support atomic writes for direct IO. The
primary target of atomic (or untorn) writes is DBs like InnoDB/MySQL,
which require direct IO support. However, as mentioned in [2], there is
a want to support atomic writes for DBs which use buffered writes, like
Postgres.

The issue raised in [2] was that the API proposed is not suitable for
buffered atomic writes. Specifically, since the API permits a range of
sizes of atomic writes, it is too difficult to track in the pagecache the
geometry of atomic writes which overlap with other atomic writes of
differing sizes and alignment. In addition, tracking and handling
overlapping atomic and non-atomic writes is difficult also.

In this series, buffered atomic writes are supported based upon the
following principles:
- A buffered atomic write requires RWF_ATOMIC flag be set, same as
  direct IO. The same other atomic writes rules apply, like power-of-2
  size and naturally aligned.
- For an inode, only a single size of buffered write is allowed. So for
  statx, atomic_write_unit_min = atomic_write_unit_max always for
  buffered atomic writes.
- A single folio maps to an atomic write in the pagecache. Folios match
  atomic writes well, as an atomic write must be a power-of-2 in size and
  naturally aligned.
- A folio is tagged as "atomic" when atomically written. If any part of an
  "atomic" folio is fully or partially overwritten with a non-atomic
  write, the folio loses it atomicity. Indeed, issuing a non-atomic write
  over an atomic write would typically be seen as a userspace bug.
- If userspace wants to guarantee a buffered atomic write is written to
  media atomically after the write syscall returns, it must use RWF_SYNC
  or similar (along with RWF_ATOMIC).

This series just supports buffered atomic writes for XFS. I do have some
patches for bdev file operations buffered atomic writes. I did not include
them, as:
a. I don't know of any requirement for this support
b. atomic_write_unit_min and atomic_write_unit_max would be fixed at
   PAGE_SIZE there. This is very limiting. However an API like BLKBSZSET
   could be added to allow userspace to program the values for
   atomic_write_unit_{min, max}.
c. We may want to support atomic_write_unit_{min, max} < PAGE_SIZE, and
   this becomes more complicated to support.
d. I would like to see what happens with bs > ps work there.

This series is just an early proof-of-concept, to prove that the API
proposed for block atomic writes can work for buffered IO. I would like to
unblock that direct IO series and have it merged.

Patches are based on [0], [1], and [3] (the bs > ps series). For the bs >
ps series, I had to borrow an earlier filemap change which allows the
folio min and max order be selected.

All patches can be found at:
https://github.com/johnpgarry/linux/tree/atomic-writes-v6.9-v6-fs-v2-buffered

[0] https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@oracle.com/
[1] https://lore.kernel.org/linux-block/20240304130428.13026-1-john.g.garry@oracle.com/
[2] https://lore.kernel.org/linux-fsdevel/20240228061257.GA106651@mit.edu/
[3] https://lore.kernel.org/linux-xfs/20240313170253.2324812-1-kernel@pankajraghav.com/

John Garry (7):
  fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO
  filemap: Change mapping_set_folio_min_order() ->
    mapping_set_folio_orders()
  mm: Add PG_atomic
  fs: Add initial buffered atomic write support info to statx
  fs: iomap: buffered atomic write support
  fs: xfs: buffered atomic writes statx support
  fs: xfs: Enable buffered atomic writes

 block/bdev.c                   |  9 +++---
 fs/iomap/buffered-io.c         | 53 +++++++++++++++++++++++++++++-----
 fs/iomap/trace.h               |  3 +-
 fs/stat.c                      | 26 ++++++++++++-----
 fs/xfs/libxfs/xfs_inode_buf.c  |  8 +++++
 fs/xfs/xfs_file.c              | 12 ++++++--
 fs/xfs/xfs_icache.c            | 10 ++++---
 fs/xfs/xfs_ioctl.c             |  3 ++
 fs/xfs/xfs_iops.c              | 11 +++++--
 include/linux/fs.h             |  3 +-
 include/linux/iomap.h          |  1 +
 include/linux/page-flags.h     |  5 ++++
 include/linux/pagemap.h        | 20 ++++++++-----
 include/trace/events/mmflags.h |  3 +-
 include/uapi/linux/stat.h      |  6 ++--
 mm/filemap.c                   |  8 ++++-
 16 files changed, 141 insertions(+), 40 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH RFC 1/7] fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO
  2024-04-22 14:39 [PATCH RFC 0/7] buffered block atomic writes John Garry
@ 2024-04-22 14:39 ` John Garry
  2024-04-22 14:39 ` [PATCH RFC 2/7] filemap: Change mapping_set_folio_min_order() -> mapping_set_folio_orders() John Garry
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: John Garry @ 2024-04-22 14:39 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, akpm, willy, dchinner, tytso,
	hch, martin.petersen, nilay, ritesh.list, mcgrof
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm,
	ojaswin, p.raghav, jbongio, okiselev, John Garry

Rename STATX_WRITE_ATOMIC -> STATX_WRITE_ATOMIC_DIO and
STATX_ATTR_WRITE_ATOMIC -> STATX_ATTR_WRITE_ATOMIC_DIO, to make it clear
that they are only relevant to direct IO.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/bdev.c              |  6 +++---
 fs/stat.c                 | 10 +++++-----
 fs/xfs/xfs_iops.c         |  2 +-
 include/uapi/linux/stat.h |  4 ++--
 4 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index f3dd9f3c8838..e2a9951bd2e9 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1182,14 +1182,14 @@ void sync_bdevs(bool wait)
 }
 
 /*
- * Handle STATX_{DIOALIGN, WRITE_ATOMIC} for block devices.
+ * Handle STATX_{DIOALIGN, WRITE_ATOMIC_DIO} for block devices.
  */
 void bdev_statx(struct inode *backing_inode, struct kstat *stat,
 		u32 request_mask)
 {
 	struct block_device *bdev;
 
-	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC)))
+	if (!(request_mask & (STATX_DIOALIGN | STATX_WRITE_ATOMIC_DIO)))
 		return;
 
 	/*
@@ -1208,7 +1208,7 @@ void bdev_statx(struct inode *backing_inode, struct kstat *stat,
 		stat->result_mask |= STATX_DIOALIGN;
 	}
 
-	if (request_mask & STATX_WRITE_ATOMIC && bdev_can_atomic_write(bdev)) {
+	if (request_mask & STATX_WRITE_ATOMIC_DIO && bdev_can_atomic_write(bdev)) {
 		struct request_queue *bd_queue = bdev->bd_queue;
 
 		generic_fill_statx_atomic_writes(stat,
diff --git a/fs/stat.c b/fs/stat.c
index 0e296925a56b..0c0c4c22c563 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -95,18 +95,18 @@ EXPORT_SYMBOL(generic_fill_statx_attr);
  * @unit_min:	Minimum supported atomic write length in bytes
  * @unit_max:	Maximum supported atomic write length in bytes
  *
- * Fill in the STATX{_ATTR}_WRITE_ATOMIC flags in the kstat structure from
- * atomic write unit_min and unit_max values.
+ * Fill in the STATX{_ATTR}_WRITE_ATOMIC_DIO flags in the kstat structure
+ * from atomic write unit_min and unit_max values.
  */
 void generic_fill_statx_atomic_writes(struct kstat *stat,
 				      unsigned int unit_min,
 				      unsigned int unit_max)
 {
 	/* Confirm that the request type is known */
-	stat->result_mask |= STATX_WRITE_ATOMIC;
+	stat->result_mask |= STATX_WRITE_ATOMIC_DIO;
 
 	/* Confirm that the file attribute type is known */
-	stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
+	stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC_DIO;
 
 	if (unit_min) {
 		stat->atomic_write_unit_min = unit_min;
@@ -115,7 +115,7 @@ void generic_fill_statx_atomic_writes(struct kstat *stat,
 		stat->atomic_write_segments_max = 1;
 
 		/* Confirm atomic writes are actually supported */
-		stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
+		stat->attributes |= STATX_ATTR_WRITE_ATOMIC_DIO;
 	}
 }
 EXPORT_SYMBOL_GPL(generic_fill_statx_atomic_writes);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 2de0a9c4e6e0..37076176db67 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -650,7 +650,7 @@ xfs_vn_getattr(
 			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
 			stat->dio_offset_align = bdev_logical_block_size(bdev);
 		}
-		if (request_mask & STATX_WRITE_ATOMIC) {
+		if (request_mask & STATX_WRITE_ATOMIC_DIO) {
 			unsigned int unit_min, unit_max;
 
 			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 319ef4afb89e..05f9720d4030 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -160,7 +160,7 @@ struct statx {
 #define STATX_MNT_ID		0x00001000U	/* Got stx_mnt_id */
 #define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
 #define STATX_MNT_ID_UNIQUE	0x00004000U	/* Want/got extended stx_mount_id */
-#define STATX_WRITE_ATOMIC	0x00008000U	/* Want/got atomic_write_* fields */
+#define STATX_WRITE_ATOMIC_DIO	0x00008000U	/* Want/got atomic_write_* fields for dio */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
@@ -196,7 +196,7 @@ struct statx {
 #define STATX_ATTR_MOUNT_ROOT		0x00002000 /* Root of a mount */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
 #define STATX_ATTR_DAX			0x00200000 /* File is currently in DAX state */
-#define STATX_ATTR_WRITE_ATOMIC		0x00400000 /* File supports atomic write operations */
+#define STATX_ATTR_WRITE_ATOMIC_DIO	0x00400000 /* File supports atomic write dio operations */
 
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 2/7] filemap: Change mapping_set_folio_min_order() -> mapping_set_folio_orders()
  2024-04-22 14:39 [PATCH RFC 0/7] buffered block atomic writes John Garry
  2024-04-22 14:39 ` [PATCH RFC 1/7] fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO John Garry
@ 2024-04-22 14:39 ` John Garry
  2024-04-25 14:47   ` Pankaj Raghav (Samsung)
  2024-04-22 14:39 ` [PATCH RFC 3/7] mm: Add PG_atomic John Garry
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: John Garry @ 2024-04-22 14:39 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, akpm, willy, dchinner, tytso,
	hch, martin.petersen, nilay, ritesh.list, mcgrof
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm,
	ojaswin, p.raghav, jbongio, okiselev, John Garry

Borrowed from:

https://lore.kernel.org/linux-fsdevel/20240213093713.1753368-2-kernel@pankajraghav.com/
(credit given in due course)

We will need to be able to only use a single folio order for buffered
atomic writes, so allow the mapping folio order min and max be set.

We still have the restriction of not being able to support order-1
folios - it will be required to lift this limit at some stage.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_icache.c     | 10 ++++++----
 include/linux/pagemap.h | 20 +++++++++++++-------
 mm/filemap.c            |  8 +++++++-
 3 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8fb5cf0f5a09..6186887bd6ff 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -89,8 +89,9 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
-				    M_IGEO(mp)->min_folio_order);
+	mapping_set_folio_orders(VFS_I(ip)->i_mapping,
+				    M_IGEO(mp)->min_folio_order,
+				    MAX_PAGECACHE_ORDER);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -325,8 +326,9 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_folio_min_order(inode->i_mapping,
-				    M_IGEO(mp)->min_folio_order);
+	mapping_set_folio_orders(inode->i_mapping,
+				    M_IGEO(mp)->min_folio_order,
+				    MAX_PAGECACHE_ORDER);
 	return error;
 }
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index fc8eb9c94e9c..c22455fa28a1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -363,9 +363,10 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 #endif
 
 /*
- * mapping_set_folio_min_order() - Set the minimum folio order
+ * mapping_set_folio_orders() - Set the minimum and max folio order
  * @mapping: The address_space.
  * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
  *
  * The filesystem should call this function in its inode constructor to
  * indicate which base size of folio the VFS can use to cache the contents
@@ -376,15 +377,20 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  * Context: This should not be called while the inode is active as it
  * is non-atomic.
  */
-static inline void mapping_set_folio_min_order(struct address_space *mapping,
-					       unsigned int min)
+
+static inline void mapping_set_folio_orders(struct address_space *mapping,
+					    unsigned int min, unsigned int max)
 {
-	if (min > MAX_PAGECACHE_ORDER)
-		min = MAX_PAGECACHE_ORDER;
+	if (min == 1)
+		min = 2;
+	if (max < min)
+		max = min;
+	if (max > MAX_PAGECACHE_ORDER)
+		max = MAX_PAGECACHE_ORDER;
 
 	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
 			 (min << AS_FOLIO_ORDER_MIN) |
-			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
+			 (max << AS_FOLIO_ORDER_MAX);
 }
 
 /**
@@ -400,7 +406,7 @@ static inline void mapping_set_folio_min_order(struct address_space *mapping,
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	mapping_set_folio_min_order(mapping, 0);
+	mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
 }
 
 static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index d81530b0aac0..d5effe50ddcb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1898,9 +1898,15 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
 		unsigned int min_order = mapping_min_folio_order(mapping);
-		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
+		unsigned int max_order = mapping_max_folio_order(mapping);
+		unsigned int order = FGF_GET_ORDER(fgp_flags);
 		int err;
 
+		if (order > max_order)
+			order = max_order;
+		else if (order < min_order)
+			order = max_order;
+
 		index = mapping_align_start_index(mapping, index);
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 3/7] mm: Add PG_atomic
  2024-04-22 14:39 [PATCH RFC 0/7] buffered block atomic writes John Garry
  2024-04-22 14:39 ` [PATCH RFC 1/7] fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO John Garry
  2024-04-22 14:39 ` [PATCH RFC 2/7] filemap: Change mapping_set_folio_min_order() -> mapping_set_folio_orders() John Garry
@ 2024-04-22 14:39 ` John Garry
  2024-04-22 14:39 ` [PATCH RFC 4/7] fs: Add initial buffered atomic write support info to statx John Garry
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: John Garry @ 2024-04-22 14:39 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, akpm, willy, dchinner, tytso,
	hch, martin.petersen, nilay, ritesh.list, mcgrof
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm,
	ojaswin, p.raghav, jbongio, okiselev, John Garry

Add page flag PG_atomic, meaning that a folio needs to be written back
atomically.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/linux/page-flags.h     | 5 +++++
 include/trace/events/mmflags.h | 3 ++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 652d77805e99..e777b2e7daaf 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -119,6 +119,7 @@ enum pageflags {
 	PG_reclaim,		/* To be reclaimed asap */
 	PG_swapbacked,		/* Page is backed by RAM/swap */
 	PG_unevictable,		/* Page is "unevictable"  */
+	PG_atomic,
 #ifdef CONFIG_MMU
 	PG_mlocked,		/* Page is vma mlocked */
 #endif
@@ -579,6 +580,10 @@ PAGEFLAG(Unevictable, unevictable, PF_HEAD)
 	__CLEARPAGEFLAG(Unevictable, unevictable, PF_HEAD)
 	TESTCLEARFLAG(Unevictable, unevictable, PF_HEAD)
 
+PAGEFLAG(Atomic, atomic, PF_HEAD)
+	__CLEARPAGEFLAG(Atomic, atomic, PF_HEAD)
+	TESTCLEARFLAG(Atomic, atomic, PF_HEAD)
+
 #ifdef CONFIG_MMU
 PAGEFLAG(Mlocked, mlocked, PF_NO_TAIL)
 	__CLEARPAGEFLAG(Mlocked, mlocked, PF_NO_TAIL)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index d801409b33cf..3c83e7b93898 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -118,7 +118,8 @@
 	DEF_PAGEFLAG_NAME(mappedtodisk),				\
 	DEF_PAGEFLAG_NAME(reclaim),					\
 	DEF_PAGEFLAG_NAME(swapbacked),					\
-	DEF_PAGEFLAG_NAME(unevictable)					\
+	DEF_PAGEFLAG_NAME(unevictable),					\
+	DEF_PAGEFLAG_NAME(atomic)					\
 IF_HAVE_PG_MLOCK(mlocked)						\
 IF_HAVE_PG_UNCACHED(uncached)						\
 IF_HAVE_PG_HWPOISON(hwpoison)						\
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 4/7] fs: Add initial buffered atomic write support info to statx
  2024-04-22 14:39 [PATCH RFC 0/7] buffered block atomic writes John Garry
                   ` (2 preceding siblings ...)
  2024-04-22 14:39 ` [PATCH RFC 3/7] mm: Add PG_atomic John Garry
@ 2024-04-22 14:39 ` John Garry
  2024-04-22 14:39 ` [PATCH RFC 5/7] fs: iomap: buffered atomic write support John Garry
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: John Garry @ 2024-04-22 14:39 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, akpm, willy, dchinner, tytso,
	hch, martin.petersen, nilay, ritesh.list, mcgrof
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm,
	ojaswin, p.raghav, jbongio, okiselev, John Garry

Extend statx system call to return additional info for buffered atomic
write support for a file. Currently only direct IO is supported.

New flags STATX_WRITE_ATOMIC_BUF and STATX_ATTR_WRITE_ATOMIC_BUF are for
indicating whether the file knows and supports buffered atomic writes.

Structure statx members stx_atomic_write_unit_{min, max, segments_max} will
be reused for bufferd atomic writes. Flags STATX_WRITE_ATOMIC_DIO and
STATX_WRITE_ATOMIC_BUF are mutually exclusive. With both flags set, neither
fields in statx.result_mask will be set.

For buffered atomic writes, stx_atomic_write_unit_{min, max} must hold the
same value.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 block/bdev.c              |  3 ++-
 fs/stat.c                 | 26 ++++++++++++++++++--------
 fs/xfs/xfs_iops.c         |  2 +-
 include/linux/fs.h        |  3 ++-
 include/uapi/linux/stat.h |  2 ++
 5 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index e2a9951bd2e9..b80c78aed9f6 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1213,7 +1213,8 @@ void bdev_statx(struct inode *backing_inode, struct kstat *stat,
 
 		generic_fill_statx_atomic_writes(stat,
 			queue_atomic_write_unit_min_bytes(bd_queue),
-			queue_atomic_write_unit_max_bytes(bd_queue));
+			queue_atomic_write_unit_max_bytes(bd_queue),
+			true);
 	}
 
 	blkdev_put_no_open(bdev);
diff --git a/fs/stat.c b/fs/stat.c
index 0c0c4c22c563..cb8283534616 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -94,19 +94,26 @@ EXPORT_SYMBOL(generic_fill_statx_attr);
  * @stat:	Where to fill in the attribute flags
  * @unit_min:	Minimum supported atomic write length in bytes
  * @unit_max:	Maximum supported atomic write length in bytes
+ * @dio:	Whether filling in the fields for direct-IO
  *
- * Fill in the STATX{_ATTR}_WRITE_ATOMIC_DIO flags in the kstat structure
- * from atomic write unit_min and unit_max values.
+ * Fill in the STATX{_ATTR}_WRITE_ATOMIC_{DIO, BUF} flags in the kstat
+ * structure from atomic write unit_min and unit_max values.
  */
 void generic_fill_statx_atomic_writes(struct kstat *stat,
 				      unsigned int unit_min,
-				      unsigned int unit_max)
+				      unsigned int unit_max,
+				      bool dio)
 {
 	/* Confirm that the request type is known */
-	stat->result_mask |= STATX_WRITE_ATOMIC_DIO;
-
-	/* Confirm that the file attribute type is known */
-	stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC_DIO;
+	if (dio) {
+		/* Confirm that the request type is known */
+		stat->result_mask |= STATX_WRITE_ATOMIC_DIO;
+		/* Confirm that the file attribute type is known */
+		stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC_DIO;
+	} else {
+		stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC_BUF;
+		stat->result_mask |= STATX_WRITE_ATOMIC_BUF;
+	}
 
 	if (unit_min) {
 		stat->atomic_write_unit_min = unit_min;
@@ -115,7 +122,10 @@ void generic_fill_statx_atomic_writes(struct kstat *stat,
 		stat->atomic_write_segments_max = 1;
 
 		/* Confirm atomic writes are actually supported */
-		stat->attributes |= STATX_ATTR_WRITE_ATOMIC_DIO;
+		if (dio)
+			stat->attributes |= STATX_ATTR_WRITE_ATOMIC_DIO;
+		else
+			stat->attributes |= STATX_ATTR_WRITE_ATOMIC_BUF;
 	}
 }
 EXPORT_SYMBOL_GPL(generic_fill_statx_atomic_writes);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 37076176db67..05b20c88ff77 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -655,7 +655,7 @@ xfs_vn_getattr(
 
 			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
 			generic_fill_statx_atomic_writes(stat,
-				unit_min, unit_max);
+				unit_min, unit_max, true);
 		}
 		fallthrough;
 	default:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 08a0b9a4da93..1147d031d5bd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3243,7 +3243,8 @@ void generic_fillattr(struct mnt_idmap *, u32, struct inode *, struct kstat *);
 void generic_fill_statx_attr(struct inode *inode, struct kstat *stat);
 void generic_fill_statx_atomic_writes(struct kstat *stat,
 				      unsigned int unit_min,
-				      unsigned int unit_max);
+				      unsigned int unit_max,
+				      bool dio);
 extern int vfs_getattr_nosec(const struct path *, struct kstat *, u32, unsigned int);
 extern int vfs_getattr(const struct path *, struct kstat *, u32, unsigned int);
 void __inode_add_bytes(struct inode *inode, loff_t bytes);
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 05f9720d4030..9eef921b3942 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -161,6 +161,7 @@ struct statx {
 #define STATX_DIOALIGN		0x00002000U	/* Want/got direct I/O alignment info */
 #define STATX_MNT_ID_UNIQUE	0x00004000U	/* Want/got extended stx_mount_id */
 #define STATX_WRITE_ATOMIC_DIO	0x00008000U	/* Want/got atomic_write_* fields for dio */
+#define STATX_WRITE_ATOMIC_BUF	0x00010000U	/* Want/got atomic_write_* fields for non-dio */
 
 #define STATX__RESERVED		0x80000000U	/* Reserved for future struct statx expansion */
 
@@ -197,6 +198,7 @@ struct statx {
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
 #define STATX_ATTR_DAX			0x00200000 /* File is currently in DAX state */
 #define STATX_ATTR_WRITE_ATOMIC_DIO	0x00400000 /* File supports atomic write dio operations */
+#define STATX_ATTR_WRITE_ATOMIC_BUF	0x00800000 /* File supports atomic write non-dio operations */
 
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 5/7] fs: iomap: buffered atomic write support
  2024-04-22 14:39 [PATCH RFC 0/7] buffered block atomic writes John Garry
                   ` (3 preceding siblings ...)
  2024-04-22 14:39 ` [PATCH RFC 4/7] fs: Add initial buffered atomic write support info to statx John Garry
@ 2024-04-22 14:39 ` John Garry
  2024-04-22 15:03   ` Matthew Wilcox
  2024-04-22 14:39 ` [PATCH RFC 6/7] fs: xfs: buffered atomic writes statx support John Garry
  2024-04-22 14:39 ` [PATCH RFC 7/7] fs: xfs: Enable buffered atomic writes John Garry
  6 siblings, 1 reply; 12+ messages in thread
From: John Garry @ 2024-04-22 14:39 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, akpm, willy, dchinner, tytso,
	hch, martin.petersen, nilay, ritesh.list, mcgrof
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm,
	ojaswin, p.raghav, jbongio, okiselev, John Garry

Add special handling of PG_atomic flag to iomap buffered write path.

To flag an iomap iter for an atomic write, set IOMAP_ATOMIC.

For a folio associated with a write which has IOMAP_ATOMIC set, set
PG_atomic.

Otherwise, when IOMAP_ATOMIC is unset, clear PG_atomic.

This means that for an "atomic" folio which has not been written back, it
loses it "atomicity". So if userspace issues a write with RWF_ATOMIC set
and another write with RWF_ATOMIC unset and which fully or partially
overwrites that same region as the first write, that folio is not written
back atomically. For such a scenario to occur, it would be considered a
userspace usage error.

To ensure that a buffered atomic write is written back atomically when
the write syscall returns, RWF_SYNC or similar needs to be used (in
conjunction with RWF_ATOMIC).

As a safety check, when getting a folio for an atomic write in
iomap_get_folio(), ensure that the length matches the inode mapping folio
order-limit.

Only a single BIO should ever be submitted for an atomic write. So modify
iomap_add_to_ioend() to ensure that we don't try to write back an atomic
folio as part of a larger mixed-atomicity BIO.

In iomap_alloc_ioend(), handle an atomic write by setting REQ_ATOMIC for
the allocated BIO.

When a folio is written back, again clear PG_atomic, as it is no longer
required. I assume it will not be needlessly written back a second time...

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/buffered-io.c | 53 ++++++++++++++++++++++++++++++++++++------
 fs/iomap/trace.h       |  3 ++-
 include/linux/iomap.h  |  1 +
 3 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e8e41c8b3c0..ac2a014c91a9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -586,13 +586,25 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  */
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
+	struct address_space *mapping = iter->inode->i_mapping;
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
 	fgp |= fgf_set_order(len);
 
-	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
+	if (iter->flags & IOMAP_ATOMIC) {
+		unsigned int min_order = mapping_min_folio_order(mapping);
+		unsigned int max_order = mapping_max_folio_order(mapping);
+		unsigned int order = FGF_GET_ORDER(fgp);
+
+		if (order != min_order)
+			return ERR_PTR(-EINVAL);
+		if (order != max_order)
+			return ERR_PTR(-EINVAL);
+	}
+
+	return __filemap_get_folio(mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
 }
 EXPORT_SYMBOL_GPL(iomap_get_folio);
@@ -769,6 +781,7 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 {
 	const struct iomap_folio_ops *folio_ops = iter->iomap.folio_ops;
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
+	bool is_atomic = iter->flags & IOMAP_ATOMIC;
 	struct folio *folio;
 	int status = 0;
 
@@ -786,6 +799,11 @@ static int iomap_write_begin(struct iomap_iter *iter, loff_t pos,
 	if (IS_ERR(folio))
 		return PTR_ERR(folio);
 
+	if (is_atomic)
+		folio_set_atomic(folio);
+	else
+		folio_clear_atomic(folio);
+
 	/*
 	 * Now we have a locked folio, before we do anything with it we need to
 	 * check that the iomap we have cached is not stale. The inode extent
@@ -1010,6 +1028,8 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *i,
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iter.flags |= IOMAP_NOWAIT;
+	if (iocb->ki_flags & IOCB_ATOMIC)
+		iter.flags |= IOMAP_ATOMIC;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
 		iter.processed = iomap_write_iter(&iter, i);
@@ -1499,8 +1519,10 @@ static void iomap_finish_folio_write(struct inode *inode, struct folio *folio,
 	WARN_ON_ONCE(i_blocks_per_folio(inode, folio) > 1 && !ifs);
 	WARN_ON_ONCE(ifs && atomic_read(&ifs->write_bytes_pending) <= 0);
 
-	if (!ifs || atomic_sub_and_test(len, &ifs->write_bytes_pending))
+	if (!ifs || atomic_sub_and_test(len, &ifs->write_bytes_pending)) {
+		folio_clear_atomic(folio);
 		folio_end_writeback(folio);
+	}
 }
 
 /*
@@ -1679,14 +1701,18 @@ static int iomap_submit_ioend(struct iomap_writepage_ctx *wpc, int error)
 }
 
 static struct iomap_ioend *iomap_alloc_ioend(struct iomap_writepage_ctx *wpc,
-		struct writeback_control *wbc, struct inode *inode, loff_t pos)
+		struct writeback_control *wbc, struct inode *inode, loff_t pos,
+		bool atomic)
 {
+	blk_opf_t opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
 	struct iomap_ioend *ioend;
 	struct bio *bio;
 
+	if (atomic)
+		opf |= REQ_ATOMIC;
+
 	bio = bio_alloc_bioset(wpc->iomap.bdev, BIO_MAX_VECS,
-			       REQ_OP_WRITE | wbc_to_write_flags(wbc),
-			       GFP_NOFS, &iomap_ioend_bioset);
+			       opf, GFP_NOFS, &iomap_ioend_bioset);
 	bio->bi_iter.bi_sector = iomap_sector(&wpc->iomap, pos);
 	bio->bi_end_io = iomap_writepage_end_bio;
 	wbc_init_bio(wbc, bio);
@@ -1744,14 +1770,27 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 {
 	struct iomap_folio_state *ifs = folio->private;
 	size_t poff = offset_in_folio(folio, pos);
+	bool is_atomic = folio_test_atomic(folio);
 	int error;
 
-	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
+	if (!wpc->ioend || is_atomic || !iomap_can_add_to_ioend(wpc, pos)) {
 new_ioend:
 		error = iomap_submit_ioend(wpc, 0);
 		if (error)
 			return error;
-		wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos);
+		wpc->ioend = iomap_alloc_ioend(wpc, wbc, inode, pos, is_atomic);
+	}
+
+	/* We must not append anything later if atomic, so submit now */
+	if (is_atomic) {
+		if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff))
+			return -EINVAL;
+		wpc->ioend->io_size = len;
+		wbc_account_cgroup_owner(wbc, &folio->page, len);
+		if (ifs)
+			atomic_add(len, &ifs->write_bytes_pending);
+
+		return iomap_submit_ioend(wpc, 0);
 	}
 
 	if (!bio_add_folio(&wpc->ioend->io_bio, folio, len, poff))
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 0a991c4ce87d..4118a42cdab0 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
 	{ IOMAP_REPORT,		"REPORT" }, \
 	{ IOMAP_FAULT,		"FAULT" }, \
 	{ IOMAP_DIRECT,		"DIRECT" }, \
-	{ IOMAP_NOWAIT,		"NOWAIT" }
+	{ IOMAP_NOWAIT,		"NOWAIT" }, \
+	{ IOMAP_ATOMIC,		"ATOMIC" }
 
 #define IOMAP_F_FLAGS_STRINGS \
 	{ IOMAP_F_NEW,		"NEW" }, \
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f726f0058fd6..2f50abe06f27 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -179,6 +179,7 @@ struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
+#define IOMAP_ATOMIC	(1 << 9)
 
 struct iomap_ops {
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 6/7] fs: xfs: buffered atomic writes statx support
  2024-04-22 14:39 [PATCH RFC 0/7] buffered block atomic writes John Garry
                   ` (4 preceding siblings ...)
  2024-04-22 14:39 ` [PATCH RFC 5/7] fs: iomap: buffered atomic write support John Garry
@ 2024-04-22 14:39 ` John Garry
  2024-04-22 14:39 ` [PATCH RFC 7/7] fs: xfs: Enable buffered atomic writes John Garry
  6 siblings, 0 replies; 12+ messages in thread
From: John Garry @ 2024-04-22 14:39 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, akpm, willy, dchinner, tytso,
	hch, martin.petersen, nilay, ritesh.list, mcgrof
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm,
	ojaswin, p.raghav, jbongio, okiselev, John Garry

For filling in the statx fields, we use the extent alignment as the
buffered atomic writes size. Only a single size is permitted.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 05b20c88ff77..d2226df567ca 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -650,12 +650,19 @@ xfs_vn_getattr(
 			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
 			stat->dio_offset_align = bdev_logical_block_size(bdev);
 		}
-		if (request_mask & STATX_WRITE_ATOMIC_DIO) {
+		if (request_mask & STATX_WRITE_ATOMIC_DIO &&
+		    !(request_mask & STATX_WRITE_ATOMIC_BUF)) {
 			unsigned int unit_min, unit_max;
 
 			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
 			generic_fill_statx_atomic_writes(stat,
 				unit_min, unit_max, true);
+		} else if (request_mask & STATX_WRITE_ATOMIC_BUF) {
+			unsigned int unit_min, unit_max;
+
+			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+			generic_fill_statx_atomic_writes(stat,
+				unit_max, unit_max, false);
 		}
 		fallthrough;
 	default:
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH RFC 7/7] fs: xfs: Enable buffered atomic writes
  2024-04-22 14:39 [PATCH RFC 0/7] buffered block atomic writes John Garry
                   ` (5 preceding siblings ...)
  2024-04-22 14:39 ` [PATCH RFC 6/7] fs: xfs: buffered atomic writes statx support John Garry
@ 2024-04-22 14:39 ` John Garry
  6 siblings, 0 replies; 12+ messages in thread
From: John Garry @ 2024-04-22 14:39 UTC (permalink / raw)
  To: axboe, brauner, djwong, viro, jack, akpm, willy, dchinner, tytso,
	hch, martin.petersen, nilay, ritesh.list, mcgrof
  Cc: linux-block, linux-kernel, linux-xfs, linux-fsdevel, linux-mm,
	ojaswin, p.raghav, jbongio, okiselev, John Garry

Enable support for buffered atomic writes, in addition to already
supported direct IO atomic writes.

The folio mapping order min and max is set to this same size for an inode
with FS_XFLAG_ATOMICWRITES set. That size is the extent alignment size.

Atomic writes support depends on forcealign. For forcealign, extent sizes
need to be a power-of-2 and naturally aligned, and this matches folios
nicely.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_inode_buf.c |  8 ++++++++
 fs/xfs/xfs_file.c             | 12 ++++++++++--
 fs/xfs/xfs_ioctl.c            |  3 +++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index abaef1137b97..38e058756b1e 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -181,6 +181,7 @@ xfs_inode_from_disk(
 	struct inode		*inode = VFS_I(ip);
 	int			error;
 	xfs_failaddr_t		fa;
+	struct xfs_mount	*mp = ip->i_mount;
 
 	ASSERT(ip->i_cowfp == NULL);
 
@@ -261,6 +262,13 @@ xfs_inode_from_disk(
 	}
 	if (xfs_is_reflink_inode(ip))
 		xfs_ifork_init_cow(ip);
+
+	if (xfs_inode_atomicwrites(ip)) {
+		unsigned int folio_order = ffs(XFS_B_TO_FSB(mp, ip->i_extsize)) - 1;
+
+		mapping_set_folio_orders(VFS_I(ip)->i_mapping, folio_order, folio_order);
+	}
+
 	return 0;
 
 out_destroy_data_fork:
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2fbefd60d753..d35869b5e4ce 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -782,6 +782,16 @@ xfs_file_buffered_write(
 	ssize_t			ret;
 	bool			cleared_space = false;
 	unsigned int		iolock;
+	struct xfs_mount	*mp = ip->i_mount;
+
+	if (iocb->ki_flags & IOCB_ATOMIC) {
+		unsigned int extsz_bytes = XFS_FSB_TO_B(mp, ip->i_extsize);
+
+		if (!generic_atomic_write_valid_size(iocb->ki_pos, from,
+			extsz_bytes, extsz_bytes)) {
+			return -EINVAL;
+		}
+	}
 
 write_retry:
 	iolock = XFS_IOLOCK_EXCL;
@@ -1241,8 +1251,6 @@ static bool xfs_file_open_can_atomicwrite(
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
 
-	if (!(file->f_flags & O_DIRECT))
-		return false;
 
 	if (!xfs_inode_atomicwrites(ip))
 		return false;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d115f2601921..d6b146c999f6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1169,10 +1169,13 @@ xfs_ioctl_setattr_xflags(
 	}
 
 	if (atomic_writes) {
+		unsigned int folio_order = ffs(XFS_B_TO_FSB(mp, fa->fsx_extsize)) - 1;
+
 		if (!xfs_has_atomicwrites(mp))
 			return -EINVAL;
 		if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
 			return -EINVAL;
+		mapping_set_folio_orders(VFS_I(ip)->i_mapping, folio_order, folio_order);
 	}
 
 	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 5/7] fs: iomap: buffered atomic write support
  2024-04-22 14:39 ` [PATCH RFC 5/7] fs: iomap: buffered atomic write support John Garry
@ 2024-04-22 15:03   ` Matthew Wilcox
  2024-04-22 16:02     ` John Garry
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2024-04-22 15:03 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, djwong, viro, jack, akpm, dchinner, tytso, hch,
	martin.petersen, nilay, ritesh.list, mcgrof, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm, ojaswin,
	p.raghav, jbongio, okiselev

On Mon, Apr 22, 2024 at 02:39:21PM +0000, John Garry wrote:
> Add special handling of PG_atomic flag to iomap buffered write path.
> 
> To flag an iomap iter for an atomic write, set IOMAP_ATOMIC.
> 
> For a folio associated with a write which has IOMAP_ATOMIC set, set
> PG_atomic.
> 
> Otherwise, when IOMAP_ATOMIC is unset, clear PG_atomic.
> 
> This means that for an "atomic" folio which has not been written back, it
> loses it "atomicity". So if userspace issues a write with RWF_ATOMIC set
> and another write with RWF_ATOMIC unset and which fully or partially
> overwrites that same region as the first write, that folio is not written
> back atomically. For such a scenario to occur, it would be considered a
> userspace usage error.
> 
> To ensure that a buffered atomic write is written back atomically when
> the write syscall returns, RWF_SYNC or similar needs to be used (in
> conjunction with RWF_ATOMIC).
> 
> As a safety check, when getting a folio for an atomic write in
> iomap_get_folio(), ensure that the length matches the inode mapping folio
> order-limit.
> 
> Only a single BIO should ever be submitted for an atomic write. So modify
> iomap_add_to_ioend() to ensure that we don't try to write back an atomic
> folio as part of a larger mixed-atomicity BIO.
> 
> In iomap_alloc_ioend(), handle an atomic write by setting REQ_ATOMIC for
> the allocated BIO.
> 
> When a folio is written back, again clear PG_atomic, as it is no longer
> required. I assume it will not be needlessly written back a second time...

I'm not taking a position on the mechanism yet; need to think about it
some more.  But there's a hole here I also don't have a solution to,
so we can all start thinking about it.

In iomap_write_iter(), we call copy_folio_from_iter_atomic().  Through no
fault of the application, if the range crosses a page boundary, we might
partially copy the bytes from the first page, then take a page fault on
the second page, hence doing a short write into the folio.  And there's
nothing preventing writeback from writing back a partially copied folio.

Now, if it's not dirty, then it can't be written back.  So if we're
doing an atomic write, we could clear the dirty bit after calling
iomap_write_begin() (given the usage scenarios we've discussed, it should
always be clear ...)

We need to prevent the "fall back to a short copy" logic in
iomap_write_iter() as well.  But then we also need to make sure we don't
get stuck in a loop, so maybe go three times around, and if it's still
not readable as a chunk, -EFAULT?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 5/7] fs: iomap: buffered atomic write support
  2024-04-22 15:03   ` Matthew Wilcox
@ 2024-04-22 16:02     ` John Garry
  0 siblings, 0 replies; 12+ messages in thread
From: John Garry @ 2024-04-22 16:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: axboe, brauner, djwong, viro, jack, akpm, dchinner, tytso, hch,
	martin.petersen, nilay, ritesh.list, mcgrof, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm, ojaswin,
	p.raghav, jbongio, okiselev

On 22/04/2024 16:03, Matthew Wilcox wrote:
> On Mon, Apr 22, 2024 at 02:39:21PM +0000, John Garry wrote:
>> Add special handling of PG_atomic flag to iomap buffered write path.
>>
>> To flag an iomap iter for an atomic write, set IOMAP_ATOMIC.
>>
>> For a folio associated with a write which has IOMAP_ATOMIC set, set
>> PG_atomic.
>>
>> Otherwise, when IOMAP_ATOMIC is unset, clear PG_atomic.
>>
>> This means that for an "atomic" folio which has not been written back, it
>> loses it "atomicity". So if userspace issues a write with RWF_ATOMIC set
>> and another write with RWF_ATOMIC unset and which fully or partially
>> overwrites that same region as the first write, that folio is not written
>> back atomically. For such a scenario to occur, it would be considered a
>> userspace usage error.
>>
>> To ensure that a buffered atomic write is written back atomically when
>> the write syscall returns, RWF_SYNC or similar needs to be used (in
>> conjunction with RWF_ATOMIC).
>>
>> As a safety check, when getting a folio for an atomic write in
>> iomap_get_folio(), ensure that the length matches the inode mapping folio
>> order-limit.
>>
>> Only a single BIO should ever be submitted for an atomic write. So modify
>> iomap_add_to_ioend() to ensure that we don't try to write back an atomic
>> folio as part of a larger mixed-atomicity BIO.
>>
>> In iomap_alloc_ioend(), handle an atomic write by setting REQ_ATOMIC for
>> the allocated BIO.
>>
>> When a folio is written back, again clear PG_atomic, as it is no longer
>> required. I assume it will not be needlessly written back a second time...
> 
> I'm not taking a position on the mechanism yet; need to think about it
> some more.  But there's a hole here I also don't have a solution to,
> so we can all start thinking about it.
> 
> In iomap_write_iter(), we call copy_folio_from_iter_atomic().  Through no
> fault of the application, if the range crosses a page boundary, we might
> partially copy the bytes from the first page, then take a page fault on
> the second page, hence doing a short write into the folio.  And there's
> nothing preventing writeback from writing back a partially copied folio.
> 
> Now, if it's not dirty, then it can't be written back.  So if we're
> doing an atomic write, we could clear the dirty bit after calling
> iomap_write_begin() (given the usage scenarios we've discussed, it should
> always be clear ...)
> > We need to prevent the "fall back to a short copy" logic in
> iomap_write_iter() as well.  But then we also need to make sure we don't
> get stuck in a loop, so maybe go three times around, and if it's still
> not readable as a chunk, -EFAULT?
This idea sounds reasonable. So at what stage would the dirty flag be 
set? Would it be only when all bytes are copied successfully as a single 
chunk?

FWIW, we do have somewhat equivalent handling in direct IO path, being 
that if the iomap iter loops more than once such that we will need to 
create > 1 bio in the DIO bio submission handler, then we -EINVAL as 
something has gone wrong. But that's not so relevant here.

Thanks,
John

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 2/7] filemap: Change mapping_set_folio_min_order() -> mapping_set_folio_orders()
  2024-04-22 14:39 ` [PATCH RFC 2/7] filemap: Change mapping_set_folio_min_order() -> mapping_set_folio_orders() John Garry
@ 2024-04-25 14:47   ` Pankaj Raghav (Samsung)
  2024-04-26  8:02     ` John Garry
  0 siblings, 1 reply; 12+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 14:47 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, brauner, djwong, viro, jack, akpm, willy, dchinner, tytso,
	hch, martin.petersen, nilay, ritesh.list, mcgrof, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm, ojaswin,
	p.raghav, jbongio, okiselev

On Mon, Apr 22, 2024 at 02:39:18PM +0000, John Garry wrote:
> Borrowed from:
> 
> https://lore.kernel.org/linux-fsdevel/20240213093713.1753368-2-kernel@pankajraghav.com/
> (credit given in due course)
> 
> We will need to be able to only use a single folio order for buffered
> atomic writes, so allow the mapping folio order min and max be set.

> 
> We still have the restriction of not being able to support order-1
> folios - it will be required to lift this limit at some stage.

This is already supported upstream for file-backed folios:
commit: 8897277acfef7f70fdecc054073bea2542fc7a1b

> index fc8eb9c94e9c..c22455fa28a1 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -363,9 +363,10 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>  #endif
>  
>  /*
> - * mapping_set_folio_min_order() - Set the minimum folio order
> + * mapping_set_folio_orders() - Set the minimum and max folio order

In the new series (sorry forgot to CC you), I added a new helper called
mapping_set_folio_order_range() which does something similar to avoid
confusion based on willy's suggestion:
https://lore.kernel.org/linux-xfs/20240425113746.335530-3-kernel@pankajraghav.com/

mapping_set_folio_min_order() also sets max folio order to be 
MAX_PAGECACHE_ORDER order anyway. So no need of explicitly calling it
here?

>  /**
> @@ -400,7 +406,7 @@ static inline void mapping_set_folio_min_order(struct address_space *mapping,
>   */
>  static inline void mapping_set_large_folios(struct address_space *mapping)
>  {
> -	mapping_set_folio_min_order(mapping, 0);
> +	mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
>  }
>  
>  static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d81530b0aac0..d5effe50ddcb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1898,9 +1898,15 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
>  		unsigned int min_order = mapping_min_folio_order(mapping);
> -		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
> +		unsigned int max_order = mapping_max_folio_order(mapping);
> +		unsigned int order = FGF_GET_ORDER(fgp_flags);
>  		int err;
>  
> +		if (order > max_order)
> +			order = max_order;
> +		else if (order < min_order)
> +			order = max_order;

order = min_order; ?
--
Pankaj

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH RFC 2/7] filemap: Change mapping_set_folio_min_order() -> mapping_set_folio_orders()
  2024-04-25 14:47   ` Pankaj Raghav (Samsung)
@ 2024-04-26  8:02     ` John Garry
  0 siblings, 0 replies; 12+ messages in thread
From: John Garry @ 2024-04-26  8:02 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: axboe, brauner, djwong, viro, jack, akpm, willy, dchinner, tytso,
	hch, martin.petersen, nilay, ritesh.list, mcgrof, linux-block,
	linux-kernel, linux-xfs, linux-fsdevel, linux-mm, ojaswin,
	p.raghav, jbongio, okiselev

On 25/04/2024 15:47, Pankaj Raghav (Samsung) wrote:
> On Mon, Apr 22, 2024 at 02:39:18PM +0000, John Garry wrote:
>> Borrowed from:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/20240213093713.1753368-2-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!LvajFab0xQx8oBWDlDtVY8duiLDjOKX91G4YqadoCu6gqatA2H0FzBUvdSC69dqXNoe2QvStSwrxIZ142MXOKk8$
>> (credit given in due course)
>>
>> We will need to be able to only use a single folio order for buffered
>> atomic writes, so allow the mapping folio order min and max be set.
> 
>>
>> We still have the restriction of not being able to support order-1
>> folios - it will be required to lift this limit at some stage.
> 
> This is already supported upstream for file-backed folios:
> commit: 8897277acfef7f70fdecc054073bea2542fc7a1b

ok

> 
>> index fc8eb9c94e9c..c22455fa28a1 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -363,9 +363,10 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>>   #endif
>>   
>>   /*
>> - * mapping_set_folio_min_order() - Set the minimum folio order
>> + * mapping_set_folio_orders() - Set the minimum and max folio order
> 
> In the new series (sorry forgot to CC you),

no worries, I saw it

> I added a new helper called
> mapping_set_folio_order_range() which does something similar to avoid
> confusion based on willy's suggestion:
> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/20240425113746.335530-3-kernel@pankajraghav.com/__;!!ACWV5N9M2RV99hQ!LvajFab0xQx8oBWDlDtVY8duiLDjOKX91G4YqadoCu6gqatA2H0FzBUvdSC69dqXNoe2QvStSwrxIZ14opzAoso$
> 

Fine, I can include that

> mapping_set_folio_min_order() also sets max folio order to be
> MAX_PAGECACHE_ORDER order anyway. So no need of explicitly calling it
> here?
> 

Here mapping_set_folio_min_order() is being replaced with 
mapping_set_folio_order_range(), so not sure why you mention that. 
Regardless, I'll use your mapping_set_folio_order_range().

>>   /**
>> @@ -400,7 +406,7 @@ static inline void mapping_set_folio_min_order(struct address_space *mapping,
>>    */
>>   static inline void mapping_set_large_folios(struct address_space *mapping)
>>   {
>> -	mapping_set_folio_min_order(mapping, 0);
>> +	mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
>>   }
>>   
>>   static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index d81530b0aac0..d5effe50ddcb 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1898,9 +1898,15 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>>   no_page:
>>   	if (!folio && (fgp_flags & FGP_CREAT)) {
>>   		unsigned int min_order = mapping_min_folio_order(mapping);
>> -		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
>> +		unsigned int max_order = mapping_max_folio_order(mapping);
>> +		unsigned int order = FGF_GET_ORDER(fgp_flags);
>>   		int err;
>>   
>> +		if (order > max_order)
>> +			order = max_order;
>> +		else if (order < min_order)
>> +			order = max_order;
> 
> order = min_order; ?

right

Thanks,
John

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-04-26  8:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-22 14:39 [PATCH RFC 0/7] buffered block atomic writes John Garry
2024-04-22 14:39 ` [PATCH RFC 1/7] fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO John Garry
2024-04-22 14:39 ` [PATCH RFC 2/7] filemap: Change mapping_set_folio_min_order() -> mapping_set_folio_orders() John Garry
2024-04-25 14:47   ` Pankaj Raghav (Samsung)
2024-04-26  8:02     ` John Garry
2024-04-22 14:39 ` [PATCH RFC 3/7] mm: Add PG_atomic John Garry
2024-04-22 14:39 ` [PATCH RFC 4/7] fs: Add initial buffered atomic write support info to statx John Garry
2024-04-22 14:39 ` [PATCH RFC 5/7] fs: iomap: buffered atomic write support John Garry
2024-04-22 15:03   ` Matthew Wilcox
2024-04-22 16:02     ` John Garry
2024-04-22 14:39 ` [PATCH RFC 6/7] fs: xfs: buffered atomic writes statx support John Garry
2024-04-22 14:39 ` [PATCH RFC 7/7] fs: xfs: Enable buffered atomic writes John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).