linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] block atomic writes for XFS
@ 2024-01-24 14:26 John Garry
  2024-01-24 14:26 ` [PATCH 1/6] fs: iomap: Atomic write support John Garry
                   ` (8 more replies)
  0 siblings, 9 replies; 68+ messages in thread
From: John Garry @ 2024-01-24 14:26 UTC (permalink / raw)
  To: hch, djwong, viro, brauner, dchinner, jack, chandan.babu
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, John Garry

This series expands atomic write support to filesystems, specifically
XFS. Since XFS rtvol supports extent alignment already, support will
initially be added there. When XFS forcealign feature is merged, then we
can similarly support atomic writes for a non-rtvol filesystem.

Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.

For XFS rtvol, support can be enabled through xfs_io command:
$xfs_io -c "chattr +W" filename
$xfs_io -c "lsattr -v" filename
[realtime, atomic-writes] filename

The FS needs to be formatted with a specific extent alignment size, like:
mkf.xfs -r rtdev=/dev/sdb,extsize=16K -d rtinherit=1 /dev/sda

This enables 16K atomic write support. There are no checks whether the
underlying HW actually supports that for enabling atomic writes with
xfs_io, though, so statx needs to be issued for a file to know atomic
write limits.

For supporting non-rtvol, we will require forcealign enabled. As such, a
dedicated xfs_io command to enable atomic writes for a regular FS may
be useful, which would enable FS_XFLAG_ATOMICWRITES, enable forcealign,
and set an extent alignment hint.

Baseline is following series (which is based on v6.8-rc1):
https://urldefense.com/v3/__https://lore.kernel.org/linux-nvme/20240124113841.31824-1-john.g.garry@oracle.com/T/*m4ad28b480a8e12eb51467e17208d98ca50041ff2__;Iw!!ACWV5N9M2RV99hQ!PKOcFzPtVYZ9uATl1BrTJmYanWxEtCKJPV-tTPDYqeTjuWmChXn08ZcmP_H07A9mxPyQ8wwjdSzgH0eYU_45MaIOJyEW$ 

Basic xfsprogs support at:
https://urldefense.com/v3/__https://github.com/johnpgarry/xfsprogs-dev/tree/atomicwrites__;!!ACWV5N9M2RV99hQ!PKOcFzPtVYZ9uATl1BrTJmYanWxEtCKJPV-tTPDYqeTjuWmChXn08ZcmP_H07A9mxPyQ8wwjdSzgH0eYU_45MTapy6qp$ 

John Garry (6):
  fs: iomap: Atomic write support
  fs: Add FS_XFLAG_ATOMICWRITES flag
  fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol
  fs: xfs: Support atomic write for statx
  fs: xfs: iomap atomic write support
  fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set

 fs/iomap/direct-io.c       | 21 +++++++++++++++++-
 fs/iomap/trace.h           |  3 ++-
 fs/xfs/libxfs/xfs_format.h |  8 +++++--
 fs/xfs/libxfs/xfs_sb.c     |  2 ++
 fs/xfs/xfs_file.c          |  2 ++
 fs/xfs/xfs_inode.c         | 22 +++++++++++++++++++
 fs/xfs/xfs_inode.h         |  7 ++++++
 fs/xfs/xfs_ioctl.c         | 19 ++++++++++++++--
 fs/xfs/xfs_iomap.c         | 41 ++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iops.c          | 45 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iops.h          |  4 ++++
 fs/xfs/xfs_mount.h         |  2 ++
 fs/xfs/xfs_super.c         |  4 ++++
 include/linux/iomap.h      |  1 +
 include/uapi/linux/fs.h    |  1 +
 15 files changed, 176 insertions(+), 6 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH 1/6] fs: iomap: Atomic write support
  2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
@ 2024-01-24 14:26 ` John Garry
  2024-02-02 17:25   ` Darrick J. Wong
  2024-02-05 15:20   ` Pankaj Raghav (Samsung)
  2024-01-24 14:26 ` [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 68+ messages in thread
From: John Garry @ 2024-01-24 14:26 UTC (permalink / raw)
  To: hch, djwong, viro, brauner, dchinner, jack, chandan.babu
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, John Garry

Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
bio is being created and all the rules there need to be followed.

It is the task of the FS iomap iter callbacks to ensure that the mapping
created adheres to those rules, like size is power-of-2, is at a
naturally-aligned offset, etc. However, checking for a single iovec, i.e.
iter type is ubuf, is done in __iomap_dio_rw().

A write should only produce a single bio, so error when it doesn't.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/iomap/direct-io.c  | 21 ++++++++++++++++++++-
 fs/iomap/trace.h      |  3 ++-
 include/linux/iomap.h |  1 +
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..25736d01b857 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -275,10 +275,12 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
 static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		struct iomap_dio *dio)
 {
+	bool atomic_write = iter->flags & IOMAP_ATOMIC;
 	const struct iomap *iomap = &iter->iomap;
 	struct inode *inode = iter->inode;
 	unsigned int fs_block_size = i_blocksize(inode), pad;
 	loff_t length = iomap_length(iter);
+	const size_t iter_len = iter->len;
 	loff_t pos = iter->pos;
 	blk_opf_t bio_opf;
 	struct bio *bio;
@@ -381,6 +383,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 					  GFP_KERNEL);
 		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
 		bio->bi_ioprio = dio->iocb->ki_ioprio;
+		if (atomic_write)
+			bio->bi_opf |= REQ_ATOMIC;
+
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
 
@@ -397,6 +402,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		}
 
 		n = bio->bi_iter.bi_size;
+		if (atomic_write && n != iter_len) {
+			/* This bio should have covered the complete length */
+			ret = -EINVAL;
+			bio_put(bio);
+			goto out;
+		}
 		if (dio->flags & IOMAP_DIO_WRITE) {
 			task_io_account_write(n);
 		} else {
@@ -554,12 +565,17 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	struct blk_plug plug;
 	struct iomap_dio *dio;
 	loff_t ret = 0;
+	bool is_read = iov_iter_rw(iter) == READ;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
 
 	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
 
 	if (!iomi.len)
 		return NULL;
 
+	if (atomic_write && !iter_is_ubuf(iter))
+		return ERR_PTR(-EINVAL);
+
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	if (!dio)
 		return ERR_PTR(-ENOMEM);
@@ -579,7 +595,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		iomi.flags |= IOMAP_NOWAIT;
 
-	if (iov_iter_rw(iter) == READ) {
+	if (is_read) {
 		/* reads can always complete inline */
 		dio->flags |= IOMAP_DIO_INLINE_COMP;
 
@@ -605,6 +621,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		if (iocb->ki_flags & IOCB_DIO_CALLER_COMP)
 			dio->flags |= IOMAP_DIO_CALLER_COMP;
 
+		if (atomic_write)
+			iomi.flags |= IOMAP_ATOMIC;
+
 		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
 			ret = -EAGAIN;
 			if (iomi.pos >= dio->i_size ||
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index c16fd55f5595..c95576420bca 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
 	{ IOMAP_REPORT,		"REPORT" }, \
 	{ IOMAP_FAULT,		"FAULT" }, \
 	{ IOMAP_DIRECT,		"DIRECT" }, \
-	{ IOMAP_NOWAIT,		"NOWAIT" }
+	{ IOMAP_NOWAIT,		"NOWAIT" }, \
+	{ IOMAP_ATOMIC,		"ATOMIC" }
 
 #define IOMAP_F_FLAGS_STRINGS \
 	{ IOMAP_F_NEW,		"NEW" }, \
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 96dd0acbba44..9eac704a0d6f 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -178,6 +178,7 @@ struct iomap_folio_ops {
 #else
 #define IOMAP_DAX		0
 #endif /* CONFIG_FS_DAX */
+#define IOMAP_ATOMIC		(1 << 9)
 
 struct iomap_ops {
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag
  2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
  2024-01-24 14:26 ` [PATCH 1/6] fs: iomap: Atomic write support John Garry
@ 2024-01-24 14:26 ` John Garry
  2024-02-02 17:57   ` Darrick J. Wong
  2024-01-24 14:26 ` [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol John Garry
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-01-24 14:26 UTC (permalink / raw)
  To: hch, djwong, viro, brauner, dchinner, jack, chandan.babu
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, John Garry

Add a flag indicating that a regular file is enabled for atomic writes.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 include/uapi/linux/fs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index a0975ae81e64..b5b4e1db9576 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -140,6 +140,7 @@ struct fsxattr {
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+#define FS_XFLAG_ATOMICWRITES	0x00020000	/* atomic writes enabled */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol
  2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
  2024-01-24 14:26 ` [PATCH 1/6] fs: iomap: Atomic write support John Garry
  2024-01-24 14:26 ` [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
@ 2024-01-24 14:26 ` John Garry
  2024-02-02 17:52   ` Darrick J. Wong
  2024-01-24 14:26 ` [PATCH 4/6] fs: xfs: Support atomic write for statx John Garry
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-01-24 14:26 UTC (permalink / raw)
  To: hch, djwong, viro, brauner, dchinner, jack, chandan.babu
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, John Garry

Add initial support for FS_XFLAG_ATOMICWRITES in rtvol.

Current kernel support for atomic writes is based on HW support (for atomic
writes). As such, it is required to ensure extent alignment with
atomic_write_unit_max so that an atomic write can result in a single
HW-compliant IO operation.

rtvol already guarantees extent alignment, so initially add support there.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h |  8 ++++++--
 fs/xfs/libxfs/xfs_sb.c     |  2 ++
 fs/xfs/xfs_inode.c         | 22 ++++++++++++++++++++++
 fs/xfs/xfs_inode.h         |  7 +++++++
 fs/xfs/xfs_ioctl.c         | 19 +++++++++++++++++--
 fs/xfs/xfs_mount.h         |  2 ++
 fs/xfs/xfs_super.c         |  4 ++++
 7 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 382ab1e71c0b..79fb0d4adeda 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -353,11 +353,13 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
+#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 29)	/* aligned file data extents */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
 		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
-		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
+		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
+		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
@@ -1085,16 +1087,18 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
 #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
+#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
 
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
 #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
 #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
+#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_ATOMICWRITES)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 4a9e8588f4c9..28a98130a56d 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -163,6 +163,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_REFLINK;
 	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
 		features |= XFS_FEAT_INOBTCNT;
+	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
+		features |= XFS_FEAT_ATOMICWRITES;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
 		features |= XFS_FEAT_FTYPE;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 1fd94958aa97..0b0f525fd043 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -65,6 +65,26 @@ xfs_get_extsz_hint(
 	return 0;
 }
 
+/*
+ * helper function to extract extent size
+ */
+xfs_extlen_t
+xfs_get_extsz(
+	struct xfs_inode	*ip)
+{
+	/*
+	 * No point in aligning allocations if we need to COW to actually
+	 * write to them.
+	 */
+	if (xfs_is_always_cow_inode(ip))
+		return 0;
+
+	if (XFS_IS_REALTIME_INODE(ip))
+		return ip->i_mount->m_sb.sb_rextsize;
+
+	return 1;
+}
+
 /*
  * Helper function to extract CoW extent size hint from inode.
  * Between the extent size hint and the CoW extent size hint, we
@@ -629,6 +649,8 @@ xfs_ip2xflags(
 			flags |= FS_XFLAG_DAX;
 		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
 			flags |= FS_XFLAG_COWEXTSIZE;
+		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
+			flags |= FS_XFLAG_ATOMICWRITES;
 	}
 
 	if (xfs_inode_has_attr_fork(ip))
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 97f63bacd4c2..0e0a21d9d30f 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -305,6 +305,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
 	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
 }
 
+static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)
+{
+	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
+}
+
 /*
  * Return the buftarg used for data allocations on a given inode.
  */
@@ -542,7 +547,9 @@ void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
 				struct xfs_inode *ip1, uint ip1_mode);
 
 xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
+xfs_extlen_t	xfs_get_extsz(struct xfs_inode *ip);
 xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
+xfs_extlen_t	xfs_get_atomicwrites_size(struct xfs_inode *ip);
 
 int xfs_init_new_inode(struct mnt_idmap *idmap, struct xfs_trans *tp,
 		struct xfs_inode *pip, xfs_ino_t ino, umode_t mode,
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index f02b6e558af5..c380a3055be7 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
 		di_flags2 |= XFS_DIFLAG2_DAX;
 	if (xflags & FS_XFLAG_COWEXTSIZE)
 		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
+	if (xflags & FS_XFLAG_ATOMICWRITES)
+		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
 
 	return di_flags2;
 }
@@ -1122,10 +1124,12 @@ xfs_ioctl_setattr_xflags(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
+	bool			atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
 	uint64_t		i_flags2;
 
-	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
-		/* Can't change realtime flag if any extents are allocated. */
+
+	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
+	    atomic_writes != xfs_inode_atomicwrites(ip)) {
 		if (ip->i_df.if_nextents || ip->i_delayed_blks)
 			return -EINVAL;
 	}
@@ -1146,6 +1150,17 @@ xfs_ioctl_setattr_xflags(
 	if (i_flags2 && !xfs_has_v3inodes(mp))
 		return -EINVAL;
 
+	if (atomic_writes) {
+		if (!xfs_has_atomicwrites(mp))
+			return -EINVAL;
+
+		if (!rtflag)
+			return -EINVAL;
+
+		if (!is_power_of_2(mp->m_sb.sb_rextsize))
+			return -EINVAL;
+	}
+
 	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
 	ip->i_diflags2 = i_flags2;
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 503fe3c7edbf..bcd591f52925 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -289,6 +289,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_BIGTIME	(1ULL << 24)	/* large timestamps */
 #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
 #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
+#define XFS_FEAT_ATOMICWRITES	(1ULL << 28)	/* atomic writes support */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -352,6 +353,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
 __XFS_HAS_FEAT(bigtime, BIGTIME)
 __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
 __XFS_HAS_FEAT(large_extent_counts, NREXT64)
+__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
 
 /*
  * Mount features
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index aff20ddd4a9f..263404f683d6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1696,6 +1696,10 @@ xfs_fs_fill_super(
 		mp->m_features &= ~XFS_FEAT_DISCARD;
 	}
 
+	if (xfs_has_atomicwrites(mp))
+		xfs_warn(mp,
+"EXPERIMENTAL atomic writes feature in use. Use at your own risk!");
+
 	if (xfs_has_reflink(mp)) {
 		if (mp->m_sb.sb_rblocks) {
 			xfs_alert(mp,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 4/6] fs: xfs: Support atomic write for statx
  2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
                   ` (2 preceding siblings ...)
  2024-01-24 14:26 ` [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol John Garry
@ 2024-01-24 14:26 ` John Garry
  2024-02-02 18:05   ` Darrick J. Wong
  2024-02-09  7:00   ` Ojaswin Mujoo
  2024-01-24 14:26 ` [PATCH RFC 5/6] fs: xfs: iomap atomic write support John Garry
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 68+ messages in thread
From: John Garry @ 2024-01-24 14:26 UTC (permalink / raw)
  To: hch, djwong, viro, brauner, dchinner, jack, chandan.babu
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, John Garry

Support providing info on atomic write unit min and max for an inode.

For simplicity, currently we limit the min at the FS block size, but a
lower limit could be supported in future.

The atomic write unit min and max is limited by the guaranteed extent
alignment for the inode.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_iops.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iops.h |  4 ++++
 2 files changed, 49 insertions(+)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a0d77f5f512e..0890d2f70f4d 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -546,6 +546,44 @@ xfs_stat_blksize(
 	return PAGE_SIZE;
 }
 
+void xfs_get_atomic_write_attr(
+	struct xfs_inode *ip,
+	unsigned int *unit_min,
+	unsigned int *unit_max)
+{
+	xfs_extlen_t		extsz = xfs_get_extsz(ip);
+	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
+	struct block_device	*bdev = target->bt_bdev;
+	unsigned int		awu_min, awu_max, align;
+	struct request_queue	*q = bdev->bd_queue;
+	struct xfs_mount	*mp = ip->i_mount;
+
+	/*
+	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
+	 * atomic write unit of BLOCKSIZE).
+	 */
+	awu_min = queue_atomic_write_unit_min_bytes(q);
+	awu_max = queue_atomic_write_unit_max_bytes(q);
+
+	awu_min &= ~mp->m_blockmask;
+	awu_max &= ~mp->m_blockmask;
+
+	align = XFS_FSB_TO_B(mp, extsz);
+
+	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
+	    !is_power_of_2(align)) {
+		*unit_min = 0;
+		*unit_max = 0;
+	} else {
+		if (awu_min)
+			*unit_min = min(awu_min, align);
+		else
+			*unit_min = mp->m_sb.sb_blocksize;
+
+		*unit_max = min(awu_max, align);
+	}
+}
+
 STATIC int
 xfs_vn_getattr(
 	struct mnt_idmap	*idmap,
@@ -619,6 +657,13 @@ xfs_vn_getattr(
 			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
 			stat->dio_offset_align = bdev_logical_block_size(bdev);
 		}
+		if (request_mask & STATX_WRITE_ATOMIC) {
+			unsigned int unit_min, unit_max;
+
+			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+			generic_fill_statx_atomic_writes(stat,
+				unit_min, unit_max);
+		}
 		fallthrough;
 	default:
 		stat->blksize = xfs_stat_blksize(ip);
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 7f84a0843b24..76dd4c3687aa 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -19,4 +19,8 @@ int xfs_vn_setattr_size(struct mnt_idmap *idmap,
 int xfs_inode_init_security(struct inode *inode, struct inode *dir,
 		const struct qstr *qstr);
 
+void xfs_get_atomic_write_attr(struct xfs_inode *ip,
+		unsigned int *unit_min,
+		unsigned int *unit_max);
+
 #endif /* __XFS_IOPS_H__ */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
                   ` (3 preceding siblings ...)
  2024-01-24 14:26 ` [PATCH 4/6] fs: xfs: Support atomic write for statx John Garry
@ 2024-01-24 14:26 ` John Garry
  2024-02-02 18:47   ` Darrick J. Wong
  2024-01-24 14:26 ` [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set John Garry
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-01-24 14:26 UTC (permalink / raw)
  To: hch, djwong, viro, brauner, dchinner, jack, chandan.babu
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, John Garry

Ensure that when creating a mapping that we adhere to all the atomic
write rules.

We check that the mapping covers the complete range of the write to ensure
that we'll be just creating a single mapping.

Currently minimum granularity is the FS block size, but it should be
possibly to support lower in future.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
I am setting this as an RFC as I am not sure on the change in
xfs_iomap_write_direct() - it gives the desired result AFAICS.

 fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 18c8f168b153..758dc1c90a42 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -289,6 +289,9 @@ xfs_iomap_write_direct(
 		}
 	}
 
+	if (xfs_inode_atomicwrites(ip))
+		bmapi_flags = XFS_BMAPI_ZERO;
+
 	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
 			rblocks, force, &tp);
 	if (error)
@@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
 	if (error)
 		goto out_unlock;
 
+	if (flags & IOMAP_ATOMIC) {
+		xfs_filblks_t unit_min_fsb, unit_max_fsb;
+		unsigned int unit_min, unit_max;
+
+		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
+		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
+		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
+
+		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
+			error = -EINVAL;
+			goto out_unlock;
+		}
+
+		if ((offset & mp->m_blockmask) ||
+		    (length & mp->m_blockmask)) {
+			error = -EINVAL;
+			goto out_unlock;
+		}
+
+		if (imap.br_blockcount == unit_min_fsb ||
+		    imap.br_blockcount == unit_max_fsb) {
+			/* ok if exactly min or max */
+		} else if (imap.br_blockcount < unit_min_fsb ||
+			   imap.br_blockcount > unit_max_fsb) {
+			error = -EINVAL;
+			goto out_unlock;
+		} else if (!is_power_of_2(imap.br_blockcount)) {
+			error = -EINVAL;
+			goto out_unlock;
+		}
+
+		if (imap.br_startoff &&
+		    imap.br_startoff & (imap.br_blockcount - 1)) {
+			error = -EINVAL;
+			goto out_unlock;
+		}
+	}
+
 	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
 		error = -EAGAIN;
 		if (flags & IOMAP_NOWAIT)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set
  2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
                   ` (4 preceding siblings ...)
  2024-01-24 14:26 ` [PATCH RFC 5/6] fs: xfs: iomap atomic write support John Garry
@ 2024-01-24 14:26 ` John Garry
  2024-02-02 18:06   ` Darrick J. Wong
  2024-02-09  7:14 ` [PATCH 0/6] block atomic writes for XFS Ojaswin Mujoo
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-01-24 14:26 UTC (permalink / raw)
  To: hch, djwong, viro, brauner, dchinner, jack, chandan.babu
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, John Garry

For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
flag.

Signed-off-by: John Garry <john.g.garry@oracle.com>
---
 fs/xfs/xfs_file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e33e5e13b95f..1375d0089806 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1232,6 +1232,8 @@ xfs_file_open(
 		return -EIO;
 	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
 			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
+	if (xfs_inode_atomicwrites(XFS_I(inode)))
+		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
 	return generic_file_open(inode, file);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] fs: iomap: Atomic write support
  2024-01-24 14:26 ` [PATCH 1/6] fs: iomap: Atomic write support John Garry
@ 2024-02-02 17:25   ` Darrick J. Wong
  2024-02-05 11:29     ` John Garry
  2024-02-05 15:20   ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-02 17:25 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Wed, Jan 24, 2024 at 02:26:40PM +0000, John Garry wrote:
> Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
> bio is being created and all the rules there need to be followed.
> 
> It is the task of the FS iomap iter callbacks to ensure that the mapping
> created adheres to those rules, like size is power-of-2, is at a
> naturally-aligned offset, etc. However, checking for a single iovec, i.e.
> iter type is ubuf, is done in __iomap_dio_rw().
> 
> A write should only produce a single bio, so error when it doesn't.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/iomap/direct-io.c  | 21 ++++++++++++++++++++-
>  fs/iomap/trace.h      |  3 ++-
>  include/linux/iomap.h |  1 +
>  3 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..25736d01b857 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -275,10 +275,12 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		struct iomap_dio *dio)
>  {
> +	bool atomic_write = iter->flags & IOMAP_ATOMIC;
>  	const struct iomap *iomap = &iter->iomap;
>  	struct inode *inode = iter->inode;
>  	unsigned int fs_block_size = i_blocksize(inode), pad;
>  	loff_t length = iomap_length(iter);
> +	const size_t iter_len = iter->len;
>  	loff_t pos = iter->pos;
>  	blk_opf_t bio_opf;
>  	struct bio *bio;
> @@ -381,6 +383,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  					  GFP_KERNEL);
>  		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>  		bio->bi_ioprio = dio->iocb->ki_ioprio;
> +		if (atomic_write)
> +			bio->bi_opf |= REQ_ATOMIC;

This really ought to be in iomap_dio_bio_opflags.  Unless you can't pass
REQ_ATOMIC to bio_alloc*, in which case there ought to be a comment
about why.

Also, what's the meaning of REQ_OP_READ | REQ_ATOMIC?  Does that
actually work?  I don't know what that means, and "block: Add REQ_ATOMIC
flag" says that's not a valid combination.  I'll complain about this
more below.

> +
>  		bio->bi_private = dio;
>  		bio->bi_end_io = iomap_dio_bio_end_io;
>  
> @@ -397,6 +402,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		}
>  
>  		n = bio->bi_iter.bi_size;
> +		if (atomic_write && n != iter_len) {

s/iter_len/orig_len/ ?

> +			/* This bio should have covered the complete length */
> +			ret = -EINVAL;
> +			bio_put(bio);
> +			goto out;
> +		}
>  		if (dio->flags & IOMAP_DIO_WRITE) {
>  			task_io_account_write(n);
>  		} else {
> @@ -554,12 +565,17 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	struct blk_plug plug;
>  	struct iomap_dio *dio;
>  	loff_t ret = 0;
> +	bool is_read = iov_iter_rw(iter) == READ;
> +	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;

Hrmm.  So if the caller passes in an IOCB_ATOMIC iocb with a READ iter,
we'll silently drop IOCB_ATOMIC and do the read anyway?  That seems like
a nonsense combination, but is that ok for some reason?

>  	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
>  
>  	if (!iomi.len)
>  		return NULL;
>  
> +	if (atomic_write && !iter_is_ubuf(iter))
> +		return ERR_PTR(-EINVAL);

Does !iter_is_ubuf actually happen?  Why don't we support any of the
other ITER_ types?  Is it because hardware doesn't want vectored
buffers?

I really wish there was more commenting on /why/ we do things here:

	if (iocb->ki_flags & IOCB_ATOMIC) {
		/* atomic reads do not make sense */
		if (iov_iter_rw(iter) == READ)
			return ERR_PTR(-EINVAL);

		/*
		 * block layer doesn't want to handle handle vectors of
		 * buffers when performing an atomic write i guess?
		 */
		if (!iter_is_ubuf(iter))
			return ERR_PTR(-EINVAL);

		iomi.flags |= IOMAP_ATOMIC;
	}

> +
>  	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
>  	if (!dio)
>  		return ERR_PTR(-ENOMEM);
> @@ -579,7 +595,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		iomi.flags |= IOMAP_NOWAIT;
>  
> -	if (iov_iter_rw(iter) == READ) {
> +	if (is_read) {
>  		/* reads can always complete inline */
>  		dio->flags |= IOMAP_DIO_INLINE_COMP;
>  
> @@ -605,6 +621,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		if (iocb->ki_flags & IOCB_DIO_CALLER_COMP)
>  			dio->flags |= IOMAP_DIO_CALLER_COMP;
>  
> +		if (atomic_write)
> +			iomi.flags |= IOMAP_ATOMIC;
> +
>  		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
>  			ret = -EAGAIN;
>  			if (iomi.pos >= dio->i_size ||
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index c16fd55f5595..c95576420bca 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
>  	{ IOMAP_REPORT,		"REPORT" }, \
>  	{ IOMAP_FAULT,		"FAULT" }, \
>  	{ IOMAP_DIRECT,		"DIRECT" }, \
> -	{ IOMAP_NOWAIT,		"NOWAIT" }
> +	{ IOMAP_NOWAIT,		"NOWAIT" }, \
> +	{ IOMAP_ATOMIC,		"ATOMIC" }
>  
>  #define IOMAP_F_FLAGS_STRINGS \
>  	{ IOMAP_F_NEW,		"NEW" }, \
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 96dd0acbba44..9eac704a0d6f 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -178,6 +178,7 @@ struct iomap_folio_ops {
>  #else
>  #define IOMAP_DAX		0
>  #endif /* CONFIG_FS_DAX */
> +#define IOMAP_ATOMIC		(1 << 9)
>  
>  struct iomap_ops {
>  	/*
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol
  2024-01-24 14:26 ` [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol John Garry
@ 2024-02-02 17:52   ` Darrick J. Wong
  2024-02-03  7:40     ` Ojaswin Mujoo
  2024-02-05 12:51     ` John Garry
  0 siblings, 2 replies; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-02 17:52 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Wed, Jan 24, 2024 at 02:26:42PM +0000, John Garry wrote:
> Add initial support for FS_XFLAG_ATOMICWRITES in rtvol.
> 
> Current kernel support for atomic writes is based on HW support (for atomic
> writes). As such, it is required to ensure extent alignment with
> atomic_write_unit_max so that an atomic write can result in a single
> HW-compliant IO operation.
> 
> rtvol already guarantees extent alignment, so initially add support there.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_format.h |  8 ++++++--
>  fs/xfs/libxfs/xfs_sb.c     |  2 ++
>  fs/xfs/xfs_inode.c         | 22 ++++++++++++++++++++++
>  fs/xfs/xfs_inode.h         |  7 +++++++
>  fs/xfs/xfs_ioctl.c         | 19 +++++++++++++++++--
>  fs/xfs/xfs_mount.h         |  2 ++
>  fs/xfs/xfs_super.c         |  4 ++++
>  7 files changed, 60 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 382ab1e71c0b..79fb0d4adeda 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -353,11 +353,13 @@ xfs_sb_has_compat_feature(
>  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>  #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 29)	/* aligned file data extents */

I thought FORCEALIGN was going to signal aligned file data extent
allocations being mandatory?

This flag (AFAICT) simply marks the inode as something that gets
FMODE_CAN_ATOMIC_WRITES, right?

>  #define XFS_SB_FEAT_RO_COMPAT_ALL \
>  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
>  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
>  		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
> -		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
> +		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
> +		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
>  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
>  static inline bool
>  xfs_sb_has_ro_compat_feature(
> @@ -1085,16 +1087,18 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
>  #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
>  #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
> +#define XFS_DIFLAG2_ATOMICWRITES_BIT 6

Needs a comment here ("files flagged for atomic writes").  Also not sure
why you skipped bit 5, though I'm guessing it's because the forcealign
series is/was using it?

>  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
>  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
>  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
>  #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
>  #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
> +#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
>  
>  #define XFS_DIFLAG2_ANY \
>  	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> -	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
> +	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_ATOMICWRITES)
>  
>  static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
>  {
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index 4a9e8588f4c9..28a98130a56d 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -163,6 +163,8 @@ xfs_sb_version_to_features(
>  		features |= XFS_FEAT_REFLINK;
>  	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
>  		features |= XFS_FEAT_INOBTCNT;
> +	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> +		features |= XFS_FEAT_ATOMICWRITES;
>  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
>  		features |= XFS_FEAT_FTYPE;
>  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 1fd94958aa97..0b0f525fd043 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -65,6 +65,26 @@ xfs_get_extsz_hint(
>  	return 0;
>  }
>  
> +/*
> + * helper function to extract extent size

How does that differ from xfs_get_extsz_hint?

> + */
> +xfs_extlen_t
> +xfs_get_extsz(
> +	struct xfs_inode	*ip)
> +{
> +	/*
> +	 * No point in aligning allocations if we need to COW to actually
> +	 * write to them.

What does alwayscow have to do with untorn writes?

> +	 */
> +	if (xfs_is_always_cow_inode(ip))
> +		return 0;
> +
> +	if (XFS_IS_REALTIME_INODE(ip))
> +		return ip->i_mount->m_sb.sb_rextsize;
> +
> +	return 1;
> +}

Does this function exist to return the allocation unit for a given file?
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=b8ddcef3df8da02ed2c4aacbed1d811e60372006

> +
>  /*
>   * Helper function to extract CoW extent size hint from inode.
>   * Between the extent size hint and the CoW extent size hint, we
> @@ -629,6 +649,8 @@ xfs_ip2xflags(
>  			flags |= FS_XFLAG_DAX;
>  		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
>  			flags |= FS_XFLAG_COWEXTSIZE;
> +		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> +			flags |= FS_XFLAG_ATOMICWRITES;
>  	}
>  
>  	if (xfs_inode_has_attr_fork(ip))
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 97f63bacd4c2..0e0a21d9d30f 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -305,6 +305,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
>  	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
>  }
>  
> +static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)

I think this predicate wants a verb in its name, the rest of them have
"is" or "has" somewhere:

"xfs_inode_has_atomicwrites"

> +{
> +	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
> +}
> +
>  /*
>   * Return the buftarg used for data allocations on a given inode.
>   */
> @@ -542,7 +547,9 @@ void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
>  				struct xfs_inode *ip1, uint ip1_mode);
>  
>  xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
> +xfs_extlen_t	xfs_get_extsz(struct xfs_inode *ip);
>  xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
> +xfs_extlen_t	xfs_get_atomicwrites_size(struct xfs_inode *ip);
>  
>  int xfs_init_new_inode(struct mnt_idmap *idmap, struct xfs_trans *tp,
>  		struct xfs_inode *pip, xfs_ino_t ino, umode_t mode,
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index f02b6e558af5..c380a3055be7 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
>  		di_flags2 |= XFS_DIFLAG2_DAX;
>  	if (xflags & FS_XFLAG_COWEXTSIZE)
>  		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> +	if (xflags & FS_XFLAG_ATOMICWRITES)
> +		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
>  
>  	return di_flags2;
>  }
> @@ -1122,10 +1124,12 @@ xfs_ioctl_setattr_xflags(
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
> +	bool			atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
>  	uint64_t		i_flags2;
>  
> -	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
> -		/* Can't change realtime flag if any extents are allocated. */

Please augment this comment ("Can't change realtime or atomicwrites
flags if any extents are allocated") instead of deleting it.  This is
validation code, the requirements should be spelled out in English.

> +
> +	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
> +	    atomic_writes != xfs_inode_atomicwrites(ip)) {
>  		if (ip->i_df.if_nextents || ip->i_delayed_blks)
>  			return -EINVAL;
>  	}
> @@ -1146,6 +1150,17 @@ xfs_ioctl_setattr_xflags(
>  	if (i_flags2 && !xfs_has_v3inodes(mp))
>  		return -EINVAL;
>  
> +	if (atomic_writes) {
> +		if (!xfs_has_atomicwrites(mp))
> +			return -EINVAL;
> +
> +		if (!rtflag)
> +			return -EINVAL;
> +
> +		if (!is_power_of_2(mp->m_sb.sb_rextsize))
> +			return -EINVAL;

Shouldn't we check sb_rextsize w.r.t. the actual block device queue
limits here?  I keep seeing similar validation logic open-coded
throughout both atomic write patchsets:

	if (l < queue_atomic_write_unit_min_bytes())
		/* fail */
	if (l > queue_atomic_write_unit_max_bytes())
		/* fail */
	if (!is_power_of_2(l))
		/* fail */
	/* ok */

which really should be a common helper somewhere.

		/*
		 * Don't set atomic write if the allocation unit doesn't
		 * align with the device requirements.
		 */
		if (!bdev_validate_atomic_write(<target blockdev>,
				XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize))
			return -EINVAL;

Too bad we have to figure out the target blockdev and file allocation
unit based on the ioctl in-params and can't use the xfs_inode helpers
here.

--D

> +	}
> +
>  	ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
>  	ip->i_diflags2 = i_flags2;
>  
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 503fe3c7edbf..bcd591f52925 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -289,6 +289,7 @@ typedef struct xfs_mount {
>  #define XFS_FEAT_BIGTIME	(1ULL << 24)	/* large timestamps */
>  #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
>  #define XFS_FEAT_NREXT64	(1ULL << 26)	/* large extent counters */
> +#define XFS_FEAT_ATOMICWRITES	(1ULL << 28)	/* atomic writes support */
>  
>  /* Mount features */
>  #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
> @@ -352,6 +353,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
>  __XFS_HAS_FEAT(bigtime, BIGTIME)
>  __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
>  __XFS_HAS_FEAT(large_extent_counts, NREXT64)
> +__XFS_HAS_FEAT(atomicwrites, ATOMICWRITES)
>  
>  /*
>   * Mount features
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index aff20ddd4a9f..263404f683d6 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1696,6 +1696,10 @@ xfs_fs_fill_super(
>  		mp->m_features &= ~XFS_FEAT_DISCARD;
>  	}
>  
> +	if (xfs_has_atomicwrites(mp))
> +		xfs_warn(mp,
> +"EXPERIMENTAL atomic writes feature in use. Use at your own risk!");
> +
>  	if (xfs_has_reflink(mp)) {
>  		if (mp->m_sb.sb_rblocks) {
>  			xfs_alert(mp,
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag
  2024-01-24 14:26 ` [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
@ 2024-02-02 17:57   ` Darrick J. Wong
  2024-02-05 12:58     ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-02 17:57 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Wed, Jan 24, 2024 at 02:26:41PM +0000, John Garry wrote:
> Add a flag indicating that a regular file is enabled for atomic writes.

This is a file attribute that mirrors an ondisk inode flag.  Actual
support for untorn file writes (for now) depends on both the iflag and
the underlying storage devices, which we can only really check at statx
and pwrite time.  This is the same story as FS_XFLAG_DAX, which signals
to the fs that we should try to enable the fsdax IO path on the file
(instead of the regular page cache), but applications have to query
STAT_ATTR_DAX to find out if they really got that IO path.

"try to enable atomic writes", perhaps?

(and the comment for FS_XFLAG_DAX ought to read "try to use DAX for IO")

--D 

> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  include/uapi/linux/fs.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index a0975ae81e64..b5b4e1db9576 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -140,6 +140,7 @@ struct fsxattr {
>  #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
>  #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
>  #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
> +#define FS_XFLAG_ATOMICWRITES	0x00020000	/* atomic writes enabled */
>  #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
>  
>  /* the read-only stuff doesn't really belong here, but any other place is
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] fs: xfs: Support atomic write for statx
  2024-01-24 14:26 ` [PATCH 4/6] fs: xfs: Support atomic write for statx John Garry
@ 2024-02-02 18:05   ` Darrick J. Wong
  2024-02-05 13:10     ` John Garry
  2024-02-09  7:00   ` Ojaswin Mujoo
  1 sibling, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-02 18:05 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Wed, Jan 24, 2024 at 02:26:43PM +0000, John Garry wrote:
> Support providing info on atomic write unit min and max for an inode.
> 
> For simplicity, currently we limit the min at the FS block size, but a
> lower limit could be supported in future.
> 
> The atomic write unit min and max is limited by the guaranteed extent
> alignment for the inode.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iops.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_iops.h |  4 ++++
>  2 files changed, 49 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a0d77f5f512e..0890d2f70f4d 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -546,6 +546,44 @@ xfs_stat_blksize(
>  	return PAGE_SIZE;
>  }
>  
> +void xfs_get_atomic_write_attr(

static void?

> +	struct xfs_inode *ip,
> +	unsigned int *unit_min,
> +	unsigned int *unit_max)

Weird indenting here.

> +{
> +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	struct block_device	*bdev = target->bt_bdev;
> +	unsigned int		awu_min, awu_max, align;
> +	struct request_queue	*q = bdev->bd_queue;
> +	struct xfs_mount	*mp = ip->i_mount;
> +
> +	/*
> +	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
> +	 * atomic write unit of BLOCKSIZE).
> +	 */
> +	awu_min = queue_atomic_write_unit_min_bytes(q);
> +	awu_max = queue_atomic_write_unit_max_bytes(q);
> +
> +	awu_min &= ~mp->m_blockmask;

Why do you round /down/ the awu_min value here?

> +	awu_max &= ~mp->m_blockmask;

Actually -- since the atomic write units have to be powers of 2, why is
rounding needed here at all?

> +
> +	align = XFS_FSB_TO_B(mp, extsz);
> +
> +	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
> +	    !is_power_of_2(align)) {

...and if you take my suggestion to make a common helper to validate the
atomic write unit parameters, this can collapse into:

	alloc_unit_bytes = xfs_inode_alloc_unitsize(ip);
	if (!xfs_inode_has_atomicwrites(ip) ||
	    !bdev_validate_atomic_write(bdev, alloc_unit_bytes)) {
		/* not supported, return zeroes */
		*unit_min = 0;
		*unit_max = 0;
		return;
	}

	*unit_min = max(alloc_unit_bytes, awu_min);
	*unit_max = min(alloc_unit_bytes, awu_max);

--D

> +		*unit_min = 0;
> +		*unit_max = 0;
> +	} else {
> +		if (awu_min)
> +			*unit_min = min(awu_min, align);
> +		else
> +			*unit_min = mp->m_sb.sb_blocksize;
> +
> +		*unit_max = min(awu_max, align);
> +	}
> +}
> +
>  STATIC int
>  xfs_vn_getattr(
>  	struct mnt_idmap	*idmap,
> @@ -619,6 +657,13 @@ xfs_vn_getattr(
>  			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
>  			stat->dio_offset_align = bdev_logical_block_size(bdev);
>  		}
> +		if (request_mask & STATX_WRITE_ATOMIC) {
> +			unsigned int unit_min, unit_max;
> +
> +			xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> +			generic_fill_statx_atomic_writes(stat,
> +				unit_min, unit_max);
> +		}
>  		fallthrough;
>  	default:
>  		stat->blksize = xfs_stat_blksize(ip);
> diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
> index 7f84a0843b24..76dd4c3687aa 100644
> --- a/fs/xfs/xfs_iops.h
> +++ b/fs/xfs/xfs_iops.h
> @@ -19,4 +19,8 @@ int xfs_vn_setattr_size(struct mnt_idmap *idmap,
>  int xfs_inode_init_security(struct inode *inode, struct inode *dir,
>  		const struct qstr *qstr);
>  
> +void xfs_get_atomic_write_attr(struct xfs_inode *ip,
> +		unsigned int *unit_min,
> +		unsigned int *unit_max);
> +
>  #endif /* __XFS_IOPS_H__ */
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set
  2024-01-24 14:26 ` [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set John Garry
@ 2024-02-02 18:06   ` Darrick J. Wong
  2024-02-05 10:26     ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-02 18:06 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Wed, Jan 24, 2024 at 02:26:45PM +0000, John Garry wrote:
> For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
> flag.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_file.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e33e5e13b95f..1375d0089806 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1232,6 +1232,8 @@ xfs_file_open(
>  		return -EIO;
>  	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
>  			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
> +	if (xfs_inode_atomicwrites(XFS_I(inode)))

Shouldn't we check that the device supports AWU at all before turning on
the FMODE flag?

--D

> +		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
>  	return generic_file_open(inode, file);
>  }
>  
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-01-24 14:26 ` [PATCH RFC 5/6] fs: xfs: iomap atomic write support John Garry
@ 2024-02-02 18:47   ` Darrick J. Wong
  2024-02-05 13:36     ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-02 18:47 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote:
> Ensure that when creating a mapping that we adhere to all the atomic
> write rules.
> 
> We check that the mapping covers the complete range of the write to ensure
> that we'll be just creating a single mapping.
> 
> Currently minimum granularity is the FS block size, but it should be
> possibly to support lower in future.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
> I am setting this as an RFC as I am not sure on the change in
> xfs_iomap_write_direct() - it gives the desired result AFAICS.
> 
>  fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 41 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 18c8f168b153..758dc1c90a42 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
>  		}
>  	}
>  
> +	if (xfs_inode_atomicwrites(ip))
> +		bmapi_flags = XFS_BMAPI_ZERO;

Why do we want to write zeroes to the disk if we're allocating space
even if we're not sending an atomic write?

(This might want an explanation for why we're doing this at all -- it's
to avoid unwritten extent conversion, which defeats hardware untorn
writes.)

I think we should support IOCB_ATOMIC when the mapping is unwritten --
the data will land on disk in an untorn fashion, the unwritten extent
conversion on IO completion is itself atomic, and callers still have to
set O_DSYNC to persist anything.  Then we can avoid the cost of
BMAPI_ZERO, because double-writes aren't free.

> +
>  	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
>  			rblocks, force, &tp);
>  	if (error)
> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
>  	if (error)
>  		goto out_unlock;
>  
> +	if (flags & IOMAP_ATOMIC) {
> +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
> +		unsigned int unit_min, unit_max;
> +
> +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
> +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
> +
> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		if ((offset & mp->m_blockmask) ||
> +		    (length & mp->m_blockmask)) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		if (imap.br_blockcount == unit_min_fsb ||
> +		    imap.br_blockcount == unit_max_fsb) {
> +			/* ok if exactly min or max */
> +		} else if (imap.br_blockcount < unit_min_fsb ||
> +			   imap.br_blockcount > unit_max_fsb) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		} else if (!is_power_of_2(imap.br_blockcount)) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		if (imap.br_startoff &&
> +		    imap.br_startoff & (imap.br_blockcount - 1)) {

Not sure why we care about the file position, it's br_startblock that
gets passed into the bio, not br_startoff.

I'm also still not convinced that any of this validation is useful here.
The block device stack underneath the filesystem can change at any time
without any particular notice to the fs, so the only way to find out if
the proposed IO would meet the alignment constraints is to submit_bio
and see what happens.

(The "one bio per untorn write request" thing in the direct-io.c patch
sound sane to me though.)

--D

> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +	}
> +
>  	if (imap_needs_cow(ip, flags, &imap, nimaps)) {
>  		error = -EAGAIN;
>  		if (flags & IOMAP_NOWAIT)
> -- 
> 2.31.1
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol
  2024-02-02 17:52   ` Darrick J. Wong
@ 2024-02-03  7:40     ` Ojaswin Mujoo
  2024-02-05 12:51     ` John Garry
  1 sibling, 0 replies; 68+ messages in thread
From: Ojaswin Mujoo @ 2024-02-03  7:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio

On Fri, Feb 02, 2024 at 09:52:25AM -0800, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:42PM +0000, John Garry wrote:
> > Add initial support for FS_XFLAG_ATOMICWRITES in rtvol.
> > 
> > Current kernel support for atomic writes is based on HW support (for atomic
> > writes). As such, it is required to ensure extent alignment with
> > atomic_write_unit_max so that an atomic write can result in a single
> > HW-compliant IO operation.
> > 
> > rtvol already guarantees extent alignment, so initially add support there.
> > 
> > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h |  8 ++++++--
> >  fs/xfs/libxfs/xfs_sb.c     |  2 ++
> >  fs/xfs/xfs_inode.c         | 22 ++++++++++++++++++++++
> >  fs/xfs/xfs_inode.h         |  7 +++++++
> >  fs/xfs/xfs_ioctl.c         | 19 +++++++++++++++++--
> >  fs/xfs/xfs_mount.h         |  2 ++
> >  fs/xfs/xfs_super.c         |  4 ++++
> >  7 files changed, 60 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 382ab1e71c0b..79fb0d4adeda 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -353,11 +353,13 @@ xfs_sb_has_compat_feature(
> >  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> >  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> >  #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> > +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 29)	/* aligned file data extents */
> 
> I thought FORCEALIGN was going to signal aligned file data extent
> allocations being mandatory?
> 
> This flag (AFAICT) simply marks the inode as something that gets
> FMODE_CAN_ATOMIC_WRITES, right?
> 
> >  #define XFS_SB_FEAT_RO_COMPAT_ALL \
> >  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> >  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> >  		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
> > -		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
> > +		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
> > +		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> >  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
> >  static inline bool
> >  xfs_sb_has_ro_compat_feature(
> > @@ -1085,16 +1087,18 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> >  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> >  #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
> >  #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
> > +#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
> 
> Needs a comment here ("files flagged for atomic writes").  Also not sure
> why you skipped bit 5, though I'm guessing it's because the forcealign
> series is/was using it?
> 
> >  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
> >  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
> >  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
> >  #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
> >  #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
> > +#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
> >  
> >  #define XFS_DIFLAG2_ANY \
> >  	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> > -	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
> > +	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_ATOMICWRITES)
> >  
> >  static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
> >  {
> > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > index 4a9e8588f4c9..28a98130a56d 100644
> > --- a/fs/xfs/libxfs/xfs_sb.c
> > +++ b/fs/xfs/libxfs/xfs_sb.c
> > @@ -163,6 +163,8 @@ xfs_sb_version_to_features(
> >  		features |= XFS_FEAT_REFLINK;
> >  	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
> >  		features |= XFS_FEAT_INOBTCNT;
> > +	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> > +		features |= XFS_FEAT_ATOMICWRITES;
> >  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
> >  		features |= XFS_FEAT_FTYPE;
> >  	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 1fd94958aa97..0b0f525fd043 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -65,6 +65,26 @@ xfs_get_extsz_hint(
> >  	return 0;
> >  }
> >  
> > +/*
> > + * helper function to extract extent size
> 
> How does that differ from xfs_get_extsz_hint?
> 
> > + */
> > +xfs_extlen_t
> > +xfs_get_extsz(
> > +	struct xfs_inode	*ip)
> > +{
> > +	/*
> > +	 * No point in aligning allocations if we need to COW to actually
> > +	 * write to them.
> 
> What does alwayscow have to do with untorn writes?
> 
> > +	 */
> > +	if (xfs_is_always_cow_inode(ip))
> > +		return 0;
> > +
> > +	if (XFS_IS_REALTIME_INODE(ip))
> > +		return ip->i_mount->m_sb.sb_rextsize;
> > +
> > +	return 1;
> > +}
> 
> Does this function exist to return the allocation unit for a given file?
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=b8ddcef3df8da02ed2c4aacbed1d811e60372006
> 
> > +
> >  /*
> >   * Helper function to extract CoW extent size hint from inode.
> >   * Between the extent size hint and the CoW extent size hint, we
> > @@ -629,6 +649,8 @@ xfs_ip2xflags(
> >  			flags |= FS_XFLAG_DAX;
> >  		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
> >  			flags |= FS_XFLAG_COWEXTSIZE;
> > +		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> > +			flags |= FS_XFLAG_ATOMICWRITES;
> >  	}
> >  
> >  	if (xfs_inode_has_attr_fork(ip))
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index 97f63bacd4c2..0e0a21d9d30f 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -305,6 +305,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
> >  	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
> >  }
> >  
> > +static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)
> 
> I think this predicate wants a verb in its name, the rest of them have
> "is" or "has" somewhere:
> 
> "xfs_inode_has_atomicwrites"
> 
> > +{
> > +	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
> > +}
> > +
> >  /*
> >   * Return the buftarg used for data allocations on a given inode.
> >   */
> > @@ -542,7 +547,9 @@ void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
> >  				struct xfs_inode *ip1, uint ip1_mode);
> >  
> >  xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
> > +xfs_extlen_t	xfs_get_extsz(struct xfs_inode *ip);
> >  xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
> > +xfs_extlen_t	xfs_get_atomicwrites_size(struct xfs_inode *ip);
> >  
> >  int xfs_init_new_inode(struct mnt_idmap *idmap, struct xfs_trans *tp,
> >  		struct xfs_inode *pip, xfs_ino_t ino, umode_t mode,
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index f02b6e558af5..c380a3055be7 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
> >  		di_flags2 |= XFS_DIFLAG2_DAX;
> >  	if (xflags & FS_XFLAG_COWEXTSIZE)
> >  		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> > +	if (xflags & FS_XFLAG_ATOMICWRITES)
> > +		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
> >  
> >  	return di_flags2;
> >  }
> > @@ -1122,10 +1124,12 @@ xfs_ioctl_setattr_xflags(
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
> > +	bool			atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
> >  	uint64_t		i_flags2;
> >  
> > -	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
> > -		/* Can't change realtime flag if any extents are allocated. */
> 
> Please augment this comment ("Can't change realtime or atomicwrites
> flags if any extents are allocated") instead of deleting it.  This is
> validation code, the requirements should be spelled out in English.
> 
> > +
> > +	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
> > +	    atomic_writes != xfs_inode_atomicwrites(ip)) {
> >  		if (ip->i_df.if_nextents || ip->i_delayed_blks)
> >  			return -EINVAL;
> >  	}
> > @@ -1146,6 +1150,17 @@ xfs_ioctl_setattr_xflags(
> >  	if (i_flags2 && !xfs_has_v3inodes(mp))
> >  		return -EINVAL;
> >  
> > +	if (atomic_writes) {
> > +		if (!xfs_has_atomicwrites(mp))
> > +			return -EINVAL;
> > +
> > +		if (!rtflag)
> > +			return -EINVAL;
> > +
> > +		if (!is_power_of_2(mp->m_sb.sb_rextsize))
> > +			return -EINVAL;
> 
> Shouldn't we check sb_rextsize w.r.t. the actual block device queue
> limits here?  I keep seeing similar validation logic open-coded
> throughout both atomic write patchsets:
> 
> 	if (l < queue_atomic_write_unit_min_bytes())
> 		/* fail */
> 	if (l > queue_atomic_write_unit_max_bytes())
> 		/* fail */
> 	if (!is_power_of_2(l))
> 		/* fail */
> 	/* ok */
> 
> which really should be a common helper somewhere.
> 
> 		/*
> 		 * Don't set atomic write if the allocation unit doesn't
> 		 * align with the device requirements.
> 		 */
> 		if (!bdev_validate_atomic_write(<target blockdev>,
> 				XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize))
> 			return -EINVAL;
> 
> Too bad we have to figure out the target blockdev and file allocation
> unit based on the ioctl in-params and can't use the xfs_inode helpers
> here.
> 
> --D

Hey John, Darrick,

I agree that we should have a common helper to validate block device
limits. I tried to do so by exporting blkdev_atomic_write_valid() in the
ext4 series [1].

There was also some discussion on moving this to VFS, where we can check
against the len and off of the write and then we can make fs specific
checks (eg if off,len align with rt extsize/bigalloc size) later in the
fs layer.

[1]
https://lore.kernel.org/linux-ext4/b53609d0d4b97eb9355987ac5ec03d4e89293b43.1701339358.git.ojaswin@linux.ibm.com/


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set
  2024-02-02 18:06   ` Darrick J. Wong
@ 2024-02-05 10:26     ` John Garry
  2024-02-13 17:59       ` Darrick J. Wong
  0 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-05 10:26 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 02/02/2024 18:06, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:45PM +0000, John Garry wrote:
>> For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
>> flag.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_file.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index e33e5e13b95f..1375d0089806 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -1232,6 +1232,8 @@ xfs_file_open(
>>   		return -EIO;
>>   	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
>>   			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
>> +	if (xfs_inode_atomicwrites(XFS_I(inode)))

Note to self: This should also check if O_DIRECT is set

> 
> Shouldn't we check that the device supports AWU at all before turning on
> the FMODE flag?

Can we easily get this sort of bdev info here?

Currently if we do try to issue an atomic write and AWU for the bdev is 
zero, then XFS iomap code will reject it.

Thanks,
John

> 
> --D
> 
>> +		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
>>   	return generic_file_open(inode, file);
>>   }
>>   
>> -- 
>> 2.31.1
>>
>>


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] fs: iomap: Atomic write support
  2024-02-02 17:25   ` Darrick J. Wong
@ 2024-02-05 11:29     ` John Garry
  2024-02-13  6:55       ` Christoph Hellwig
  2024-02-13 18:08       ` Darrick J. Wong
  0 siblings, 2 replies; 68+ messages in thread
From: John Garry @ 2024-02-05 11:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 02/02/2024 17:25, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:40PM +0000, John Garry wrote:
>> Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
>> bio is being created and all the rules there need to be followed.
>>
>> It is the task of the FS iomap iter callbacks to ensure that the mapping
>> created adheres to those rules, like size is power-of-2, is at a
>> naturally-aligned offset, etc. However, checking for a single iovec, i.e.
>> iter type is ubuf, is done in __iomap_dio_rw().
>>
>> A write should only produce a single bio, so error when it doesn't.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/iomap/direct-io.c  | 21 ++++++++++++++++++++-
>>   fs/iomap/trace.h      |  3 ++-
>>   include/linux/iomap.h |  1 +
>>   3 files changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index bcd3f8cf5ea4..25736d01b857 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -275,10 +275,12 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>>   static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		struct iomap_dio *dio)
>>   {
>> +	bool atomic_write = iter->flags & IOMAP_ATOMIC;
>>   	const struct iomap *iomap = &iter->iomap;
>>   	struct inode *inode = iter->inode;
>>   	unsigned int fs_block_size = i_blocksize(inode), pad;
>>   	loff_t length = iomap_length(iter);
>> +	const size_t iter_len = iter->len;
>>   	loff_t pos = iter->pos;
>>   	blk_opf_t bio_opf;
>>   	struct bio *bio;
>> @@ -381,6 +383,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   					  GFP_KERNEL);
>>   		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
>>   		bio->bi_ioprio = dio->iocb->ki_ioprio;
>> +		if (atomic_write)
>> +			bio->bi_opf |= REQ_ATOMIC;
> 
> This really ought to be in iomap_dio_bio_opflags.  Unless you can't pass
> REQ_ATOMIC to bio_alloc*, in which case there ought to be a comment
> about why.

I think that should be ok

> 
> Also, what's the meaning of REQ_OP_READ | REQ_ATOMIC? 

REQ_ATOMIC will be ignored for REQ_OP_READ. I'm following the same 
policy as something like RWF_SYNC for a read.

However, if FMODE_CAN_ATOMIC_WRITE is unset, then REQ_ATOMIC will be 
rejected for both REQ_OP_READ and REQ_OP_WRITE.

> Does that
> actually work?  I don't know what that means, and "block: Add REQ_ATOMIC
> flag" says that's not a valid combination.  I'll complain about this
> more below.

Please note that I do mention that this flag is only meaningful for 
pwritev2(), like RWF_SYNC, here:
https://lore.kernel.org/linux-api/20240124112731.28579-3-john.g.garry@oracle.com/

> 
>> +
>>   		bio->bi_private = dio;
>>   		bio->bi_end_io = iomap_dio_bio_end_io;
>>   
>> @@ -397,6 +402,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		}
>>   
>>   		n = bio->bi_iter.bi_size;
>> +		if (atomic_write && n != iter_len) {
> 
> s/iter_len/orig_len/ ?

ok, I can change the name if you prefer

> 
>> +			/* This bio should have covered the complete length */
>> +			ret = -EINVAL;
>> +			bio_put(bio);
>> +			goto out;
>> +		}
>>   		if (dio->flags & IOMAP_DIO_WRITE) {
>>   			task_io_account_write(n);
>>   		} else {
>> @@ -554,12 +565,17 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   	struct blk_plug plug;
>>   	struct iomap_dio *dio;
>>   	loff_t ret = 0;
>> +	bool is_read = iov_iter_rw(iter) == READ;
>> +	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
> 
> Hrmm.  So if the caller passes in an IOCB_ATOMIC iocb with a READ iter,
> we'll silently drop IOCB_ATOMIC and do the read anyway?  That seems like
> a nonsense combination, but is that ok for some reason?

Please see above

> 
>>   	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
>>   
>>   	if (!iomi.len)
>>   		return NULL;
>>   
>> +	if (atomic_write && !iter_is_ubuf(iter))
>> +		return ERR_PTR(-EINVAL);
> 
> Does !iter_is_ubuf actually happen? 

Sure, if someone uses iovcnt > 1 for pwritev2

Please see __import_iovec(), where only if iovcnt == 1 we create 
iter_type == ITER_UBUF, if > 1 then we have iter_type == ITER_IOVEC

> Why don't we support any of the
> other ITER_ types?  Is it because hardware doesn't want vectored
> buffers?
It's related how we can determine atomic_write_unit_max for the bdev.

We want to give a definitive max write value which we can guarantee to 
always fit in a BIO, but not mandate any extra special iovec 
length/alignment rules.

Without any iovec length or alignment rules (apart from direct IO rules 
that an iovec needs to be bdev logical block size and length aligned) , 
if a user provides many iovecs, then we may only be able to only fit 
bdev LBS of data (typically 512B) in each BIO vector, and thus we need 
to give a pessimistically low atomic_write_unit_max value.

If we say that iovcnt max == 1, then we know that we can fit PAGE size 
of data in each BIO vector (ignoring first/last vectors), and this will 
give a reasonably large atomic_write_unit_max value.

Note that we do now provide this iovcnt max value via statx, but always 
return 1 for now. This was agreed with Christoph, please see:
https://lore.kernel.org/linux-nvme/20240117150200.GA30112@lst.de/

> 
> I really wish there was more commenting on /why/ we do things here:
> 
> 	if (iocb->ki_flags & IOCB_ATOMIC) {
> 		/* atomic reads do not make sense */
> 		if (iov_iter_rw(iter) == READ)
> 			return ERR_PTR(-EINVAL);
> 
> 		/*
> 		 * block layer doesn't want to handle handle vectors of
> 		 * buffers when performing an atomic write i guess?
> 		 */
> 		if (!iter_is_ubuf(iter))
> 			return ERR_PTR(-EINVAL);
> 
> 		iomi.flags |= IOMAP_ATOMIC;
> 	}

ok, I can make this more clear.

Note: It would be nice if we could check this in 
xfs_iomap_write_direct() or a common VFS helper (which 
xfs_iomap_write_direct() calls), but iter is not available there.

I could just check iter_is_ubuf() on its own in the vfs rw path, but I 
would like to keep the checks as close together as possible.

Thanks,
John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol
  2024-02-02 17:52   ` Darrick J. Wong
  2024-02-03  7:40     ` Ojaswin Mujoo
@ 2024-02-05 12:51     ` John Garry
  2024-02-13 17:22       ` Darrick J. Wong
  1 sibling, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-05 12:51 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 02/02/2024 17:52, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:42PM +0000, John Garry wrote:
>> Add initial support for FS_XFLAG_ATOMICWRITES in rtvol.
>>
>> Current kernel support for atomic writes is based on HW support (for atomic
>> writes). As such, it is required to ensure extent alignment with
>> atomic_write_unit_max so that an atomic write can result in a single
>> HW-compliant IO operation.
>>
>> rtvol already guarantees extent alignment, so initially add support there.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/libxfs/xfs_format.h |  8 ++++++--
>>   fs/xfs/libxfs/xfs_sb.c     |  2 ++
>>   fs/xfs/xfs_inode.c         | 22 ++++++++++++++++++++++
>>   fs/xfs/xfs_inode.h         |  7 +++++++
>>   fs/xfs/xfs_ioctl.c         | 19 +++++++++++++++++--
>>   fs/xfs/xfs_mount.h         |  2 ++
>>   fs/xfs/xfs_super.c         |  4 ++++
>>   7 files changed, 60 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> index 382ab1e71c0b..79fb0d4adeda 100644
>> --- a/fs/xfs/libxfs/xfs_format.h
>> +++ b/fs/xfs/libxfs/xfs_format.h
>> @@ -353,11 +353,13 @@ xfs_sb_has_compat_feature(
>>   #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>>   #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>>   #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
>> +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 29)	/* aligned file data extents */
> 
> I thought FORCEALIGN was going to signal aligned file data extent
> allocations being mandatory?

Right, I'll fix that comment

> 
> This flag (AFAICT) simply marks the inode as something that gets
> FMODE_CAN_ATOMIC_WRITES, right?

Correct

> 
>>   #define XFS_SB_FEAT_RO_COMPAT_ALL \
>>   		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
>>   		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
>>   		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
>> -		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
>> +		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
>> +		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
>>   #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
>>   static inline bool
>>   xfs_sb_has_ro_compat_feature(
>> @@ -1085,16 +1087,18 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>>   #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
>>   #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
>>   #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
>> +#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
> 
> Needs a comment here ("files flagged for atomic writes"). 

ok

> Also not sure
> why you skipped bit 5, though I'm guessing it's because the forcealign
> series is/was using it?

Right, I'll fix that

> 
>>   #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
>>   #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
>>   #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
>>   #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
>>   #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
>> +#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
>>   
>>   #define XFS_DIFLAG2_ANY \
>>   	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
>> -	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
>> +	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_ATOMICWRITES)
>>   
>>   static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
>>   {
>> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
>> index 4a9e8588f4c9..28a98130a56d 100644
>> --- a/fs/xfs/libxfs/xfs_sb.c
>> +++ b/fs/xfs/libxfs/xfs_sb.c
>> @@ -163,6 +163,8 @@ xfs_sb_version_to_features(
>>   		features |= XFS_FEAT_REFLINK;
>>   	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
>>   		features |= XFS_FEAT_INOBTCNT;
>> +	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
>> +		features |= XFS_FEAT_ATOMICWRITES;
>>   	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
>>   		features |= XFS_FEAT_FTYPE;
>>   	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 1fd94958aa97..0b0f525fd043 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -65,6 +65,26 @@ xfs_get_extsz_hint(
>>   	return 0;
>>   }
>>   
>> +/*
>> + * helper function to extract extent size
> 
> How does that differ from xfs_get_extsz_hint?

The idea of this function is to return the guaranteed extent alignment, 
and not just the hint

> 
>> + */
>> +xfs_extlen_t
>> +xfs_get_extsz(
>> +	struct xfs_inode	*ip)
>> +{
>> +	/*
>> +	 * No point in aligning allocations if we need to COW to actually
>> +	 * write to them.
> 
> What does alwayscow have to do with untorn writes?

Nothing at the moment, so I'll remove this.

> 
>> +	 */
>> +	if (xfs_is_always_cow_inode(ip))
>> +		return 0;
>> +
>> +	if (XFS_IS_REALTIME_INODE(ip))
>> +		return ip->i_mount->m_sb.sb_rextsize;
>> +
>> +	return 1;
>> +}
> 
> Does this function exist to return the allocation unit for a given file?
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=b8ddcef3df8da02ed2c4aacbed1d811e60372006
> 

Yes, something like xfs_inode_alloc_unitsize() there.

What's the upstream status for that change? I see it mentioned in 
linux-xfs lore and seems to be part of a mega patchset.

>> +
>>   /*
>>    * Helper function to extract CoW extent size hint from inode.
>>    * Between the extent size hint and the CoW extent size hint, we
>> @@ -629,6 +649,8 @@ xfs_ip2xflags(
>>   			flags |= FS_XFLAG_DAX;
>>   		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
>>   			flags |= FS_XFLAG_COWEXTSIZE;
>> +		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
>> +			flags |= FS_XFLAG_ATOMICWRITES;
>>   	}
>>   
>>   	if (xfs_inode_has_attr_fork(ip))
>> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
>> index 97f63bacd4c2..0e0a21d9d30f 100644
>> --- a/fs/xfs/xfs_inode.h
>> +++ b/fs/xfs/xfs_inode.h
>> @@ -305,6 +305,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
>>   	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
>>   }
>>   
>> +static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)
> 
> I think this predicate wants a verb in its name, the rest of them have
> "is" or "has" somewhere:
> 
> "xfs_inode_has_atomicwrites"

ok, fine.

Note that I was copying xfs_inode_forcealign() in terms of naming.

> 
>> +{
>> +	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
>> +}
>> +
>>   /*
>>    * Return the buftarg used for data allocations on a given inode.
>>    */
>> @@ -542,7 +547,9 @@ void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
>>   				struct xfs_inode *ip1, uint ip1_mode);
>>   
>>   xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
>> +xfs_extlen_t	xfs_get_extsz(struct xfs_inode *ip);
>>   xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
>> +xfs_extlen_t	xfs_get_atomicwrites_size(struct xfs_inode *ip);
>>   
>>   int xfs_init_new_inode(struct mnt_idmap *idmap, struct xfs_trans *tp,
>>   		struct xfs_inode *pip, xfs_ino_t ino, umode_t mode,
>> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
>> index f02b6e558af5..c380a3055be7 100644
>> --- a/fs/xfs/xfs_ioctl.c
>> +++ b/fs/xfs/xfs_ioctl.c
>> @@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
>>   		di_flags2 |= XFS_DIFLAG2_DAX;
>>   	if (xflags & FS_XFLAG_COWEXTSIZE)
>>   		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
>> +	if (xflags & FS_XFLAG_ATOMICWRITES)
>> +		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
>>   
>>   	return di_flags2;
>>   }
>> @@ -1122,10 +1124,12 @@ xfs_ioctl_setattr_xflags(
>>   {
>>   	struct xfs_mount	*mp = ip->i_mount;
>>   	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
>> +	bool			atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
>>   	uint64_t		i_flags2;
>>   
>> -	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
>> -		/* Can't change realtime flag if any extents are allocated. */
> 
> Please augment this comment ("Can't change realtime or atomicwrites
> flags if any extents are allocated") instead of deleting it.

I wasn't supposed to delete that - will remedy.

>  This is
> validation code, the requirements should be spelled out in English.
> 
>> +
>> +	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
>> +	    atomic_writes != xfs_inode_atomicwrites(ip)) {
>>   		if (ip->i_df.if_nextents || ip->i_delayed_blks)
>>   			return -EINVAL;
>>   	}
>> @@ -1146,6 +1150,17 @@ xfs_ioctl_setattr_xflags(
>>   	if (i_flags2 && !xfs_has_v3inodes(mp))
>>   		return -EINVAL;
>>   
>> +	if (atomic_writes) {
>> +		if (!xfs_has_atomicwrites(mp))
>> +			return -EINVAL;
>> +
>> +		if (!rtflag)
>> +			return -EINVAL;
>> +
>> +		if (!is_power_of_2(mp->m_sb.sb_rextsize))
>> +			return -EINVAL;
> 
> Shouldn't we check sb_rextsize w.r.t. the actual block device queue
> limits here?  I keep seeing similar validation logic open-coded
> throughout both atomic write patchsets:
> 
> 	if (l < queue_atomic_write_unit_min_bytes())
> 		/* fail */
> 	if (l > queue_atomic_write_unit_max_bytes())
> 		/* fail */
> 	if (!is_power_of_2(l))
> 		/* fail */
> 	/* ok */
> 
> which really should be a common helper somewhere.

I think that it is a reasonable comment about duplication the atomic 
writes checks for the bdev and iomap write paths - I can try to improve 
that.

But the is_power_of_2(mp->m_sb.sb_rextsize) check is to ensure that the 
extent size is suitable for enabling atomic writes. I don't see a point 
in checking the bdev queue limits here.

> 
> 		/*
> 		 * Don't set atomic write if the allocation unit doesn't
> 		 * align with the device requirements.
> 		 */
> 		if (!bdev_validate_atomic_write(<target blockdev>,
> 				XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize))
> 			return -EINVAL;
> 
> Too bad we have to figure out the target blockdev and file allocation
> unit based on the ioctl in-params and can't use the xfs_inode helpers
> here.

I am not sure what bdev_validate_atomic_write() would even do. If 
sb_rextsize exceeded the bdev atomic write unit max, then we just cap 
reported atomic write unit max in statx to that which the bdev reports 
and vice-versa.

And didn't we previously have a concern that it is possible to change 
the geometry of the device? If so, not much point in this check.

Thanks,
John


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag
  2024-02-02 17:57   ` Darrick J. Wong
@ 2024-02-05 12:58     ` John Garry
  2024-02-13  6:56       ` Christoph Hellwig
  2024-02-13 17:08       ` Darrick J. Wong
  0 siblings, 2 replies; 68+ messages in thread
From: John Garry @ 2024-02-05 12:58 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 02/02/2024 17:57, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:41PM +0000, John Garry wrote:
>> Add a flag indicating that a regular file is enabled for atomic writes.
> 
> This is a file attribute that mirrors an ondisk inode flag.  Actual
> support for untorn file writes (for now) depends on both the iflag and
> the underlying storage devices, which we can only really check at statx
> and pwrite time.  This is the same story as FS_XFLAG_DAX, which signals
> to the fs that we should try to enable the fsdax IO path on the file
> (instead of the regular page cache), but applications have to query
> STAT_ATTR_DAX to find out if they really got that IO path.

To be clear, are you suggesting to add this info to the commit message?

> 
> "try to enable atomic writes", perhaps? >
> (and the comment for FS_XFLAG_DAX ought to read "try to use DAX for IO")

To me that sounds like "try to use DAX for IO, and, if not possible, 
fall back on some other method" - is that reality of what that flag does?

Thanks,
John

> 
> --D
> 
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   include/uapi/linux/fs.h | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>> index a0975ae81e64..b5b4e1db9576 100644
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -140,6 +140,7 @@ struct fsxattr {
>>   #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
>>   #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
>>   #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
>> +#define FS_XFLAG_ATOMICWRITES	0x00020000	/* atomic writes enabled */
>>   #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
>>   
>>   /* the read-only stuff doesn't really belong here, but any other place is
>> -- 
>> 2.31.1
>>
>>


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] fs: xfs: Support atomic write for statx
  2024-02-02 18:05   ` Darrick J. Wong
@ 2024-02-05 13:10     ` John Garry
  2024-02-13 17:37       ` Darrick J. Wong
  0 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-05 13:10 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 02/02/2024 18:05, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:43PM +0000, John Garry wrote:
>> Support providing info on atomic write unit min and max for an inode.
>>
>> For simplicity, currently we limit the min at the FS block size, but a
>> lower limit could be supported in future.
>>
>> The atomic write unit min and max is limited by the guaranteed extent
>> alignment for the inode.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/xfs/xfs_iops.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>   fs/xfs/xfs_iops.h |  4 ++++
>>   2 files changed, 49 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index a0d77f5f512e..0890d2f70f4d 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -546,6 +546,44 @@ xfs_stat_blksize(
>>   	return PAGE_SIZE;
>>   }
>>   
>> +void xfs_get_atomic_write_attr(
> 
> static void?

We use this in the iomap and statx code

> 
>> +	struct xfs_inode *ip,
>> +	unsigned int *unit_min,
>> +	unsigned int *unit_max)
> 
> Weird indenting here.

hmmm... I thought that this was the XFS style

Can you show how it should look?

> 
>> +{
>> +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
>> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>> +	struct block_device	*bdev = target->bt_bdev;
>> +	unsigned int		awu_min, awu_max, align;
>> +	struct request_queue	*q = bdev->bd_queue;
>> +	struct xfs_mount	*mp = ip->i_mount;
>> +
>> +	/*
>> +	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
>> +	 * atomic write unit of BLOCKSIZE).
>> +	 */
>> +	awu_min = queue_atomic_write_unit_min_bytes(q);
>> +	awu_max = queue_atomic_write_unit_max_bytes(q);
>> +
>> +	awu_min &= ~mp->m_blockmask;
> 
> Why do you round /down/ the awu_min value here?

This is just to ensure that we returning *unit_min >= BLOCKSIZE

For example, if awu_min, max 1K, 64K from the bdev, we now have 0 and 
64K. And below this gives us awu_min, max of 4k, 64k.

Maybe there is a more logical way of doing this.

> 
>> +	awu_max &= ~mp->m_blockmask;
> 
> Actually -- since the atomic write units have to be powers of 2, why is
> rounding needed here at all?

Sure, but the bdev can report a awu_min < BLOCKSIZE

> 
>> +
>> +	align = XFS_FSB_TO_B(mp, extsz);
>> +
>> +	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
>> +	    !is_power_of_2(align)) {
> 
> ...and if you take my suggestion to make a common helper to validate the
> atomic write unit parameters, this can collapse into:
> 
> 	alloc_unit_bytes = xfs_inode_alloc_unitsize(ip);
> 	if (!xfs_inode_has_atomicwrites(ip) ||
> 	    !bdev_validate_atomic_write(bdev, alloc_unit_bytes))  > 		/* not supported, return zeroes */
> 		*unit_min = 0;
> 		*unit_max = 0;
> 		return;
> 	}
> 
> 	*unit_min = max(alloc_unit_bytes, awu_min);
> 	*unit_max = min(alloc_unit_bytes, awu_max);

Again, we need to ensure that *unit_min >= BLOCKSIZE

Thanks,
John


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-02 18:47   ` Darrick J. Wong
@ 2024-02-05 13:36     ` John Garry
  2024-02-06  1:15       ` Dave Chinner
  2024-02-13 17:50       ` Darrick J. Wong
  0 siblings, 2 replies; 68+ messages in thread
From: John Garry @ 2024-02-05 13:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 02/02/2024 18:47, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote:
>> Ensure that when creating a mapping that we adhere to all the atomic
>> write rules.
>>
>> We check that the mapping covers the complete range of the write to ensure
>> that we'll be just creating a single mapping.
>>
>> Currently minimum granularity is the FS block size, but it should be
>> possibly to support lower in future.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>> I am setting this as an RFC as I am not sure on the change in
>> xfs_iomap_write_direct() - it gives the desired result AFAICS.
>>
>>   fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 41 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 18c8f168b153..758dc1c90a42 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
>>   		}
>>   	}
>>   
>> +	if (xfs_inode_atomicwrites(ip))
>> +		bmapi_flags = XFS_BMAPI_ZERO;
> 
> Why do we want to write zeroes to the disk if we're allocating space
> even if we're not sending an atomic write?
> 
> (This might want an explanation for why we're doing this at all -- it's
> to avoid unwritten extent conversion, which defeats hardware untorn
> writes.)

It's to handle the scenario where we have a partially written extent, 
and then try to issue an atomic write which covers the complete extent. 
In this scenario, the iomap code will issue 2x IOs, which is 
unacceptable. So we ensure that the extent is completely written 
whenever we allocate it. At least that is my idea.

> 
> I think we should support IOCB_ATOMIC when the mapping is unwritten --
> the data will land on disk in an untorn fashion, the unwritten extent
> conversion on IO completion is itself atomic, and callers still have to
> set O_DSYNC to persist anything. 

But does this work for the scenario above?

> Then we can avoid the cost of
> BMAPI_ZERO, because double-writes aren't free.

About double-writes not being free, I thought that this was acceptable 
to just have this write zero when initially allocating the extent as it 
should not add too much overhead in practice, i.e. it's one off.

> 
>> +
>>   	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
>>   			rblocks, force, &tp);
>>   	if (error)
>> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
>>   	if (error)
>>   		goto out_unlock;
>>   
>> +	if (flags & IOMAP_ATOMIC) {
>> +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
>> +		unsigned int unit_min, unit_max;
>> +
>> +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
>> +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
>> +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
>> +
>> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +
>> +		if ((offset & mp->m_blockmask) ||
>> +		    (length & mp->m_blockmask)) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +
>> +		if (imap.br_blockcount == unit_min_fsb ||
>> +		    imap.br_blockcount == unit_max_fsb) {
>> +			/* ok if exactly min or max */
>> +		} else if (imap.br_blockcount < unit_min_fsb ||
>> +			   imap.br_blockcount > unit_max_fsb) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		} else if (!is_power_of_2(imap.br_blockcount)) {
>> +			error = -EINVAL;
>> +			goto out_unlock;
>> +		}
>> +
>> +		if (imap.br_startoff &&
>> +		    imap.br_startoff & (imap.br_blockcount - 1)) {
> 
> Not sure why we care about the file position, it's br_startblock that
> gets passed into the bio, not br_startoff.

We just want to ensure that the length of the write is valid w.r.t. to 
the offset within the extent, and br_startoff would be the offset within 
the aligned extent.

> 
> I'm also still not convinced that any of this validation is useful here.
> The block device stack underneath the filesystem can change at any time
> without any particular notice to the fs, so the only way to find out if
> the proposed IO would meet the alignment constraints is to submit_bio
> and see what happens.

I am not sure what submit_bio() would do differently. If the block 
device is changing underneath the block layer, then there is where these 
things need to be checked.

> 
> (The "one bio per untorn write request" thing in the direct-io.c patch
> sound sane to me though.)
> 

ok

Thanks,
John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] fs: iomap: Atomic write support
  2024-01-24 14:26 ` [PATCH 1/6] fs: iomap: Atomic write support John Garry
  2024-02-02 17:25   ` Darrick J. Wong
@ 2024-02-05 15:20   ` Pankaj Raghav (Samsung)
  2024-02-05 15:41     ` John Garry
  1 sibling, 1 reply; 68+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-05 15:20 UTC (permalink / raw)
  To: John Garry
  Cc: hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, p.raghav

On Wed, Jan 24, 2024 at 02:26:40PM +0000, John Garry wrote:
> Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
> bio is being created and all the rules there need to be followed.
> 
> It is the task of the FS iomap iter callbacks to ensure that the mapping
> created adheres to those rules, like size is power-of-2, is at a
> naturally-aligned offset, etc. However, checking for a single iovec, i.e.
> iter type is ubuf, is done in __iomap_dio_rw().
> 
> A write should only produce a single bio, so error when it doesn't.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/iomap/direct-io.c  | 21 ++++++++++++++++++++-
>  fs/iomap/trace.h      |  3 ++-
>  include/linux/iomap.h |  1 +
>  3 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..25736d01b857 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -275,10 +275,12 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		struct iomap_dio *dio)
>  {
> +	bool atomic_write = iter->flags & IOMAP_ATOMIC;

Minor nit: the commit says IOMAP_ATOMIC_WRITE and you set the enum as
IOMAP_ATOMIC in the code.

As the atomic semantics only apply to write, the commit could be just
reworded to reflect the code?

<snip>
> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
> index c16fd55f5595..c95576420bca 100644
> --- a/fs/iomap/trace.h
> +++ b/fs/iomap/trace.h
> @@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
>  	{ IOMAP_REPORT,		"REPORT" }, \
>  	{ IOMAP_FAULT,		"FAULT" }, \
>  	{ IOMAP_DIRECT,		"DIRECT" }, \
> -	{ IOMAP_NOWAIT,		"NOWAIT" }
> +	{ IOMAP_NOWAIT,		"NOWAIT" }, \
> +	{ IOMAP_ATOMIC,		"ATOMIC" }
>  

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] fs: iomap: Atomic write support
  2024-02-05 15:20   ` Pankaj Raghav (Samsung)
@ 2024-02-05 15:41     ` John Garry
  0 siblings, 0 replies; 68+ messages in thread
From: John Garry @ 2024-02-05 15:41 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, p.raghav

On 05/02/2024 15:20, Pankaj Raghav (Samsung) wrote:
> On Wed, Jan 24, 2024 at 02:26:40PM +0000, John Garry wrote:
>> Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
>> bio is being created and all the rules there need to be followed.
>>
>> It is the task of the FS iomap iter callbacks to ensure that the mapping
>> created adheres to those rules, like size is power-of-2, is at a
>> naturally-aligned offset, etc. However, checking for a single iovec, i.e.
>> iter type is ubuf, is done in __iomap_dio_rw().
>>
>> A write should only produce a single bio, so error when it doesn't.
>>
>> Signed-off-by: John Garry <john.g.garry@oracle.com>
>> ---
>>   fs/iomap/direct-io.c  | 21 ++++++++++++++++++++-
>>   fs/iomap/trace.h      |  3 ++-
>>   include/linux/iomap.h |  1 +
>>   3 files changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index bcd3f8cf5ea4..25736d01b857 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -275,10 +275,12 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
>>   static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   		struct iomap_dio *dio)
>>   {
>> +	bool atomic_write = iter->flags & IOMAP_ATOMIC;
> 
> Minor nit: the commit says IOMAP_ATOMIC_WRITE and you set the enum as
> IOMAP_ATOMIC in the code.

Thanks for spotting this

> 
> As the atomic semantics only apply to write, the commit could be just
> reworded to reflect the code?

Yes, so I was advised to change IOMAP_ATOMIC_WRITE  -> IOMAP_ATOMIC, as 
this flag is just a write modifier. I just didn't update the commit message.

Thanks,
John

> 
> <snip>
>> diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
>> index c16fd55f5595..c95576420bca 100644
>> --- a/fs/iomap/trace.h
>> +++ b/fs/iomap/trace.h
>> @@ -98,7 +98,8 @@ DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
>>   	{ IOMAP_REPORT,		"REPORT" }, \
>>   	{ IOMAP_FAULT,		"FAULT" }, \
>>   	{ IOMAP_DIRECT,		"DIRECT" }, \
>> -	{ IOMAP_NOWAIT,		"NOWAIT" }
>> +	{ IOMAP_NOWAIT,		"NOWAIT" }, \
>> +	{ IOMAP_ATOMIC,		"ATOMIC" }
>>   


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-05 13:36     ` John Garry
@ 2024-02-06  1:15       ` Dave Chinner
  2024-02-06  9:53         ` John Garry
  2024-02-13 17:50       ` Darrick J. Wong
  1 sibling, 1 reply; 68+ messages in thread
From: Dave Chinner @ 2024-02-06  1:15 UTC (permalink / raw)
  To: John Garry
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Mon, Feb 05, 2024 at 01:36:03PM +0000, John Garry wrote:
> On 02/02/2024 18:47, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote:
> > > Ensure that when creating a mapping that we adhere to all the atomic
> > > write rules.
> > > 
> > > We check that the mapping covers the complete range of the write to ensure
> > > that we'll be just creating a single mapping.
> > > 
> > > Currently minimum granularity is the FS block size, but it should be
> > > possibly to support lower in future.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > > I am setting this as an RFC as I am not sure on the change in
> > > xfs_iomap_write_direct() - it gives the desired result AFAICS.
> > > 
> > >   fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 41 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 18c8f168b153..758dc1c90a42 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
> > >   		}
> > >   	}
> > > +	if (xfs_inode_atomicwrites(ip))
> > > +		bmapi_flags = XFS_BMAPI_ZERO;

We really, really don't want to be doing this during allocation
unless we can avoid it. If the filesystem block size is 64kB, we
could be allocating up to 96GB per extent, and that becomes an
uninterruptable write stream inside a transaction context that holds
inode metadata locked.

IOWs, if the inode is already dirty, this data zeroing effectively
pins the tail of the journal until the data writes complete, and
hence can potentially stall the entire filesystem for that length of
time.

Historical note: XFS_BMAPI_ZERO was introduced for DAX where
unwritten extents are not used for initial allocation because the
direct zeroing overhead is typically much lower than unwritten
extent conversion overhead.  It was not intended as a general
purpose "zero data at allocation time" solution primarily because of
how easy it would be to DOS the storage with a single, unkillable
fallocate() call on slow storage.

> > Why do we want to write zeroes to the disk if we're allocating space
> > even if we're not sending an atomic write?
> > 
> > (This might want an explanation for why we're doing this at all -- it's
> > to avoid unwritten extent conversion, which defeats hardware untorn
> > writes.)
> 
> It's to handle the scenario where we have a partially written extent, and
> then try to issue an atomic write which covers the complete extent.

When/how would that ever happen with the forcealign bits being set
preventing unaligned allocation and writes?

> In this
> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
> ensure that the extent is completely written whenever we allocate it. At
> least that is my idea.

So return an unaligned extent, and then the IOMAP_ATOMIC checks you
add below say "no" and then the application has to do things the
slow, safe way....

> > I think we should support IOCB_ATOMIC when the mapping is unwritten --
> > the data will land on disk in an untorn fashion, the unwritten extent
> > conversion on IO completion is itself atomic, and callers still have to
> > set O_DSYNC to persist anything.
> 
> But does this work for the scenario above?

Probably not, but if we want the mapping to return a single
contiguous extent mapping that spans both unwritten and written
states, then we should directly code that behaviour for atomic
IO and not try to hack around it via XFS_BMAPI_ZERO.

Unwritten extent conversion will already do the right thing in that
it will only convert unwritten regions to written in the larger
range that is passed to it, but if there are multiple regions that
need conversion then the conversion won't be atomic.

> > Then we can avoid the cost of
> > BMAPI_ZERO, because double-writes aren't free.
> 
> About double-writes not being free, I thought that this was acceptable to
> just have this write zero when initially allocating the extent as it should
> not add too much overhead in practice, i.e. it's one off.

The whole point about atomic writes is they are a performance
optimisation. If the cost of enabling atomic writes is that we
double the amount of IO we are doing, then we've lost more
performance than we gained by using atomic writes. That doesn't
seem desirable....

> 
> > 
> > > +
> > >   	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
> > >   			rblocks, force, &tp);
> > >   	if (error)
> > > @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
> > >   	if (error)
> > >   		goto out_unlock;
> > > +	if (flags & IOMAP_ATOMIC) {
> > > +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
> > > +		unsigned int unit_min, unit_max;
> > > +
> > > +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> > > +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
> > > +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
> > > +
> > > +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		if ((offset & mp->m_blockmask) ||
> > > +		    (length & mp->m_blockmask)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}

That belongs in the iomap DIO setup code, not here. It's also only
checking the data offset/length is filesystem block aligned, not
atomic IO aligned, too.

> > > +
> > > +		if (imap.br_blockcount == unit_min_fsb ||
> > > +		    imap.br_blockcount == unit_max_fsb) {
> > > +			/* ok if exactly min or max */

Why? Exact sizing doesn't imply alignment is correct.

> > > +		} else if (imap.br_blockcount < unit_min_fsb ||
> > > +			   imap.br_blockcount > unit_max_fsb) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;

Why do this after an exact check?

> > > +		} else if (!is_power_of_2(imap.br_blockcount)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;

Why does this matter? If the extent mapping spans a range larger
than was asked for, who cares what size it is as the infrastructure
is only going to do IO for the sub-range in the mapping the user
asked for....

> > > +		}
> > > +
> > > +		if (imap.br_startoff &&
> > > +		    imap.br_startoff & (imap.br_blockcount - 1)) {
> > 
> > Not sure why we care about the file position, it's br_startblock that
> > gets passed into the bio, not br_startoff.
> 
> We just want to ensure that the length of the write is valid w.r.t. to the
> offset within the extent, and br_startoff would be the offset within the
> aligned extent.

I'm not sure why the filesystem extent mapping code needs to care
about IOMAP_ATOMIC like this - the extent allocation behaviour is
determined by the inode forcealign flag, not IOMAP_ATOMIC.
Everything else we have to do is just mapping the offset/len that
was passed to it from the iomap DIO layer. As long as we allocate
with correct alignment and return a mapping that spans the start
offset of the requested range, we've done our job here.

Actually determining if the mapping returned for IO is suitable for
the type of IO we are doing (i.e. IOMAP_ATOMIC) is the
responsibility of the iomap infrastructure. The same checks will
have to be done for every filesystem that implements atomic writes,
so these checks belong in the generic code, not the filesystem
mapping callouts.

-Dave
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-06  1:15       ` Dave Chinner
@ 2024-02-06  9:53         ` John Garry
  2024-02-07  0:06           ` Dave Chinner
  0 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-06  9:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

Hi Dave,

>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>>>> index 18c8f168b153..758dc1c90a42 100644
>>>> --- a/fs/xfs/xfs_iomap.c
>>>> +++ b/fs/xfs/xfs_iomap.c
>>>> @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
>>>>    		}
>>>>    	}
>>>> +	if (xfs_inode_atomicwrites(ip))
>>>> +		bmapi_flags = XFS_BMAPI_ZERO;
> 
> We really, really don't want to be doing this during allocation
> unless we can avoid it. If the filesystem block size is 64kB, we
> could be allocating up to 96GB per extent, and that becomes an
> uninterruptable write stream inside a transaction context that holds
> inode metadata locked.

Where does that 96GB figure come from?

> 
> IOWs, if the inode is already dirty, this data zeroing effectively
> pins the tail of the journal until the data writes complete, and
> hence can potentially stall the entire filesystem for that length of
> time.
> 
> Historical note: XFS_BMAPI_ZERO was introduced for DAX where
> unwritten extents are not used for initial allocation because the
> direct zeroing overhead is typically much lower than unwritten
> extent conversion overhead.  It was not intended as a general
> purpose "zero data at allocation time" solution primarily because of
> how easy it would be to DOS the storage with a single, unkillable
> fallocate() call on slow storage.

Understood

> 
>>> Why do we want to write zeroes to the disk if we're allocating space
>>> even if we're not sending an atomic write?
>>>
>>> (This might want an explanation for why we're doing this at all -- it's
>>> to avoid unwritten extent conversion, which defeats hardware untorn
>>> writes.)
>>
>> It's to handle the scenario where we have a partially written extent, and
>> then try to issue an atomic write which covers the complete extent.
> 
> When/how would that ever happen with the forcealign bits being set
> preventing unaligned allocation and writes?

Consider this scenario:

# mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda
# mount /dev/sda mnt -o rtdev=/dev/sdb
# touch  mnt/file
# /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic
write, 4096B at pos 0
# filefrag -v mnt/file
Filesystem type is: 58465342
File size of mnt/file is 4096 (1 block of 4096 bytes)
   ext:     logical_offset:        physical_offset: length:   expected:
flags:
     0:        0..       0:         24..        24:      1:
last,eof
mnt/file: 1 extent found
# /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file
wrote -1 bytes at pos 0 write_size=16384
#

For the 2nd write, which would cover a 16KB extent, the iomap code will 
iter twice and produce 2x BIOs, which we don't want - that's why it 
errors there.

With the change in this patch, instead we have something like this after 
the first write:

# /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
wrote 4096 bytes at pos 0 write_size=4096
# filefrag -v mnt/file
Filesystem type is: 58465342
File size of mnt/file is 4096 (1 block of 4096 bytes)
   ext:     logical_offset:        physical_offset: length:   expected:
flags:
     0:        0..       3:         24..        27:      4:
last,eof
mnt/file: 1 extent found
#

So the 16KB extent is in written state and the 2nd 16KB write would iter 
once, producing a single BIO.

> 
>> In this
>> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
>> ensure that the extent is completely written whenever we allocate it. At
>> least that is my idea.
> 
> So return an unaligned extent, and then the IOMAP_ATOMIC checks you
> add below say "no" and then the application has to do things the
> slow, safe way....

We have been porting atomic write support to some database apps and they 
(database developers) have had to do something like manually zero the 
complete file to get around this issue, but that's not a good user 
experience.

Note that in their case the first 4KB write is non-atomic, but that does 
not change things. They do these 4KB writes in some DB setup phase.

> 
>>> I think we should support IOCB_ATOMIC when the mapping is unwritten --
>>> the data will land on disk in an untorn fashion, the unwritten extent
>>> conversion on IO completion is itself atomic, and callers still have to
>>> set O_DSYNC to persist anything.
>>
>> But does this work for the scenario above?
> 
> Probably not, but if we want the mapping to return a single
> contiguous extent mapping that spans both unwritten and written
> states, then we should directly code that behaviour for atomic
> IO and not try to hack around it via XFS_BMAPI_ZERO.
> 
> Unwritten extent conversion will already do the right thing in that
> it will only convert unwritten regions to written in the larger
> range that is passed to it, but if there are multiple regions that
> need conversion then the conversion won't be atomic.

We would need something atomic.

> 
>>> Then we can avoid the cost of
>>> BMAPI_ZERO, because double-writes aren't free.
>>
>> About double-writes not being free, I thought that this was acceptable to
>> just have this write zero when initially allocating the extent as it should
>> not add too much overhead in practice, i.e. it's one off.
> 
> The whole point about atomic writes is they are a performance
> optimisation. If the cost of enabling atomic writes is that we
> double the amount of IO we are doing, then we've lost more
> performance than we gained by using atomic writes. That doesn't
> seem desirable....

But the zero'ing is a one off per extent allocation, right? I would 
expect just an initial overhead when the database is being created/extended.

Anyway, I did mark this as an RFC for this same reason.

> 
>>
>>>
>>>> +
>>>>    	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
>>>>    			rblocks, force, &tp);
>>>>    	if (error)
>>>> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
>>>>    	if (error)
>>>>    		goto out_unlock;
>>>> +	if (flags & IOMAP_ATOMIC) {
>>>> +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
>>>> +		unsigned int unit_min, unit_max;
>>>> +
>>>> +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
>>>> +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
>>>> +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
>>>> +
>>>> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
>>>> +			error = -EINVAL;
>>>> +			goto out_unlock;
>>>> +		}
>>>> +
>>>> +		if ((offset & mp->m_blockmask) ||
>>>> +		    (length & mp->m_blockmask)) {
>>>> +			error = -EINVAL;
>>>> +			goto out_unlock;
>>>> +		}
> 
> That belongs in the iomap DIO setup code, not here. It's also only
> checking the data offset/length is filesystem block aligned, not
> atomic IO aligned, too.

hmmm... I'm not sure about that. Initially XFS will only support writes 
whose size is a multiple of FS block size, and this is what we are 
checking here, even if it is not obvious.

The idea is that we can first ensure size is a multiple of FS blocksize, 
and then can use br_blockcount directly, below.

> 
>>>> +
>>>> +		if (imap.br_blockcount == unit_min_fsb ||
>>>> +		    imap.br_blockcount == unit_max_fsb) {
>>>> +			/* ok if exactly min or max */
> 
> Why? Exact sizing doesn't imply alignment is correct.

We're not checking alignment specifically, but just checking that the 
size is ok.

> 
>>>> +		} else if (imap.br_blockcount < unit_min_fsb ||
>>>> +			   imap.br_blockcount > unit_max_fsb) {
>>>> +			error = -EINVAL;
>>>> +			goto out_unlock;
> 
> Why do this after an exact check?

And this is a continuation of the size check.

> 
>>>> +		} else if (!is_power_of_2(imap.br_blockcount)) {
>>>> +			error = -EINVAL;
>>>> +			goto out_unlock;
> 
> Why does this matter? If the extent mapping spans a range larger
> than was asked for, who cares what size it is as the infrastructure
> is only going to do IO for the sub-range in the mapping the user
> asked for....

ok, so where would be a better check for power-of-2 write length? In 
iomap DIO code?

I was thinking of doing that, but not so happy with sparse checks.

> 
>>>> +		}
>>>> +
>>>> +		if (imap.br_startoff &&
>>>> +		    imap.br_startoff & (imap.br_blockcount - 1)) {
>>>
>>> Not sure why we care about the file position, it's br_startblock that
>>> gets passed into the bio, not br_startoff.
>>
>> We just want to ensure that the length of the write is valid w.r.t. to the
>> offset within the extent, and br_startoff would be the offset within the
>> aligned extent.
> 
> I'm not sure why the filesystem extent mapping code needs to care
> about IOMAP_ATOMIC like this - the extent allocation behaviour is
> determined by the inode forcealign flag, not IOMAP_ATOMIC.
> Everything else we have to do is just mapping the offset/len that
> was passed to it from the iomap DIO layer. As long as we allocate
> with correct alignment and return a mapping that spans the start
> offset of the requested range, we've done our job here.
> 
> Actually determining if the mapping returned for IO is suitable for
> the type of IO we are doing (i.e. IOMAP_ATOMIC) is the
> responsibility of the iomap infrastructure. The same checks will
> have to be done for every filesystem that implements atomic writes,
> so these checks belong in the generic code, not the filesystem
> mapping callouts.

We can move some of these checks to the core iomap code.

However, the core iomap code does not know FS atomic write min and max 
per inode, so we need some checks here.

Thanks,
John


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-06  9:53         ` John Garry
@ 2024-02-07  0:06           ` Dave Chinner
  2024-02-07 14:13             ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Dave Chinner @ 2024-02-07  0:06 UTC (permalink / raw)
  To: John Garry
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Tue, Feb 06, 2024 at 09:53:11AM +0000, John Garry wrote:
> Hi Dave,
> 
> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > > index 18c8f168b153..758dc1c90a42 100644
> > > > > --- a/fs/xfs/xfs_iomap.c
> > > > > +++ b/fs/xfs/xfs_iomap.c
> > > > > @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
> > > > >    		}
> > > > >    	}
> > > > > +	if (xfs_inode_atomicwrites(ip))
> > > > > +		bmapi_flags = XFS_BMAPI_ZERO;
> > 
> > We really, really don't want to be doing this during allocation
> > unless we can avoid it. If the filesystem block size is 64kB, we
> > could be allocating up to 96GB per extent, and that becomes an
> > uninterruptable write stream inside a transaction context that holds
> > inode metadata locked.
> 
> Where does that 96GB figure come from?

My inability to do math. The actual number is 128GB.

Max extent size = XFS_MAX_BMBT_EXTLEN * fs block size.
	        = 2^21 * fs block size.

So for a 4kB block size filesystem, that's 8GB max extent length,
and that's the most we will allocate in a single transaction (i.e.
one new BMBT record).

For 64kB block size, we can get 128GB of space allocated in a single
transaction.

> > > > Why do we want to write zeroes to the disk if we're allocating space
> > > > even if we're not sending an atomic write?
> > > > 
> > > > (This might want an explanation for why we're doing this at all -- it's
> > > > to avoid unwritten extent conversion, which defeats hardware untorn
> > > > writes.)
> > > 
> > > It's to handle the scenario where we have a partially written extent, and
> > > then try to issue an atomic write which covers the complete extent.
> > 
> > When/how would that ever happen with the forcealign bits being set
> > preventing unaligned allocation and writes?
> 
> Consider this scenario:
> 
> # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda
> # mount /dev/sda mnt -o rtdev=/dev/sdb
> # touch  mnt/file
> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic
> write, 4096B at pos 0

Please don't write one-off custom test programs to issue IO - please
use and enhance xfs_io so the test cases can then be put straight
into fstests without adding yet another "do some minor IO variant"
test program. This also means you don't need a random assortment of
other tools.

i.e.

# xfs_io -dc "pwrite -VA 0 4096" /root/mnt/file

Should do an RWF_ATOMIC IO, and

# xfs_io -dc "pwrite -VAD 0 4096" /root/mnt/file

should do an RWF_ATOMIC|RWF_DSYNC IO...


> # filefrag -v mnt/file

xfs_io -c "fiemap" mnt/file

> Filesystem type is: 58465342
> File size of mnt/file is 4096 (1 block of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected:
> flags:
>     0:        0..       0:         24..        24:      1:
> last,eof
> mnt/file: 1 extent found
> # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file
> wrote -1 bytes at pos 0 write_size=16384
> #

Whole test as one repeatable command:

# xfs_io -d -c "truncate 0" -c "chattr +r" \
	-c "pwrite -VAD 0 4096" \
	-c "fiemap" \
	-c "pwrite -VAD 0 16384" \
	/mnt/root/file
> 
> For the 2nd write, which would cover a 16KB extent, the iomap code will iter
> twice and produce 2x BIOs, which we don't want - that's why it errors there.

Yes, but I think that's a feature.  You've optimised the filesystem
layout for IO that is 64kB sized and aligned IO, but your test case
is mixing 4kB and 16KB IO. The filesystem should be telling you that
you're doing something that is sub-optimal for it's configuration,
and refusing to do weird overlapping sub-rtextsize atomic IO is a
pretty good sign that you've got something wrong.

The whole reason for rtextsize existing is to optimise the rtextent
allocation to the typical minimum IO size done to that volume. If
all your IO is sub-rtextsize size and alignment, then all that has
been done is forcing the entire rt device IO into a corner it was
never really intended nor optimised for.

Why should we jump through crazy hoops to try to make filesystems
optimised for large IOs with mismatched, overlapping small atomic
writes?

> With the change in this patch, instead we have something like this after the
> first write:
> 
> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
> wrote 4096 bytes at pos 0 write_size=4096
> # filefrag -v mnt/file
> Filesystem type is: 58465342
> File size of mnt/file is 4096 (1 block of 4096 bytes)
>   ext:     logical_offset:        physical_offset: length:   expected:
> flags:
>     0:        0..       3:         24..        27:      4:
> last,eof
> mnt/file: 1 extent found
> #
> 
> So the 16KB extent is in written state and the 2nd 16KB write would iter
> once, producing a single BIO.

Sure, I know how it works. My point is that it's a terrible way to
go about allowing that second atomic write to succeed.

> > > In this
> > > scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
> > > ensure that the extent is completely written whenever we allocate it. At
> > > least that is my idea.
> > 
> > So return an unaligned extent, and then the IOMAP_ATOMIC checks you
> > add below say "no" and then the application has to do things the
> > slow, safe way....
> 
> We have been porting atomic write support to some database apps and they
> (database developers) have had to do something like manually zero the
> complete file to get around this issue, but that's not a good user
> experience.

Better the application zeros the file when it is being initialised
and doesn't have performance constraints rather than forcing the
filesystem to do it in the IO fast path when IO performance and
latency actually matters to the application.

There are production databases that already do this manual zero
initialisation to avoid unwritten extent conversion overhead during
runtime operation. That's because they want FUA writes to work, and
that gives 25% better IO performance over the same O_DSYNC writes
doing allocation and/or unwritten extent conversion after
fallocate() which requires journal flushes with O_DSYNC writes.

Using atomic writes is no different.

> Note that in their case the first 4KB write is non-atomic, but that does not
> change things. They do these 4KB writes in some DB setup phase.

And therein lies the problem.

If you are doing sub-rtextent IO at all, then you are forcing the
filesystem down the path of explicitly using unwritten extents and
requiring O_DSYNC direct IO to do journal flushes in IO completion
context and then performance just goes down hill from them.

The requirement for unwritten extents to track sub-rtextsize written
regions is what you're trying to work around with XFS_BMAPI_ZERO so
that atomic writes will always see "atomic write aligned" allocated
regions.

Do you see the problem here? You've explicitly told the filesystem
that allocation is aligned to 64kB chunks, then because the
filesystem block size is 4kB, it's allowed to track unwritten
regions at 4kB boundaries. Then you do 4kB aligned file IO, which
then changes unwritten extents at 4kB boundaries. Then you do a
overlapping 16kB IO that *requires* 16kB allocation alignment, and
things go BOOM.

Yes, they should go BOOM.

This is a horrible configuration - it is incomaptible with 16kB
aligned and sized atomic IO. Allocation is aligned to 64kB, written
region tracking is aligned to 4kB, and there's nothing to tell the
filesystem that it should be maintaining 16kB "written alignment" so
that 16kB atomic writes can always be issued atomically.

i.e. if we are going to do 16kB aligned atomic IO, then all the
allocation and unwritten tracking needs to be done in 16kB aligned
chunks, not 4kB. That means a 4KB write into an unwritten region or
a hole actually needs to zero the rest of the 16KB range it sits
within.

The direct IO code can do this, but it needs extension of the
unaligned IO serialisation in XFS (the alignment checks in
xfs_file_dio_write()) and the the sub-block zeroing in
iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
allocation size, not the fsblock size) to do this safely.

Regardless of how we do it, all IO concurrency on this file is shot
if we have sub-rtextent sized IOs being done. That is true even with
this patch set - XFS_BMAPI_ZERO is done whilst holding the
XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
zeroing is being done.

IOWs, anything to do with sub-rtextent IO really has to be treated
like sub-fsblock DIO - i.e. exclusive inode access until the
sub-rtextent zeroing has been completed.

> > > > I think we should support IOCB_ATOMIC when the mapping is unwritten --
> > > > the data will land on disk in an untorn fashion, the unwritten extent
> > > > conversion on IO completion is itself atomic, and callers still have to
> > > > set O_DSYNC to persist anything.
> > > 
> > > But does this work for the scenario above?
> > 
> > Probably not, but if we want the mapping to return a single
> > contiguous extent mapping that spans both unwritten and written
> > states, then we should directly code that behaviour for atomic
> > IO and not try to hack around it via XFS_BMAPI_ZERO.
> > 
> > Unwritten extent conversion will already do the right thing in that
> > it will only convert unwritten regions to written in the larger
> > range that is passed to it, but if there are multiple regions that
> > need conversion then the conversion won't be atomic.
> 
> We would need something atomic.
> 
> > 
> > > > Then we can avoid the cost of
> > > > BMAPI_ZERO, because double-writes aren't free.
> > > 
> > > About double-writes not being free, I thought that this was acceptable to
> > > just have this write zero when initially allocating the extent as it should
> > > not add too much overhead in practice, i.e. it's one off.
> > 
> > The whole point about atomic writes is they are a performance
> > optimisation. If the cost of enabling atomic writes is that we
> > double the amount of IO we are doing, then we've lost more
> > performance than we gained by using atomic writes. That doesn't
> > seem desirable....
> 
> But the zero'ing is a one off per extent allocation, right? I would expect
> just an initial overhead when the database is being created/extended.

So why can't the application do that manually like others already do
to enable FUA optimisations for O_DSYNC writes?

FWIW, we probably should look to extend fallocate() to allow
userspace to say "write real zeroes, not fake ones" so the
filesystem can call blkdev_issue_zeroout() after preallocation to
offload the zeroing to the hardware and then clear the unwritten
bits on the preallocated range...

> > > > >    	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
> > > > >    			rblocks, force, &tp);
> > > > >    	if (error)
> > > > > @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
> > > > >    	if (error)
> > > > >    		goto out_unlock;
> > > > > +	if (flags & IOMAP_ATOMIC) {
> > > > > +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
> > > > > +		unsigned int unit_min, unit_max;
> > > > > +
> > > > > +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> > > > > +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
> > > > > +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
> > > > > +
> > > > > +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
> > > > > +			error = -EINVAL;
> > > > > +			goto out_unlock;
> > > > > +		}
> > > > > +
> > > > > +		if ((offset & mp->m_blockmask) ||
> > > > > +		    (length & mp->m_blockmask)) {
> > > > > +			error = -EINVAL;
> > > > > +			goto out_unlock;
> > > > > +		}
> > 
> > That belongs in the iomap DIO setup code, not here. It's also only
> > checking the data offset/length is filesystem block aligned, not
> > atomic IO aligned, too.
> 
> hmmm... I'm not sure about that. Initially XFS will only support writes
> whose size is a multiple of FS block size, and this is what we are checking
> here, even if it is not obvious.

Which means, initially, iomap only supposed writes that are a
multiple of filesystem block size. regardless, this should be
checked in the submission path, not in the extent mapping callback.

FWIW, we've already established above that iomap needs to handle
rtextsize chunks rather than fs block size for correct zeroing
behaviour for atomic writes, so this probably just needs to go away.

> The idea is that we can first ensure size is a multiple of FS blocksize, and
> then can use br_blockcount directly, below.

Yes, and we can do all these checks on the iomap that we return to
the iomap layer. All this is doing is running the checks on the XFS
imap before it is formatted into the iomap iomap and returned to the
iomap layer. These checks can be done on the returned iomap in the
iomap layer if IOMAP_ATOMIC is set....

> However, the core iomap code does not know FS atomic write min and max per
> inode, so we need some checks here.

So maybe we should pass them down to iomap in the iocb when
IOCB_ATOMIC is set, or reject the IO at the filesytem layer when
checking atomic IO alignment before we pass the IO to the iomap
layer...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-07  0:06           ` Dave Chinner
@ 2024-02-07 14:13             ` John Garry
  2024-02-09  1:40               ` Dave Chinner
  0 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-07 14:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On 07/02/2024 00:06, Dave Chinner wrote:
>>> We really, really don't want to be doing this during allocation
>>> unless we can avoid it. If the filesystem block size is 64kB, we
>>> could be allocating up to 96GB per extent, and that becomes an
>>> uninterruptable write stream inside a transaction context that holds
>>> inode metadata locked.
>> Where does that 96GB figure come from?
> My inability to do math. The actual number is 128GB.
> 
> Max extent size = XFS_MAX_BMBT_EXTLEN * fs block size.
> 	        = 2^21  * fs block size.
> 
> So for a 4kB block size filesystem, that's 8GB max extent length,
> and that's the most we will allocate in a single transaction (i.e.
> one new BMBT record).
> 
> For 64kB block size, we can get 128GB of space allocated in a single
> transaction.

atomic write unit max theoretical upper limit is 
rounddown_power_of_2(2^32 - 1) = 2GB

So this would be what is expected to be the largest extent size 
requested for atomic writes. I am not saying that 2GB is small, but 
certainly much smaller than 128GB.

> 
>>>>> Why do we want to write zeroes to the disk if we're allocating space
>>>>> even if we're not sending an atomic write?
>>>>>
>>>>> (This might want an explanation for why we're doing this at all -- it's
>>>>> to avoid unwritten extent conversion, which defeats hardware untorn
>>>>> writes.)
>>>> It's to handle the scenario where we have a partially written extent, and
>>>> then try to issue an atomic write which covers the complete extent.
>>> When/how would that ever happen with the forcealign bits being set
>>> preventing unaligned allocation and writes?
>> Consider this scenario:
>>
>> # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda
>> # mount /dev/sda mnt -o rtdev=/dev/sdb
>> # touch  mnt/file
>> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic
>> write, 4096B at pos 0
> Please don't write one-off custom test programs to issue IO - please
> use and enhance xfs_io so the test cases can then be put straight
> into fstests without adding yet another "do some minor IO variant"
> test program. This also means you don't need a random assortment of
> other tools.
> 
> i.e.
> 
> # xfs_io -dc "pwrite -VA 0 4096" /root/mnt/file
> 
> Should do an RWF_ATOMIC IO, and
> 
> # xfs_io -dc "pwrite -VAD 0 4096" /root/mnt/file
> 
> should do an RWF_ATOMIC|RWF_DSYNC IO...
> 
> 
>> # filefrag -v mnt/file
> xfs_io -c "fiemap" mnt/file

Fine, but I like using something generic for accessing block devices and 
also other FSes. I didn't think that xfs_io can do that.

Anyway, we can look to add atomic write support to xfs_io and any other 
xfs-progs

> 
>> Filesystem type is: 58465342
>> File size of mnt/file is 4096 (1 block of 4096 bytes)
>>    ext:     logical_offset:        physical_offset: length:   expected:
>> flags:
>>      0:        0..       0:         24..        24:      1:
>> last,eof
>> mnt/file: 1 extent found
>> # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file
>> wrote -1 bytes at pos 0 write_size=16384
>> #
> Whole test as one repeatable command:
> 
> # xfs_io -d -c "truncate 0" -c "chattr +r" \
> 	-c "pwrite -VAD 0 4096" \
> 	-c "fiemap" \
> 	-c "pwrite -VAD 0 16384" \
> 	/mnt/root/file
>> For the 2nd write, which would cover a 16KB extent, the iomap code will iter
>> twice and produce 2x BIOs, which we don't want - that's why it errors there.
> Yes, but I think that's a feature.  You've optimised the filesystem
> layout for IO that is 64kB sized and aligned IO, but your test case
> is mixing 4kB and 16KB IO. The filesystem should be telling you that
> you're doing something that is sub-optimal for it's configuration,
> and refusing to do weird overlapping sub-rtextsize atomic IO is a
> pretty good sign that you've got something wrong.

Then we really end up with a strange behavior for the user. I mean, the 
user may ask - "why did this 16KB atomic write pass and this one fail? 
I'm following all the rules", and then "No one said not to mix write 
sizes or not mix atomic and non-atomic writes, so should be ok. Indeed, 
that earlier 4K write for the same region passed".

Playing devil's advocate here, at least this behavior should be documented.

> 
> The whole reason for rtextsize existing is to optimise the rtextent
> allocation to the typical minimum IO size done to that volume. If
> all your IO is sub-rtextsize size and alignment, then all that has
> been done is forcing the entire rt device IO into a corner it was
> never really intended nor optimised for.

Sure, but just because we are optimized for a certain IO write size 
should not mean that other writes are disallowed or quite problematic.

> 
> Why should we jump through crazy hoops to try to make filesystems
> optimised for large IOs with mismatched, overlapping small atomic
> writes?

As mentioned, typically the atomic writes will be the same size, but we 
may have other writes of smaller size.

> 
>> With the change in this patch, instead we have something like this after the
>> first write:
>>
>> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
>> wrote 4096 bytes at pos 0 write_size=4096
>> # filefrag -v mnt/file
>> Filesystem type is: 58465342
>> File size of mnt/file is 4096 (1 block of 4096 bytes)
>>    ext:     logical_offset:        physical_offset: length:   expected:
>> flags:
>>      0:        0..       3:         24..        27:      4:
>> last,eof
>> mnt/file: 1 extent found
>> #
>>
>> So the 16KB extent is in written state and the 2nd 16KB write would iter
>> once, producing a single BIO.
> Sure, I know how it works. My point is that it's a terrible way to
> go about allowing that second atomic write to succeed.
I think 'terrible' is a bit too strong a word here. Indeed, you suggest 
to manually zero the file to solve this problem, below, while this code 
change does the same thing automatically.

BTW, there was a request for behavior like in this patch. Please see 
this discussion on the ext4 atomic writes port:

https://lore.kernel.org/linux-ext4/ZXhb0tKFvAge%2FGWf@infradead.org/

So we should have some solution where the kernel automatically takes 
care of this unwritten extent issue.

> 
>>>> In this
>>>> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
>>>> ensure that the extent is completely written whenever we allocate it. At
>>>> least that is my idea.
>>> So return an unaligned extent, and then the IOMAP_ATOMIC checks you
>>> add below say "no" and then the application has to do things the
>>> slow, safe way....
>> We have been porting atomic write support to some database apps and they
>> (database developers) have had to do something like manually zero the
>> complete file to get around this issue, but that's not a good user
>> experience.
> Better the application zeros the file when it is being initialised
> and doesn't have performance constraints rather than forcing the
> filesystem to do it in the IO fast path when IO performance and
> latency actually matters to the application.

Can't we do both? I mean, the well-informed user can still pre-zero the 
file just to ensure we aren't doing this zero'ing with the extent 
allocation.

> 
> There are production databases that already do this manual zero
> initialisation to avoid unwritten extent conversion overhead during
> runtime operation. That's because they want FUA writes to work, and
> that gives 25% better IO performance over the same O_DSYNC writes
> doing allocation and/or unwritten extent conversion after
> fallocate() which requires journal flushes with O_DSYNC writes.
> 
> Using atomic writes is no different.

Sure, and as I said, they can do both. The user is already wise enough 
to zero the file for other performance reasons (like FUA writes).

> 
>> Note that in their case the first 4KB write is non-atomic, but that does not
>> change things. They do these 4KB writes in some DB setup phase.

JFYI, these 4K writes were in some compressed mode in the DB setup 
phase, hence the smaller size.

> And therein lies the problem.
> 
> If you are doing sub-rtextent IO at all, then you are forcing the
> filesystem down the path of explicitly using unwritten extents and
> requiring O_DSYNC direct IO to do journal flushes in IO completion
> context and then performance just goes down hill from them.
> 
> The requirement for unwritten extents to track sub-rtextsize written
> regions is what you're trying to work around with XFS_BMAPI_ZERO so
> that atomic writes will always see "atomic write aligned" allocated
> regions.
> 
> Do you see the problem here? You've explicitly told the filesystem
> that allocation is aligned to 64kB chunks, then because the
> filesystem block size is 4kB, it's allowed to track unwritten
> regions at 4kB boundaries. Then you do 4kB aligned file IO, which
> then changes unwritten extents at 4kB boundaries. Then you do a
> overlapping 16kB IO that*requires*  16kB allocation alignment, and
> things go BOOM.
> 
> Yes, they should go BOOM.
> 
> This is a horrible configuration - it is incomaptible with 16kB
> aligned and sized atomic IO. 

Just because the DB may do 16KB atomic writes most of the time should 
not disallow it from any other form of writes.

Indeed at 
https://lore.kernel.org/linux-nvme/ZSR4jeSKlppLWjQy@dread.disaster.area/ 
you wrote "Not every IO to every file needs to be atomic." (sorry for 
quoting you)

So the user can do other regular writes, but you say that they should be 
always writing full extents. This just may not suit some DBs.

> Allocation is aligned to 64kB, written
> region tracking is aligned to 4kB, and there's nothing to tell the
> filesystem that it should be maintaining 16kB "written alignment" so
> that 16kB atomic writes can always be issued atomically.
> 
> i.e. if we are going to do 16kB aligned atomic IO, then all the
> allocation and unwritten tracking needs to be done in 16kB aligned
> chunks, not 4kB. That means a 4KB write into an unwritten region or
> a hole actually needs to zero the rest of the 16KB range it sits
> within.
> 
> The direct IO code can do this, but it needs extension of the
> unaligned IO serialisation in XFS (the alignment checks in
> xfs_file_dio_write()) and the the sub-block zeroing in
> iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
> allocation size, not the fsblock size) to do this safely.
> 
> Regardless of how we do it, all IO concurrency on this file is shot
> if we have sub-rtextent sized IOs being done. That is true even with
> this patch set - XFS_BMAPI_ZERO is done whilst holding the
> XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
> zeroing is being done.
> 
> IOWs, anything to do with sub-rtextent IO really has to be treated
> like sub-fsblock DIO - i.e. exclusive inode access until the
> sub-rtextent zeroing has been completed.

I do understand that this is not perfect that we may have mixed block 
sizes being written, but I don't think that we should disallow it and 
throw an error.

I would actually think that the worst thing is that the user does not 
know this restriction.

> 
>>>>> I think we should support IOCB_ATOMIC when the mapping is unwritten --
>>>>> the data will land on disk in an untorn fashion, the unwritten extent
>>>>> conversion on IO completion is itself atomic, and callers still have to
>>>>> set O_DSYNC to persist anything.
>>>> But does this work for the scenario above?
>>> Probably not, but if we want the mapping to return a single
>>> contiguous extent mapping that spans both unwritten and written
>>> states, then we should directly code that behaviour for atomic
>>> IO and not try to hack around it via XFS_BMAPI_ZERO.
>>>
>>> Unwritten extent conversion will already do the right thing in that
>>> it will only convert unwritten regions to written in the larger
>>> range that is passed to it, but if there are multiple regions that
>>> need conversion then the conversion won't be atomic.
>> We would need something atomic.
>>
>>>>> Then we can avoid the cost of
>>>>> BMAPI_ZERO, because double-writes aren't free.
>>>> About double-writes not being free, I thought that this was acceptable to
>>>> just have this write zero when initially allocating the extent as it should
>>>> not add too much overhead in practice, i.e. it's one off.
>>> The whole point about atomic writes is they are a performance
>>> optimisation. If the cost of enabling atomic writes is that we
>>> double the amount of IO we are doing, then we've lost more
>>> performance than we gained by using atomic writes. That doesn't
>>> seem desirable....
>> But the zero'ing is a one off per extent allocation, right? I would expect
>> just an initial overhead when the database is being created/extended.
> So why can't the application do that manually like others already do
> to enable FUA optimisations for O_DSYNC writes?

Is this even officially documented as advice or a suggestion for users?

> 
> FWIW, we probably should look to extend fallocate() to allow
> userspace to say "write real zeroes, not fake ones" so the
> filesystem can call blkdev_issue_zeroout() after preallocation to
> offload the zeroing to the hardware and then clear the unwritten
> bits on the preallocated range...

ack

> 
>>>>>>     	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
>>>>>>     			rblocks, force, &tp);
>>>>>>     	if (error)
>>>>>> @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
>>>>>>     	if (error)
>>>>>>     		goto out_unlock;
>>>>>> +	if (flags & IOMAP_ATOMIC) {
>>>>>> +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
>>>>>> +		unsigned int unit_min, unit_max;
>>>>>> +
>>>>>> +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
>>>>>> +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
>>>>>> +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
>>>>>> +
>>>>>> +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
>>>>>> +			error = -EINVAL;
>>>>>> +			goto out_unlock;
>>>>>> +		}
>>>>>> +
>>>>>> +		if ((offset & mp->m_blockmask) ||
>>>>>> +		    (length & mp->m_blockmask)) {
>>>>>> +			error = -EINVAL;
>>>>>> +			goto out_unlock;
>>>>>> +		}
>>> That belongs in the iomap DIO setup code, not here. It's also only
>>> checking the data offset/length is filesystem block aligned, not
>>> atomic IO aligned, too.
>> hmmm... I'm not sure about that. Initially XFS will only support writes
>> whose size is a multiple of FS block size, and this is what we are checking
>> here, even if it is not obvious.
> Which means, initially, iomap only supposed writes that are a
> multiple of filesystem block size. regardless, this should be
> checked in the submission path, not in the extent mapping callback.
> 
> FWIW, we've already established above that iomap needs to handle
> rtextsize chunks rather than fs block size for correct zeroing
> behaviour for atomic writes, so this probably just needs to go away.

Fine, I think that all this can be moved to iomap core / removed.

> 
>> The idea is that we can first ensure size is a multiple of FS blocksize, and
>> then can use br_blockcount directly, below.
> Yes, and we can do all these checks on the iomap that we return to
> the iomap layer. All this is doing is running the checks on the XFS
> imap before it is formatted into the iomap iomap and returned to the
> iomap layer. These checks can be done on the returned iomap in the
> iomap layer if IOMAP_ATOMIC is set....
> 
>> However, the core iomap code does not know FS atomic write min and max per
>> inode, so we need some checks here.
> So maybe we should pass them down to iomap in the iocb when
> IOCB_ATOMIC is set, or reject the IO at the filesytem layer when
> checking atomic IO alignment before we pass the IO to the iomap
> layer...

Yes, I think that something like this is possible.

As for using kiocb, it's quite a small structure, so I can't imagine 
that it would be welcome to add all this atomic write stuff.

So we could add extra awu min and max args to iomao_dio_rw(), and these 
could be filled in by the calling FS. But that function already has 
enough args.

Alternatively we could add an iomap_ops method to look up awu min and 
max for the inode.

Thanks,
John


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-07 14:13             ` John Garry
@ 2024-02-09  1:40               ` Dave Chinner
  2024-02-09 12:47                 ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Dave Chinner @ 2024-02-09  1:40 UTC (permalink / raw)
  To: John Garry
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Wed, Feb 07, 2024 at 02:13:23PM +0000, John Garry wrote:
> On 07/02/2024 00:06, Dave Chinner wrote:
> > > > We really, really don't want to be doing this during allocation
> > > > unless we can avoid it. If the filesystem block size is 64kB, we
> > > > could be allocating up to 96GB per extent, and that becomes an
> > > > uninterruptable write stream inside a transaction context that holds
> > > > inode metadata locked.
> > > Where does that 96GB figure come from?
> > My inability to do math. The actual number is 128GB.
> > 
> > Max extent size = XFS_MAX_BMBT_EXTLEN * fs block size.
> > 	        = 2^21  * fs block size.
> > 
> > So for a 4kB block size filesystem, that's 8GB max extent length,
> > and that's the most we will allocate in a single transaction (i.e.
> > one new BMBT record).
> > 
> > For 64kB block size, we can get 128GB of space allocated in a single
> > transaction.
> 
> atomic write unit max theoretical upper limit is rounddown_power_of_2(2^32 -
> 1) = 2GB
> 
> So this would be what is expected to be the largest extent size requested
> for atomic writes. I am not saying that 2GB is small, but certainly much
> smaller than 128GB.

*cough*

Extent size hints.

I'm a little disappointed that after all these discussions about how
we decouple extent allocation size and alignment from the user IO
size and alignment with things like extent size hints, force align,
etc that you are still thinking that user IO size and alignment
directly drives extent allocation size and alignment....


> > > > > > Why do we want to write zeroes to the disk if we're allocating space
> > > > > > even if we're not sending an atomic write?
> > > > > > 
> > > > > > (This might want an explanation for why we're doing this at all -- it's
> > > > > > to avoid unwritten extent conversion, which defeats hardware untorn
> > > > > > writes.)
> > > > > It's to handle the scenario where we have a partially written extent, and
> > > > > then try to issue an atomic write which covers the complete extent.
> > > > When/how would that ever happen with the forcealign bits being set
> > > > preventing unaligned allocation and writes?
> > > Consider this scenario:
> > > 
> > > # mkfs.xfs -r rtdev=/dev/sdb,extsize=64K -d rtinherit=1 /dev/sda
> > > # mount /dev/sda mnt -o rtdev=/dev/sdb
> > > # touch  mnt/file
> > > # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file # direct IO, atomic
> > > write, 4096B at pos 0
> > Please don't write one-off custom test programs to issue IO - please
> > use and enhance xfs_io so the test cases can then be put straight
> > into fstests without adding yet another "do some minor IO variant"
> > test program. This also means you don't need a random assortment of
> > other tools.
> > 
> > i.e.
> > 
> > # xfs_io -dc "pwrite -VA 0 4096" /root/mnt/file
> > 
> > Should do an RWF_ATOMIC IO, and
> > 
> > # xfs_io -dc "pwrite -VAD 0 4096" /root/mnt/file
> > 
> > should do an RWF_ATOMIC|RWF_DSYNC IO...
> > 
> > 
> > > # filefrag -v mnt/file
> > xfs_io -c "fiemap" mnt/file
> 
> Fine, but I like using something generic for accessing block devices and
> also other FSes. I didn't think that xfs_io can do that.

Yes, it can. We use it extensively in fstests because it works
for any filesystem, not just XFS.

> Anyway, we can look to add atomic write support to xfs_io and any other
> xfs-progs

Please do, then the support is there for developers, users and
fstests without needing to write their own custom test programs.

> > > Filesystem type is: 58465342
> > > File size of mnt/file is 4096 (1 block of 4096 bytes)
> > >    ext:     logical_offset:        physical_offset: length:   expected:
> > > flags:
> > >      0:        0..       0:         24..        24:      1:
> > > last,eof
> > > mnt/file: 1 extent found
> > > # /test-pwritev2 -a -d -l 16384 -p 0 /root/mnt/file
> > > wrote -1 bytes at pos 0 write_size=16384
> > > #
> > Whole test as one repeatable command:
> > 
> > # xfs_io -d -c "truncate 0" -c "chattr +r" \
> > 	-c "pwrite -VAD 0 4096" \
> > 	-c "fiemap" \
> > 	-c "pwrite -VAD 0 16384" \
> > 	/mnt/root/file
> > > For the 2nd write, which would cover a 16KB extent, the iomap code will iter
> > > twice and produce 2x BIOs, which we don't want - that's why it errors there.
> > Yes, but I think that's a feature.  You've optimised the filesystem
> > layout for IO that is 64kB sized and aligned IO, but your test case
> > is mixing 4kB and 16KB IO. The filesystem should be telling you that
> > you're doing something that is sub-optimal for it's configuration,
> > and refusing to do weird overlapping sub-rtextsize atomic IO is a
> > pretty good sign that you've got something wrong.
> 
> Then we really end up with a strange behavior for the user. I mean, the user
> may ask - "why did this 16KB atomic write pass and this one fail? I'm
> following all the rules", and then "No one said not to mix write sizes or
> not mix atomic and non-atomic writes, so should be ok. Indeed, that earlier
> 4K write for the same region passed".
> 
> Playing devil's advocate here, at least this behavior should be documented.

That's what man pages are for, yes?

Are you expecting your deployments to be run on highly suboptimal
configurations and so the code needs to be optimised for this
behaviour, or are you expecting them to be run on correctly
configured systems which would never see these issues?


> > The whole reason for rtextsize existing is to optimise the rtextent
> > allocation to the typical minimum IO size done to that volume. If
> > all your IO is sub-rtextsize size and alignment, then all that has
> > been done is forcing the entire rt device IO into a corner it was
> > never really intended nor optimised for.
> 
> Sure, but just because we are optimized for a certain IO write size should
> not mean that other writes are disallowed or quite problematic.

Atomic writes are just "other writes". They are writes that are
*expected to fail* if they cannot be done atomically.

Application writers will quickly learn how to do sane, fast,
reliable atomic write IO if we reject anything that is going to
requires some complex, sub-optimal workaround in the kernel to make
it work. The simplest solution is to -fail the write-, because
userspace *must* be prepared for *any* atomic write to fail.

> > Why should we jump through crazy hoops to try to make filesystems
> > optimised for large IOs with mismatched, overlapping small atomic
> > writes?
> 
> As mentioned, typically the atomic writes will be the same size, but we may
> have other writes of smaller size.

Then we need the tiny write to allocate and zero according to the
maximum sized atomic write bounds. Then we just don't care about
large atomic IO overlapping small IO, because the extent on disk
aligned to the large atomic IO is then always guaranteed to be the
correct size and shape.


> > > With the change in this patch, instead we have something like this after the
> > > first write:
> > > 
> > > # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
> > > wrote 4096 bytes at pos 0 write_size=4096
> > > # filefrag -v mnt/file
> > > Filesystem type is: 58465342
> > > File size of mnt/file is 4096 (1 block of 4096 bytes)
> > >    ext:     logical_offset:        physical_offset: length:   expected:
> > > flags:
> > >      0:        0..       3:         24..        27:      4:
> > > last,eof
> > > mnt/file: 1 extent found
> > > #
> > > 
> > > So the 16KB extent is in written state and the 2nd 16KB write would iter
> > > once, producing a single BIO.
> > Sure, I know how it works. My point is that it's a terrible way to
> > go about allowing that second atomic write to succeed.
> I think 'terrible' is a bit too strong a word here.

Doing it anything in a way that a user can DOS the entire filesystem
is *terrible*. No ifs, buts or otherwise.

> Indeed, you suggest to
> manually zero the file to solve this problem, below, while this code change
> does the same thing automatically.

Yes, but I also outlined a way that it can be done automatically
without being terrible. There are multiple options here, I outlined
two different approaches that are acceptible.

> > > > > In this
> > > > > scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
> > > > > ensure that the extent is completely written whenever we allocate it. At
> > > > > least that is my idea.
> > > > So return an unaligned extent, and then the IOMAP_ATOMIC checks you
> > > > add below say "no" and then the application has to do things the
> > > > slow, safe way....
> > > We have been porting atomic write support to some database apps and they
> > > (database developers) have had to do something like manually zero the
> > > complete file to get around this issue, but that's not a good user
> > > experience.
> > Better the application zeros the file when it is being initialised
> > and doesn't have performance constraints rather than forcing the
> > filesystem to do it in the IO fast path when IO performance and
> > latency actually matters to the application.
> 
> Can't we do both? I mean, the well-informed user can still pre-zero the file
> just to ensure we aren't doing this zero'ing with the extent allocation.

I never said we can't do zeroing. I just said that it's normally
better when the application controls zeroing directly.

> > And therein lies the problem.
> > 
> > If you are doing sub-rtextent IO at all, then you are forcing the
> > filesystem down the path of explicitly using unwritten extents and
> > requiring O_DSYNC direct IO to do journal flushes in IO completion
> > context and then performance just goes down hill from them.
> > 
> > The requirement for unwritten extents to track sub-rtextsize written
> > regions is what you're trying to work around with XFS_BMAPI_ZERO so
> > that atomic writes will always see "atomic write aligned" allocated
> > regions.
> > 
> > Do you see the problem here? You've explicitly told the filesystem
> > that allocation is aligned to 64kB chunks, then because the
> > filesystem block size is 4kB, it's allowed to track unwritten
> > regions at 4kB boundaries. Then you do 4kB aligned file IO, which
> > then changes unwritten extents at 4kB boundaries. Then you do a
> > overlapping 16kB IO that*requires*  16kB allocation alignment, and
> > things go BOOM.
> > 
> > Yes, they should go BOOM.
> > 
> > This is a horrible configuration - it is incomaptible with 16kB
> > aligned and sized atomic IO.
> 
> Just because the DB may do 16KB atomic writes most of the time should not
> disallow it from any other form of writes.

That's not what I said. I said the using sub-rtextsize atomic writes
with single FSB unwritten extent tracking is horrible and
incompatible with doing 16kB atomic writes.

This setup will not work at all well with your patches and should go
BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
has uncoordinated extent allocation and unwritten conversion
granularity.

That's the fundamental design problem with your approach - it allows
unwritten conversion at *minimum IO sizes* and that does not work
with atomic IOs with larger alignment requirements.

The fundamental design principle is this: for maximally sized atomic
writes to always succeed we require every allocation, zeroing and
unwritten conversion operation to use alignments and sizes that are
compatible with the maximum atomic write sizes being used.

i.e. atomic writes need to use max write size granularity for all IO
operations, not filesystem block granularity.

And that also means things like rtextsize and extsize hints need to
match these atomic write requirements, too....

> > Allocation is aligned to 64kB, written
> > region tracking is aligned to 4kB, and there's nothing to tell the
> > filesystem that it should be maintaining 16kB "written alignment" so
> > that 16kB atomic writes can always be issued atomically.
> > 
> > i.e. if we are going to do 16kB aligned atomic IO, then all the
> > allocation and unwritten tracking needs to be done in 16kB aligned
> > chunks, not 4kB. That means a 4KB write into an unwritten region or
> > a hole actually needs to zero the rest of the 16KB range it sits
> > within.
> > 
> > The direct IO code can do this, but it needs extension of the
> > unaligned IO serialisation in XFS (the alignment checks in
> > xfs_file_dio_write()) and the the sub-block zeroing in
> > iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
> > allocation size, not the fsblock size) to do this safely.
> > 
> > Regardless of how we do it, all IO concurrency on this file is shot
> > if we have sub-rtextent sized IOs being done. That is true even with
> > this patch set - XFS_BMAPI_ZERO is done whilst holding the
> > XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
> > zeroing is being done.
> > 
> > IOWs, anything to do with sub-rtextent IO really has to be treated
> > like sub-fsblock DIO - i.e. exclusive inode access until the
> > sub-rtextent zeroing has been completed.
> 
> I do understand that this is not perfect that we may have mixed block sizes
> being written, but I don't think that we should disallow it and throw an
> error.

Ummmm, did you read what you quoted?

The above is an outline of the IO path modifications that will allow
mixed IO sizes to be used with atomic writes without requiring the
XFS_BMAPI_ZERO hack. It pushes the sub-atomic write alignment
zeroing out to the existing DIO sub-block zeroing, hence ensuring
that we only ever convert unwritten extents on max sized atomic
write boundaries for atomic write enabled inodes.

At no point have I said "no mixed writes". I've said no to the
XFS_BMAPI_ZERO hack, but then I've explained the fundamental issue
that it works around and given you a decent amount of detail on how
to sanely implementing mixed write support that will work (slowly)
with those configurations and IO patterns.

So it's your choice - you can continue to beleive I don't mixed
writes to work at all, or you can go back and try to understand the
IO path changes I've suggested that will allow mixed atomic writes
to work as well as they possibly can....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] fs: xfs: Support atomic write for statx
  2024-01-24 14:26 ` [PATCH 4/6] fs: xfs: Support atomic write for statx John Garry
  2024-02-02 18:05   ` Darrick J. Wong
@ 2024-02-09  7:00   ` Ojaswin Mujoo
  2024-02-09 17:30     ` John Garry
  1 sibling, 1 reply; 68+ messages in thread
From: Ojaswin Mujoo @ 2024-02-09  7:00 UTC (permalink / raw)
  To: John Garry
  Cc: hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio

Hi John,

Thanks for the patch, I've added some review comments and questions
below.

On Wed, Jan 24, 2024 at 02:26:43PM +0000, John Garry wrote:
> Support providing info on atomic write unit min and max for an inode.
> 
> For simplicity, currently we limit the min at the FS block size, but a
> lower limit could be supported in future.
> 
> The atomic write unit min and max is limited by the guaranteed extent
> alignment for the inode.
> 
> Signed-off-by: John Garry <john.g.garry@oracle.com>
> ---
>  fs/xfs/xfs_iops.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_iops.h |  4 ++++
>  2 files changed, 49 insertions(+)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a0d77f5f512e..0890d2f70f4d 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -546,6 +546,44 @@ xfs_stat_blksize(
>  	return PAGE_SIZE;
>  }
>  
> +void xfs_get_atomic_write_attr(
> +	struct xfs_inode *ip,
> +	unsigned int *unit_min,
> +	unsigned int *unit_max)
> +{
> +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> +	struct block_device	*bdev = target->bt_bdev;
> +	unsigned int		awu_min, awu_max, align;
> +	struct request_queue	*q = bdev->bd_queue;
> +	struct xfs_mount	*mp = ip->i_mount;
> +
> +	/*
> +	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
> +	 * atomic write unit of BLOCKSIZE).
> +	 */
> +	awu_min = queue_atomic_write_unit_min_bytes(q);
> +	awu_max = queue_atomic_write_unit_max_bytes(q);
> +
> +	awu_min &= ~mp->m_blockmask;
> +	awu_max &= ~mp->m_blockmask;

I don't understand why we try to round down the awu_max to blocks size
here and not just have an explicit check of (awu_max < blocksize).

I think the issue with changing the awu_max is that we are using awu_max
to also indirectly reflect the alignment so as to ensure we don't cross
atomic boundaries set by the hw (eg we check uint_max % atomic alignment
== 0 in scsi). So once we change the awu_max, there's a chance that even
if an atomic write aligns to the new awu_max it still doesn't have the
right alignment and fails. 

It works right now since eveything is power of 2 but it should cause
issues incase we decide to remove that limitation.  Anyways, I think
this implicit behavior of things working since eveything is a power of 2
should atleast be documented in a comment, so these things are
immediately clear. 

> +
> +	align = XFS_FSB_TO_B(mp, extsz);
> +
> +	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
> +	    !is_power_of_2(align)) {

Correct me if I'm wrong but here as well, the is_power_of_2(align) is
esentially checking if the align % uinit_max == 0 (or vice versa if
unit_max is greater) so that an allocation of extsize will always align
nicely as needed by the device. 

So maybe we should use the % expression explicitly so that the intention
is immediately clear.

> +		*unit_min = 0;
> +		*unit_max = 0;
> +	} else {
> +		if (awu_min)
> +			*unit_min = min(awu_min, align);

How will the min() here work? If awu_min is the minumum set by the
device, how can statx be allowed to advertise something smaller than
that?

If I understand correctly, right now the way we set awu_min in scsi and
nvme, the follwoing should usually be true for a sane device:

 awu_min <= blocks size of fs <= align

 so the min() anyways becomes redundant, but if we do assume that there
 might be some weird devices with awu_min absurdly large (SCSI with
 high atomic granularity) we still can't actually advertise a min
 smaller than that of the device, or am I missing something here?

> +		else
> +			*unit_min = mp->m_sb.sb_blocksize;
> +
> +		*unit_max = min(awu_max, align);
> +	}
> +}
> +

Regards,
ojaswin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
                   ` (5 preceding siblings ...)
  2024-01-24 14:26 ` [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set John Garry
@ 2024-02-09  7:14 ` Ojaswin Mujoo
  2024-02-09  9:22   ` John Garry
  2024-02-13  7:22 ` Christoph Hellwig
  2024-02-13  7:45 ` Ritesh Harjani
  8 siblings, 1 reply; 68+ messages in thread
From: Ojaswin Mujoo @ 2024-02-09  7:14 UTC (permalink / raw)
  To: John Garry
  Cc: hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio

On Wed, Jan 24, 2024 at 02:26:39PM +0000, John Garry wrote:
> This series expands atomic write support to filesystems, specifically
> XFS. Since XFS rtvol supports extent alignment already, support will
> initially be added there. When XFS forcealign feature is merged, then we
> can similarly support atomic writes for a non-rtvol filesystem.

Hi John,

Along with rtvol check, we can also have a simple check to see if the 
FS blocksize itself is big enough to satisfy the atomic requirements.
For eg on machines with 64K page, we can have say 16k or 64k block sizes
which should be able to provide required allocation behavior for atomic
writes. In such cases we don't need rtvol.

Regards,
ojaswin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-09  7:14 ` [PATCH 0/6] block atomic writes for XFS Ojaswin Mujoo
@ 2024-02-09  9:22   ` John Garry
  2024-02-12 12:06     ` Ojaswin Mujoo
  0 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-09  9:22 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio

On 09/02/2024 07:14, Ojaswin Mujoo wrote:
> On Wed, Jan 24, 2024 at 02:26:39PM +0000, John Garry wrote:
>> This series expands atomic write support to filesystems, specifically
>> XFS. Since XFS rtvol supports extent alignment already, support will
>> initially be added there. When XFS forcealign feature is merged, then we
>> can similarly support atomic writes for a non-rtvol filesystem.
> 
> Hi John,
> 
> Along with rtvol check, we can also have a simple check to see if the
> FS blocksize itself is big enough to satisfy the atomic requirements.
> For eg on machines with 64K page, we can have say 16k or 64k block sizes
> which should be able to provide required allocation behavior for atomic
> writes. In such cases we don't need rtvol.
>
I suppose we could do, but I would rather just concentrate on rtvol 
support initially, and there we do report atomic write unit min = FS 
block size (even if rt extsize is unset).

In addition, I plan to initially just support atomic write unit min = FS 
block size (for both rtvol and !rtvol).

Thanks,
John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-09  1:40               ` Dave Chinner
@ 2024-02-09 12:47                 ` John Garry
  2024-02-13 23:41                   ` Dave Chinner
  0 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-09 12:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

>>
>> Playing devil's advocate here, at least this behavior should be documented.
> 
> That's what man pages are for, yes?
> 
> Are you expecting your deployments to be run on highly suboptimal
> configurations and so the code needs to be optimised for this
> behaviour, or are you expecting them to be run on correctly
> configured systems which would never see these issues?

The latter hopefully

> 
> 
>>> The whole reason for rtextsize existing is to optimise the rtextent
>>> allocation to the typical minimum IO size done to that volume. If
>>> all your IO is sub-rtextsize size and alignment, then all that has
>>> been done is forcing the entire rt device IO into a corner it was
>>> never really intended nor optimised for.
>>
>> Sure, but just because we are optimized for a certain IO write size should
>> not mean that other writes are disallowed or quite problematic.
> 
> Atomic writes are just "other writes". They are writes that are
> *expected to fail* if they cannot be done atomically.

Agreed

> 
> Application writers will quickly learn how to do sane, fast,
> reliable atomic write IO if we reject anything that is going to
> requires some complex, sub-optimal workaround in the kernel to make
> it work. The simplest solution is to -fail the write-, because
> userspace *must* be prepared for *any* atomic write to fail.

Sure, but it needs to be such that the application writer at least knows 
why it failed, which so far had not been documented.

> 
>>> Why should we jump through crazy hoops to try to make filesystems
>>> optimised for large IOs with mismatched, overlapping small atomic
>>> writes?
>>
>> As mentioned, typically the atomic writes will be the same size, but we may
>> have other writes of smaller size.
> 
> Then we need the tiny write to allocate and zero according to the
> maximum sized atomic write bounds. Then we just don't care about
> large atomic IO overlapping small IO, because the extent on disk
> aligned to the large atomic IO is then always guaranteed to be the
> correct size and shape.

I think it's worth mentioning that there is currently a separation 
between how we configure the FS extent size for atomic writes and what 
the bdev can actually support in terms of atomic writes.

Setting the rtvol extsize at mkfs time or enabling atomic writes 
FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do 
in terms of atomic writes.

This check is not done as it is not fixed what the bdev can do in terms 
of atomic writes - or, more specifically, what they request_queue 
reports is not be fixed. There are things which can change this. For 
example, a FW update could change all the atomic write capabilities of a 
disk. Or even if we swapped a SCSI disk into another host the atomic 
write limits may change, as the atomic write unit max depends on the 
SCSI HBA DMA limits. Changing BIO_MAX_VECS - which could come from a 
kernel update - could also change what we report as atomic write limit 
in the request queue.

> 
> 
>>>> With the change in this patch, instead we have something like this after the
>>>> first write:
>>>>
>>>> # /test-pwritev2 -a -d -l 4096 -p 0 /root/mnt/file
>>>> wrote 4096 bytes at pos 0 write_size=4096
>>>> # filefrag -v mnt/file
>>>> Filesystem type is: 58465342
>>>> File size of mnt/file is 4096 (1 block of 4096 bytes)
>>>>     ext:     logical_offset:        physical_offset: length:   expected:
>>>> flags:
>>>>       0:        0..       3:         24..        27:      4:
>>>> last,eof
>>>> mnt/file: 1 extent found
>>>> #
>>>>
>>>> So the 16KB extent is in written state and the 2nd 16KB write would iter
>>>> once, producing a single BIO.
>>> Sure, I know how it works. My point is that it's a terrible way to
>>> go about allowing that second atomic write to succeed.
>> I think 'terrible' is a bit too strong a word here.
> 
> Doing it anything in a way that a user can DOS the entire filesystem
> is *terrible*. No ifs, buts or otherwise.

Understood

> 
>> Indeed, you suggest to
>> manually zero the file to solve this problem, below, while this code change
>> does the same thing automatically.
> 
> Yes, but I also outlined a way that it can be done automatically
> without being terrible. There are multiple options here, I outlined
> two different approaches that are acceptible.

I think that I need to check these alternate solutions in more detail. 
More below.

> 
>>>>>> In this
>>>>>> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
>>>>>> ensure that the extent is completely written whenever we allocate it. At
>>>>>> least that is my idea.
>>>>> So return an unaligned extent, and then the IOMAP_ATOMIC checks you
>>>>> add below say "no" and then the application has to do things the
>>>>> slow, safe way....
>>>> We have been porting atomic write support to some database apps and they
>>>> (database developers) have had to do something like manually zero the
>>>> complete file to get around this issue, but that's not a good user
>>>> experience.
>>> Better the application zeros the file when it is being initialised
>>> and doesn't have performance constraints rather than forcing the
>>> filesystem to do it in the IO fast path when IO performance and
>>> latency actually matters to the application.
>>
>> Can't we do both? I mean, the well-informed user can still pre-zero the file
>> just to ensure we aren't doing this zero'ing with the extent allocation.
> 
> I never said we can't do zeroing. I just said that it's normally
> better when the application controls zeroing directly.

ok

> 
>>> And therein lies the problem.
>>>
>>> If you are doing sub-rtextent IO at all, then you are forcing the
>>> filesystem down the path of explicitly using unwritten extents and
>>> requiring O_DSYNC direct IO to do journal flushes in IO completion
>>> context and then performance just goes down hill from them.
>>>
>>> The requirement for unwritten extents to track sub-rtextsize written
>>> regions is what you're trying to work around with XFS_BMAPI_ZERO so
>>> that atomic writes will always see "atomic write aligned" allocated
>>> regions.
>>>
>>> Do you see the problem here? You've explicitly told the filesystem
>>> that allocation is aligned to 64kB chunks, then because the
>>> filesystem block size is 4kB, it's allowed to track unwritten
>>> regions at 4kB boundaries. Then you do 4kB aligned file IO, which
>>> then changes unwritten extents at 4kB boundaries. Then you do a
>>> overlapping 16kB IO that*requires*  16kB allocation alignment, and
>>> things go BOOM.
>>>
>>> Yes, they should go BOOM.
>>>
>>> This is a horrible configuration - it is incomaptible with 16kB
>>> aligned and sized atomic IO.
>>
>> Just because the DB may do 16KB atomic writes most of the time should not
>> disallow it from any other form of writes.
> 
> That's not what I said. I said the using sub-rtextsize atomic writes
> with single FSB unwritten extent tracking is horrible and
> incompatible with doing 16kB atomic writes.
> 
> This setup will not work at all well with your patches and should go
> BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
> has uncoordinated extent allocation and unwritten conversion
> granularity.
> 
> That's the fundamental design problem with your approach - it allows
> unwritten conversion at *minimum IO sizes* and that does not work
> with atomic IOs with larger alignment requirements.
> 
> The fundamental design principle is this: for maximally sized atomic
> writes to always succeed we require every allocation, zeroing and
> unwritten conversion operation to use alignments and sizes that are
> compatible with the maximum atomic write sizes being used.
> 

That sounds fine.

My question then is how we determine this max atomic write size granularity.

We don't explicitly tell the FS what atomic write size we want for a 
file. Rather we mkfs with some extsize value which should match our 
atomic write maximal value and then tell the FS we want to do atomic 
writes on a file, and if this is accepted then we can query the atomic 
write min and max unit size, and this would be [FS block size, min(bdev 
atomic write limit, rtexsize)].

If rtextsize is 16KB, then we have a good idea that we want 16KB atomic 
writes support. So then we could use rtextsize as this max atomic write 
size. But I am not 100% sure that it your idea (apologies if I am wrong 
- I am sincerely trying to follow your idea), but rather it would be 
min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and 
bdev atomic write limit is 16KB, then there is no much point in dealing 
in 1MB blocks for this unwritten extent conversion alignment. If so, 
then my concern is that the bdev atomic write upper limit is not fixed. 
This can solved, but I would still like to be clear on this max atomic 
write size.

 > i.e. atomic writes need to use max write size granularity for all IO
 > operations, not filesystem block granularity.
> 
> And that also means things like rtextsize and extsize hints need to
> match these atomic write requirements, too....
> 

As above, I am not 100% sure if you mean these to be the atomic write 
maximal value.

>>> Allocation is aligned to 64kB, written
>>> region tracking is aligned to 4kB, and there's nothing to tell the
>>> filesystem that it should be maintaining 16kB "written alignment" so
>>> that 16kB atomic writes can always be issued atomically.

Please note that in my previous example the mkfs rtextsize arg should 
really have been 16KB, and that the intention would have been to enable 
16KB atomic writes. I used 64KB casually as I thought it should be 
possible to support sub-rtextsize atomic writes. The point which I was 
trying to make was that the 16KB atomic write and 4KB regular write 
intermixing was problematic.

>>>
>>> i.e. if we are going to do 16kB aligned atomic IO, then all the
>>> allocation and unwritten tracking needs to be done in 16kB aligned
>>> chunks, not 4kB. That means a 4KB write into an unwritten region or
>>> a hole actually needs to zero the rest of the 16KB range it sits
>>> within.
>>>
>>> The direct IO code can do this, but it needs extension of the
>>> unaligned IO serialisation in XFS (the alignment checks in
>>> xfs_file_dio_write()) and the the sub-block zeroing in
>>> iomap_dio_bio_iter() (the need_zeroing padding has to span the fs
>>> allocation size, not the fsblock size) to do this safely.
>>>
>>> Regardless of how we do it, all IO concurrency on this file is shot
>>> if we have sub-rtextent sized IOs being done. That is true even with
>>> this patch set - XFS_BMAPI_ZERO is done whilst holding the
>>> XFS_ILOCK_EXCL, and so no other DIO can map extents whilst the
>>> zeroing is being done.
>>>
>>> IOWs, anything to do with sub-rtextent IO really has to be treated
>>> like sub-fsblock DIO - i.e. exclusive inode access until the
>>> sub-rtextent zeroing has been completed.
>>
>> I do understand that this is not perfect that we may have mixed block sizes
>> being written, but I don't think that we should disallow it and throw an
>> error.
> 
> Ummmm, did you read what you quoted?
> 
> The above is an outline of the IO path modifications that will allow
> mixed IO sizes to be used with atomic writes without requiring the
> XFS_BMAPI_ZERO hack. It pushes the sub-atomic write alignment
> zeroing out to the existing DIO sub-block zeroing, hence ensuring
> that we only ever convert unwritten extents on max sized atomic
> write boundaries for atomic write enabled inodes.

ok, I get this idea. And, indeed, it does sound better than the 
XFS_BMAPI_ZERO proposal.

> 
> At no point have I said "no mixed writes".

For sure

> I've said no to the
> XFS_BMAPI_ZERO hack, but then I've explained the fundamental issue
> that it works around and given you a decent amount of detail on how
> to sanely implementing mixed write support that will work (slowly)
> with those configurations and IO patterns.
> 
> So it's your choice - you can continue to beleive I don't mixed
> writes to work at all, or you can go back and try to understand the
> IO path changes I've suggested that will allow mixed atomic writes
> to work as well as they possibly can....
> 

Ack

Much appreciated,
John



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] fs: xfs: Support atomic write for statx
  2024-02-09  7:00   ` Ojaswin Mujoo
@ 2024-02-09 17:30     ` John Garry
  2024-02-12 11:48       ` Ojaswin Mujoo
  2024-02-12 12:05       ` Ojaswin Mujoo
  0 siblings, 2 replies; 68+ messages in thread
From: John Garry @ 2024-02-09 17:30 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio


>> +void xfs_get_atomic_write_attr(
>> +	struct xfs_inode *ip,
>> +	unsigned int *unit_min,
>> +	unsigned int *unit_max)
>> +{
>> +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
>> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>> +	struct block_device	*bdev = target->bt_bdev;
>> +	unsigned int		awu_min, awu_max, align;
>> +	struct request_queue	*q = bdev->bd_queue;
>> +	struct xfs_mount	*mp = ip->i_mount;
>> +
>> +	/*
>> +	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
>> +	 * atomic write unit of BLOCKSIZE).
>> +	 */
>> +	awu_min = queue_atomic_write_unit_min_bytes(q);
>> +	awu_max = queue_atomic_write_unit_max_bytes(q);
>> +
>> +	awu_min &= ~mp->m_blockmask;
>> +	awu_max &= ~mp->m_blockmask;
> 
> I don't understand why we try to round down the awu_max to blocks size
> here and not just have an explicit check of (awu_max < blocksize).
We have later check for !awu_max:

if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
...

So what we are doing is ensuring that the awu_max which the device 
reports is at least FS block size. If it is not, then we cannot support 
atomic writes.

Indeed, my NVMe drive reports awu_max  = 4K. So for XFS configured for 
64K block size, we will say that we don't support atomic writes.

> 
> I think the issue with changing the awu_max is that we are using awu_max
> to also indirectly reflect the alignment so as to ensure we don't cross
> atomic boundaries set by the hw (eg we check uint_max % atomic alignment
> == 0 in scsi). So once we change the awu_max, there's a chance that even
> if an atomic write aligns to the new awu_max it still doesn't have the
> right alignment and fails.

All these values should be powers-of-2, so rounding down should not 
affect whether we would not cross an atomic write boundary.

> 
> It works right now since eveything is power of 2 but it should cause
> issues incase we decide to remove that limitation. 

Sure, but that is a fundamental principle of this atomic write support. 
Not having powers-of-2 requirement for atomic writes will affect many 
things.

> Anyways, I think
> this implicit behavior of things working since eveything is a power of 2
> should atleast be documented in a comment, so these things are
> immediately clear.
> 
>> +
>> +	align = XFS_FSB_TO_B(mp, extsz);
>> +
>> +	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
>> +	    !is_power_of_2(align)) {
> 
> Correct me if I'm wrong but here as well, the is_power_of_2(align) is
> esentially checking if the align % uinit_max == 0 (or vice versa if
> unit_max is greater) 

yes

>so that an allocation of extsize will always align
> nicely as needed by the device.
>

I'm trying to keep things simple now.

In theory we could allow, say, align == 12 FSB, and then could say 
awu_max = 4.

The same goes for atomic write boundary in NVMe. Currently we say that 
it needs to be a power-of-2. However, it really just needs to be a 
multiple of awu_max. So if some HW did report a !power-of-2 atomic write 
boundary, we could reduce awu_max reported until to fits the power-of-2 
rule and also is cleanly divisible into atomic write boundary. But that 
is just not what HW will report (I expect). We live in a power-of-2 data 
granularity world.

> So maybe we should use the % expression explicitly so that the intention
> is immediately clear.

As mentioned, I wanted to keep it simple. In addition, it's a bit of a 
mess for the FS block allocator to work with odd sizes, like 12. And it 
does not suit RAID stripe alignment, which works in powers-of-2.

> 
>> +		*unit_min = 0;
>> +		*unit_max = 0;
>> +	} else {
>> +		if (awu_min)
>> +			*unit_min = min(awu_min, align);
> 
> How will the min() here work? If awu_min is the minumum set by the
> device, how can statx be allowed to advertise something smaller than
> that?
The idea is that is that if the awu_min reported by the device is less 
than the FS block size, then we report awu_min = FS block size. We 
already know that awu_max => FS block size, since we got this far, so 
saying that awu_min = FS block size is ok.

Otherwise it is the minimum of alignment and awu_min. I suppose that 
does not make much sense, and we should just always require awu_min = FS 
block size.

> 
> If I understand correctly, right now the way we set awu_min in scsi and
> nvme, the follwoing should usually be true for a sane device:
> 
>   awu_min <= blocks size of fs <= align
> 
>   so the min() anyways becomes redundant, but if we do assume that there
>   might be some weird devices with awu_min absurdly large (SCSI with
>   high atomic granularity) we still can't actually advertise a min
>   smaller than that of the device, or am I missing something here?

As above, I might just ensure that we can do awu_min = FS block size and 
not deal with crazy devices.

Thanks,
John


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] fs: xfs: Support atomic write for statx
  2024-02-09 17:30     ` John Garry
@ 2024-02-12 11:48       ` Ojaswin Mujoo
  2024-02-12 12:05       ` Ojaswin Mujoo
  1 sibling, 0 replies; 68+ messages in thread
From: Ojaswin Mujoo @ 2024-02-12 11:48 UTC (permalink / raw)
  To: John Garry
  Cc: hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio

On Fri, Feb 09, 2024 at 05:30:50PM +0000, John Garry wrote:
> 
> > > +void xfs_get_atomic_write_attr(
> > > +	struct xfs_inode *ip,
> > > +	unsigned int *unit_min,
> > > +	unsigned int *unit_max)
> > > +{
> > > +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
> > > +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> > > +	struct block_device	*bdev = target->bt_bdev;
> > > +	unsigned int		awu_min, awu_max, align;
> > > +	struct request_queue	*q = bdev->bd_queue;
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +
> > > +	/*
> > > +	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
> > > +	 * atomic write unit of BLOCKSIZE).
> > > +	 */
> > > +	awu_min = queue_atomic_write_unit_min_bytes(q);
> > > +	awu_max = queue_atomic_write_unit_max_bytes(q);
> > > +
> > > +	awu_min &= ~mp->m_blockmask;
> > > +	awu_max &= ~mp->m_blockmask;
> > 
> > I don't understand why we try to round down the awu_max to blocks size
> > here and not just have an explicit check of (awu_max < blocksize).
> We have later check for !awu_max:
> 
> if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
> ...
> 
> So what we are doing is ensuring that the awu_max which the device reports
> is at least FS block size. If it is not, then we cannot support atomic
> writes.
> 
> Indeed, my NVMe drive reports awu_max  = 4K. So for XFS configured for 64K
> block size, we will say that we don't support atomic writes.

> 
> > 
> > I think the issue with changing the awu_max is that we are using awu_max
> > to also indirectly reflect the alignment so as to ensure we don't cross
> > atomic boundaries set by the hw (eg we check uint_max % atomic alignment
> > == 0 in scsi). So once we change the awu_max, there's a chance that even
> > if an atomic write aligns to the new awu_max it still doesn't have the
> > right alignment and fails.
> 
> All these values should be powers-of-2, so rounding down should not affect
> whether we would not cross an atomic write boundary.


> 
> > 
> > It works right now since eveything is power of 2 but it should cause
> > issues incase we decide to remove that limitation.
> 
> Sure, but that is a fundamental principle of this atomic write support. Not
> having powers-of-2 requirement for atomic writes will affect many things.
> 

Correct, so the only reason for the rounding down is to ensure that
awu_max is not smaller than our block size but the way these checks are
right now doesn't make it obvious. It also raises questions like why we
are changing these min and max values especially why are we rounding
*down* min.

I think we should just have explicit (unit_[min/max] < bs) checks
without trying to round down the values.

> > Anyways, I think
> > this implicit behavior of things working since eveything is a power of 2
> > should atleast be documented in a comment, so these things are
> > immediately clear.
> > 
> > > +
> > > +	align = XFS_FSB_TO_B(mp, extsz);
> > > +
> > > +	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
> > > +	    !is_power_of_2(align)) {
> > 
> > Correct me if I'm wrong but here as well, the is_power_of_2(align) is
> > esentially checking if the align % uinit_max == 0 (or vice versa if
> > unit_max is greater)
> 
> yes
> 
> > so that an allocation of extsize will always align
> > nicely as needed by the device.
> > 
> 
> I'm trying to keep things simple now.
> 
> In theory we could allow, say, align == 12 FSB, and then could say awu_max =
> 4.
> 
> The same goes for atomic write boundary in NVMe. Currently we say that it
> needs to be a power-of-2. However, it really just needs to be a multiple of
> awu_max. So if some HW did report a !power-of-2 atomic write boundary, we
> could reduce awu_max reported until to fits the power-of-2 rule and also is
> cleanly divisible into atomic write boundary. But that is just not what HW
> will report (I expect). We live in a power-of-2 data granularity world.

True, we ideally won't expect the hw to report that but why not just
make the check as (awu_max % align) so that:

1. intention is immediately clear
2. It'll directly work for non power of 2 boundary in future without
change.

Just my 2 cents.
> 
> > So maybe we should use the % expression explicitly so that the intention
> > is immediately clear.
> 
> As mentioned, I wanted to keep it simple. In addition, it's a bit of a mess
> for the FS block allocator to work with odd sizes, like 12. And it does not
> suit RAID stripe alignment, which works in powers-of-2.
> 
> > 
> > > +		*unit_min = 0;
> > > +		*unit_max = 0;
> > > +	} else {
> > > +		if (awu_min)
> > > +			*unit_min = min(awu_min, align);
> > 
> > How will the min() here work? If awu_min is the minumum set by the
> > device, how can statx be allowed to advertise something smaller than
> > that?
> The idea is that is that if the awu_min reported by the device is less than
> the FS block size, then we report awu_min = FS block size. We already know
> that awu_max => FS block size, since we got this far, so saying that awu_min
> = FS block size is ok.
> 
> Otherwise it is the minimum of alignment and awu_min. I suppose that does
> not make much sense, and we should just always require awu_min = FS block
> size.

Yep, that also works. We should just set the minimum to FS block size,
since that min() operator there is confusing.

Regards,
ojaswin

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] fs: xfs: Support atomic write for statx
  2024-02-09 17:30     ` John Garry
  2024-02-12 11:48       ` Ojaswin Mujoo
@ 2024-02-12 12:05       ` Ojaswin Mujoo
  1 sibling, 0 replies; 68+ messages in thread
From: Ojaswin Mujoo @ 2024-02-12 12:05 UTC (permalink / raw)
  To: John Garry
  Cc: hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio

On Fri, Feb 09, 2024 at 05:30:50PM +0000, John Garry wrote:
> The same goes for atomic write boundary in NVMe. Currently we say that it
> needs to be a power-of-2. However, it really just needs to be a multiple of
> awu_max. So if some HW did report a !power-of-2 atomic write boundary, we

Hey John, sorry for double reply but can you point out where this
requrement is stated in the spec?

For example in NVME 2.1.4.3 Command Set spec I can see that

> The boundary size shall be greater than or equal to the corresponding
> atomic write size

However I'm not able to find the multiple-of-unit-max reqirement in the
spec. Maybe I'm missing something?

Regards,
ojaswin

> could reduce awu_max reported until to fits the power-of-2 rule and also is
> cleanly divisible into atomic write boundary. But that is just not what HW
> will report (I expect). We live in a power-of-2 data granularity world.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-09  9:22   ` John Garry
@ 2024-02-12 12:06     ` Ojaswin Mujoo
  0 siblings, 0 replies; 68+ messages in thread
From: Ojaswin Mujoo @ 2024-02-12 12:06 UTC (permalink / raw)
  To: John Garry
  Cc: hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio

On Fri, Feb 09, 2024 at 09:22:20AM +0000, John Garry wrote:
> On 09/02/2024 07:14, Ojaswin Mujoo wrote:
> > On Wed, Jan 24, 2024 at 02:26:39PM +0000, John Garry wrote:
> > > This series expands atomic write support to filesystems, specifically
> > > XFS. Since XFS rtvol supports extent alignment already, support will
> > > initially be added there. When XFS forcealign feature is merged, then we
> > > can similarly support atomic writes for a non-rtvol filesystem.
> > 
> > Hi John,
> > 
> > Along with rtvol check, we can also have a simple check to see if the
> > FS blocksize itself is big enough to satisfy the atomic requirements.
> > For eg on machines with 64K page, we can have say 16k or 64k block sizes
> > which should be able to provide required allocation behavior for atomic
> > writes. In such cases we don't need rtvol.
> > 
> I suppose we could do, but I would rather just concentrate on rtvol support
> initially, and there we do report atomic write unit min = FS block size
> (even if rt extsize is unset).

Okay understood. 

Thanks,
ojaswin

> 
> In addition, I plan to initially just support atomic write unit min = FS
> block size (for both rtvol and !rtvol).
> 
> Thanks,
> John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] fs: iomap: Atomic write support
  2024-02-05 11:29     ` John Garry
@ 2024-02-13  6:55       ` Christoph Hellwig
  2024-02-13  8:20         ` John Garry
  2024-02-13 18:08       ` Darrick J. Wong
  1 sibling, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2024-02-13  6:55 UTC (permalink / raw)
  To: John Garry
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Mon, Feb 05, 2024 at 11:29:57AM +0000, John Garry wrote:
>>
>> Also, what's the meaning of REQ_OP_READ | REQ_ATOMIC? 
>
> REQ_ATOMIC will be ignored for REQ_OP_READ. I'm following the same policy 
> as something like RWF_SYNC for a read.

We've been rather sloppy with these flags in the past, which isn't
a good thing.  Let's add proper checking for new interfaces.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag
  2024-02-05 12:58     ` John Garry
@ 2024-02-13  6:56       ` Christoph Hellwig
  2024-02-13 17:08       ` Darrick J. Wong
  1 sibling, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2024-02-13  6:56 UTC (permalink / raw)
  To: John Garry
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Mon, Feb 05, 2024 at 12:58:30PM +0000, John Garry wrote:
> To me that sounds like "try to use DAX for IO, and, if not possible, fall 
> back on some other method" - is that reality of what that flag does?

Yes.  Of course for a fallback on XFS we need Darrick's swapext log
item.  Which would be good to have..


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
                   ` (6 preceding siblings ...)
  2024-02-09  7:14 ` [PATCH 0/6] block atomic writes for XFS Ojaswin Mujoo
@ 2024-02-13  7:22 ` Christoph Hellwig
  2024-02-13 17:55   ` Darrick J. Wong
  2024-02-13 23:50   ` Dave Chinner
  2024-02-13  7:45 ` Ritesh Harjani
  8 siblings, 2 replies; 68+ messages in thread
From: Christoph Hellwig @ 2024-02-13  7:22 UTC (permalink / raw)
  To: John Garry
  Cc: hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

From reading the series and the discussions with Darrick and Dave
I'm coming more and more back to my initial position that tying this
user visible feature to hardware limits is wrong and will just keep
on creating ever more painpoints in the future.

Based on that I suspect that doing proper software only atomic writes
using the swapext log item and selective always COW mode and making that
work should be the first step.  We can then avoid that overhead for
properly aligned writs if the hardware supports it.  For your Oracle
DB loads you'll set the alignment hints and maybe even check with
fiemap that everything is fine and will get the offload, but we also
provide a nice and useful API for less performance critical applications
that don't have to care about all these details.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
                   ` (7 preceding siblings ...)
  2024-02-13  7:22 ` Christoph Hellwig
@ 2024-02-13  7:45 ` Ritesh Harjani
  2024-02-13  8:41   ` John Garry
  8 siblings, 1 reply; 68+ messages in thread
From: Ritesh Harjani @ 2024-02-13  7:45 UTC (permalink / raw)
  To: John Garry, hch, djwong, viro, brauner, dchinner, jack, chandan.babu
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin, John Garry

John Garry <john.g.garry@oracle.com> writes:

> This series expands atomic write support to filesystems, specifically
> XFS. Since XFS rtvol supports extent alignment already, support will
> initially be added there. When XFS forcealign feature is merged, then we
> can similarly support atomic writes for a non-rtvol filesystem.
>
> Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.
>
> For XFS rtvol, support can be enabled through xfs_io command:
> $xfs_io -c "chattr +W" filename
> $xfs_io -c "lsattr -v" filename
> [realtime, atomic-writes] filename

Hi John,

I first took your block atomic write patch series [1] and then applied this
series on top. I also compiled xfsprogs with chattr atomic write support from [2]. 

[1]: https://lore.kernel.org/linux-nvme/20240124113841.31824-1-john.g.garry@oracle.com/T/#m4ad28b480a8e12eb51467e17208d98ca50041ff2
[2]: https://github.com/johnpgarry/xfsprogs-dev/commits/atomicwrites/


But while setting +W attr, I see an Invalid argument error. Is there
anything I need to do first?

root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "chattr +W" /mnt1/test/f1
xfs_io: cannot set flags on /mnt1/test/f1: Invalid argument

root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "lsattr -v" /mnt1/test/f1
[realtime] /mnt1/test/f1

>
> The FS needs to be formatted with a specific extent alignment size, like:
> mkf.xfs -r rtdev=/dev/sdb,extsize=16K -d rtinherit=1 /dev/sda
>
> This enables 16K atomic write support. There are no checks whether the
> underlying HW actually supports that for enabling atomic writes with
> xfs_io, though, so statx needs to be issued for a file to know atomic
> write limits.
>

Here you say that xfs_io does not check whether underlying HW actually
supports atomic writes or not. So I am assuming xfs_io -c "chattr +W"
should have just worked?

Sorry, I am still in the process of going over the patches, but I thought let
me anyways ask this first.


-ritesh

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] fs: iomap: Atomic write support
  2024-02-13  6:55       ` Christoph Hellwig
@ 2024-02-13  8:20         ` John Garry
  2024-02-15 11:08           ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-13  8:20 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 13/02/2024 06:55, Christoph Hellwig wrote:
> On Mon, Feb 05, 2024 at 11:29:57AM +0000, John Garry wrote:
>>> Also, what's the meaning of REQ_OP_READ | REQ_ATOMIC?
>> REQ_ATOMIC will be ignored for REQ_OP_READ. I'm following the same policy
>> as something like RWF_SYNC for a read.
> We've been rather sloppy with these flags in the past, which isn't
> a good thing.  Let's add proper checking for new interfaces.

ok, fine.

Thanks,
John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-13  7:45 ` Ritesh Harjani
@ 2024-02-13  8:41   ` John Garry
  2024-02-13  9:10     ` Ritesh Harjani
  2024-02-13 22:49     ` Dave Chinner
  0 siblings, 2 replies; 68+ messages in thread
From: John Garry @ 2024-02-13  8:41 UTC (permalink / raw)
  To: Ritesh Harjani (IBM),
	hch, djwong, viro, brauner, dchinner, jack, chandan.babu
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 13/02/2024 07:45, Ritesh Harjani (IBM) wrote:
> John Garry <john.g.garry@oracle.com> writes:
> 
>> This series expands atomic write support to filesystems, specifically
>> XFS. Since XFS rtvol supports extent alignment already, support will
>> initially be added there. When XFS forcealign feature is merged, then we
>> can similarly support atomic writes for a non-rtvol filesystem.
>>
>> Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.
>>
>> For XFS rtvol, support can be enabled through xfs_io command:
>> $xfs_io -c "chattr +W" filename
>> $xfs_io -c "lsattr -v" filename
>> [realtime, atomic-writes] filename
> 
> Hi John,
> 
> I first took your block atomic write patch series [1] and then applied this
> series on top. I also compiled xfsprogs with chattr atomic write support from [2].
> 
> [1]: https://lore.kernel.org/linux-nvme/20240124113841.31824-1-john.g.garry@oracle.com/T/#m4ad28b480a8e12eb51467e17208d98ca50041ff2
> [2]: https://github.com/johnpgarry/xfsprogs-dev/commits/atomicwrites/
> 
> 
> But while setting +W attr, I see an Invalid argument error. Is there
> anything I need to do first?
> 
> root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "chattr +W" /mnt1/test/f1
> xfs_io: cannot set flags on /mnt1/test/f1: Invalid argument
> 
> root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "lsattr -v" /mnt1/test/f1
> [realtime] /mnt1/test/f1

Can you provide your full steps?

I'm doing something like:

# /mkfs.xfs -r rtdev=/dev/sdb,extsize=16k -d rtinherit=1 /dev/sda
meta-data=/dev/sda               isize=512    agcount=4, agsize=22400 blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=0    bigtime=1 inobtcount=1 
nrext64=0
data     =                       bsize=4096   blocks=89600, imaxpct=25
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=16384, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/sdb               extsz=16384  blocks=89600, rtextents=22400
# mount /dev/sda mnt -o rtdev=/dev/sdb
[    5.553482] XFS (sda): EXPERIMENTAL atomic writes feature in use. Use 
at your own risk!
[    5.556752] XFS (sda): Mounting V5 Filesystem 
6e0820e6-4d44-4c3e-89f2-21b4d4480f88
[    5.602315] XFS (sda): Ending clean mount
#
# touch mnt/file
# /xfs_io -c "lsattr -v" mnt/file
[realtime] mnt/file
#
#
# /xfs_io -c "chattr +W" mnt/file
# /xfs_io -c "lsattr -v" mnt/file
[realtime, atomic-writes] mnt/file

And then we can check limits:

# /test-statx -a /root/mnt/file
dump_statx results=9fff
   Size: 0               Blocks: 0          IO Block: 16384   regular file
Device: 08:00           Inode: 131         Links: 1
Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
Access: 2024-02-13 08:31:51.962900974+0000
Modify: 2024-02-13 08:31:51.962900974+0000
Change: 2024-02-13 08:31:51.969900974+0000
  Birth: 2024-02-13 08:31:51.962900974+0000
stx_attributes_mask=0x603070
         STATX_ATTR_WRITE_ATOMIC set
         unit min: 4096
         unit max: 16384
         segments max: 1
Attributes: 0000000000400000 (........ ........ ........ ........ 
........ .?-..... ..--.... .---....)
#
#

Does xfs_io have a statx function? If so, I can add support for atomic 
writes for statx there. In the meantime, that test-statx code is also on 
my branch, and can be run on the block device file (to sanity check that 
the rtvol device supports atomic writes).

Thanks,
John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-13  8:41   ` John Garry
@ 2024-02-13  9:10     ` Ritesh Harjani
  2024-02-13 22:49     ` Dave Chinner
  1 sibling, 0 replies; 68+ messages in thread
From: Ritesh Harjani @ 2024-02-13  9:10 UTC (permalink / raw)
  To: John Garry, hch, djwong, viro, brauner, dchinner, jack, chandan.babu
  Cc: martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

John Garry <john.g.garry@oracle.com> writes:

> On 13/02/2024 07:45, Ritesh Harjani (IBM) wrote:
>> John Garry <john.g.garry@oracle.com> writes:
>> 
>>> This series expands atomic write support to filesystems, specifically
>>> XFS. Since XFS rtvol supports extent alignment already, support will
>>> initially be added there. When XFS forcealign feature is merged, then we
>>> can similarly support atomic writes for a non-rtvol filesystem.
>>>
>>> Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.
>>>
>>> For XFS rtvol, support can be enabled through xfs_io command:
>>> $xfs_io -c "chattr +W" filename
>>> $xfs_io -c "lsattr -v" filename
>>> [realtime, atomic-writes] filename
>> 
>> Hi John,
>> 
>> I first took your block atomic write patch series [1] and then applied this
>> series on top. I also compiled xfsprogs with chattr atomic write support from [2].
>> 
>> [1]: https://lore.kernel.org/linux-nvme/20240124113841.31824-1-john.g.garry@oracle.com/T/#m4ad28b480a8e12eb51467e17208d98ca50041ff2
>> [2]: https://github.com/johnpgarry/xfsprogs-dev/commits/atomicwrites/
>> 
>> 
>> But while setting +W attr, I see an Invalid argument error. Is there
>> anything I need to do first?
>> 
>> root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "chattr +W" /mnt1/test/f1
>> xfs_io: cannot set flags on /mnt1/test/f1: Invalid argument
>> 
>> root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "lsattr -v" /mnt1/test/f1
>> [realtime] /mnt1/test/f1
>
> Can you provide your full steps?
>
> I'm doing something like:
>
> # /mkfs.xfs -r rtdev=/dev/sdb,extsize=16k -d rtinherit=1 /dev/sda
> meta-data=/dev/sda               isize=512    agcount=4, agsize=22400 blks
>           =                       sectsz=512   attr=2, projid32bit=1
>           =                       crc=1        finobt=1, sparse=1, rmapbt=0
>           =                       reflink=0    bigtime=1 inobtcount=1 
> nrext64=0
> data     =                       bsize=4096   blocks=89600, imaxpct=25
>           =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=16384, version=2
>           =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =/dev/sdb               extsz=16384  blocks=89600, rtextents=22400
> # mount /dev/sda mnt -o rtdev=/dev/sdb
> [    5.553482] XFS (sda): EXPERIMENTAL atomic writes feature in use. Use 
> at your own risk!

My bad, I missed to see your xfsprogs change involve setting this
feature flag as well during mkfs time itself. I wasn't using the right
mkfs utility.


> [    5.556752] XFS (sda): Mounting V5 Filesystem 
> 6e0820e6-4d44-4c3e-89f2-21b4d4480f88
> [    5.602315] XFS (sda): Ending clean mount
> #
> # touch mnt/file
> # /xfs_io -c "lsattr -v" mnt/file
> [realtime] mnt/file
> #
> #
> # /xfs_io -c "chattr +W" mnt/file
> # /xfs_io -c "lsattr -v" mnt/file
> [realtime, atomic-writes] mnt/file
>

Yup, this seems to work fine. Thanks!

> And then we can check limits:
>
> # /test-statx -a /root/mnt/file
> dump_statx results=9fff
>    Size: 0               Blocks: 0          IO Block: 16384   regular file
> Device: 08:00           Inode: 131         Links: 1
> Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
> Access: 2024-02-13 08:31:51.962900974+0000
> Modify: 2024-02-13 08:31:51.962900974+0000
> Change: 2024-02-13 08:31:51.969900974+0000
>   Birth: 2024-02-13 08:31:51.962900974+0000
> stx_attributes_mask=0x603070
>          STATX_ATTR_WRITE_ATOMIC set
>          unit min: 4096
>          unit max: 16384
>          segments max: 1
> Attributes: 0000000000400000 (........ ........ ........ ........ 
> ........ .?-..... ..--.... .---....)
> #
> #
>
> Does xfs_io have a statx function? If so, I can add support for atomic 
> writes for statx there. In the meantime, that test-statx code is also on 
> my branch, and can be run on the block device file (to sanity check that 
> the rtvol device supports atomic writes).
>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag
  2024-02-05 12:58     ` John Garry
  2024-02-13  6:56       ` Christoph Hellwig
@ 2024-02-13 17:08       ` Darrick J. Wong
  1 sibling, 0 replies; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-13 17:08 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Mon, Feb 05, 2024 at 12:58:30PM +0000, John Garry wrote:
> On 02/02/2024 17:57, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:41PM +0000, John Garry wrote:
> > > Add a flag indicating that a regular file is enabled for atomic writes.
> > 
> > This is a file attribute that mirrors an ondisk inode flag.  Actual
> > support for untorn file writes (for now) depends on both the iflag and
> > the underlying storage devices, which we can only really check at statx
> > and pwrite time.  This is the same story as FS_XFLAG_DAX, which signals
> > to the fs that we should try to enable the fsdax IO path on the file
> > (instead of the regular page cache), but applications have to query
> > STAT_ATTR_DAX to find out if they really got that IO path.
> 
> To be clear, are you suggesting to add this info to the commit message?

That and a S_ATOMICW flag for the inode that triggers the proposed
STAT_ATTR_ATOMICWRITES flag.

> > "try to enable atomic writes", perhaps? >
> > (and the comment for FS_XFLAG_DAX ought to read "try to use DAX for IO")
> 
> To me that sounds like "try to use DAX for IO, and, if not possible, fall
> back on some other method" - is that reality of what that flag does?

As hch said, yes.

--D

> Thanks,
> John
> 
> > 
> > --D
> > 
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   include/uapi/linux/fs.h | 1 +
> > >   1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > index a0975ae81e64..b5b4e1db9576 100644
> > > --- a/include/uapi/linux/fs.h
> > > +++ b/include/uapi/linux/fs.h
> > > @@ -140,6 +140,7 @@ struct fsxattr {
> > >   #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
> > >   #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
> > >   #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
> > > +#define FS_XFLAG_ATOMICWRITES	0x00020000	/* atomic writes enabled */
> > >   #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
> > >   /* the read-only stuff doesn't really belong here, but any other place is
> > > -- 
> > > 2.31.1
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol
  2024-02-05 12:51     ` John Garry
@ 2024-02-13 17:22       ` Darrick J. Wong
  2024-02-14 12:19         ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-13 17:22 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Mon, Feb 05, 2024 at 12:51:07PM +0000, John Garry wrote:
> On 02/02/2024 17:52, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:42PM +0000, John Garry wrote:
> > > Add initial support for FS_XFLAG_ATOMICWRITES in rtvol.
> > > 
> > > Current kernel support for atomic writes is based on HW support (for atomic
> > > writes). As such, it is required to ensure extent alignment with
> > > atomic_write_unit_max so that an atomic write can result in a single
> > > HW-compliant IO operation.
> > > 
> > > rtvol already guarantees extent alignment, so initially add support there.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/libxfs/xfs_format.h |  8 ++++++--
> > >   fs/xfs/libxfs/xfs_sb.c     |  2 ++
> > >   fs/xfs/xfs_inode.c         | 22 ++++++++++++++++++++++
> > >   fs/xfs/xfs_inode.h         |  7 +++++++
> > >   fs/xfs/xfs_ioctl.c         | 19 +++++++++++++++++--
> > >   fs/xfs/xfs_mount.h         |  2 ++
> > >   fs/xfs/xfs_super.c         |  4 ++++
> > >   7 files changed, 60 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 382ab1e71c0b..79fb0d4adeda 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -353,11 +353,13 @@ xfs_sb_has_compat_feature(
> > >   #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> > >   #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> > >   #define XFS_SB_FEAT_RO_COMPAT_INOBTCNT (1 << 3)		/* inobt block counts */
> > > +#define XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES (1 << 29)	/* aligned file data extents */
> > 
> > I thought FORCEALIGN was going to signal aligned file data extent
> > allocations being mandatory?
> 
> Right, I'll fix that comment
> 
> > 
> > This flag (AFAICT) simply marks the inode as something that gets
> > FMODE_CAN_ATOMIC_WRITES, right?
> 
> Correct
> 
> > 
> > >   #define XFS_SB_FEAT_RO_COMPAT_ALL \
> > >   		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> > >   		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> > >   		 XFS_SB_FEAT_RO_COMPAT_REFLINK| \
> > > -		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
> > > +		 XFS_SB_FEAT_RO_COMPAT_INOBTCNT | \
> > > +		 XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> > >   #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
> > >   static inline bool
> > >   xfs_sb_has_ro_compat_feature(
> > > @@ -1085,16 +1087,18 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> > >   #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> > >   #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
> > >   #define XFS_DIFLAG2_NREXT64_BIT 4	/* large extent counters */
> > > +#define XFS_DIFLAG2_ATOMICWRITES_BIT 6
> > 
> > Needs a comment here ("files flagged for atomic writes").
> 
> ok
> 
> > Also not sure
> > why you skipped bit 5, though I'm guessing it's because the forcealign
> > series is/was using it?
> 
> Right, I'll fix that
> 
> > 
> > >   #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
> > >   #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
> > >   #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
> > >   #define XFS_DIFLAG2_BIGTIME	(1 << XFS_DIFLAG2_BIGTIME_BIT)
> > >   #define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
> > > +#define XFS_DIFLAG2_ATOMICWRITES	(1 << XFS_DIFLAG2_ATOMICWRITES_BIT)
> > >   #define XFS_DIFLAG2_ANY \
> > >   	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> > > -	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64)
> > > +	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_NREXT64 | XFS_DIFLAG2_ATOMICWRITES)
> > >   static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
> > >   {
> > > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > > index 4a9e8588f4c9..28a98130a56d 100644
> > > --- a/fs/xfs/libxfs/xfs_sb.c
> > > +++ b/fs/xfs/libxfs/xfs_sb.c
> > > @@ -163,6 +163,8 @@ xfs_sb_version_to_features(
> > >   		features |= XFS_FEAT_REFLINK;
> > >   	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_INOBTCNT)
> > >   		features |= XFS_FEAT_INOBTCNT;
> > > +	if (sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES)
> > > +		features |= XFS_FEAT_ATOMICWRITES;
> > >   	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_FTYPE)
> > >   		features |= XFS_FEAT_FTYPE;
> > >   	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_SPINODES)
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 1fd94958aa97..0b0f525fd043 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -65,6 +65,26 @@ xfs_get_extsz_hint(
> > >   	return 0;
> > >   }
> > > +/*
> > > + * helper function to extract extent size
> > 
> > How does that differ from xfs_get_extsz_hint?
> 
> The idea of this function is to return the guaranteed extent alignment, and
> not just the hint
> 
> > 
> > > + */
> > > +xfs_extlen_t
> > > +xfs_get_extsz(
> > > +	struct xfs_inode	*ip)
> > > +{
> > > +	/*
> > > +	 * No point in aligning allocations if we need to COW to actually
> > > +	 * write to them.
> > 
> > What does alwayscow have to do with untorn writes?
> 
> Nothing at the moment, so I'll remove this.
> 
> > 
> > > +	 */
> > > +	if (xfs_is_always_cow_inode(ip))
> > > +		return 0;
> > > +
> > > +	if (XFS_IS_REALTIME_INODE(ip))
> > > +		return ip->i_mount->m_sb.sb_rextsize;
> > > +
> > > +	return 1;
> > > +}
> > 
> > Does this function exist to return the allocation unit for a given file?
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=b8ddcef3df8da02ed2c4aacbed1d811e60372006
> > 
> 
> Yes, something like xfs_inode_alloc_unitsize() there.
> 
> What's the upstream status for that change? I see it mentioned in linux-xfs
> lore and seems to be part of a mega patchset.

It's stuck in review along with the other ~1400 patches that I've been
grumbling about in our staff meetings for years now.

> > > +
> > >   /*
> > >    * Helper function to extract CoW extent size hint from inode.
> > >    * Between the extent size hint and the CoW extent size hint, we
> > > @@ -629,6 +649,8 @@ xfs_ip2xflags(
> > >   			flags |= FS_XFLAG_DAX;
> > >   		if (ip->i_diflags2 & XFS_DIFLAG2_COWEXTSIZE)
> > >   			flags |= FS_XFLAG_COWEXTSIZE;
> > > +		if (ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES)
> > > +			flags |= FS_XFLAG_ATOMICWRITES;
> > >   	}
> > >   	if (xfs_inode_has_attr_fork(ip))
> > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > index 97f63bacd4c2..0e0a21d9d30f 100644
> > > --- a/fs/xfs/xfs_inode.h
> > > +++ b/fs/xfs/xfs_inode.h
> > > @@ -305,6 +305,11 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip)
> > >   	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
> > >   }
> > > +static inline bool xfs_inode_atomicwrites(struct xfs_inode *ip)
> > 
> > I think this predicate wants a verb in its name, the rest of them have
> > "is" or "has" somewhere:
> > 
> > "xfs_inode_has_atomicwrites"
> 
> ok, fine.
> 
> Note that I was copying xfs_inode_forcealign() in terms of naming.

Yeah, I could rename that xfs_inode_forces_alignment() or something.

Or just leave the condensed version where the verb and object are
smashed together.

xfs_inode_has_forcealign?

Yeah.  I'll go with that.

> > 
> > > +{
> > > +	return ip->i_diflags2 & XFS_DIFLAG2_ATOMICWRITES;
> > > +}
> > > +
> > >   /*
> > >    * Return the buftarg used for data allocations on a given inode.
> > >    */
> > > @@ -542,7 +547,9 @@ void		xfs_lock_two_inodes(struct xfs_inode *ip0, uint ip0_mode,
> > >   				struct xfs_inode *ip1, uint ip1_mode);
> > >   xfs_extlen_t	xfs_get_extsz_hint(struct xfs_inode *ip);
> > > +xfs_extlen_t	xfs_get_extsz(struct xfs_inode *ip);
> > >   xfs_extlen_t	xfs_get_cowextsz_hint(struct xfs_inode *ip);
> > > +xfs_extlen_t	xfs_get_atomicwrites_size(struct xfs_inode *ip);
> > >   int xfs_init_new_inode(struct mnt_idmap *idmap, struct xfs_trans *tp,
> > >   		struct xfs_inode *pip, xfs_ino_t ino, umode_t mode,
> > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > > index f02b6e558af5..c380a3055be7 100644
> > > --- a/fs/xfs/xfs_ioctl.c
> > > +++ b/fs/xfs/xfs_ioctl.c
> > > @@ -1110,6 +1110,8 @@ xfs_flags2diflags2(
> > >   		di_flags2 |= XFS_DIFLAG2_DAX;
> > >   	if (xflags & FS_XFLAG_COWEXTSIZE)
> > >   		di_flags2 |= XFS_DIFLAG2_COWEXTSIZE;
> > > +	if (xflags & FS_XFLAG_ATOMICWRITES)
> > > +		di_flags2 |= XFS_DIFLAG2_ATOMICWRITES;
> > >   	return di_flags2;
> > >   }
> > > @@ -1122,10 +1124,12 @@ xfs_ioctl_setattr_xflags(
> > >   {
> > >   	struct xfs_mount	*mp = ip->i_mount;
> > >   	bool			rtflag = (fa->fsx_xflags & FS_XFLAG_REALTIME);
> > > +	bool			atomic_writes = fa->fsx_xflags & FS_XFLAG_ATOMICWRITES;
> > >   	uint64_t		i_flags2;
> > > -	if (rtflag != XFS_IS_REALTIME_INODE(ip)) {
> > > -		/* Can't change realtime flag if any extents are allocated. */
> > 
> > Please augment this comment ("Can't change realtime or atomicwrites
> > flags if any extents are allocated") instead of deleting it.
> 
> I wasn't supposed to delete that - will remedy.
> 
> >  This is
> > validation code, the requirements should be spelled out in English.
> > 
> > > +
> > > +	if (rtflag != XFS_IS_REALTIME_INODE(ip) ||
> > > +	    atomic_writes != xfs_inode_atomicwrites(ip)) {
> > >   		if (ip->i_df.if_nextents || ip->i_delayed_blks)
> > >   			return -EINVAL;
> > >   	}
> > > @@ -1146,6 +1150,17 @@ xfs_ioctl_setattr_xflags(
> > >   	if (i_flags2 && !xfs_has_v3inodes(mp))
> > >   		return -EINVAL;
> > > +	if (atomic_writes) {
> > > +		if (!xfs_has_atomicwrites(mp))
> > > +			return -EINVAL;
> > > +
> > > +		if (!rtflag)
> > > +			return -EINVAL;
> > > +
> > > +		if (!is_power_of_2(mp->m_sb.sb_rextsize))
> > > +			return -EINVAL;
> > 
> > Shouldn't we check sb_rextsize w.r.t. the actual block device queue
> > limits here?  I keep seeing similar validation logic open-coded
> > throughout both atomic write patchsets:
> > 
> > 	if (l < queue_atomic_write_unit_min_bytes())
> > 		/* fail */
> > 	if (l > queue_atomic_write_unit_max_bytes())
> > 		/* fail */
> > 	if (!is_power_of_2(l))
> > 		/* fail */
> > 	/* ok */
> > 
> > which really should be a common helper somewhere.
> 
> I think that it is a reasonable comment about duplication the atomic writes
> checks for the bdev and iomap write paths - I can try to improve that.
> 
> But the is_power_of_2(mp->m_sb.sb_rextsize) check is to ensure that the
> extent size is suitable for enabling atomic writes. I don't see a point in
> checking the bdev queue limits here.

Ok, skip the queue limits then.

> > 
> > 		/*
> > 		 * Don't set atomic write if the allocation unit doesn't
> > 		 * align with the device requirements.
> > 		 */
> > 		if (!bdev_validate_atomic_write(<target blockdev>,
> > 				XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize))
> > 			return -EINVAL;
> > 
> > Too bad we have to figure out the target blockdev and file allocation
> > unit based on the ioctl in-params and can't use the xfs_inode helpers
> > here.
> 
> I am not sure what bdev_validate_atomic_write() would even do. If
> sb_rextsize exceeded the bdev atomic write unit max, then we just cap
> reported atomic write unit max in statx to that which the bdev reports and
> vice-versa.
> 
> And didn't we previously have a concern that it is possible to change the
> geometry of the device?

The thing is, I don't want this logic:

	if (!is_power_of_2(mp->m_sb.sb_rextsize))
		/* fail */

to be open-coded inside xfs.  I'd rather have a standard bdev_* helper
that every filesystem can call, so we don't end up with more generic
code copy-pasted all over the codebase.

The awkward part (for me) is the naming, since filesystems usually don't
have to check with the block layer about their units of space allocation.

/*
 * Ensure that a file space allocation unit is congruent with the atomic
 * write unit capabilities of supported block devices.
 */
static inline bool bdev_validate_atomic_write_allocunit(unsigned au)
{
	return is_power_of_2(au);
}

	if (!bdev_validate_atomic_write_allocunit(mp->m-sb.sb_rextsize))
		return -EINVAL;

> If so, not much point in this check.

Yes, that is a disadvantage of me reading patchsets in reverse order. ;)

--D

> Thanks,
> John
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] fs: xfs: Support atomic write for statx
  2024-02-05 13:10     ` John Garry
@ 2024-02-13 17:37       ` Darrick J. Wong
  2024-02-14 12:26         ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-13 17:37 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Mon, Feb 05, 2024 at 01:10:54PM +0000, John Garry wrote:
> On 02/02/2024 18:05, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:43PM +0000, John Garry wrote:
> > > Support providing info on atomic write unit min and max for an inode.
> > > 
> > > For simplicity, currently we limit the min at the FS block size, but a
> > > lower limit could be supported in future.
> > > 
> > > The atomic write unit min and max is limited by the guaranteed extent
> > > alignment for the inode.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_iops.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > >   fs/xfs/xfs_iops.h |  4 ++++
> > >   2 files changed, 49 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index a0d77f5f512e..0890d2f70f4d 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -546,6 +546,44 @@ xfs_stat_blksize(
> > >   	return PAGE_SIZE;
> > >   }
> > > +void xfs_get_atomic_write_attr(
> > 
> > static void?
> 
> We use this in the iomap and statx code
> 
> > 
> > > +	struct xfs_inode *ip,
> > > +	unsigned int *unit_min,
> > > +	unsigned int *unit_max)
> > 
> > Weird indenting here.
> 
> hmmm... I thought that this was the XFS style
> 
> Can you show how it should look?

The parameter declarations should line up with the local variables:

void
xfs_get_atomic_write_attr(
	struct xfs_inode	*ip,
	unsigned int		*unit_min,
	unsigned int		*unit_max)
{
	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
	struct block_device	*bdev = target->bt_bdev;
	struct request_queue	*q = bdev->bd_queue;
	struct xfs_mount	*mp = ip->i_mount;
	unsigned int		awu_min, awu_max, align;
	xfs_extlen_t		extsz = xfs_get_extsz(ip);

> > 
> > > +{
> > > +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
> > > +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> > > +	struct block_device	*bdev = target->bt_bdev;
> > > +	unsigned int		awu_min, awu_max, align;
> > > +	struct request_queue	*q = bdev->bd_queue;
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +
> > > +	/*
> > > +	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
> > > +	 * atomic write unit of BLOCKSIZE).
> > > +	 */
> > > +	awu_min = queue_atomic_write_unit_min_bytes(q);
> > > +	awu_max = queue_atomic_write_unit_max_bytes(q);
> > > +
> > > +	awu_min &= ~mp->m_blockmask;
> > 
> > Why do you round /down/ the awu_min value here?
> 
> This is just to ensure that we returning *unit_min >= BLOCKSIZE
> 
> For example, if awu_min, max 1K, 64K from the bdev, we now have 0 and 64K.
> And below this gives us awu_min, max of 4k, 64k.
> 
> Maybe there is a more logical way of doing this.

	awu_min = roundup(queue_atomic_write_unit_min_bytes(q),
			  mp->m_sb.sb_blocksize);

?

> 
> > 
> > > +	awu_max &= ~mp->m_blockmask;
> > 
> > Actually -- since the atomic write units have to be powers of 2, why is
> > rounding needed here at all?
> 
> Sure, but the bdev can report a awu_min < BLOCKSIZE
> 
> > 
> > > +
> > > +	align = XFS_FSB_TO_B(mp, extsz);
> > > +
> > > +	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
> > > +	    !is_power_of_2(align)) {
> > 
> > ...and if you take my suggestion to make a common helper to validate the
> > atomic write unit parameters, this can collapse into:
> > 
> > 	alloc_unit_bytes = xfs_inode_alloc_unitsize(ip);
> > 	if (!xfs_inode_has_atomicwrites(ip) ||
> > 	    !bdev_validate_atomic_write(bdev, alloc_unit_bytes))  > 		/* not supported, return zeroes */
> > 		*unit_min = 0;
> > 		*unit_max = 0;
> > 		return;
> > 	}
> > 
> > 	*unit_min = max(alloc_unit_bytes, awu_min);
> > 	*unit_max = min(alloc_unit_bytes, awu_max);
> 
> Again, we need to ensure that *unit_min >= BLOCKSIZE

The file allocation unit and hence the return value of
xfs_inode_alloc_unitsize is always a multiple of sb_blocksize.

--D

> Thanks,
> John
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-05 13:36     ` John Garry
  2024-02-06  1:15       ` Dave Chinner
@ 2024-02-13 17:50       ` Darrick J. Wong
  2024-02-14 12:13         ` John Garry
  1 sibling, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-13 17:50 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Mon, Feb 05, 2024 at 01:36:03PM +0000, John Garry wrote:
> On 02/02/2024 18:47, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:44PM +0000, John Garry wrote:
> > > Ensure that when creating a mapping that we adhere to all the atomic
> > > write rules.
> > > 
> > > We check that the mapping covers the complete range of the write to ensure
> > > that we'll be just creating a single mapping.
> > > 
> > > Currently minimum granularity is the FS block size, but it should be
> > > possibly to support lower in future.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > > I am setting this as an RFC as I am not sure on the change in
> > > xfs_iomap_write_direct() - it gives the desired result AFAICS.
> > > 
> > >   fs/xfs/xfs_iomap.c | 41 +++++++++++++++++++++++++++++++++++++++++
> > >   1 file changed, 41 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index 18c8f168b153..758dc1c90a42 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -289,6 +289,9 @@ xfs_iomap_write_direct(
> > >   		}
> > >   	}
> > > +	if (xfs_inode_atomicwrites(ip))
> > > +		bmapi_flags = XFS_BMAPI_ZERO;
> > 
> > Why do we want to write zeroes to the disk if we're allocating space
> > even if we're not sending an atomic write?
> > 
> > (This might want an explanation for why we're doing this at all -- it's
> > to avoid unwritten extent conversion, which defeats hardware untorn
> > writes.)
> 
> It's to handle the scenario where we have a partially written extent, and
> then try to issue an atomic write which covers the complete extent. In this
> scenario, the iomap code will issue 2x IOs, which is unacceptable. So we
> ensure that the extent is completely written whenever we allocate it. At
> least that is my idea.
> 
> > 
> > I think we should support IOCB_ATOMIC when the mapping is unwritten --
> > the data will land on disk in an untorn fashion, the unwritten extent
> > conversion on IO completion is itself atomic, and callers still have to
> > set O_DSYNC to persist anything.
> 
> But does this work for the scenario above?
> 
> > Then we can avoid the cost of
> > BMAPI_ZERO, because double-writes aren't free.
> 
> About double-writes not being free, I thought that this was acceptable to
> just have this write zero when initially allocating the extent as it should
> not add too much overhead in practice, i.e. it's one off.
> 
> > 
> > > +
> > >   	error = xfs_trans_alloc_inode(ip, &M_RES(mp)->tr_write, dblocks,
> > >   			rblocks, force, &tp);
> > >   	if (error)
> > > @@ -812,6 +815,44 @@ xfs_direct_write_iomap_begin(
> > >   	if (error)
> > >   		goto out_unlock;
> > > +	if (flags & IOMAP_ATOMIC) {
> > > +		xfs_filblks_t unit_min_fsb, unit_max_fsb;
> > > +		unsigned int unit_min, unit_max;
> > > +
> > > +		xfs_get_atomic_write_attr(ip, &unit_min, &unit_max);
> > > +		unit_min_fsb = XFS_B_TO_FSBT(mp, unit_min);
> > > +		unit_max_fsb = XFS_B_TO_FSBT(mp, unit_max);
> > > +
> > > +		if (!imap_spans_range(&imap, offset_fsb, end_fsb)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		if ((offset & mp->m_blockmask) ||
> > > +		    (length & mp->m_blockmask)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		if (imap.br_blockcount == unit_min_fsb ||
> > > +		    imap.br_blockcount == unit_max_fsb) {
> > > +			/* ok if exactly min or max */
> > > +		} else if (imap.br_blockcount < unit_min_fsb ||
> > > +			   imap.br_blockcount > unit_max_fsb) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		} else if (!is_power_of_2(imap.br_blockcount)) {
> > > +			error = -EINVAL;
> > > +			goto out_unlock;
> > > +		}
> > > +
> > > +		if (imap.br_startoff &&
> > > +		    imap.br_startoff & (imap.br_blockcount - 1)) {
> > 
> > Not sure why we care about the file position, it's br_startblock that
> > gets passed into the bio, not br_startoff.
> 
> We just want to ensure that the length of the write is valid w.r.t. to the
> offset within the extent, and br_startoff would be the offset within the
> aligned extent.

Yes, I understand what br_startoff is, but this doesn't help me
understand why this code is necessary.  Let's say you have a device that
supports untorn writes of 16k in length provided the LBA of the write
command is also aligned to 16k, and the fs has 4k blocks.

Userspace issues an 16k untorn write at offset 13k in the file, and gets
this mapping:

[startoff: 13k, startblock: 16k, blockcount: 16k]

Why should this IO be rejected?  The physical space extent satisfies the
alignment requirements of the underlying device, and the logical file
space extent does not need aligning at all.

> > I'm also still not convinced that any of this validation is useful here.
> > The block device stack underneath the filesystem can change at any time
> > without any particular notice to the fs, so the only way to find out if
> > the proposed IO would meet the alignment constraints is to submit_bio
> > and see what happens.
> 
> I am not sure what submit_bio() would do differently. If the block device is
> changing underneath the block layer, then there is where these things need
> to be checked.

Agreed.

> > 
> > (The "one bio per untorn write request" thing in the direct-io.c patch
> > sound sane to me though.)
> > 
> 
> ok
> 
> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-13  7:22 ` Christoph Hellwig
@ 2024-02-13 17:55   ` Darrick J. Wong
  2024-02-14  7:45     ` Christoph Hellwig
  2024-02-13 23:50   ` Dave Chinner
  1 sibling, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-13 17:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Garry, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Tue, Feb 13, 2024 at 08:22:37AM +0100, Christoph Hellwig wrote:
> From reading the series and the discussions with Darrick and Dave
> I'm coming more and more back to my initial position that tying this
> user visible feature to hardware limits is wrong and will just keep
> on creating ever more painpoints in the future.
> 
> Based on that I suspect that doing proper software only atomic writes
> using the swapext log item and selective always COW mode

Er, what are you thinking w.r.t. swapext and sometimescow?  swapext
doesn't currently handle COW forks at all, and it can only exchange
between two of the same type of fork (e.g. both data forks or both attr
forks, no mixing).

Or will that be your next suggestion whenever I get back to fiddling
with the online fsck patches? ;)

> and making that
> work should be the first step.  We can then avoid that overhead for
> properly aligned writs if the hardware supports it.  For your Oracle
> DB loads you'll set the alignment hints and maybe even check with
> fiemap that everything is fine and will get the offload, but we also
> provide a nice and useful API for less performance critical applications
> that don't have to care about all these details.

I suspect they might want to fail-fast (back to standard WAL mode or
whatever) if the hardware support isn't available.

--D

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set
  2024-02-05 10:26     ` John Garry
@ 2024-02-13 17:59       ` Darrick J. Wong
  2024-02-14 12:36         ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-13 17:59 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Mon, Feb 05, 2024 at 10:26:43AM +0000, John Garry wrote:
> On 02/02/2024 18:06, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:45PM +0000, John Garry wrote:
> > > For when an inode is enabled for atomic writes, set FMODE_CAN_ATOMIC_WRITE
> > > flag.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/xfs/xfs_file.c | 2 ++
> > >   1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index e33e5e13b95f..1375d0089806 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1232,6 +1232,8 @@ xfs_file_open(
> > >   		return -EIO;
> > >   	file->f_mode |= FMODE_NOWAIT | FMODE_BUF_RASYNC | FMODE_BUF_WASYNC |
> > >   			FMODE_DIO_PARALLEL_WRITE | FMODE_CAN_ODIRECT;
> > > +	if (xfs_inode_atomicwrites(XFS_I(inode)))
> 
> Note to self: This should also check if O_DIRECT is set
> 
> > 
> > Shouldn't we check that the device supports AWU at all before turning on
> > the FMODE flag?
> 
> Can we easily get this sort of bdev info here?
> 
> Currently if we do try to issue an atomic write and AWU for the bdev is
> zero, then XFS iomap code will reject it.

Hmm.  Well, if we move towards pushing all the hardware checks out of
xfs/iomap and into whatever goes on underneath submit_bio then I guess
we don't need to check device support here at all.

--D

> Thanks,
> John
> 
> > 
> > --D
> > 
> > > +		file->f_mode |= FMODE_CAN_ATOMIC_WRITE;
> > >   	return generic_file_open(inode, file);
> > >   }
> > > -- 
> > > 2.31.1
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] fs: iomap: Atomic write support
  2024-02-05 11:29     ` John Garry
  2024-02-13  6:55       ` Christoph Hellwig
@ 2024-02-13 18:08       ` Darrick J. Wong
  1 sibling, 0 replies; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-13 18:08 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Mon, Feb 05, 2024 at 11:29:57AM +0000, John Garry wrote:
> On 02/02/2024 17:25, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:40PM +0000, John Garry wrote:
> > > Add flag IOMAP_ATOMIC_WRITE to indicate to the FS that an atomic write
> > > bio is being created and all the rules there need to be followed.
> > > 
> > > It is the task of the FS iomap iter callbacks to ensure that the mapping
> > > created adheres to those rules, like size is power-of-2, is at a
> > > naturally-aligned offset, etc. However, checking for a single iovec, i.e.
> > > iter type is ubuf, is done in __iomap_dio_rw().
> > > 
> > > A write should only produce a single bio, so error when it doesn't.
> > > 
> > > Signed-off-by: John Garry <john.g.garry@oracle.com>
> > > ---
> > >   fs/iomap/direct-io.c  | 21 ++++++++++++++++++++-
> > >   fs/iomap/trace.h      |  3 ++-
> > >   include/linux/iomap.h |  1 +
> > >   3 files changed, 23 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index bcd3f8cf5ea4..25736d01b857 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -275,10 +275,12 @@ static inline blk_opf_t iomap_dio_bio_opflags(struct iomap_dio *dio,
> > >   static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >   		struct iomap_dio *dio)
> > >   {
> > > +	bool atomic_write = iter->flags & IOMAP_ATOMIC;
> > >   	const struct iomap *iomap = &iter->iomap;
> > >   	struct inode *inode = iter->inode;
> > >   	unsigned int fs_block_size = i_blocksize(inode), pad;
> > >   	loff_t length = iomap_length(iter);
> > > +	const size_t iter_len = iter->len;
> > >   	loff_t pos = iter->pos;
> > >   	blk_opf_t bio_opf;
> > >   	struct bio *bio;
> > > @@ -381,6 +383,9 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >   					  GFP_KERNEL);
> > >   		bio->bi_iter.bi_sector = iomap_sector(iomap, pos);
> > >   		bio->bi_ioprio = dio->iocb->ki_ioprio;
> > > +		if (atomic_write)
> > > +			bio->bi_opf |= REQ_ATOMIC;
> > 
> > This really ought to be in iomap_dio_bio_opflags.  Unless you can't pass
> > REQ_ATOMIC to bio_alloc*, in which case there ought to be a comment
> > about why.
> 
> I think that should be ok
> 
> > 
> > Also, what's the meaning of REQ_OP_READ | REQ_ATOMIC?
> 
> REQ_ATOMIC will be ignored for REQ_OP_READ. I'm following the same policy as
> something like RWF_SYNC for a read.
> 
> However, if FMODE_CAN_ATOMIC_WRITE is unset, then REQ_ATOMIC will be
> rejected for both REQ_OP_READ and REQ_OP_WRITE.
> 
> > Does that
> > actually work?  I don't know what that means, and "block: Add REQ_ATOMIC
> > flag" says that's not a valid combination.  I'll complain about this
> > more below.
> 
> Please note that I do mention that this flag is only meaningful for
> pwritev2(), like RWF_SYNC, here:
> https://lore.kernel.org/linux-api/20240124112731.28579-3-john.g.garry@oracle.com/
> 
> > 
> > > +
> > >   		bio->bi_private = dio;
> > >   		bio->bi_end_io = iomap_dio_bio_end_io;
> > > @@ -397,6 +402,12 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >   		}
> > >   		n = bio->bi_iter.bi_size;
> > > +		if (atomic_write && n != iter_len) {
> > 
> > s/iter_len/orig_len/ ?
> 
> ok, I can change the name if you prefer
> 
> > 
> > > +			/* This bio should have covered the complete length */
> > > +			ret = -EINVAL;
> > > +			bio_put(bio);
> > > +			goto out;
> > > +		}
> > >   		if (dio->flags & IOMAP_DIO_WRITE) {
> > >   			task_io_account_write(n);
> > >   		} else {
> > > @@ -554,12 +565,17 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >   	struct blk_plug plug;
> > >   	struct iomap_dio *dio;
> > >   	loff_t ret = 0;
> > > +	bool is_read = iov_iter_rw(iter) == READ;
> > > +	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC) && !is_read;
> > 
> > Hrmm.  So if the caller passes in an IOCB_ATOMIC iocb with a READ iter,
> > we'll silently drop IOCB_ATOMIC and do the read anyway?  That seems like
> > a nonsense combination, but is that ok for some reason?
> 
> Please see above
> 
> > 
> > >   	trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
> > >   	if (!iomi.len)
> > >   		return NULL;
> > > +	if (atomic_write && !iter_is_ubuf(iter))
> > > +		return ERR_PTR(-EINVAL);
> > 
> > Does !iter_is_ubuf actually happen?
> 
> Sure, if someone uses iovcnt > 1 for pwritev2
> 
> Please see __import_iovec(), where only if iovcnt == 1 we create iter_type
> == ITER_UBUF, if > 1 then we have iter_type == ITER_IOVEC

Ok.  The iter stuff (especially the macros) confuse the hell out of me
every time I go reading through that.

> > Why don't we support any of the
> > other ITER_ types?  Is it because hardware doesn't want vectored
> > buffers?
> It's related how we can determine atomic_write_unit_max for the bdev.
> 
> We want to give a definitive max write value which we can guarantee to
> always fit in a BIO, but not mandate any extra special iovec
> length/alignment rules.
> 
> Without any iovec length or alignment rules (apart from direct IO rules that
> an iovec needs to be bdev logical block size and length aligned) , if a user
> provides many iovecs, then we may only be able to only fit bdev LBS of data
> (typically 512B) in each BIO vector, and thus we need to give a
> pessimistically low atomic_write_unit_max value.
> 
> If we say that iovcnt max == 1, then we know that we can fit PAGE size of
> data in each BIO vector (ignoring first/last vectors), and this will give a
> reasonably large atomic_write_unit_max value.
> 
> Note that we do now provide this iovcnt max value via statx, but always
> return 1 for now. This was agreed with Christoph, please see:
> https://lore.kernel.org/linux-nvme/20240117150200.GA30112@lst.de/

Got it.  We can always add ITER_IOVEC support later if we figure out a
sane way to restrain userspace. :)

> > 
> > I really wish there was more commenting on /why/ we do things here:
> > 
> > 	if (iocb->ki_flags & IOCB_ATOMIC) {
> > 		/* atomic reads do not make sense */
> > 		if (iov_iter_rw(iter) == READ)
> > 			return ERR_PTR(-EINVAL);
> > 
> > 		/*
> > 		 * block layer doesn't want to handle handle vectors of
> > 		 * buffers when performing an atomic write i guess?
> > 		 */
> > 		if (!iter_is_ubuf(iter))
> > 			return ERR_PTR(-EINVAL);
> > 
> > 		iomi.flags |= IOMAP_ATOMIC;
> > 	}
> 
> ok, I can make this more clear.
> 
> Note: It would be nice if we could check this in xfs_iomap_write_direct() or
> a common VFS helper (which xfs_iomap_write_direct() calls), but iter is not
> available there.

No, do not put generic stuff like that in XFS, leave it here in iomap.

> I could just check iter_is_ubuf() on its own in the vfs rw path, but I would
> like to keep the checks as close together as possible.

Yeah, and I want you to put as many of the checks in the VFS as
possible so that we (or really Ojaswin) don't end up copy-pasting all
that validation into ext4 and every other filesystem that wants to
expose untorn writes.

AFAICT the only things XFS really needs to do on its own is check that
the xfs_inode_alloc_unit() is a power of two if untorn writes are
present; and adjusting the awu min/max for statx reporting.

--D

> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-13  8:41   ` John Garry
  2024-02-13  9:10     ` Ritesh Harjani
@ 2024-02-13 22:49     ` Dave Chinner
  2024-02-14 10:10       ` John Garry
  1 sibling, 1 reply; 68+ messages in thread
From: Dave Chinner @ 2024-02-13 22:49 UTC (permalink / raw)
  To: John Garry
  Cc: Ritesh Harjani (IBM),
	hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Tue, Feb 13, 2024 at 08:41:10AM +0000, John Garry wrote:
> On 13/02/2024 07:45, Ritesh Harjani (IBM) wrote:
> > John Garry <john.g.garry@oracle.com> writes:
> 
> Does xfs_io have a statx function?

Yes, it's right there in the man page:

	statx [ -v|-r ][ -m basic | -m all | -m <mask> ][ -FD ]
              Selected statistics from stat(2) and the XFS_IOC_GETXATTR system call on the current file.
                 -v  Show timestamps.
                 -r  Dump raw statx structure values.
                 -m basic
                     Set the field mask for the statx call to STATX_BASIC_STATS.
                 -m all
                     Set the the field mask for the statx call to STATX_ALL (default).
                 -m <mask>
                     Specify a numeric field mask for the statx call.
                 -F  Force the attributes to be synced with the server.
                 -D  Don't sync attributes with the server.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-09 12:47                 ` John Garry
@ 2024-02-13 23:41                   ` Dave Chinner
  2024-02-14 11:06                     ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Dave Chinner @ 2024-02-13 23:41 UTC (permalink / raw)
  To: John Garry
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Fri, Feb 09, 2024 at 12:47:38PM +0000, John Garry wrote:
> > > > Why should we jump through crazy hoops to try to make filesystems
> > > > optimised for large IOs with mismatched, overlapping small atomic
> > > > writes?
> > > 
> > > As mentioned, typically the atomic writes will be the same size, but we may
> > > have other writes of smaller size.
> > 
> > Then we need the tiny write to allocate and zero according to the
> > maximum sized atomic write bounds. Then we just don't care about
> > large atomic IO overlapping small IO, because the extent on disk
> > aligned to the large atomic IO is then always guaranteed to be the
> > correct size and shape.
> 
> I think it's worth mentioning that there is currently a separation between
> how we configure the FS extent size for atomic writes and what the bdev can
> actually support in terms of atomic writes.

And that's part of what is causing all the issues here - we're
trying to jump though hoops at the fs level to handle cases that
they device doesn't support and vice versa.

> Setting the rtvol extsize at mkfs time or enabling atomic writes
> FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
> terms of atomic writes.

Which is wrong. mkfs.xfs gets physical information about the volume
from the kernel and makes the filesystem accounting to that
information. That's how we do stripe alignment, sector sizing, etc.
Atomic write support and setting up alignment constraints should be
no different.

Yes, mkfs allows the user to override the hardware configsi it
probes, but it also warns when the override is doing something
sub-optimal (like aligning all AG headers to the same disk in a
stripe).

IOWs, mkfs should be pulling this atomic write info from the
hardware and configuring the filesysetm around that information.
That's the target we should be aiming the kernel implementation at
and optimising for - a filesystem that is correctly configured
according to published hardware capability.

Everything else is in the "make it behave correctly, but we don't
care if it's slow" category.

> This check is not done as it is not fixed what the bdev can do in terms of
> atomic writes - or, more specifically, what they request_queue reports is
> not be fixed. There are things which can change this. For example, a FW
> update could change all the atomic write capabilities of a disk. Or even if
> we swapped a SCSI disk into another host the atomic write limits may change,
> as the atomic write unit max depends on the SCSI HBA DMA limits. Changing
> BIO_MAX_VECS - which could come from a kernel update - could also change
> what we report as atomic write limit in the request queue.

If that sort of thing happens, then that's too bad. We already have
these sorts of "do not do if you care about performance"
constraints. e.g. don't do a RAID restripe that changes the
alignment/size of the RAID device (e.g. add a single disk and make
the stripe width wider) because the physical filesystem structure
will no longer be aligned to the underlying hardware. instead, you
have to grow striped volumes with compatible stripes in compatible
sizes to ensure the filesystem remains aligned to the storage...

We don't try to cater for every single possible permutation of
storage hardware configurations - that way lies madness. Design and
optimise for the common case of correctly configured and well
behaved storage, and everything else we just don't care about beyond
"don't corrupt or lose data".

> > > > And therein lies the problem.
> > > > 
> > > > If you are doing sub-rtextent IO at all, then you are forcing the
> > > > filesystem down the path of explicitly using unwritten extents and
> > > > requiring O_DSYNC direct IO to do journal flushes in IO completion
> > > > context and then performance just goes down hill from them.
> > > > 
> > > > The requirement for unwritten extents to track sub-rtextsize written
> > > > regions is what you're trying to work around with XFS_BMAPI_ZERO so
> > > > that atomic writes will always see "atomic write aligned" allocated
> > > > regions.
> > > > 
> > > > Do you see the problem here? You've explicitly told the filesystem
> > > > that allocation is aligned to 64kB chunks, then because the
> > > > filesystem block size is 4kB, it's allowed to track unwritten
> > > > regions at 4kB boundaries. Then you do 4kB aligned file IO, which
> > > > then changes unwritten extents at 4kB boundaries. Then you do a
> > > > overlapping 16kB IO that*requires*  16kB allocation alignment, and
> > > > things go BOOM.
> > > > 
> > > > Yes, they should go BOOM.
> > > > 
> > > > This is a horrible configuration - it is incomaptible with 16kB
> > > > aligned and sized atomic IO.
> > > 
> > > Just because the DB may do 16KB atomic writes most of the time should not
> > > disallow it from any other form of writes.
> > 
> > That's not what I said. I said the using sub-rtextsize atomic writes
> > with single FSB unwritten extent tracking is horrible and
> > incompatible with doing 16kB atomic writes.
> > 
> > This setup will not work at all well with your patches and should go
> > BOOM. Using XFS_BMAPI_ZERO is hacking around the fact that the setup
> > has uncoordinated extent allocation and unwritten conversion
> > granularity.
> > 
> > That's the fundamental design problem with your approach - it allows
> > unwritten conversion at *minimum IO sizes* and that does not work
> > with atomic IOs with larger alignment requirements.
> > 
> > The fundamental design principle is this: for maximally sized atomic
> > writes to always succeed we require every allocation, zeroing and
> > unwritten conversion operation to use alignments and sizes that are
> > compatible with the maximum atomic write sizes being used.
> > 
> 
> That sounds fine.
> 
> My question then is how we determine this max atomic write size granularity.
> 
> We don't explicitly tell the FS what atomic write size we want for a file.
> Rather we mkfs with some extsize value which should match our atomic write
> maximal value and then tell the FS we want to do atomic writes on a file,
> and if this is accepted then we can query the atomic write min and max unit
> size, and this would be [FS block size, min(bdev atomic write limit,
> rtexsize)].
> 
> If rtextsize is 16KB, then we have a good idea that we want 16KB atomic
> writes support. So then we could use rtextsize as this max atomic write
> size.

Maybe, but I think continuing to focus on this as 'atomic writes
requires' is wrong.

The filesystem does not carea bout atomic writes. What it cares
about is the allocation constraints that need to be implemented.
That constraint is that all BMBT extent operations need to be
aligned to a specific extent size, not filesystem blocks.

The current extent size hint (and rtextsize) only impact the
-allocation of extents-. They are not directly placing constraints
on the BMBT layout, they are placing constraints on the free space
search that the allocator runs on the BNO/CNT btrees to select an
extent that is then inserted into the BMBT.

The problem is that unwritten extent conversion, truncate, hole
punching, etc also all need to be correctly aligned for files that
are configured to support atomic writes. These operations place
constraints on how the BMBT can modify the existing extent list.

These are different constraints to what rtextsize/extszhint apply,
and that's the fundamental behavioural difference between existing
extent size hint behaviour and the behaviour needed by atomic
writes.

> But I am not 100% sure that it your idea (apologies if I am wrong - I
> am sincerely trying to follow your idea), but rather it would be
> min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev
> atomic write limit is 16KB, then there is no much point in dealing in 1MB
> blocks for this unwritten extent conversion alignment.

Exactly my point - there really is no relationship between rtextsize
and atomic write constraints and that it is a mistake to use
rtextsize as it stands as a placeholder for atomic write
constraints.

> If so, then my
> concern is that the bdev atomic write upper limit is not fixed. This can
> solved, but I would still like to be clear on this max atomic write size.

Right, we haven't clearly defined how XFS is supposed align BMBT
operations in a way that is compatible for atomic write operations.

What the patchset does is try to extend and infer things from
existing allocation alignment constraints, but then not apply those
constraints to critical extent state operations (pure BMBT
modifications) that atomic writes also need constrained to work
correctly and efficiently.

> > i.e. atomic writes need to use max write size granularity for all IO
> > operations, not filesystem block granularity.
> > 
> > And that also means things like rtextsize and extsize hints need to
> > match these atomic write requirements, too....
> 
> As above, I am not 100% sure if you mean these to be the atomic write
> maximal value.

Yes, they either need to be the same as the atomic write max value
or, much better, once a hint has been set, then resultant size is
then aligned up to be compatible with the atomic write constraints.

e.g. set an inode extent size hint of 960kB on a device with 128kB
atomic write capability. If the inode has the atomic write flag set,
then allocations need to round the extent size up from 960kB to 1MB
so that the BMBT extent layout and alignment is compatible with 128kB
atomic writes.

At this point, zeroing, punching, unwritten extent conversion, etc
also needs to be done on aligned 128kB ranges to be comaptible with
atomic writes, rather than filesysetm block boundaries that would
normally be used if just the extent size hint was set.

This is effectively what the original "force align" inode flag bit
did - it told the inode that all BMBT manipulations need to be
extent size hint aligned, not just allocation. It's a generic flag
that implements specific extent manipulation constraints that happen
to be compatible with atomic writes when the right extent size hint
is set.

So at this point, I'm not sure that we should have an "atomic
writes" flag in the inode. We need to tell BMBT modifications
to behave in a particular way - forced alignment - not that atomic
writes are being used in the filesystem....

At this point, the filesystem can do all the extent modification
alignment stuff that atomic writes require without caring if the
block device supports atomic writes or even if the
application is using atomic writes.

This means we can test the BMBT functionality in fstests without
requiring hardware (or emulation) that supports atomic writes - all
we need to do is set the forced align flag, an extent size hint and
go run fsx on it...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-13  7:22 ` Christoph Hellwig
  2024-02-13 17:55   ` Darrick J. Wong
@ 2024-02-13 23:50   ` Dave Chinner
  2024-02-14  7:38     ` Christoph Hellwig
  1 sibling, 1 reply; 68+ messages in thread
From: Dave Chinner @ 2024-02-13 23:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Garry, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Tue, Feb 13, 2024 at 08:22:37AM +0100, Christoph Hellwig wrote:
> From reading the series and the discussions with Darrick and Dave
> I'm coming more and more back to my initial position that tying this
> user visible feature to hardware limits is wrong and will just keep
> on creating ever more painpoints in the future.

Yes, that's pretty much what I've been trying to say from the start.

The functionality atomic writes need from the filesystem is for
extent alignment constraints to be applied to all extent
manipulations, not just allocation.  This is the same functionality
that DAX based XFS filesystems need to guarantee PMD aligned
extents.

IOWs, the required filesystem extent alignment functionality is not
specific to atomic writes and it is not specific to a particular
type of storage hardware.

If we implement the generic extent alignment constraints properly,
everything else from there is just a matter of configuring the
filesystem geometry to match the underlying hardware capability.
mkfs can do that for us, like it already does for RAID storage...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-13 23:50   ` Dave Chinner
@ 2024-02-14  7:38     ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2024-02-14  7:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, John Garry, djwong, viro, brauner, dchinner,
	jack, chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Wed, Feb 14, 2024 at 10:50:22AM +1100, Dave Chinner wrote:
> The functionality atomic writes need from the filesystem is for
> extent alignment constraints to be applied to all extent
> manipulations, not just allocation.  This is the same functionality
> that DAX based XFS filesystems need to guarantee PMD aligned
> extents.
> 
> IOWs, the required filesystem extent alignment functionality is not
> specific to atomic writes and it is not specific to a particular
> type of storage hardware.
> 
> If we implement the generic extent alignment constraints properly,
> everything else from there is just a matter of configuring the
> filesystem geometry to match the underlying hardware capability.
> mkfs can do that for us, like it already does for RAID storage...

Agreed.

But the one thing making atomic writes odd right now is that it
absolutely is required for operation right now, while for other
features is is somewhere between nice and important to have and not
a deal breaker.

So eithe we need to figure out a somewhat generic and not totally
XFS implementation specific user space interface to do the force
alignment (which this series tries to do).  Or we make atomic
writes like the other features and ensure they still work without
the proper alignment if they suck.  Doing that was my initial
gut feeling, and looking at other approaches just makes me tend
even stronger towards that.



> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
---end quoted text---

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-13 17:55   ` Darrick J. Wong
@ 2024-02-14  7:45     ` Christoph Hellwig
  2024-02-21 16:56       ` Darrick J. Wong
  0 siblings, 1 reply; 68+ messages in thread
From: Christoph Hellwig @ 2024-02-14  7:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, John Garry, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Tue, Feb 13, 2024 at 09:55:49AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 13, 2024 at 08:22:37AM +0100, Christoph Hellwig wrote:
> > From reading the series and the discussions with Darrick and Dave
> > I'm coming more and more back to my initial position that tying this
> > user visible feature to hardware limits is wrong and will just keep
> > on creating ever more painpoints in the future.
> > 
> > Based on that I suspect that doing proper software only atomic writes
> > using the swapext log item and selective always COW mode
> 
> Er, what are you thinking w.r.t. swapext and sometimescow?

What do you mean with sometimescow?  Just normal reflinked inodes?

> swapext
> doesn't currently handle COW forks at all, and it can only exchange
> between two of the same type of fork (e.g. both data forks or both attr
> forks, no mixing).
> 
> Or will that be your next suggestion whenever I get back to fiddling
> with the online fsck patches? ;)

Let's take a step back.  If we want atomic write semantics without
hardware offload, what we need is to allocate new blocks and atomically
swap them into the data fork.  Basicall an atomic version of
xfs_reflink_end_cow.  But yes, the details of the current swapext
item might not be an exact fit, maybe it's just shared infrastructure
and concepts.

I'm not planning to make you do it, because such a log item would
generally be pretty useful for always COW mode.

> > and making that
> > work should be the first step.  We can then avoid that overhead for
> > properly aligned writs if the hardware supports it.  For your Oracle
> > DB loads you'll set the alignment hints and maybe even check with
> > fiemap that everything is fine and will get the offload, but we also
> > provide a nice and useful API for less performance critical applications
> > that don't have to care about all these details.
> 
> I suspect they might want to fail-fast (back to standard WAL mode or
> whatever) if the hardware support isn't available.

Maybe for your particular DB use case.  But there's plenty of
applications that just want atomic writes without building their
own infrastruture, including some that want pretty large chunks.

Also if a file system supports logging data (which I have an
XFS early prototype for that I plan to finish), we can even do
the small double writes more efficiently than the application,
all through the same interface.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-13 22:49     ` Dave Chinner
@ 2024-02-14 10:10       ` John Garry
  0 siblings, 0 replies; 68+ messages in thread
From: John Garry @ 2024-02-14 10:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ritesh Harjani (IBM),
	hch, djwong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 13/02/2024 22:49, Dave Chinner wrote:
>> Does xfs_io have a statx function?
> Yes, it's right there in the man page:
> 
> 	statx [ -v|-r ][ -m basic | -m all | -m <mask> ][ -FD ]
>                Selected statistics from stat(2) and the XFS_IOC_GETXATTR system call on the current file.
>                   -v  Show timestamps.
>                   -r  Dump raw statx structure values.
>                   -m basic
>                       Set the field mask for the statx call to STATX_BASIC_STATS.
>                   -m all
>                       Set the the field mask for the statx call to STATX_ALL (default).
>                   -m <mask>
>                       Specify a numeric field mask for the statx call.
>                   -F  Force the attributes to be synced with the server.
>                   -D  Don't sync attributes with the server.

ok, I can check that out and look to add any support required for atomic 
writes extension.

Thanks,
John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-13 23:41                   ` Dave Chinner
@ 2024-02-14 11:06                     ` John Garry
  2024-02-14 23:03                       ` Dave Chinner
  0 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-14 11:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin


> 
>> Setting the rtvol extsize at mkfs time or enabling atomic writes
>> FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
>> terms of atomic writes.
> 
> Which is wrong. mkfs.xfs gets physical information about the volume
> from the kernel and makes the filesystem accounting to that
> information. That's how we do stripe alignment, sector sizing, etc.
> Atomic write support and setting up alignment constraints should be
> no different.

Yes, I was just looking at adding a mkfs option to format for atomic 
writes, which would check physical information about the volume and 
whether it suits rtextsize and then subsequently also set 
XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.

> 
> Yes, mkfs allows the user to override the hardware configsi it
> probes, but it also warns when the override is doing something
> sub-optimal (like aligning all AG headers to the same disk in a
> stripe).
> 
> IOWs, mkfs should be pulling this atomic write info from the
> hardware and configuring the filesysetm around that information.
> That's the target we should be aiming the kernel implementation at
> and optimising for - a filesystem that is correctly configured
> according to published hardware capability.

Right

So, for example, if the atomic writes option is set and the rtextsize 
set by the user is so much larger than what HW can support in terms of 
atomic writes, then we should let the user know about this.

> 
> Everything else is in the "make it behave correctly, but we don't
> care if it's slow" category.
> 
>> This check is not done as it is not fixed what the bdev can do in terms of
>> atomic writes - or, more specifically, what they request_queue reports is
>> not be fixed. There are things which can change this. For example, a FW
>> update could change all the atomic write capabilities of a disk. Or even if
>> we swapped a SCSI disk into another host the atomic write limits may change,
>> as the atomic write unit max depends on the SCSI HBA DMA limits. Changing
>> BIO_MAX_VECS - which could come from a kernel update - could also change
>> what we report as atomic write limit in the request queue.
> 
> If that sort of thing happens, then that's too bad. We already have
> these sorts of "do not do if you care about performance"
> constraints. e.g. don't do a RAID restripe that changes the
> alignment/size of the RAID device (e.g. add a single disk and make
> the stripe width wider) because the physical filesystem structure
> will no longer be aligned to the underlying hardware. instead, you
> have to grow striped volumes with compatible stripes in compatible
> sizes to ensure the filesystem remains aligned to the storage...
> 
> We don't try to cater for every single possible permutation of
> storage hardware configurations - that way lies madness. Design and
> optimise for the common case of correctly configured and well
> behaved storage, and everything else we just don't care about beyond
> "don't corrupt or lose data".

ok

> 
>>>>> And therein lies the problem.
>>>>>

...

>>
>> That sounds fine.
>>
>> My question then is how we determine this max atomic write size granularity.
>>
>> We don't explicitly tell the FS what atomic write size we want for a file.
>> Rather we mkfs with some extsize value which should match our atomic write
>> maximal value and then tell the FS we want to do atomic writes on a file,
>> and if this is accepted then we can query the atomic write min and max unit
>> size, and this would be [FS block size, min(bdev atomic write limit,
>> rtexsize)].
>>
>> If rtextsize is 16KB, then we have a good idea that we want 16KB atomic
>> writes support. So then we could use rtextsize as this max atomic write
>> size.
> 
> Maybe, but I think continuing to focus on this as 'atomic writes
> requires' is wrong.
> 
> The filesystem does not carea bout atomic writes. What it cares
> about is the allocation constraints that need to be implemented.
> That constraint is that all BMBT extent operations need to be
> aligned to a specific extent size, not filesystem blocks.
> 
> The current extent size hint (and rtextsize) only impact the
> -allocation of extents-. They are not directly placing constraints
> on the BMBT layout, they are placing constraints on the free space
> search that the allocator runs on the BNO/CNT btrees to select an
> extent that is then inserted into the BMBT.
> 
> The problem is that unwritten extent conversion, truncate, hole
> punching, etc also all need to be correctly aligned for files that
> are configured to support atomic writes. These operations place
> constraints on how the BMBT can modify the existing extent list.
> 
> These are different constraints to what rtextsize/extszhint apply,
> and that's the fundamental behavioural difference between existing
> extent size hint behaviour and the behaviour needed by atomic
> writes.
> 
>> But I am not 100% sure that it your idea (apologies if I am wrong - I
>> am sincerely trying to follow your idea), but rather it would be
>> min(rtextsize, bdev atomic write limit), e.g. if rtextsize was 1MB and bdev
>> atomic write limit is 16KB, then there is no much point in dealing in 1MB
>> blocks for this unwritten extent conversion alignment.
> 
> Exactly my point - there really is no relationship between rtextsize
> and atomic write constraints and that it is a mistake to use
> rtextsize as it stands as a placeholder for atomic write
> constraints.
> 

ok

>> If so, then my
>> concern is that the bdev atomic write upper limit is not fixed. This can
>> solved, but I would still like to be clear on this max atomic write size.
> 
> Right, we haven't clearly defined how XFS is supposed align BMBT
> operations in a way that is compatible for atomic write operations.
> 
> What the patchset does is try to extend and infer things from
> existing allocation alignment constraints, but then not apply those
> constraints to critical extent state operations (pure BMBT
> modifications) that atomic writes also need constrained to work
> correctly and efficiently.

Right. Previously I also did mention that we could explicitly request 
the atomic write size per-inode, but a drawback is that this would 
require an on-disk format change.

> 
>>> i.e. atomic writes need to use max write size granularity for all IO
>>> operations, not filesystem block granularity.
>>>
>>> And that also means things like rtextsize and extsize hints need to
>>> match these atomic write requirements, too....
>>
>> As above, I am not 100% sure if you mean these to be the atomic write
>> maximal value.
> 
> Yes, they either need to be the same as the atomic write max value
> or, much better, once a hint has been set, then resultant size is
> then aligned up to be compatible with the atomic write constraints.
> 
> e.g. set an inode extent size hint of 960kB on a device with 128kB
> atomic write capability. If the inode has the atomic write flag set,
> then allocations need to round the extent size up from 960kB to 1MB
> so that the BMBT extent layout and alignment is compatible with 128kB
> atomic writes.
> 
> At this point, zeroing, punching, unwritten extent conversion, etc
> also needs to be done on aligned 128kB ranges to be comaptible with
> atomic writes, rather than filesysetm block boundaries that would
> normally be used if just the extent size hint was set.
> 
> This is effectively what the original "force align" inode flag bit
> did - it told the inode that all BMBT manipulations need to be
> extent size hint aligned, not just allocation. It's a generic flag
> that implements specific extent manipulation constraints that happen
> to be compatible with atomic writes when the right extent size hint
> is set.
> 
> So at this point, I'm not sure that we should have an "atomic
> writes" flag in the inode. 

Another motivation for this flag is that we can explicitly enable some 
software-based atomic write support for an inode when the backing device 
does not have HW support.

In addition, in this series setting FS_XFLAG_ATOMICWRITES means 
XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something 
similar for other OSes, and for those other OSes it may also mean some 
other special alignment feature enabled. We want a consistent user 
experience.

> We need to tell BMBT modifications
> to behave in a particular way - forced alignment - not that atomic
> writes are being used in the filesystem....
> 
> At this point, the filesystem can do all the extent modification
> alignment stuff that atomic writes require without caring if the
> block device supports atomic writes or even if the
> application is using atomic writes.
> 
> This means we can test the BMBT functionality in fstests without
> requiring hardware (or emulation) that supports atomic writes - all
> we need to do is set the forced align flag, an extent size hint and
> go run fsx on it...
> 

The current idea was that the forcealign feature would be required for 
the second phase for atomic write support - non-rtvol support. Darrick 
did send that series out separately over the New Year's break.

I think that you wanted to progress the following series first:
https://lore.kernel.org/linux-xfs/20231004001943.349265-1-david@fromorbit.com/

Right?

Thanks,
John




^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-13 17:50       ` Darrick J. Wong
@ 2024-02-14 12:13         ` John Garry
  0 siblings, 0 replies; 68+ messages in thread
From: John Garry @ 2024-02-14 12:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin


>>>
>>> Not sure why we care about the file position, it's br_startblock that
>>> gets passed into the bio, not br_startoff.
>>
>> We just want to ensure that the length of the write is valid w.r.t. to the
>> offset within the extent, and br_startoff would be the offset within the
>> aligned extent.
> 
> Yes, I understand what br_startoff is, but this doesn't help me
> understand why this code is necessary.  Let's say you have a device that
> supports untorn writes of 16k in length provided the LBA of the write
> command is also aligned to 16k, and the fs has 4k blocks.
> 
> Userspace issues an 16k untorn write at offset 13k in the file, and gets
> this mapping:
> 
> [startoff: 13k, startblock: 16k, blockcount: 16k]
> 
> Why should this IO be rejected? 

It's rejected as it does not follow the rules.

> The physical space extent satisfies the
> alignment requirements of the underlying device, and the logical file
> space extent does not need aligning at all.

Sure. In this case, we can produce a single BIO and the underlying HW 
may be able to handle this atomically.

The point really is that we want a consistent userspace experience. We 
say that the write 'must' be naturally aligned, not 'should' be.

It's not really useful to the user if sometimes a write passes and 
sometimes it fails by chance of how the extents happen to be laid out.

Furthermore, in this case, what should the user do if this write at 13K 
offset fails as the 16K of data straddles 2x extents? They asked for 16K 
written at offset 13K and they want it done atomically - there is 
nothing which the FS can do to help. If they don't really need 16K 
written atomically, then better just do a regular write, or write 
individual chunks atomically.

Thanks,
John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol
  2024-02-13 17:22       ` Darrick J. Wong
@ 2024-02-14 12:19         ` John Garry
  0 siblings, 0 replies; 68+ messages in thread
From: John Garry @ 2024-02-14 12:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 13/02/2024 17:22, Darrick J. Wong wrote:
>> I am not sure what bdev_validate_atomic_write() would even do. If
>> sb_rextsize exceeded the bdev atomic write unit max, then we just cap
>> reported atomic write unit max in statx to that which the bdev reports and
>> vice-versa.
>>
>> And didn't we previously have a concern that it is possible to change the
>> geometry of the device?
> The thing is, I don't want this logic:
> 
> 	if (!is_power_of_2(mp->m_sb.sb_rextsize))
> 		/* fail */

This is really specific to XFS. Let's see where all this alignment stuff 
goes before trying to unify all these checks.

> 
> to be open-coded inside xfs.  I'd rather have a standard bdev_* helper
> that every filesystem can call, so we don't end up with more generic
> code copy-pasted all over the codebase.
> 
> The awkward part (for me) is the naming, since filesystems usually don't
> have to check with the block layer about their units of space allocation.
> 
> /*
>   * Ensure that a file space allocation unit is congruent with the atomic
>   * write unit capabilities of supported block devices.
>   */
> static inline bool bdev_validate_atomic_write_allocunit(unsigned au)
> {
> 	return is_power_of_2(au);
> }
> 
> 	if (!bdev_validate_atomic_write_allocunit(mp->m-sb.sb_rextsize))
> 		return -EINVAL;

As above, I can try to unify, but this alignment stuff is a bit up in 
the air at the moment.

Thanks,
John


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 4/6] fs: xfs: Support atomic write for statx
  2024-02-13 17:37       ` Darrick J. Wong
@ 2024-02-14 12:26         ` John Garry
  0 siblings, 0 replies; 68+ messages in thread
From: John Garry @ 2024-02-14 12:26 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 13/02/2024 17:37, Darrick J. Wong wrote:
>> We use this in the iomap and statx code
>>
>>>> +	struct xfs_inode *ip,
>>>> +	unsigned int *unit_min,
>>>> +	unsigned int *unit_max)
>>> Weird indenting here.
>> hmmm... I thought that this was the XFS style
>>
>> Can you show how it should look?
> The parameter declarations should line up with the local variables:
> 
> void
> xfs_get_atomic_write_attr(
> 	struct xfs_inode	*ip,
> 	unsigned int		*unit_min,
> 	unsigned int		*unit_max)
> {
> 	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
> 	struct block_device	*bdev = target->bt_bdev;
> 	struct request_queue	*q = bdev->bd_queue;
> 	struct xfs_mount	*mp = ip->i_mount;
> 	unsigned int		awu_min, awu_max, align;
> 	xfs_extlen_t		extsz = xfs_get_extsz(ip);
> 
>>>> +{
>>>> +	xfs_extlen_t		extsz = xfs_get_extsz(ip);
>>>> +	struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
>>>> +	struct block_device	*bdev = target->bt_bdev;
>>>> +	unsigned int		awu_min, awu_max, align;
>>>> +	struct request_queue	*q = bdev->bd_queue;
>>>> +	struct xfs_mount	*mp = ip->i_mount;
>>>> +
>>>> +	/*
>>>> +	 * Convert to multiples of the BLOCKSIZE (as we support a minimum
>>>> +	 * atomic write unit of BLOCKSIZE).
>>>> +	 */
>>>> +	awu_min = queue_atomic_write_unit_min_bytes(q);
>>>> +	awu_max = queue_atomic_write_unit_max_bytes(q);
>>>> +
>>>> +	awu_min &= ~mp->m_blockmask;
>>> Why do you round/down/  the awu_min value here?
>> This is just to ensure that we returning *unit_min >= BLOCKSIZE
>>
>> For example, if awu_min, max 1K, 64K from the bdev, we now have 0 and 64K.
>> And below this gives us awu_min, max of 4k, 64k.
>>
>> Maybe there is a more logical way of doing this.
> 	awu_min = roundup(queue_atomic_write_unit_min_bytes(q),
> 			  mp->m_sb.sb_blocksize);
> 
> ?

Yeah, I think that all this can be simplified to be made more obvious.

> 
>>>> +	awu_max &= ~mp->m_blockmask;
>>> Actually -- since the atomic write units have to be powers of 2, why is
>>> rounding needed here at all?
>> Sure, but the bdev can report a awu_min < BLOCKSIZE
>>
>>>> +
>>>> +	align = XFS_FSB_TO_B(mp, extsz);
>>>> +
>>>> +	if (!awu_max || !xfs_inode_atomicwrites(ip) || !align ||
>>>> +	    !is_power_of_2(align)) {
>>> ...and if you take my suggestion to make a common helper to validate the
>>> atomic write unit parameters, this can collapse into:
>>>
>>> 	alloc_unit_bytes = xfs_inode_alloc_unitsize(ip);
>>> 	if (!xfs_inode_has_atomicwrites(ip) ||
>>> 	    !bdev_validate_atomic_write(bdev, alloc_unit_bytes))  > 		/* not supported, return zeroes */
>>> 		*unit_min = 0;
>>> 		*unit_max = 0;
>>> 		return;
>>> 	}
>>>
>>> 	*unit_min = max(alloc_unit_bytes, awu_min);
>>> 	*unit_max = min(alloc_unit_bytes, awu_max);
>> Again, we need to ensure that *unit_min >= BLOCKSIZE
> The file allocation unit and hence the return value of
> xfs_inode_alloc_unitsize is always a multiple of sb_blocksize.

Right, but this value is coming from HW and we are just ensuring that 
the awu_min which we report is >= BLOCKSIZE. xfs_inode_alloc_unitsize() 
return value will really guide unit_max.

Anyway, again I can make this all more obvious.

Thanks,
John



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set
  2024-02-13 17:59       ` Darrick J. Wong
@ 2024-02-14 12:36         ` John Garry
  2024-02-21 17:00           ` Darrick J. Wong
  0 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-14 12:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 13/02/2024 17:59, Darrick J. Wong wrote:
>>> Shouldn't we check that the device supports AWU at all before turning on
>>> the FMODE flag?
>> Can we easily get this sort of bdev info here?
>>
>> Currently if we do try to issue an atomic write and AWU for the bdev is
>> zero, then XFS iomap code will reject it.
> Hmm.  Well, if we move towards pushing all the hardware checks out of
> xfs/iomap and into whatever goes on underneath submit_bio then I guess
> we don't need to check device support here at all.

Yeah, I have been thinking about this. But I was still planning on 
putting a "bdev on atomic write" check here, as you mentioned.

But is this a proper method to access the bdev for an xfs inode:

STATIC bool
xfs_file_can_atomic_write(
struct xfs_inode *inode)
{
	struct xfs_buftarg *target = xfs_inode_buftarg(inode);
	struct block_device *bdev = target->bt_bdev;

	if (!xfs_inode_atomicwrites(inode))
		return false;

	return bdev_can_atomic_write(bdev);
}

I do notice the dax check in xfs_bmbt_to_iomap() when assigning 
iomap->bdev, which is creating some doubt?

Thanks,
John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-14 11:06                     ` John Garry
@ 2024-02-14 23:03                       ` Dave Chinner
  2024-02-15  9:53                         ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Dave Chinner @ 2024-02-14 23:03 UTC (permalink / raw)
  To: John Garry
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Wed, Feb 14, 2024 at 11:06:11AM +0000, John Garry wrote:
> 
> > 
> > > Setting the rtvol extsize at mkfs time or enabling atomic writes
> > > FS_XFLAG_ATOMICWRITES doesn't check for what the underlying bdev can do in
> > > terms of atomic writes.
> > 
> > Which is wrong. mkfs.xfs gets physical information about the volume
> > from the kernel and makes the filesystem accounting to that
> > information. That's how we do stripe alignment, sector sizing, etc.
> > Atomic write support and setting up alignment constraints should be
> > no different.
> 
> Yes, I was just looking at adding a mkfs option to format for atomic writes,
> which would check physical information about the volume and whether it suits
> rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.

FWIW, atomic writes need to be implemented in XFS in a way that
isn't specific to the rtdev. There is no reason they cannot be
applied to the data device (via superblock max atomic write size
field or extent size hints and the align flag) so
please don't get hung up on rtextsize as the only thing that atomic
write alignment might apply to.

> > Yes, mkfs allows the user to override the hardware configsi it
> > probes, but it also warns when the override is doing something
> > sub-optimal (like aligning all AG headers to the same disk in a
> > stripe).
> > 
> > IOWs, mkfs should be pulling this atomic write info from the
> > hardware and configuring the filesysetm around that information.
> > That's the target we should be aiming the kernel implementation at
> > and optimising for - a filesystem that is correctly configured
> > according to published hardware capability.
> 
> Right
> 
> So, for example, if the atomic writes option is set and the rtextsize set by
> the user is so much larger than what HW can support in terms of atomic
> writes, then we should let the user know about this.

Well, this is part of the problem I mention above: you're focussing
entirely on the rtdev setup and not the general "atomic writes
require BMBT extent alignment constraints".

So, maybe, yes, we might want to warn that the rtextsize is much
bigger than the atomic write size of that device, but now there's
something else we need to take into account: the rtdev could have a
different atomic write size comxpapred to the data device.

What now?

IOWs, focussing on the rtdev misses key considerations for making
the functionality generic, and we most definitely don't want to have
to rev the on disk format a second time to support atomic writes
for the data device. Hence we likely need two variables for atomic
write sizes in the superblock - one for the rtdev, and one for the
data device. And this then feeds through to Darrick's multi-rtdev
stuff - each rtdev will need to have an attribute that tracks this
information.

Actually, now that I think about it, we may require 3 sizes - I'm in
the process of writing a design doc for an new journal format, and
one of the things I want it to be able to do is use atomic writes to
enable journal writes to be decoupled from device sector sizes. If
we are using atomic writes > sector size for the journal, then we
definitely need to record somewhere in the superblock what the
minimum journal IO size is because it isn't going to be the sector
size anymore...

[...]

> > > If so, then my
> > > concern is that the bdev atomic write upper limit is not fixed. This can
> > > solved, but I would still like to be clear on this max atomic write size.
> > 
> > Right, we haven't clearly defined how XFS is supposed align BMBT
> > operations in a way that is compatible for atomic write operations.
> > 
> > What the patchset does is try to extend and infer things from
> > existing allocation alignment constraints, but then not apply those
> > constraints to critical extent state operations (pure BMBT
> > modifications) that atomic writes also need constrained to work
> > correctly and efficiently.
> 
> Right. Previously I also did mention that we could explicitly request the
> atomic write size per-inode, but a drawback is that this would require an
> on-disk format change.

We're already having to change the on-disk format for this (inode
flag, superblock feature bit), so we really should be trying to make
this generic and properly featured so that it can be used for
anything that requires physical alignment of file data extents, not
just atomic writes...

> > > > i.e. atomic writes need to use max write size granularity for all IO
> > > > operations, not filesystem block granularity.
> > > > 
> > > > And that also means things like rtextsize and extsize hints need to
> > > > match these atomic write requirements, too....
> > > 
> > > As above, I am not 100% sure if you mean these to be the atomic write
> > > maximal value.
> > 
> > Yes, they either need to be the same as the atomic write max value
> > or, much better, once a hint has been set, then resultant size is
> > then aligned up to be compatible with the atomic write constraints.
> > 
> > e.g. set an inode extent size hint of 960kB on a device with 128kB
> > atomic write capability. If the inode has the atomic write flag set,
> > then allocations need to round the extent size up from 960kB to 1MB
> > so that the BMBT extent layout and alignment is compatible with 128kB
> > atomic writes.
> > 
> > At this point, zeroing, punching, unwritten extent conversion, etc
> > also needs to be done on aligned 128kB ranges to be comaptible with
> > atomic writes, rather than filesysetm block boundaries that would
> > normally be used if just the extent size hint was set.
> > 
> > This is effectively what the original "force align" inode flag bit
> > did - it told the inode that all BMBT manipulations need to be
> > extent size hint aligned, not just allocation. It's a generic flag
> > that implements specific extent manipulation constraints that happen
> > to be compatible with atomic writes when the right extent size hint
> > is set.
> > 
> > So at this point, I'm not sure that we should have an "atomic
> > writes" flag in the inode.
> 
> Another motivation for this flag is that we can explicitly enable some
> software-based atomic write support for an inode when the backing device
> does not have HW support.

That's orthogonal to the aligment issue. If the BMBT extents are
always aligned in a way that is compatible with atomic writes, we
don't need and aomtic writes flag to tell the filesystem it should
do an atomic write. That comes from userspace via the IOCB_ATOMIC
flag.

It is the IOCB_ATOMIC that triggers the software fallback when the
hardware doesn't support atomic writes, not an inode flag. All the
filesystem has to do is guarantee all extent manipulations are
correctly aligned and IOCB_ATOMIC can always be executed regardless
of whether it is hardware or software that does it.


> In addition, in this series setting FS_XFLAG_ATOMICWRITES means
> XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something
> similar for other OSes, and for those other OSes it may also mean some other
> special alignment feature enabled. We want a consistent user experience.

I don't care about other OSes - they don't implement the
FS_IOC_FSSETXATTR interfaces, so we just don't care about cross-OS
compatibility for the user API.

Fundamentally, atomic writes are *not a property of the filesystem
on-disk format*. They require extent tracking constraints (i.e.
alignment), and that's the property of the filesystem on-disk format
that we need to manipulate here.

Users are not going to care if the setup ioctl for atomic writes
is to set FS_XFLAG_ALIGN_EXTENTS vs FS_XFLAG_ATOMICWRITES. They
already know they have to align their IO properly for atomic writes,
so it's not like this is something they will be completely
unfamiliar with.

Indeed, FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize
= max_atomic_write_size as a user interface to set up the inode for 
atomic writes is pretty concise and easy to use. It also isn't
specific to atomic writes, and so this fucntionality can be used for
anything that needs aligned extent manipulation for performant
functionality.

> > to behave in a particular way - forced alignment - not that atomic
> > writes are being used in the filesystem....
> > 
> > At this point, the filesystem can do all the extent modification
> > alignment stuff that atomic writes require without caring if the
> > block device supports atomic writes or even if the
> > application is using atomic writes.
> > 
> > This means we can test the BMBT functionality in fstests without
> > requiring hardware (or emulation) that supports atomic writes - all
> > we need to do is set the forced align flag, an extent size hint and
> > go run fsx on it...
> > 
> 
> The current idea was that the forcealign feature would be required for the
> second phase for atomic write support - non-rtvol support. Darrick did send
> that series out separately over the New Year's break.

This is the wrong approach: this needs to be integrated into the
same patchset so we can review the changes for atomic writes as a
whole, not as two separate, independent on-disk format changes. The
on-disk format change that atomic writes need is aligned BMBT extent
manipulations, regardless of whether we are targeting the rtdev or
the datadev, and trying to separate them is leading you down
entirely the wrong path.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC 5/6] fs: xfs: iomap atomic write support
  2024-02-14 23:03                       ` Dave Chinner
@ 2024-02-15  9:53                         ` John Garry
  0 siblings, 0 replies; 68+ messages in thread
From: John Garry @ 2024-02-15  9:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, hch, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin


>>
>> Yes, I was just looking at adding a mkfs option to format for atomic writes,
>> which would check physical information about the volume and whether it suits
>> rtextsize and then subsequently also set XFS_SB_FEAT_RO_COMPAT_ATOMICWRITES.
> 
> FWIW, atomic writes need to be implemented in XFS in a way that
> isn't specific to the rtdev. There is no reason they cannot be
> applied to the data device (via superblock max atomic write size
> field or extent size hints and the align flag) so
> please don't get hung up on rtextsize as the only thing that atomic
> write alignment might apply to.

Sure

> 
>>> Yes, mkfs allows the user to override the hardware configsi it
>>> probes, but it also warns when the override is doing something
>>> sub-optimal (like aligning all AG headers to the same disk in a
>>> stripe).
>>>
>>> IOWs, mkfs should be pulling this atomic write info from the
>>> hardware and configuring the filesysetm around that information.
>>> That's the target we should be aiming the kernel implementation at
>>> and optimising for - a filesystem that is correctly configured
>>> according to published hardware capability.
>>
>> Right
>>
>> So, for example, if the atomic writes option is set and the rtextsize set by
>> the user is so much larger than what HW can support in terms of atomic
>> writes, then we should let the user know about this.
> 
> Well, this is part of the problem I mention above: you're focussing
> entirely on the rtdev setup and not the general "atomic writes
> require BMBT extent alignment constraints".

I'm really just saying what I was considering based on this series only.

> 
> So, maybe, yes, we might want to warn that the rtextsize is much
> bigger than the atomic write size of that device, but now there's
> something else we need to take into account: the rtdev could have a
> different atomic write size comxpapred to the data device.
> 
> What now?
> 
> IOWs, focussing on the rtdev misses key considerations for making
> the functionality generic, and we most definitely don't want to have
> to rev the on disk format a second time to support atomic writes
> for the data device. Hence we likely need two variables for atomic
> write sizes in the superblock - one for the rtdev, and one for the
> data device. And this then feeds through to Darrick's multi-rtdev
> stuff - each rtdev will need to have an attribute that tracks this
> information.

ok


>>>
>>> What the patchset does is try to extend and infer things from
>>> existing allocation alignment constraints, but then not apply those
>>> constraints to critical extent state operations (pure BMBT
>>> modifications) that atomic writes also need constrained to work
>>> correctly and efficiently.
>>
>> Right. Previously I also did mention that we could explicitly request the
>> atomic write size per-inode, but a drawback is that this would require an
>> on-disk format change.
> 
> We're already having to change the on-disk format for this (inode
> flag, superblock feature bit), so we really should be trying to make
> this generic and properly featured so that it can be used for
> anything that requires physical alignment of file data extents, not
> just atomic writes...

ok

...

>> Another motivation for this flag is that we can explicitly enable some
>> software-based atomic write support for an inode when the backing device
>> does not have HW support.
> 
> That's orthogonal to the aligment issue. If the BMBT extents are
> always aligned in a way that is compatible with atomic writes, we
> don't need and aomtic writes flag to tell the filesystem it should
> do an atomic write. 

Any instruction to do an atomic write should be encoded in the userspace 
write operation. Or maybe the file open operation in future - I still 
get questions about O_ATOMIC.

> That comes from userspace via the IOCB_ATOMIC
> flag.
> 
> It is the IOCB_ATOMIC that triggers the software fallback when the
> hardware doesn't support atomic writes, not an inode flag. 

To me, any such fallback seems like something which we should be 
explicitly enabling.

> All the
> filesystem has to do is guarantee all extent manipulations are
> correctly aligned and IOCB_ATOMIC can always be executed regardless
> of whether it is hardware or software that does it.
> 
> 
>> In addition, in this series setting FS_XFLAG_ATOMICWRITES means
>> XFS_DIFLAG2_ATOMICWRITES gets set, and I would expect it to do something
>> similar for other OSes, and for those other OSes it may also mean some other
>> special alignment feature enabled. We want a consistent user experience.
> 
> I don't care about other OSes - they don't implement the
> FS_IOC_FSSETXATTR interfaces, so we just don't care about cross-OS
> compatibility for the user API.

Other FSes need to support FS_IOC_FSSETXATTR for atomic writes. Or at 
least should support it....

> 
> Fundamentally, atomic writes are *not a property of the filesystem
> on-disk format*. They require extent tracking constraints (i.e.
> alignment), and that's the property of the filesystem on-disk format
> that we need to manipulate here.
> 
> Users are not going to care if the setup ioctl for atomic writes
> is to set FS_XFLAG_ALIGN_EXTENTS vs FS_XFLAG_ATOMICWRITES. They
> already know they have to align their IO properly for atomic writes,
> so it's not like this is something they will be completely
> unfamiliar with.
> 
> Indeed, FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ fsx.fsx_extsize
> = max_atomic_write_size as a user interface to set up the inode for
> atomic writes is pretty concise and easy to use. It also isn't
> specific to atomic writes, and so this fucntionality can be used for
> anything that needs aligned extent manipulation for performant
> functionality.

This is where there seems to be a difference between how you would like 
atomic writes to be enabled and how Christoph would, judging by this:
https://lore.kernel.org/linux-fsdevel/20240110091929.GA31003@lst.de/

I think that if the FS and the userspace util can between them figure 
out what to do, then that is ok. This is something like what I proposed 
previously:

xfs_io -c "atomic-writes 64K" mnt/file

where the userspace util would use for the FS_IOC_FSSETXATTR ioctl:

FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_ALIGN_EXTENTS | FS_XFLAG_EXTSIZE w/ 
fsx.fsx_extsize

There is a question on the purpose of the FS_XFLAG_ATOMIC_WRITES flag, 
but, to me, it does seem useful for future feature support.

We could just have FS_XFLAG_ATOMIC_WRITES | FS_XFLAG_EXTSIZE w/ 
fsx.fsx_extsize, and the kernel auto-enables FS_XFLAG_ALIGN_EXTENTS, but 
the other way seems better


> 
>>> to behave in a particular way - forced alignment - not that atomic
>>> writes are being used in the filesystem....
>>>
>>> At this point, the filesystem can do all the extent modification
>>> alignment stuff that atomic writes require without caring if the
>>> block device supports atomic writes or even if the
>>> application is using atomic writes.
>>>
>>> This means we can test the BMBT functionality in fstests without
>>> requiring hardware (or emulation) that supports atomic writes - all
>>> we need to do is set the forced align flag, an extent size hint and
>>> go run fsx on it...
>>>
>>
>> The current idea was that the forcealign feature would be required for the
>> second phase for atomic write support - non-rtvol support. Darrick did send
>> that series out separately over the New Year's break.
> 
> This is the wrong approach: this needs to be integrated into the
> same patchset so we can review the changes for atomic writes as a
> whole, not as two separate, independent on-disk format changes. The
> on-disk format change that atomic writes need is aligned BMBT extent
> manipulations, regardless of whether we are targeting the rtdev or
> the datadev, and trying to separate them is leading you down
> entirely the wrong path.
> 

ok, fine.

BTW, going back to the original discussion on the extent zeroing and 
your idea to do this in the iomap dio path, my impression is that we 
require changes like this:

- in iomap_dio_bio_iter(), we need to zero out the extent according to 
extsize (and not just FS block size)
- xfs_dio_write_end_io() -> xfs_iomap_write_unwritten() also needs to 
consider this extent being written, and not assume a FS block
- per-inode locking similar to what is done in 
xfs_file_dio_write_unaligned() - I need to check that further, though, 
as I am not yet sure on how we decide to use this exclusive lock or not.

Does this sound sane?

Thanks,
John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 1/6] fs: iomap: Atomic write support
  2024-02-13  8:20         ` John Garry
@ 2024-02-15 11:08           ` John Garry
  0 siblings, 0 replies; 68+ messages in thread
From: John Garry @ 2024-02-15 11:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 13/02/2024 08:20, John Garry wrote:
> On 13/02/2024 06:55, Christoph Hellwig wrote:
>> On Mon, Feb 05, 2024 at 11:29:57AM +0000, John Garry wrote:
>>>> Also, what's the meaning of REQ_OP_READ | REQ_ATOMIC?
>>> REQ_ATOMIC will be ignored for REQ_OP_READ. I'm following the same 
>>> policy
>>> as something like RWF_SYNC for a read.
>> We've been rather sloppy with these flags in the past, which isn't
>> a good thing.  Let's add proper checking for new interfaces.

How about something like this:

----8<----

-static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags)
+static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags, int 
type)
  {
   int kiocb_flags = 0;

...

+ if (flags & RWF_ATOMIC) {
+ 	if (type == READ)
+ 		return -EOPNOTSUPP;
+ 	if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
+ 		return -EOPNOTSUPP;
+ }
   kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
   if (flags & RWF_SYNC)
  	 kiocb_flags |= IOCB_DSYNC;

---->8----

I don't see a better place to add this check.

John

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-14  7:45     ` Christoph Hellwig
@ 2024-02-21 16:56       ` Darrick J. Wong
  2024-02-23  6:57         ` Christoph Hellwig
  0 siblings, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-21 16:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: John Garry, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Wed, Feb 14, 2024 at 08:45:59AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 13, 2024 at 09:55:49AM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 13, 2024 at 08:22:37AM +0100, Christoph Hellwig wrote:
> > > From reading the series and the discussions with Darrick and Dave
> > > I'm coming more and more back to my initial position that tying this
> > > user visible feature to hardware limits is wrong and will just keep
> > > on creating ever more painpoints in the future.
> > > 
> > > Based on that I suspect that doing proper software only atomic writes
> > > using the swapext log item and selective always COW mode
> > 
> > Er, what are you thinking w.r.t. swapext and sometimescow?
> 
> What do you mean with sometimescow?  Just normal reflinked inodes?
> 
> > swapext
> > doesn't currently handle COW forks at all, and it can only exchange
> > between two of the same type of fork (e.g. both data forks or both attr
> > forks, no mixing).
> > 
> > Or will that be your next suggestion whenever I get back to fiddling
> > with the online fsck patches? ;)
> 
> Let's take a step back.  If we want atomic write semantics without
> hardware offload, what we need is to allocate new blocks and atomically
> swap them into the data fork.  Basicall an atomic version of
> xfs_reflink_end_cow.  But yes, the details of the current swapext
> item might not be an exact fit, maybe it's just shared infrastructure
> and concepts.

Hmm.  For rt reflink (whenever I get back to that, ha) I've been
starting to think that yes, we actually /do/ want to have a log item
that tracks the progress of remap and cow operations.  That would solve
the problem of someone wanting to reflink a semi-written rtx.

That said, it might complicate the reflink code quite a bit since right
now it writes zeroes to the unwritten parts of an rt file's rtx so that
there's only one mapping record for the whole rtx, and then it remaps
them.  That's most of why I haven't bothered to implement that solution.

> I'm not planning to make you do it, because such a log item would
> generally be pretty useful for always COW mode.

One other thing -- while I was refactoring the swapext code into
exch{range,maps}, it occurred to me that doing an exchange between the
cow and data forks isn't possible because log recovery won't be able to
do anything.  There's no ondisk metadata to map a cow staging extent
back to the file it came from, which means we can't generally resume an
exchange operation.

However for a small write I guess you could simply queue all the log
intent items for all the changes needed and commit that.

> > > and making that
> > > work should be the first step.  We can then avoid that overhead for
> > > properly aligned writs if the hardware supports it.  For your Oracle
> > > DB loads you'll set the alignment hints and maybe even check with
> > > fiemap that everything is fine and will get the offload, but we also
> > > provide a nice and useful API for less performance critical applications
> > > that don't have to care about all these details.
> > 
> > I suspect they might want to fail-fast (back to standard WAL mode or
> > whatever) if the hardware support isn't available.
> 
> Maybe for your particular DB use case.  But there's plenty of
> applications that just want atomic writes without building their
> own infrastruture, including some that want pretty large chunks.
> 
> Also if a file system supports logging data (which I have an
> XFS early prototype for that I plan to finish), we can even do
> the small double writes more efficiently than the application,
> all through the same interface.

Heh.  Ted's been trying to kill data=journal.  Now we've found a use for
it after all. :)

--D

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set
  2024-02-14 12:36         ` John Garry
@ 2024-02-21 17:00           ` Darrick J. Wong
  2024-02-21 17:38             ` John Garry
  0 siblings, 1 reply; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-21 17:00 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Wed, Feb 14, 2024 at 12:36:40PM +0000, John Garry wrote:
> On 13/02/2024 17:59, Darrick J. Wong wrote:
> > > > Shouldn't we check that the device supports AWU at all before turning on
> > > > the FMODE flag?
> > > Can we easily get this sort of bdev info here?
> > > 
> > > Currently if we do try to issue an atomic write and AWU for the bdev is
> > > zero, then XFS iomap code will reject it.
> > Hmm.  Well, if we move towards pushing all the hardware checks out of
> > xfs/iomap and into whatever goes on underneath submit_bio then I guess
> > we don't need to check device support here at all.
> 
> Yeah, I have been thinking about this. But I was still planning on putting a
> "bdev on atomic write" check here, as you mentioned.
> 
> But is this a proper method to access the bdev for an xfs inode:
> 
> STATIC bool
> xfs_file_can_atomic_write(
> struct xfs_inode *inode)
> {
> 	struct xfs_buftarg *target = xfs_inode_buftarg(inode);
> 	struct block_device *bdev = target->bt_bdev;
> 
> 	if (!xfs_inode_atomicwrites(inode))
> 		return false;
> 
> 	return bdev_can_atomic_write(bdev);
> }

There's still a TOCTOU race problem if the bdev gets reconfigured
between xfs_file_can_atomic_write and submit_bio.

However, if you're only using this to advertise the capability via statx
then I suppose that's fine -- userspace has to have some means of
discovering the ability at all.  Userspace is also inherently racy.

> I do notice the dax check in xfs_bmbt_to_iomap() when assigning iomap->bdev,
> which is creating some doubt?

Do you mean this?

	if (mapping_flags & IOMAP_DAX)
		iomap->dax_dev = target->bt_daxdev;
	else
		iomap->bdev = target->bt_bdev;

The dax path wants dax_dev set so that it can do the glorified memcpy
operation, and it doesn't need (or want) a block device.

--D

> Thanks,
> John
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set
  2024-02-21 17:00           ` Darrick J. Wong
@ 2024-02-21 17:38             ` John Garry
  2024-02-24  4:18               ` Darrick J. Wong
  0 siblings, 1 reply; 68+ messages in thread
From: John Garry @ 2024-02-21 17:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On 21/02/2024 17:00, Darrick J. Wong wrote:
>>> Hmm.  Well, if we move towards pushing all the hardware checks out of
>>> xfs/iomap and into whatever goes on underneath submit_bio then I guess
>>> we don't need to check device support here at all.
>> Yeah, I have been thinking about this. But I was still planning on putting a
>> "bdev on atomic write" check here, as you mentioned.
>>
>> But is this a proper method to access the bdev for an xfs inode:
>>
>> STATIC bool
>> xfs_file_can_atomic_write(
>> struct xfs_inode *inode)
>> {
>> 	struct xfs_buftarg *target = xfs_inode_buftarg(inode);
>> 	struct block_device *bdev = target->bt_bdev;
>>
>> 	if (!xfs_inode_atomicwrites(inode))
>> 		return false;
>>
>> 	return bdev_can_atomic_write(bdev);
>> }
> There's still a TOCTOU race problem if the bdev gets reconfigured
> between xfs_file_can_atomic_write and submit_bio.

If that is the case then a check in the bio submit path is required to 
catch any such reconfigure problems - and we effectively have that in 
this series.

I am looking at change some of these XFS bdev_can_atomic_write() checks, 
but would still have a check in the bio submit path.

> 
> However, if you're only using this to advertise the capability via statx
> then I suppose that's fine -- userspace has to have some means of
> discovering the ability at all.  Userspace is also inherently racy.
> 
>> I do notice the dax check in xfs_bmbt_to_iomap() when assigning iomap->bdev,
>> which is creating some doubt?
> Do you mean this?
> 
> 	if (mapping_flags & IOMAP_DAX)
> 		iomap->dax_dev = target->bt_daxdev;
> 	else
> 		iomap->bdev = target->bt_bdev;
> 
> The dax path wants dax_dev set so that it can do the glorified memcpy
> operation, and it doesn't need (or want) a block device.

Yes, so proper to use target->bt_bdev for checks for bdev atomic write 
capability, right?

Thanks,
John


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 0/6] block atomic writes for XFS
  2024-02-21 16:56       ` Darrick J. Wong
@ 2024-02-23  6:57         ` Christoph Hellwig
  0 siblings, 0 replies; 68+ messages in thread
From: Christoph Hellwig @ 2024-02-23  6:57 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, John Garry, viro, brauner, dchinner, jack,
	chandan.babu, martin.petersen, linux-kernel, linux-xfs,
	linux-fsdevel, tytso, jbongio, ojaswin

On Wed, Feb 21, 2024 at 08:56:15AM -0800, Darrick J. Wong wrote:
> Hmm.  For rt reflink (whenever I get back to that, ha) I've been
> starting to think that yes, we actually /do/ want to have a log item
> that tracks the progress of remap and cow operations.  That would solve
> the problem of someone wanting to reflink a semi-written rtx.
> 
> That said, it might complicate the reflink code quite a bit since right
> now it writes zeroes to the unwritten parts of an rt file's rtx so that
> there's only one mapping record for the whole rtx, and then it remaps
> them.  That's most of why I haven't bothered to implement that solution.

I'm still not sure that supporting reflinks for rtextsize > 1 is a good
idea..

> > I'm not planning to make you do it, because such a log item would
> > generally be pretty useful for always COW mode.
> 
> One other thing -- while I was refactoring the swapext code into
> exch{range,maps}, it occurred to me that doing an exchange between the
> cow and data forks isn't possible because log recovery won't be able to
> do anything.  There's no ondisk metadata to map a cow staging extent
> back to the file it came from, which means we can't generally resume an
> exchange operation.

Yeah.

> > Also if a file system supports logging data (which I have an
> > XFS early prototype for that I plan to finish), we can even do
> > the small double writes more efficiently than the application,
> > all through the same interface.
> 
> Heh.  Ted's been trying to kill data=journal.  Now we've found a use for
> it after all. :)

Well..  unconditional logging of data just seems like a really bad idea.
Using it as an optimization for very small and/or synchronous writes
is a pretty common technique.  


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set
  2024-02-21 17:38             ` John Garry
@ 2024-02-24  4:18               ` Darrick J. Wong
  0 siblings, 0 replies; 68+ messages in thread
From: Darrick J. Wong @ 2024-02-24  4:18 UTC (permalink / raw)
  To: John Garry
  Cc: hch, viro, brauner, dchinner, jack, chandan.babu,
	martin.petersen, linux-kernel, linux-xfs, linux-fsdevel, tytso,
	jbongio, ojaswin

On Wed, Feb 21, 2024 at 05:38:39PM +0000, John Garry wrote:
> On 21/02/2024 17:00, Darrick J. Wong wrote:
> > > > Hmm.  Well, if we move towards pushing all the hardware checks out of
> > > > xfs/iomap and into whatever goes on underneath submit_bio then I guess
> > > > we don't need to check device support here at all.
> > > Yeah, I have been thinking about this. But I was still planning on putting a
> > > "bdev on atomic write" check here, as you mentioned.
> > > 
> > > But is this a proper method to access the bdev for an xfs inode:
> > > 
> > > STATIC bool
> > > xfs_file_can_atomic_write(
> > > struct xfs_inode *inode)
> > > {
> > > 	struct xfs_buftarg *target = xfs_inode_buftarg(inode);
> > > 	struct block_device *bdev = target->bt_bdev;
> > > 
> > > 	if (!xfs_inode_atomicwrites(inode))
> > > 		return false;
> > > 
> > > 	return bdev_can_atomic_write(bdev);
> > > }
> > There's still a TOCTOU race problem if the bdev gets reconfigured
> > between xfs_file_can_atomic_write and submit_bio.
> 
> If that is the case then a check in the bio submit path is required to catch
> any such reconfigure problems - and we effectively have that in this series.
> 
> I am looking at change some of these XFS bdev_can_atomic_write() checks, but
> would still have a check in the bio submit path.

<nod> "check in the bio submit path" sounds good to me.  Adding in
redundant checks which are eventually gated on whatever submit_bio does
sounds like excessive overhead and layering violations.

> > 
> > However, if you're only using this to advertise the capability via statx
> > then I suppose that's fine -- userspace has to have some means of
> > discovering the ability at all.  Userspace is also inherently racy.
> > 
> > > I do notice the dax check in xfs_bmbt_to_iomap() when assigning iomap->bdev,
> > > which is creating some doubt?
> > Do you mean this?
> > 
> > 	if (mapping_flags & IOMAP_DAX)
> > 		iomap->dax_dev = target->bt_daxdev;
> > 	else
> > 		iomap->bdev = target->bt_bdev;
> > 
> > The dax path wants dax_dev set so that it can do the glorified memcpy
> > operation, and it doesn't need (or want) a block device.
> 
> Yes, so proper to use target->bt_bdev for checks for bdev atomic write
> capability, right?

Right.  fsdax doesn't support atomic memcpy to pmem.

--D

> 
> Thanks,
> John
> 
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2024-02-24  4:18 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-24 14:26 [PATCH 0/6] block atomic writes for XFS John Garry
2024-01-24 14:26 ` [PATCH 1/6] fs: iomap: Atomic write support John Garry
2024-02-02 17:25   ` Darrick J. Wong
2024-02-05 11:29     ` John Garry
2024-02-13  6:55       ` Christoph Hellwig
2024-02-13  8:20         ` John Garry
2024-02-15 11:08           ` John Garry
2024-02-13 18:08       ` Darrick J. Wong
2024-02-05 15:20   ` Pankaj Raghav (Samsung)
2024-02-05 15:41     ` John Garry
2024-01-24 14:26 ` [PATCH 2/6] fs: Add FS_XFLAG_ATOMICWRITES flag John Garry
2024-02-02 17:57   ` Darrick J. Wong
2024-02-05 12:58     ` John Garry
2024-02-13  6:56       ` Christoph Hellwig
2024-02-13 17:08       ` Darrick J. Wong
2024-01-24 14:26 ` [PATCH 3/6] fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol John Garry
2024-02-02 17:52   ` Darrick J. Wong
2024-02-03  7:40     ` Ojaswin Mujoo
2024-02-05 12:51     ` John Garry
2024-02-13 17:22       ` Darrick J. Wong
2024-02-14 12:19         ` John Garry
2024-01-24 14:26 ` [PATCH 4/6] fs: xfs: Support atomic write for statx John Garry
2024-02-02 18:05   ` Darrick J. Wong
2024-02-05 13:10     ` John Garry
2024-02-13 17:37       ` Darrick J. Wong
2024-02-14 12:26         ` John Garry
2024-02-09  7:00   ` Ojaswin Mujoo
2024-02-09 17:30     ` John Garry
2024-02-12 11:48       ` Ojaswin Mujoo
2024-02-12 12:05       ` Ojaswin Mujoo
2024-01-24 14:26 ` [PATCH RFC 5/6] fs: xfs: iomap atomic write support John Garry
2024-02-02 18:47   ` Darrick J. Wong
2024-02-05 13:36     ` John Garry
2024-02-06  1:15       ` Dave Chinner
2024-02-06  9:53         ` John Garry
2024-02-07  0:06           ` Dave Chinner
2024-02-07 14:13             ` John Garry
2024-02-09  1:40               ` Dave Chinner
2024-02-09 12:47                 ` John Garry
2024-02-13 23:41                   ` Dave Chinner
2024-02-14 11:06                     ` John Garry
2024-02-14 23:03                       ` Dave Chinner
2024-02-15  9:53                         ` John Garry
2024-02-13 17:50       ` Darrick J. Wong
2024-02-14 12:13         ` John Garry
2024-01-24 14:26 ` [PATCH 6/6] fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set John Garry
2024-02-02 18:06   ` Darrick J. Wong
2024-02-05 10:26     ` John Garry
2024-02-13 17:59       ` Darrick J. Wong
2024-02-14 12:36         ` John Garry
2024-02-21 17:00           ` Darrick J. Wong
2024-02-21 17:38             ` John Garry
2024-02-24  4:18               ` Darrick J. Wong
2024-02-09  7:14 ` [PATCH 0/6] block atomic writes for XFS Ojaswin Mujoo
2024-02-09  9:22   ` John Garry
2024-02-12 12:06     ` Ojaswin Mujoo
2024-02-13  7:22 ` Christoph Hellwig
2024-02-13 17:55   ` Darrick J. Wong
2024-02-14  7:45     ` Christoph Hellwig
2024-02-21 16:56       ` Darrick J. Wong
2024-02-23  6:57         ` Christoph Hellwig
2024-02-13 23:50   ` Dave Chinner
2024-02-14  7:38     ` Christoph Hellwig
2024-02-13  7:45 ` Ritesh Harjani
2024-02-13  8:41   ` John Garry
2024-02-13  9:10     ` Ritesh Harjani
2024-02-13 22:49     ` Dave Chinner
2024-02-14 10:10       ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).