linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Enable per-file/directory DAX operations
@ 2019-10-20 15:59 ira.weiny
  2019-10-20 15:59 ` [PATCH 1/5] fs/stat: Define DAX statx attribute ira.weiny
                   ` (5 more replies)
  0 siblings, 6 replies; 33+ messages in thread
From: ira.weiny @ 2019-10-20 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the mode of a file with
active mappings / page cache. Rather than solve those races the decision was to
just limit mode flips to 0-length files.

Finally, the physical DAX flag inheritance is maintained from previous work on 
XFS but should be added for other file systems for consistence.


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/

To: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org

Ira Weiny (5):
  fs/stat: Define DAX statx attribute
  fs/xfs: Isolate the physical DAX flag from effective
  fs/xfs: Separate functionality of xfs_inode_supports_dax()
  fs/xfs: Clean up DAX support check
  fs/xfs: Allow toggle of physical DAX flag

 fs/stat.c                 |  3 +++
 fs/xfs/xfs_ioctl.c        | 32 ++++++++++++++------------------
 fs/xfs/xfs_iops.c         | 36 ++++++++++++++++++++++++++++++------
 fs/xfs/xfs_iops.h         |  2 ++
 include/uapi/linux/stat.h |  1 +
 5 files changed, 50 insertions(+), 24 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/5] fs/stat: Define DAX statx attribute
  2019-10-20 15:59 [PATCH 0/5] Enable per-file/directory DAX operations ira.weiny
@ 2019-10-20 15:59 ` ira.weiny
  2019-10-22 11:32   ` Boaz Harrosh
  2019-10-20 15:59 ` [PATCH 2/5] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: ira.weiny @ 2019-10-20 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

In order for users to determine if a file is currently operating in DAX
mode (effective DAX).  Define a statx attribute value and set that
attribute if the effective DAX flag is set.

To go along with this we propose the following addition to the statx man
page:

STATX_ATTR_DAX

	DAX (cpu direct access) is a file mode that attempts to minimize
	software cache effects for both I/O and memory mappings of this
	file.  It requires a capable device, a compatible filesystem
	block size, and filesystem opt-in. It generally assumes all
	accesses are via cpu load / store instructions which can
	minimize overhead for small accesses, but adversely affect cpu
	utilization for large transfers. File I/O is done directly
	to/from user-space buffers. While the DAX property tends to
	result in data being transferred synchronously it does not give
	the guarantees of synchronous I/O that data and necessary
	metadata are transferred. Memory mapped I/O may be performed
	with direct mappings that bypass system memory buffering. Again
	while memory-mapped I/O tends to result in data being
	transferred synchronously it does not guarantee synchronous
	metadata updates. A dax file may optionally support being mapped
	with the MAP_SYNC flag which does allow cpu store operations to
	be considered synchronous modulo cpu cache effects.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/stat.c                 | 3 +++
 include/uapi/linux/stat.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/fs/stat.c b/fs/stat.c
index c38e4c2e1221..59ca360c1ffb 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -77,6 +77,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	if (inode->i_flags & S_DAX)
+		stat->attributes |= STATX_ATTR_DAX;
+
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
 					    query_flags);
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7b35e98d3c58..5b0962121ef7 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -169,6 +169,7 @@ struct statx {
 #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
 
 #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
+#define STATX_ATTR_DAX			0x00002000 /* [I] File is DAX */
 
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/5] fs/xfs: Isolate the physical DAX flag from effective
  2019-10-20 15:59 [PATCH 0/5] Enable per-file/directory DAX operations ira.weiny
  2019-10-20 15:59 ` [PATCH 1/5] fs/stat: Define DAX statx attribute ira.weiny
@ 2019-10-20 15:59 ` ira.weiny
  2019-10-21  0:26   ` Dave Chinner
  2019-10-20 15:59 ` [PATCH 3/5] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: ira.weiny @ 2019-10-20 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_ioctl_setattr_dax_invalidate() currently checks if the DAX flag is
changing as a quick check.

But the implementation mixes the physical (XFS_DIFLAG2_DAX) and
effective (S_DAX) DAX flags.

Remove the use of the effective flag when determining if a change of the
physical flag is required.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d58f0d6a699e..0ea326290cca 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1319,9 +1319,11 @@ xfs_ioctl_setattr_dax_invalidate(
 	}
 
 	/* If the DAX state is not changing, we have nothing to do here. */
-	if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
+	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		return 0;
-	if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
+	if (!(fa->fsx_xflags & FS_XFLAG_DAX) &&
+	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		return 0;
 
 	if (S_ISDIR(inode->i_mode))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/5] fs/xfs: Separate functionality of xfs_inode_supports_dax()
  2019-10-20 15:59 [PATCH 0/5] Enable per-file/directory DAX operations ira.weiny
  2019-10-20 15:59 ` [PATCH 1/5] fs/stat: Define DAX statx attribute ira.weiny
  2019-10-20 15:59 ` [PATCH 2/5] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
@ 2019-10-20 15:59 ` ira.weiny
  2019-10-20 15:59 ` [PATCH 4/5] fs/xfs: Clean up DAX support check ira.weiny
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: ira.weiny @ 2019-10-20 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_inode_supports_dax() should reflect if the inode can support DAX not
that it is enabled for DAX.  Leave that to other helper functions.

Change the caller of xfs_inode_supports_dax() to call
xfs_inode_use_dax() which reflects new logic to override the effective
DAX flag with either the mount option or the physical DAX flag.

To make the logic clear create 2 helper functions for the mount and
physical flag.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_iops.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index fe285d123d69..9e5c79aa1b54 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1206,6 +1206,15 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = {
 	.update_time		= xfs_vn_update_time,
 };
 
+static bool
+xfs_inode_mount_is_dax(
+	struct xfs_inode *ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	return (mp->m_flags & XFS_MOUNT_DAX) == XFS_MOUNT_DAX;
+}
+
 /* Figure out if this file actually supports DAX. */
 static bool
 xfs_inode_supports_dax(
@@ -1217,11 +1226,6 @@ xfs_inode_supports_dax(
 	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
 		return false;
 
-	/* DAX mount option or DAX iflag must be set. */
-	if (!(mp->m_flags & XFS_MOUNT_DAX) &&
-	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
-		return false;
-
 	/* Block size must match page size */
 	if (mp->m_sb.sb_blocksize != PAGE_SIZE)
 		return false;
@@ -1230,6 +1234,22 @@ xfs_inode_supports_dax(
 	return xfs_find_daxdev_for_inode(VFS_I(ip)) != NULL;
 }
 
+static bool
+xfs_inode_is_dax(
+	struct xfs_inode *ip)
+{
+	return (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) == XFS_DIFLAG2_DAX;
+}
+
+static bool
+xfs_inode_use_dax(
+	struct xfs_inode *ip)
+{
+	return xfs_inode_supports_dax(ip) &&
+		(xfs_inode_mount_is_dax(ip) ||
+		 xfs_inode_is_dax(ip));
+}
+
 STATIC void
 xfs_diflags_to_iflags(
 	struct inode		*inode,
@@ -1248,7 +1268,7 @@ xfs_diflags_to_iflags(
 		inode->i_flags |= S_SYNC;
 	if (flags & XFS_DIFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
-	if (xfs_inode_supports_dax(ip))
+	if (xfs_inode_use_dax(ip))
 		inode->i_flags |= S_DAX;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/5] fs/xfs: Clean up DAX support check
  2019-10-20 15:59 [PATCH 0/5] Enable per-file/directory DAX operations ira.weiny
                   ` (2 preceding siblings ...)
  2019-10-20 15:59 ` [PATCH 3/5] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
@ 2019-10-20 15:59 ` ira.weiny
  2019-10-20 15:59 ` [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag ira.weiny
  2019-10-22 11:21 ` [PATCH 0/5] Enable per-file/directory DAX operations Boaz Harrosh
  5 siblings, 0 replies; 33+ messages in thread
From: ira.weiny @ 2019-10-20 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Rather than open coding xfs_inode_supports_dax() in
xfs_ioctl_setattr_dax_invalidate() export xfs_inode_supports_dax() and
call it in preparation for swapping dax flags.

This also means updating xfs_inode_supports_dax() to return true for a
directory.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 17 +++--------------
 fs/xfs/xfs_iops.c  |  8 ++++++--
 fs/xfs/xfs_iops.h  |  2 ++
 3 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0ea326290cca..d3a7340d3209 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1299,24 +1299,13 @@ xfs_ioctl_setattr_dax_invalidate(
 	int			*join_flags)
 {
 	struct inode		*inode = VFS_I(ip);
-	struct super_block	*sb = inode->i_sb;
 	int			error;
 
 	*join_flags = 0;
 
-	/*
-	 * It is only valid to set the DAX flag on regular files and
-	 * directories on filesystems where the block size is equal to the page
-	 * size. On directories it serves as an inherited hint so we don't
-	 * have to check the device for dax support or flush pagecache.
-	 */
-	if (fa->fsx_xflags & FS_XFLAG_DAX) {
-		if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
-			return -EINVAL;
-		if (!bdev_dax_supported(xfs_find_bdev_for_inode(VFS_I(ip)),
-				sb->s_blocksize))
-			return -EINVAL;
-	}
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
+	    !xfs_inode_supports_dax(ip))
+		return -EINVAL;
 
 	/* If the DAX state is not changing, we have nothing to do here. */
 	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 9e5c79aa1b54..535b87ffc135 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1216,14 +1216,18 @@ xfs_inode_mount_is_dax(
 }
 
 /* Figure out if this file actually supports DAX. */
-static bool
+bool
 xfs_inode_supports_dax(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 
 	/* Only supported on non-reflinked files. */
-	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
+	if (xfs_is_reflink_inode(ip))
+		return false;
+
+	/* Only supported on regular files and directories. */
+	if (!(S_ISREG(VFS_I(ip)->i_mode) || S_ISDIR(VFS_I(ip)->i_mode)))
 		return false;
 
 	/* Block size must match page size */
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 4d24ff309f59..f24fec8de1d6 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -24,4 +24,6 @@ extern int xfs_setattr_nonsize(struct xfs_inode *ip, struct iattr *vap,
 extern int xfs_vn_setattr_nonsize(struct dentry *dentry, struct iattr *vap);
 extern int xfs_vn_setattr_size(struct dentry *dentry, struct iattr *vap);
 
+extern bool xfs_inode_supports_dax(struct xfs_inode *ip);
+
 #endif /* __XFS_IOPS_H__ */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag
  2019-10-20 15:59 [PATCH 0/5] Enable per-file/directory DAX operations ira.weiny
                   ` (3 preceding siblings ...)
  2019-10-20 15:59 ` [PATCH 4/5] fs/xfs: Clean up DAX support check ira.weiny
@ 2019-10-20 15:59 ` ira.weiny
  2019-10-21  0:45   ` Dave Chinner
  2019-10-22 11:21 ` [PATCH 0/5] Enable per-file/directory DAX operations Boaz Harrosh
  5 siblings, 1 reply; 33+ messages in thread
From: ira.weiny @ 2019-10-20 15:59 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Switching between DAX and non-DAX on a file is racy with respect to data
operations.  However, if no data is involved the flag is safe to switch.

Allow toggling the physical flag if a file is empty.  The file length
check is not racy with respect to other operations as it is performed
under XFS_MMAPLOCK_EXCL and XFS_IOLOCK_EXCL locks.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d3a7340d3209..3839684c249b 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -33,6 +33,7 @@
 #include "xfs_sb.h"
 #include "xfs_ag.h"
 #include "xfs_health.h"
+#include "libxfs/xfs_dir2.h"
 
 #include <linux/mount.h>
 #include <linux/namei.h>
@@ -1232,12 +1233,10 @@ xfs_diflags_to_linux(
 		inode->i_flags |= S_NOATIME;
 	else
 		inode->i_flags &= ~S_NOATIME;
-#if 0	/* disabled until the flag switching races are sorted out */
 	if (xflags & FS_XFLAG_DAX)
 		inode->i_flags |= S_DAX;
 	else
 		inode->i_flags &= ~S_DAX;
-#endif
 }
 
 static int
@@ -1320,6 +1319,12 @@ xfs_ioctl_setattr_dax_invalidate(
 
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+
+	if (i_size_read(inode) != 0) {
+		error = -EOPNOTSUPP;
+		goto out_unlock;
+	}
+
 	error = filemap_write_and_wait(inode->i_mapping);
 	if (error)
 		goto out_unlock;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] fs/xfs: Isolate the physical DAX flag from effective
  2019-10-20 15:59 ` [PATCH 2/5] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
@ 2019-10-21  0:26   ` Dave Chinner
  2019-10-21 17:40     ` Ira Weiny
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2019-10-21  0:26 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Sun, Oct 20, 2019 at 08:59:32AM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> xfs_ioctl_setattr_dax_invalidate() currently checks if the DAX flag is
> changing as a quick check.
> 
> But the implementation mixes the physical (XFS_DIFLAG2_DAX) and
> effective (S_DAX) DAX flags.

More nuanced than that.

The idea was that if the mount option was set, clearing the
per-inode flag would override the mount option. i.e. the mount
option sets the S_DAX flag at inode instantiation, so using
FSSETXATTR to ensure the FS_XFLAG_DAX is not set would override the
mount option setting, giving applications a way of guranteeing they
aren't using DAX to access the data.

So if the mount option is going to live on, I suspect that we want
to keep this code as it stands.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag
  2019-10-20 15:59 ` [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag ira.weiny
@ 2019-10-21  0:45   ` Dave Chinner
  2019-10-21 22:49     ` Ira Weiny
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2019-10-21  0:45 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Switching between DAX and non-DAX on a file is racy with respect to data
> operations.  However, if no data is involved the flag is safe to switch.
> 
> Allow toggling the physical flag if a file is empty.  The file length
> check is not racy with respect to other operations as it is performed
> under XFS_MMAPLOCK_EXCL and XFS_IOLOCK_EXCL locks.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  fs/xfs/xfs_ioctl.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index d3a7340d3209..3839684c249b 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -33,6 +33,7 @@
>  #include "xfs_sb.h"
>  #include "xfs_ag.h"
>  #include "xfs_health.h"
> +#include "libxfs/xfs_dir2.h"
>  
>  #include <linux/mount.h>
>  #include <linux/namei.h>
> @@ -1232,12 +1233,10 @@ xfs_diflags_to_linux(
>  		inode->i_flags |= S_NOATIME;
>  	else
>  		inode->i_flags &= ~S_NOATIME;
> -#if 0	/* disabled until the flag switching races are sorted out */
>  	if (xflags & FS_XFLAG_DAX)
>  		inode->i_flags |= S_DAX;
>  	else
>  		inode->i_flags &= ~S_DAX;
> -#endif

This code has bit-rotted. See xfs_setup_iops(), where we now have a
different inode->i_mapping->a_ops for DAX inodes.

That, fundamentally, is the issue here - it's not setting/clearing
the DAX flag that is the issue, it's doing a swap of the
mapping->a_ops while there may be other code using that ops
structure.

IOWs, if there is any code anywhere in the kernel that
calls an address space op without holding one of the three locks we
hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
of the address space operations.

By limiting the address space swap to file sizes of zero, we rule
out the page fault path (mmap of a zero length file segv's with an
access beyond EOF on the first read/write page fault, right?).
However, other aops callers that might run unlocked and do the wrong
thing if the aops pointer is swapped between check of the aop method
existing and actually calling it even if the file size is zero?

A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
to such a race condition with the current definitions of the XFS DAX
aops. I'm guessing there will be others, but I haven't looked
further than this...

>  	/* lock, flush and invalidate mapping in preparation for flag change */
>  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> +
> +	if (i_size_read(inode) != 0) {
> +		error = -EOPNOTSUPP;
> +		goto out_unlock;
> +	}

Wrong error. Should be the same as whatever is returned when we try
to change the extent size hint and can't because the file is
non-zero in length (-EINVAL, I think). Also needs a comment
explainging why this check exists, and probably better written as
i_size_read() > 0 ....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 2/5] fs/xfs: Isolate the physical DAX flag from effective
  2019-10-21  0:26   ` Dave Chinner
@ 2019-10-21 17:40     ` Ira Weiny
  0 siblings, 0 replies; 33+ messages in thread
From: Ira Weiny @ 2019-10-21 17:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Oct 21, 2019 at 11:26:21AM +1100, Dave Chinner wrote:
> On Sun, Oct 20, 2019 at 08:59:32AM -0700, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > xfs_ioctl_setattr_dax_invalidate() currently checks if the DAX flag is
> > changing as a quick check.
> > 
> > But the implementation mixes the physical (XFS_DIFLAG2_DAX) and
> > effective (S_DAX) DAX flags.
> 
> More nuanced than that.
> 
> The idea was that if the mount option was set, clearing the
> per-inode flag would override the mount option. i.e. the mount
> option sets the S_DAX flag at inode instantiation, so using
> FSSETXATTR to ensure the FS_XFLAG_DAX is not set would override the
> mount option setting, giving applications a way of guranteeing they
> aren't using DAX to access the data.

At LSF/MM we discussed keeping the mount option as a global "chicken bit" as
described by Matt Wilcox[1].  This preserves the existing behavior of turning
it on no matter what but offers an alternative with the per-file flag.

To do what you describe above, it was suggested, by Ted I believe, that an
admin can set DAX on the root directory which will enable DAX by default
through inheritance but allow users to turn it off if they desire.

I'm concerned that all users who currently use '-o dax' will expect their
current file systems to be using DAX when those mounts occur.  Their physical
inode flag is going to be 0 which, if we implement the 'turn off DAX' as you
describe will mean they will not get the behavior they expect when booting on a
new kernel.

> 
> So if the mount option is going to live on, I suspect that we want
> to keep this code as it stands.

I don't think we can get rid of it soon but I would be in favor of working
toward deprecating it.  Regardless I think this keeps the semantics simple WRT
the interaction of the mount and per-file flags.

Ira

[1] https://lwn.net/Articles/787973/

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag
  2019-10-21  0:45   ` Dave Chinner
@ 2019-10-21 22:49     ` Ira Weiny
  2019-10-21 23:46       ` Dave Chinner
  2019-11-08 13:12       ` Jan Kara
  0 siblings, 2 replies; 33+ messages in thread
From: Ira Weiny @ 2019-10-21 22:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
> On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.weiny@intel.com wrote:
> > @@ -1232,12 +1233,10 @@ xfs_diflags_to_linux(
> >  		inode->i_flags |= S_NOATIME;
> >  	else
> >  		inode->i_flags &= ~S_NOATIME;
> > -#if 0	/* disabled until the flag switching races are sorted out */
> >  	if (xflags & FS_XFLAG_DAX)
> >  		inode->i_flags |= S_DAX;
> >  	else
> >  		inode->i_flags &= ~S_DAX;
> > -#endif
> 
> This code has bit-rotted. See xfs_setup_iops(), where we now have a
> different inode->i_mapping->a_ops for DAX inodes.

:-(

> 
> That, fundamentally, is the issue here - it's not setting/clearing
> the DAX flag that is the issue, it's doing a swap of the
> mapping->a_ops while there may be other code using that ops
> structure.
> 
> IOWs, if there is any code anywhere in the kernel that
> calls an address space op without holding one of the three locks we
> hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
> of the address space operations.
> 
> By limiting the address space swap to file sizes of zero, we rule
> out the page fault path (mmap of a zero length file segv's with an
> access beyond EOF on the first read/write page fault, right?).

Yes I checked that and thought we were safe here...

> However, other aops callers that might run unlocked and do the wrong
> thing if the aops pointer is swapped between check of the aop method
> existing and actually calling it even if the file size is zero?
> 
> A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
> to such a race condition with the current definitions of the XFS DAX
> aops. I'm guessing there will be others, but I haven't looked
> further than this...

I'll check for others and think on what to do about this.  ext4 will have the
same problem I think.  :-(

I don't suppose using a single a_ops for both DAX and non-DAX is palatable?

> 
> >  	/* lock, flush and invalidate mapping in preparation for flag change */
> >  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> > +
> > +	if (i_size_read(inode) != 0) {
> > +		error = -EOPNOTSUPP;
> > +		goto out_unlock;
> > +	}
> 
> Wrong error. Should be the same as whatever is returned when we try
> to change the extent size hint and can't because the file is
> non-zero in length (-EINVAL, I think). Also needs a comment
> explainging why this check exists, and probably better written as
> i_size_read() > 0 ....

Done, done, and done.

Thank you,
Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag
  2019-10-21 22:49     ` Ira Weiny
@ 2019-10-21 23:46       ` Dave Chinner
  2019-11-08 13:12       ` Jan Kara
  1 sibling, 0 replies; 33+ messages in thread
From: Dave Chinner @ 2019-10-21 23:46 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Oct 21, 2019 at 03:49:31PM -0700, Ira Weiny wrote:
> On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
> > On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.weiny@intel.com wrote:
> > > @@ -1232,12 +1233,10 @@ xfs_diflags_to_linux(
> > >  		inode->i_flags |= S_NOATIME;
> > >  	else
> > >  		inode->i_flags &= ~S_NOATIME;
> > > -#if 0	/* disabled until the flag switching races are sorted out */
> > >  	if (xflags & FS_XFLAG_DAX)
> > >  		inode->i_flags |= S_DAX;
> > >  	else
> > >  		inode->i_flags &= ~S_DAX;
> > > -#endif
> > 
> > This code has bit-rotted. See xfs_setup_iops(), where we now have a
> > different inode->i_mapping->a_ops for DAX inodes.
> 
> :-(
> 
> > 
> > That, fundamentally, is the issue here - it's not setting/clearing
> > the DAX flag that is the issue, it's doing a swap of the
> > mapping->a_ops while there may be other code using that ops
> > structure.
> > 
> > IOWs, if there is any code anywhere in the kernel that
> > calls an address space op without holding one of the three locks we
> > hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
> > of the address space operations.
> > 
> > By limiting the address space swap to file sizes of zero, we rule
> > out the page fault path (mmap of a zero length file segv's with an
> > access beyond EOF on the first read/write page fault, right?).
> 
> Yes I checked that and thought we were safe here...
> 
> > However, other aops callers that might run unlocked and do the wrong
> > thing if the aops pointer is swapped between check of the aop method
> > existing and actually calling it even if the file size is zero?
> > 
> > A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
> > to such a race condition with the current definitions of the XFS DAX
> > aops. I'm guessing there will be others, but I haven't looked
> > further than this...
> 
> I'll check for others and think on what to do about this.  ext4 will have the
> same problem I think.  :-(
> 
> I don't suppose using a single a_ops for both DAX and non-DAX is palatable?

IMO, no. It means we have to check IS_DAX() in every aops,
and replicate a bunch of callouts to generic code. i.e this sort of
thing:

	if (aops->method)
		return aops->method(...)

	/* do something else */

results in us having to replicate that logic as something like:

	if (!IS_DAX)
		return filesystem_aops_method()

	/* do something else */

Indeed, the calling code may well do the wrong thing if we have
methods defined just to add IS_DAX() checks to avoid using that
functionality because the caller now thinks that functionality is
supported when in fact it isn't.

So it seems to me like an even bigger can of worms to try to use a
single aops structure for vastly different functionality....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-20 15:59 [PATCH 0/5] Enable per-file/directory DAX operations ira.weiny
                   ` (4 preceding siblings ...)
  2019-10-20 15:59 ` [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag ira.weiny
@ 2019-10-22 11:21 ` Boaz Harrosh
  2019-10-23 13:09   ` Boaz Harrosh
  5 siblings, 1 reply; 33+ messages in thread
From: Boaz Harrosh @ 2019-10-22 11:21 UTC (permalink / raw)
  To: ira.weiny, linux-kernel
  Cc: Alexander Viro, Darrick J. Wong, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On 20/10/2019 18:59, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> consumption due to their inability to detect whether the kernel will
> instantiate page cache for a file, and cases where a global dax enable via a
> mount option is too coarse.
> 
> The following patch series enables selecting the use of DAX on individual files
> and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
> scheme the dax mount option can be omitted to allow the per-file property to
> take effect.
> 
> The insight at LSF/MM was to separate the per-mount or per-file "physical"
> capability switch from an "effective" attribute for the file.
> 
> At LSF/MM we discussed the difficulties of switching the mode of a file with
> active mappings / page cache. Rather than solve those races the decision was to
> just limit mode flips to 0-length files.
> 

What I understand above is that only "writers" before writing any bytes may
turn the flag on, which then persists. But as a very long time user of DAX, usually
it is the writers that are least interesting. With lots of DAX technologies and
emulations the write is slower and needs slow "flushing".

The more interesting and performance gains comes from DAX READs actually.
specially cross the VM guest. (IE. All VMs share host memory or pmem)

This fixture as I understand it, that I need to know before I write if I will
want DAX or not and then the write is DAX as well as reads after that, looks
not very interesting for me as a user.

Just my $0.17
Boaz

> Finally, the physical DAX flag inheritance is maintained from previous work on 
> XFS but should be added for other file systems for consistence.
> 
> 
> [1] https://lwn.net/Articles/787973/
> [2] https://lwn.net/Articles/787233/
> 
> To: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
> Cc: Jan Kara <jack@suse.cz>
> Cc: linux-ext4@vger.kernel.org
> Cc: linux-xfs@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> 
> Ira Weiny (5):
>   fs/stat: Define DAX statx attribute
>   fs/xfs: Isolate the physical DAX flag from effective
>   fs/xfs: Separate functionality of xfs_inode_supports_dax()
>   fs/xfs: Clean up DAX support check
>   fs/xfs: Allow toggle of physical DAX flag
> 
>  fs/stat.c                 |  3 +++
>  fs/xfs/xfs_ioctl.c        | 32 ++++++++++++++------------------
>  fs/xfs/xfs_iops.c         | 36 ++++++++++++++++++++++++++++++------
>  fs/xfs/xfs_iops.h         |  2 ++
>  include/uapi/linux/stat.h |  1 +
>  5 files changed, 50 insertions(+), 24 deletions(-)
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] fs/stat: Define DAX statx attribute
  2019-10-20 15:59 ` [PATCH 1/5] fs/stat: Define DAX statx attribute ira.weiny
@ 2019-10-22 11:32   ` Boaz Harrosh
  2019-10-22 16:51     ` Ira Weiny
  0 siblings, 1 reply; 33+ messages in thread
From: Boaz Harrosh @ 2019-10-22 11:32 UTC (permalink / raw)
  To: ira.weiny, linux-kernel
  Cc: Alexander Viro, Darrick J. Wong, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On 20/10/2019 18:59, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> In order for users to determine if a file is currently operating in DAX
> mode (effective DAX).  Define a statx attribute value and set that
> attribute if the effective DAX flag is set.
> 
> To go along with this we propose the following addition to the statx man
> page:
> 
> STATX_ATTR_DAX
> 
> 	DAX (cpu direct access) is a file mode that attempts to minimize
> 	software cache effects for both I/O and memory mappings of this
> 	file.  It requires a capable device, a compatible filesystem
> 	block size, and filesystem opt-in. It generally assumes all
> 	accesses are via cpu load / store instructions which can
> 	minimize overhead for small accesses, but adversely affect cpu
> 	utilization for large transfers. File I/O is done directly
> 	to/from user-space buffers. While the DAX property tends to
> 	result in data being transferred synchronously it does not give
> 	the guarantees of synchronous I/O that data and necessary
> 	metadata are transferred. Memory mapped I/O may be performed
> 	with direct mappings that bypass system memory buffering. Again
> 	while memory-mapped I/O tends to result in data being
> 	transferred synchronously it does not guarantee synchronous
> 	metadata updates. A dax file may optionally support being mapped
> 	with the MAP_SYNC flag which does allow cpu store operations to
> 	be considered synchronous modulo cpu cache effects.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  fs/stat.c                 | 3 +++
>  include/uapi/linux/stat.h | 1 +
>  2 files changed, 4 insertions(+)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index c38e4c2e1221..59ca360c1ffb 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -77,6 +77,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>  	if (IS_AUTOMOUNT(inode))
>  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
>  
> +	if (inode->i_flags & S_DAX)

Is there a reason not to use IS_DAX(inode) ?

> +		stat->attributes |= STATX_ATTR_DAX;
> +
>  	if (inode->i_op->getattr)
>  		return inode->i_op->getattr(path, stat, request_mask,
>  					    query_flags);
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index 7b35e98d3c58..5b0962121ef7 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -169,6 +169,7 @@ struct statx {
>  #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
>  
>  #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
> +#define STATX_ATTR_DAX			0x00002000 /* [I] File is DAX */
>  
>  
>  #endif /* _UAPI_LINUX_STAT_H */
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 1/5] fs/stat: Define DAX statx attribute
  2019-10-22 11:32   ` Boaz Harrosh
@ 2019-10-22 16:51     ` Ira Weiny
  0 siblings, 0 replies; 33+ messages in thread
From: Ira Weiny @ 2019-10-22 16:51 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Tue, Oct 22, 2019 at 02:32:04PM +0300, Boaz Harrosh wrote:
> On 20/10/2019 18:59, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > In order for users to determine if a file is currently operating in DAX
> > mode (effective DAX).  Define a statx attribute value and set that
> > attribute if the effective DAX flag is set.
> > 
> > To go along with this we propose the following addition to the statx man
> > page:
> > 
> > STATX_ATTR_DAX
> > 
> > 	DAX (cpu direct access) is a file mode that attempts to minimize
> > 	software cache effects for both I/O and memory mappings of this
> > 	file.  It requires a capable device, a compatible filesystem
> > 	block size, and filesystem opt-in. It generally assumes all
> > 	accesses are via cpu load / store instructions which can
> > 	minimize overhead for small accesses, but adversely affect cpu
> > 	utilization for large transfers. File I/O is done directly
> > 	to/from user-space buffers. While the DAX property tends to
> > 	result in data being transferred synchronously it does not give
> > 	the guarantees of synchronous I/O that data and necessary
> > 	metadata are transferred. Memory mapped I/O may be performed
> > 	with direct mappings that bypass system memory buffering. Again
> > 	while memory-mapped I/O tends to result in data being
> > 	transferred synchronously it does not guarantee synchronous
> > 	metadata updates. A dax file may optionally support being mapped
> > 	with the MAP_SYNC flag which does allow cpu store operations to
> > 	be considered synchronous modulo cpu cache effects.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > ---
> >  fs/stat.c                 | 3 +++
> >  include/uapi/linux/stat.h | 1 +
> >  2 files changed, 4 insertions(+)
> > 
> > diff --git a/fs/stat.c b/fs/stat.c
> > index c38e4c2e1221..59ca360c1ffb 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -77,6 +77,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
> >  	if (IS_AUTOMOUNT(inode))
> >  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
> >  
> > +	if (inode->i_flags & S_DAX)
> 
> Is there a reason not to use IS_DAX(inode) ?

No, just forgot there was a macro when this was written.  Changed.

Thanks,
Ira

> 
> > +		stat->attributes |= STATX_ATTR_DAX;
> > +
> >  	if (inode->i_op->getattr)
> >  		return inode->i_op->getattr(path, stat, request_mask,
> >  					    query_flags);
> > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > index 7b35e98d3c58..5b0962121ef7 100644
> > --- a/include/uapi/linux/stat.h
> > +++ b/include/uapi/linux/stat.h
> > @@ -169,6 +169,7 @@ struct statx {
> >  #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
> >  
> >  #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
> > +#define STATX_ATTR_DAX			0x00002000 /* [I] File is DAX */
> >  
> >  
> >  #endif /* _UAPI_LINUX_STAT_H */
> > 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-22 11:21 ` [PATCH 0/5] Enable per-file/directory DAX operations Boaz Harrosh
@ 2019-10-23 13:09   ` Boaz Harrosh
  2019-10-23 22:13     ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Boaz Harrosh @ 2019-10-23 13:09 UTC (permalink / raw)
  To: ira.weiny, linux-kernel
  Cc: Alexander Viro, Darrick J. Wong, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On 22/10/2019 14:21, Boaz Harrosh wrote:
> On 20/10/2019 18:59, ira.weiny@intel.com wrote:
<>
>> At LSF/MM we discussed the difficulties of switching the mode of a file with
>> active mappings / page cache. Rather than solve those races the decision was to
>> just limit mode flips to 0-length files.
>>
> 
> What I understand above is that only "writers" before writing any bytes may
> turn the flag on, which then persists. But as a very long time user of DAX, usually
> it is the writers that are least interesting. With lots of DAX technologies and
> emulations the write is slower and needs slow "flushing".
> 
> The more interesting and performance gains comes from DAX READs actually.
> specially cross the VM guest. (IE. All VMs share host memory or pmem)
> 
> This fixture as I understand it, that I need to know before I write if I will
> want DAX or not and then the write is DAX as well as reads after that, looks
> not very interesting for me as a user.
> 
> Just my $0.17
> Boaz
> 

I want to say one more thing about this patchset please. I was sleeping on it
and I think it is completely wrong approach with a completely wrong API.

The DAX or not DAX is a matter of transport. and is nothing to do with data
content and persistency. It's like connecting a disk behind, say, a scsi bus and then
take the same DB and putting it behind a multy-path or an NFS server. It is always
the same data.
(Same mistake we did with ndctl which is cry for generations)

For me the DAX or NO-DAX is an application thing and not a data thing.

The right API is perhaps an open O_FLAG where if you are not the first opener
the open returns EBUSY and then the app can decide if to open without the
flag or complain and fail. (Or apply open locks)

If you are a second opener when the file is already opened O_DAX you are
silently running DAX. If you are 2nd open(O_DAX) when already oppened
O_DAX then off course you succeed.
(Also the directory inheritance thing is all mute too)

Please explain the use case behind your model?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-23 13:09   ` Boaz Harrosh
@ 2019-10-23 22:13     ` Dave Chinner
  2019-10-24  2:31       ` Boaz Harrosh
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2019-10-23 22:13 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: ira.weiny, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Oct 23, 2019 at 04:09:50PM +0300, Boaz Harrosh wrote:
> On 22/10/2019 14:21, Boaz Harrosh wrote:
> > On 20/10/2019 18:59, ira.weiny@intel.com wrote:
> Please explain the use case behind your model?

No application changes needed to control whether they use DAX or
not. It allows the admin to control the application behaviour
completely, so they can turn off DAX if necessary. Applications are
unaware of constraints that may prevent DAX from being used, and so
admins need a mechanism to prevent DAX aware application from
actually using DAX if the capability is present.

e.g. given how slow some PMEM devices are when it comes to writing
data, especially under extremely high concurrency, DAX is not
necessarily a performance win for every application. Admins need a
guaranteed method of turning off DAX in these situations - apps may
not provide such a knob, or even be aware of a thing called DAX...

e.g. the data set being accessed by the application is mapped and
modified by RDMA applications, so those files must not be accessed
using DAX by any application because DAX+RDMA are currently
incompatible. Hence you can have RDMA on pmem devices co-exist
within the same filesystem as other applications using DAX to access
the pmem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-23 22:13     ` Dave Chinner
@ 2019-10-24  2:31       ` Boaz Harrosh
  2019-10-24  7:34         ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Boaz Harrosh @ 2019-10-24  2:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: ira.weiny, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On 24/10/2019 01:13, Dave Chinner wrote:
> On Wed, Oct 23, 2019 at 04:09:50PM +0300, Boaz Harrosh wrote:
>> On 22/10/2019 14:21, Boaz Harrosh wrote:
>>> On 20/10/2019 18:59, ira.weiny@intel.com wrote:
>> Please explain the use case behind your model?
> 
> No application changes needed to control whether they use DAX or
> not. It allows the admin to control the application behaviour
> completely, so they can turn off DAX if necessary. Applications are
> unaware of constraints that may prevent DAX from being used, and so
> admins need a mechanism to prevent DAX aware application from
> actually using DAX if the capability is present.
> 
> e.g. given how slow some PMEM devices are when it comes to writing
> data, especially under extremely high concurrency, DAX is not
> necessarily a performance win for every application. Admins need a
> guaranteed method of turning off DAX in these situations - apps may
> not provide such a knob, or even be aware of a thing called DAX...
> 

Thank you Dave for explaining. Forgive my slowness. I now understand
your intention.

But if so please address my first concern. That in the submitted implementation
you must set the flag-bit after the create of the file but before the write.
So exactly the above slow writes must always be DAX if I ever want the file
to be DAX accessed in the future.

In fact I do not see how you do this without changing the application because
most applications create thier own files, so you do not have a chance to set
the DAX-flag before the write happens. So the only effective fixture is the
inheritance from the parent directory.
But then again how do you separate from the slow writes that we would like
none-DAX to the DAX reads that are fast and save so much resources and latency.

What if, say in XFS when setting the DAX-bit we take all the three write-locks
same as a truncate. Then we check that there are no active page-cache mappings
ie. a single opener. Then allow to set the bit. Else return EBUISY. (file is in use)

> e.g. the data set being accessed by the application is mapped and
> modified by RDMA applications, so those files must not be accessed
> using DAX by any application because DAX+RDMA are currently
> incompatible. Hence you can have RDMA on pmem devices co-exist
> within the same filesystem as other applications using DAX to access
> the pmem...
> 

I actually like the lastest patchset that takes a lease on the file.
But yes an outside admin tool to set different needs.

> Cheers,
> Dave.
> 

Yes, thanks
Boaz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-24  2:31       ` Boaz Harrosh
@ 2019-10-24  7:34         ` Dave Chinner
  2019-10-24 14:05           ` Boaz Harrosh
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2019-10-24  7:34 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: ira.weiny, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Oct 24, 2019 at 05:31:13AM +0300, Boaz Harrosh wrote:
> On 24/10/2019 01:13, Dave Chinner wrote:
> > On Wed, Oct 23, 2019 at 04:09:50PM +0300, Boaz Harrosh wrote:
> >> On 22/10/2019 14:21, Boaz Harrosh wrote:
> >>> On 20/10/2019 18:59, ira.weiny@intel.com wrote:
> >> Please explain the use case behind your model?
> > 
> > No application changes needed to control whether they use DAX or
> > not. It allows the admin to control the application behaviour
> > completely, so they can turn off DAX if necessary. Applications are
> > unaware of constraints that may prevent DAX from being used, and so
> > admins need a mechanism to prevent DAX aware application from
> > actually using DAX if the capability is present.
> > 
> > e.g. given how slow some PMEM devices are when it comes to writing
> > data, especially under extremely high concurrency, DAX is not
> > necessarily a performance win for every application. Admins need a
> > guaranteed method of turning off DAX in these situations - apps may
> > not provide such a knob, or even be aware of a thing called DAX...
> > 
> 
> Thank you Dave for explaining. Forgive my slowness. I now understand
> your intention.
> 
> But if so please address my first concern. That in the submitted implementation
> you must set the flag-bit after the create of the file but before the write.
> So exactly the above slow writes must always be DAX if I ever want the file
> to be DAX accessed in the future.

The on disk DAX flag is inherited from the parent directory at
create time. Hence an admin only need to set it on the data
directory of the application when first configuring it, and
everything the app creates will be configured for DAX access
automatically.

Or, alternatively, mkfs sets the flag on the root dir so that
everything in the filesystem uses DAX by default (through
inheritance) unless the admin turns off the flag on a directory
before it starts to be used or on a set of files after they have
been created (because DAX causes problems)...

So, yeah, there's another problem with the basic assertion that we
only need to allow the on disk flag to be changed on zero length
files: we actually want to be able to -clear- the DAX flag when the
file has data attached to it, not just when it is an empty file...

> What if, say in XFS when setting the DAX-bit we take all the three write-locks
> same as a truncate. Then we check that there are no active page-cache mappings
> ie. a single opener. Then allow to set the bit. Else return EBUISY. (file is in use)

DAX doesn't have page cache mappings, so anything that relies on
checking page cache state isn't going to work reliably. I also seem
to recall that there was a need to take some vm level lock to really
prevent page fault races, and that we can't safely take that in a
safe combination with all the filesystem locks we need.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-24  7:34         ` Dave Chinner
@ 2019-10-24 14:05           ` Boaz Harrosh
  2019-10-24 21:35             ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Boaz Harrosh @ 2019-10-24 14:05 UTC (permalink / raw)
  To: Dave Chinner, Boaz Harrosh
  Cc: ira.weiny, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On 24/10/2019 10:34, Dave Chinner wrote:
> On Thu, Oct 24, 2019 at 05:31:13AM +0300, Boaz Harrosh wrote:
<>
> 
> The on disk DAX flag is inherited from the parent directory at
> create time. Hence an admin only need to set it on the data
> directory of the application when first configuring it, and
> everything the app creates will be configured for DAX access
> automatically.
> 

Yes I said that as well. But again I am concerned that this is the
opposite of our Intention. As you said the WRITEs are slow and
do not scale so what we like, and why we have the all problem, is
to WRITE *none*-DAX. And if so then how do we turn the bit ON later
for the fast READs.

> Or, alternatively, mkfs sets the flag on the root dir so that
> everything in the filesystem uses DAX by default (through
> inheritance) unless the admin turns off the flag on a directory
> before it starts to be used

> or on a set of files after they have
> been created (because DAX causes problems)...
> 

Yes exactly this can not be done currently.

> So, yeah, there's another problem with the basic assertion that we
> only need to allow the on disk flag to be changed on zero length
> files: we actually want to be able to -clear- the DAX flag when the
> file has data attached to it, not just when it is an empty file...
> 

Exactly, This is my concern. And the case that I see most useful is the
opposite where I want to turn it ON, for DAX fast READs.

>> What if, say in XFS when setting the DAX-bit we take all the three write-locks
>> same as a truncate. Then we check that there are no active page-cache mappings
>> ie. a single opener. Then allow to set the bit. Else return EBUISY. (file is in use)
> 
> DAX doesn't have page cache mappings, so anything that relies on
> checking page cache state isn't going to work reliably. 

I meant on the opposite case, Where the flag was OFF and I want it ON for
fast READs. In that case if I have any users there are pages on the
xarray.
BTW the opposite is also true if we have active DAX users we will have
DAX entries in the xarray. What we want is that there are *no* active
users while we change the file-DAX-mode. Else we fail the change.

> I also seem
> to recall that there was a need to take some vm level lock to really
> prevent page fault races, and that we can't safely take that in a
> safe combination with all the filesystem locks we need.
> 

We do not really care with page fault races in the Kernel as long
as I protect the xarray access and these are protected well if we
take truncate locking. But we have a bigger problem that you pointed
out with the change of the operations vector pointer.

I was thinking about this last night. One way to do this is with
file-exclusive-lock. Correct me if I'm wrong:
file-exclusive-readwrite-lock means any other openers will fail and
if there are openers already the lock will fail. Which is what we want
no? to make sure we are the exclusive user of the file while we change
the op vector.
Now the question is if we force the application to take the lock and
Kernel only check that we are locked. Or Kernel take the lock within
the IOCTL.

Lets touch base. As I understand the protocol we want to establish with
the administration tool is:

- File is created, written. read...

- ALL file handles are closed, there are no active users
- File open by single opener for the purpose of changing the DAX-mode
- lock from all other opens
- change the DAX-mode, op vectors
- unlock-exlusivness

- File activity can resume...

That's easy to say, But how can we enforce this protocol?

> Cheers,
> Dave.
> 

Thanks Dave
Boaz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-24 14:05           ` Boaz Harrosh
@ 2019-10-24 21:35             ` Dave Chinner
  2019-10-24 23:29               ` Boaz Harrosh
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2019-10-24 21:35 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: ira.weiny, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote:
> On 24/10/2019 10:34, Dave Chinner wrote:
> > On Thu, Oct 24, 2019 at 05:31:13AM +0300, Boaz Harrosh wrote:
> <>
> > 
> > The on disk DAX flag is inherited from the parent directory at
> > create time. Hence an admin only need to set it on the data
> > directory of the application when first configuring it, and
> > everything the app creates will be configured for DAX access
> > automatically.
> > 
> 
> Yes I said that as well.

You said "it must be set between creation and first write",
stating the requirement for an on-disk flag to work. I'm
decribing how that requirement is actually implemented. i.e. what
you are stating is something we actually implemented years ago...

> > I also seem
> > to recall that there was a need to take some vm level lock to really
> > prevent page fault races, and that we can't safely take that in a
> > safe combination with all the filesystem locks we need.
> > 
> 
> We do not really care with page fault races in the Kernel as long

Oh yes we do. A write fault is a 2-part operation - a read fault to
populate the pte and mapping, then a write fault (->page_mkwrite) to 
do all the filesystem work needed to dirty the page and pte.

The read fault sets up the state for the write fault, and if we
change the aops between these two operations, then the
->page_mkwrite implementation goes kaboom.

This isn't a theoretical problem - this is exactly the race
condition that lead us to disabling the flag in the first place.
There is no serialisation between the read and write parts of the
page fault iand the filesystem changing the DAX flag and ops vector,
and so fixing this problem requires hold yet more locks in the
filesystem path to completely lock out page fault processing on the
inode's mapping.

> as I protect the xarray access and these are protected well if we
> take truncate locking. But we have a bigger problem that you pointed
> out with the change of the operations vector pointer.
> 
> I was thinking about this last night. One way to do this is with
> file-exclusive-lock. Correct me if I'm wrong:
> file-exclusive-readwrite-lock means any other openers will fail and
> if there are openers already the lock will fail. Which is what we want
> no?

The filesystem ioctls and page faults have no visibility of file
locks.  They don't know and can't find out in a sane manner that an
inode has a single -user- reference.

And it introduces a new problem for any application using the
fssetxattr() ioctl - accidentally not setting the S_DAX flag to be
unmodified will now fail, and that means such a change breaks
existing applications. Sure, you can say they are "buggy
applications", but the fact is this user API change breaks them.

Hence I don't think we can change the user API for setting/clearing
this flag like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-24 21:35             ` Dave Chinner
@ 2019-10-24 23:29               ` Boaz Harrosh
  2019-10-25  0:36                 ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Boaz Harrosh @ 2019-10-24 23:29 UTC (permalink / raw)
  To: Dave Chinner, Boaz Harrosh
  Cc: ira.weiny, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On 25/10/2019 00:35, Dave Chinner wrote:
> On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote:
<>
>> Yes I said that as well.
> 
> You said "it must be set between creation and first write",
> stating the requirement for an on-disk flag to work.

Sorry for not being clear. Its not new, I do not explain myself
very well.

The above quote is if you set/clear the flag directly.

> I'm describing how that requirement is actually implemented. i.e. what
> you are stating is something we actually implemented years ago...
> 

What you are talking about is when the flag is inherited from parent.
And I did mention that option as well.

[Which is my concern because my main point is that I want creation+write
 to be none-DAX. Then after writer is done. Switch to DAX-on so READs can
 be fast and not take any memory.
 And you talked about another case when I start DAX-on and then for
 say, use for RDMA turn it off later.
]

This flag is Unique because current RFC has an i_size == 0 check
that other flags do not have

>>> I also seem
>>> to recall that there was a need to take some vm level lock to really
>>> prevent page fault races, and that we can't safely take that in a
>>> safe combination with all the filesystem locks we need.
>>>
>>
>> We do not really care with page fault races in the Kernel as long
> 
> Oh yes we do. A write fault is a 2-part operation - a read fault to
> populate the pte and mapping, then a write fault (->page_mkwrite) to 
> do all the filesystem work needed to dirty the page and pte.
> 
> The read fault sets up the state for the write fault, and if we
> change the aops between these two operations, then the
> ->page_mkwrite implementation goes kaboom.
> 
> This isn't a theoretical problem - this is exactly the race
> condition that lead us to disabling the flag in the first place.
> There is no serialisation between the read and write parts of the
> page fault iand the filesystem changing the DAX flag and ops vector,
> and so fixing this problem requires hold yet more locks in the
> filesystem path to completely lock out page fault processing on the
> inode's mapping.
> 

Again sorry that I do not explain very good.

Already on the read fault we populate the xarray,
My point was that if I want to set the DAX mode I must enforce that
there are no other parallel users on my inode. The check that the
xarray is empty is my convoluted way to check that there are no other
users except me. If xarray is not empty I bail out with EBUISY

I agree this is stupid.

Which is the same stupid as i_size==0 check. Both have the same
intention, to make sure that nothing is going on in parallel to
me.

>> as I protect the xarray access and these are protected well if we
>> take truncate locking. But we have a bigger problem that you pointed
>> out with the change of the operations vector pointer.
>>
>> I was thinking about this last night. One way to do this is with
>> file-exclusive-lock. Correct me if I'm wrong:
>> file-exclusive-readwrite-lock means any other openers will fail and
>> if there are openers already the lock will fail. Which is what we want
>> no?
> 
> The filesystem ioctls and page faults have no visibility of file
> locks.  They don't know and can't find out in a sane manner that an
> inode has a single -user- reference.
> 

This is not true. The FS has full control. It is not too hard a work
to take a file-excl-lock from within the IOCTL implementation. then
unlock. that is option 1. Option 2 of the App taking the lock then
for the check we might need a new export from the lock-subsystem.

> And it introduces a new problem for any application using the
> fssetxattr() ioctl - accidentally not setting the S_DAX flag to be
> unmodified will now fail, and that means such a change breaks
> existing applications. Sure, you can say they are "buggy
> applications", but the fact is this user API change breaks them.
> 

I am not sure I completely understood. let me try ...

The app wants to set some foo flag. It can set the ignore mask to all
1(s) except the flag it wants to set/clear.
And/or get_current_flags(); modify foo_flag; set_flags().

Off course in both the ignore case or when the DAX bit does not change
we do not go on a locking rampage.

So I'm not sure I see the problem

> Hence I don't think we can change the user API for setting/clearing
> this flag like this.
> 

Yes we must not modify behavior of Apps that are modifing other flags.

> Cheers,
> Dave.
> 

Perhaps we always go by the directory. And then do an mv dir_DAX/foo dir_NODAX/foo
to have an effective change. In hard links the first one at iget time before populating
the inode cache takes affect. (And never change the flag on the fly)
(Just brain storming here)

Or perhaps another API?

Thanks Dave
Boaz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-24 23:29               ` Boaz Harrosh
@ 2019-10-25  0:36                 ` Dave Chinner
  2019-10-25  1:15                   ` Boaz Harrosh
  2019-10-25 20:49                   ` Ira Weiny
  0 siblings, 2 replies; 33+ messages in thread
From: Dave Chinner @ 2019-10-25  0:36 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: ira.weiny, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
> On 25/10/2019 00:35, Dave Chinner wrote:
> > On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote:
> > This isn't a theoretical problem - this is exactly the race
> > condition that lead us to disabling the flag in the first place.
> > There is no serialisation between the read and write parts of the
> > page fault iand the filesystem changing the DAX flag and ops vector,
> > and so fixing this problem requires hold yet more locks in the
> > filesystem path to completely lock out page fault processing on the
> > inode's mapping.
> > 
> 
> Again sorry that I do not explain very good.
> 
> Already on the read fault we populate the xarray,

On a write fault we can have an empty xarray slot so the write fault
needs to both populate the xarray slot (read fault) and process the
write fault.

> My point was that if I want to set the DAX mode I must enforce that
> there are no other parallel users on my inode. The check that the
> xarray is empty is my convoluted way to check that there are no other
> users except me. If xarray is not empty I bail out with EBUISY

Checking the xarray being empty is racy. The moment you drop the
mapping lock, the page fault can populate a slot in the mapping that
you just checked was empty. And then you swap the aops between the
population and the ->page-mkwrite() call in the page fault
that is running, and things go boom.

Unless there's something new in the page fault path that nobody has
noticed in the past couple of years, this TOCTOU race hasn't been
solved....

> Perhaps we always go by the directory. And then do an mv dir_DAX/foo dir_NODAX/foo

The inode is instatiated before the rename is run, so it's set up
with it's old dir config, not the new one. So this ends up with the
same problem of haivng to change the S_DAX flag and aops vector
dynamically on rename. Same problem, not a solution.

> to have an effective change. In hard links the first one at iget time before populating
> the inode cache takes affect.

If something like a find or backup program brings the inode into
cache, the app may not even get the behaviour it wants, and it can't
change it until the inode is evicted from cache, which may be never.
Nobody wants implicit/random/uncontrollable/unchangeable behaviour
like this.

> (And never change the flag on the fly)
> (Just brain storming here)

We went over all this ground when we disabled the flag in the first
place. We disabled the flag because we couldn't come up with a sane
way to flip the ops vector short of tracking the number of aops
calls in progress at any given time. i.e. reference counting the
aops structure, but that's hard to do with a const ops structure,
and so it got disabled rather than allowing users to crash
kernels....

Cheers,

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-25  0:36                 ` Dave Chinner
@ 2019-10-25  1:15                   ` Boaz Harrosh
  2019-10-25 20:49                   ` Ira Weiny
  1 sibling, 0 replies; 33+ messages in thread
From: Boaz Harrosh @ 2019-10-25  1:15 UTC (permalink / raw)
  To: Dave Chinner, Boaz Harrosh
  Cc: ira.weiny, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On 25/10/2019 03:36, Dave Chinner wrote:
> On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
<>

>> Perhaps we always go by the directory. And then do an mv dir_DAX/foo dir_NODAX/foo
> 
> The inode is instatiated before the rename is run, so it's set up
> with it's old dir config, not the new one. So this ends up with the
> same problem of haivng to change the S_DAX flag and aops vector
> dynamically on rename. Same problem, not a solution.
> 

Yes Admin needs a inode-drop_caches after the mv if she/he wants an effective
change.

>> to have an effective change. In hard links the first one at iget time before populating
>> the inode cache takes affect.
> 
> If something like a find or backup program brings the inode into
> cache, the app may not even get the behaviour it wants, and it can't
> change it until the inode is evicted from cache, which may be never.

inode-drop-caches. (echo 2 > /proc/sys/vm/drop_caches)

> Nobody wants implicit/random/uncontrollable/unchangeable behaviour
> like this.
> 

You mean in the case of hard links between different mode directories?
I agree it is not so good. I do not like it too.

<>
> We went over all this ground when we disabled the flag in the first
> place. We disabled the flag because we couldn't come up with a sane
> way to flip the ops vector short of tracking the number of aops
> calls in progress at any given time. i.e. reference counting the
> aops structure, but that's hard to do with a const ops structure,
> and so it got disabled rather than allowing users to crash
> kernels....
> 

Do you mean dropping this patchset all together? I missed that.

Current patchset with the i_size == 0 thing is really bad I think.
Its the same has dropping the direct change all together and only
supporting inheritance from parent.
[Which again for me is really not interesting]

> Cheers,
> -Dave.

Lets sleep on it. Please remind me if xfs supports clone + DAX

Thanks Dave
Boaz

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-25  0:36                 ` Dave Chinner
  2019-10-25  1:15                   ` Boaz Harrosh
@ 2019-10-25 20:49                   ` Ira Weiny
  2019-10-27 22:10                     ` Dave Chinner
  1 sibling, 1 reply; 33+ messages in thread
From: Ira Weiny @ 2019-10-25 20:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Boaz Harrosh, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Fri, Oct 25, 2019 at 11:36:03AM +1100, Dave Chinner wrote:
> On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
> > On 25/10/2019 00:35, Dave Chinner wrote:
> 
> If something like a find or backup program brings the inode into
> cache, the app may not even get the behaviour it wants, and it can't
> change it until the inode is evicted from cache, which may be never.

Why would this be never?

> Nobody wants implicit/random/uncontrollable/unchangeable behaviour
> like this.

I'm thinking this could work with a bit of effort on the users part.  While the
behavior does have a bit of uncertainty, I feel like there has to be a way to
get the inode to drop from the cache when a final iput() happens on the inode.

Admin programs should not leave files open forever, without the users knowing
about it.  So I don't understand why the inode could not be evicted from the
cache if the FS knew that this change had been made and the inode needs to be
"re-loaded".  See below...

> 
> > (And never change the flag on the fly)
> > (Just brain storming here)
> 
> We went over all this ground when we disabled the flag in the first
> place. We disabled the flag because we couldn't come up with a sane
> way to flip the ops vector short of tracking the number of aops
> calls in progress at any given time. i.e. reference counting the
> aops structure, but that's hard to do with a const ops structure,
> and so it got disabled rather than allowing users to crash
> kernels....

Agreed.  We can't change the a_ops without some guarantee that no one is using
the file.  Which means we need all fds to close and a final iput().  I thought
that would mean an eviction of the inode and a subsequent reload.

Yesterday I coded up the following (applies on top of this series) but I can't
seem to get it to work because I believe xfs is keeping a reference on the
inode.  What am I missing?  I think if I could get xfs to recognize that the
inode needs to be cleared from it's cache this would work, with some caveats.

Currently this works if I remount the fs or if I use <procfs>/drop_caches like
Boaz mentioned.

Isn't there a way to get xfs to do that on it's own?

Ira


From 7762afd95a3e21a782dffd2fd8e13ae4a23b5e4a Mon Sep 17 00:00:00 2001
From: Ira Weiny <ira.weiny@intel.com>
Date: Fri, 25 Oct 2019 13:32:07 -0700
Subject: [PATCH] squash: Allow phys change on any length file

delay the changing of the effective bit to when the inode is re-read
into the cache.

Currently a work in Progress because xfs seems to cache the inodes as
well and I'm not sure how to get xfs to release it's reference.
---
 fs/xfs/xfs_ioctl.c | 18 +++++++-----------
 include/linux/fs.h |  5 ++++-
 2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89cf47ef273e..4d730d5781d9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1233,10 +1233,13 @@ xfs_diflags_to_linux(
 		inode->i_flags |= S_NOATIME;
 	else
 		inode->i_flags &= ~S_NOATIME;
-	if (xflags & FS_XFLAG_DAX)
-		inode->i_flags |= S_DAX;
-	else
-		inode->i_flags &= ~S_DAX;
+	/* NOTE: we do not allow the physical DAX flag to immediately change
+	 * the effective IS_DAX() flag tell the VFS layer to remove the inode
+	 * from the cache on the final iput() to force recreation on the next
+	 * 'fresh' open */
+	if (((xflags & FS_XFLAG_DAX) && !IS_DAX(inode)) ||
+	    (!(xflags & FS_XFLAG_DAX) && IS_DAX(inode)))
+		inode->i_flags |= S_REVALIDATE;
 }
 
 static int
@@ -1320,13 +1323,6 @@ xfs_ioctl_setattr_dax_invalidate(
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
 
-	/* File size must be zero to avoid races with asynchronous page
-	 * faults */
-	if (i_size_read(inode) > 0) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-
 	error = filemap_write_and_wait(inode->i_mapping);
 	if (error)
 		goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0b4d8fc79e0f..4e9b7bf53c86 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1998,6 +1998,7 @@ struct super_operations {
 #define S_ENCRYPTED	16384	/* Encrypted file (using fs/crypto/) */
 #define S_CASEFOLD	32768	/* Casefolded file */
 #define S_VERITY	65536	/* Verity file (using fs/verity/) */
+#define S_REVALIDATE	131072	/* Drop inode from cache on final put */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -2040,6 +2041,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
 #define IS_ENCRYPTED(inode)	((inode)->i_flags & S_ENCRYPTED)
 #define IS_CASEFOLDED(inode)	((inode)->i_flags & S_CASEFOLD)
 #define IS_VERITY(inode)	((inode)->i_flags & S_VERITY)
+#define IS_REVALIDATE(inode)	((inode)->i_flags & S_REVALIDATE)
 
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
@@ -3027,7 +3029,8 @@ extern int inode_needs_sync(struct inode *inode);
 extern int generic_delete_inode(struct inode *inode);
 static inline int generic_drop_inode(struct inode *inode)
 {
-	return !inode->i_nlink || inode_unhashed(inode);
+	return !inode->i_nlink || inode_unhashed(inode) ||
+		IS_REVALIDATE(inode);
 }
 
 extern struct inode *ilookup5_nowait(struct super_block *sb,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-25 20:49                   ` Ira Weiny
@ 2019-10-27 22:10                     ` Dave Chinner
  2019-10-31 16:17                       ` Ira Weiny
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2019-10-27 22:10 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Boaz Harrosh, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Fri, Oct 25, 2019 at 01:49:26PM -0700, Ira Weiny wrote:
> On Fri, Oct 25, 2019 at 11:36:03AM +1100, Dave Chinner wrote:
> > On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
> > > On 25/10/2019 00:35, Dave Chinner wrote:
> > 
> > If something like a find or backup program brings the inode into
> > cache, the app may not even get the behaviour it wants, and it can't
> > change it until the inode is evicted from cache, which may be never.
> 
> Why would this be never?

Because only unreferenced inodes can be removed from cache. As long
as something holds a reference or repeatedly accesses the inode such
that reclaim always skips it because it is referenced, it will never
get evicted from the cache.

IOWs, "never" in the practical sense, not "never" in the theoretical
sense.

> > Nobody wants implicit/random/uncontrollable/unchangeable behaviour
> > like this.
> 
> I'm thinking this could work with a bit of effort on the users part.  While the
> behavior does have a bit of uncertainty, I feel like there has to be a way to
> get the inode to drop from the cache when a final iput() happens on the inode.

Keep in mind that the final iput()->evict() process doesn't mean the
inode is going to get removed from all filesystem inode caches, just
the VFS level cache. The filesystem can still have internal
references to the inode, and still be doing work on the inode that
the VFS knows nothing about. XFS definitely fits into this category.

XFS will, however, re-initialise the inode aops structure if the VFS
then does another lookup on the inode while it is in this
"reclaimed" state, so from the VFS perspective it looks like a
newly instantiated inodes on the next lookup. We don't actually need
to do this for large parts of the inode as it is already still in
the valid state from the evict() call. It's an implementation
simplification that means we always re-init the ops vectors attached
to the inode rather than just the fields that need to be
re-initialised.

IOWs, evict/reinit changing the aops vector because the on disk dax
flag changed on XFS works by luck right now, not intent....

> Admin programs should not leave files open forever, without the users knowing
> about it.  So I don't understand why the inode could not be evicted from the
> cache if the FS knew that this change had been made and the inode needs to be
> "re-loaded".  See below...

Doesn't need to be an open file - inodes are pinned in memory by the
reference the dentry holds on it. Hence as long as there are
actively referenced dentries that point at the inode, the inode
cannot be reclaimed. Hard links mean multiple dentries could pin the
inode, too.

> > > (And never change the flag on the fly)
> > > (Just brain storming here)
> > 
> > We went over all this ground when we disabled the flag in the first
> > place. We disabled the flag because we couldn't come up with a sane
> > way to flip the ops vector short of tracking the number of aops
> > calls in progress at any given time. i.e. reference counting the
> > aops structure, but that's hard to do with a const ops structure,
> > and so it got disabled rather than allowing users to crash
> > kernels....
> 
> Agreed.  We can't change the a_ops without some guarantee that no one is using
> the file.  Which means we need all fds to close and a final iput().  I thought
> that would mean an eviction of the inode and a subsequent reload.
> 
> Yesterday I coded up the following (applies on top of this series) but I can't
> seem to get it to work because I believe xfs is keeping a reference on the
> inode.  What am I missing?  I think if I could get xfs to recognize that the
> inode needs to be cleared from it's cache this would work, with some caveats.

You are missing the fact that dentries hold an active reference to
inodes. So a path lookup (access(), stat(), etc) will pin the inode
just as effectively as holding an open file because they instantiate
a dentry that holds a reference to the inode....

> Currently this works if I remount the fs or if I use <procfs>/drop_caches like
> Boaz mentioned.

drop_caches frees all the dentries that don't have an active
references before it iterates over inodes, thereby dropping the
cached reference(s) to the inode that pins it in memory before it
iterates the inode LRU.

> Isn't there a way to get xfs to do that on it's own?

Not reliably. Killing all the dentries doesn't guarantee the inode
will be reclaimed immediately. The ioctl() itself requires an open
file reference to the inode, and there's no telling how many other
references there are to the inode that the filesystem a) can't find,
and b) even if it can find them, it is illegal to release them.

IOWs, if you are relying on being able to force eviction of inode
from the cache for correct operation of a user controlled flag, then
it's just not going to work.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-27 22:10                     ` Dave Chinner
@ 2019-10-31 16:17                       ` Ira Weiny
  2019-11-01 22:47                         ` Dave Chinner
  0 siblings, 1 reply; 33+ messages in thread
From: Ira Weiny @ 2019-10-31 16:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Boaz Harrosh, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Mon, Oct 28, 2019 at 09:10:39AM +1100, Dave Chinner wrote:
> On Fri, Oct 25, 2019 at 01:49:26PM -0700, Ira Weiny wrote:

[snip]

> 
> > Currently this works if I remount the fs or if I use <procfs>/drop_caches like
> > Boaz mentioned.
> 
> drop_caches frees all the dentries that don't have an active
> references before it iterates over inodes, thereby dropping the
> cached reference(s) to the inode that pins it in memory before it
> iterates the inode LRU.
> 
> > Isn't there a way to get xfs to do that on it's own?
> 
> Not reliably. Killing all the dentries doesn't guarantee the inode
> will be reclaimed immediately. The ioctl() itself requires an open
> file reference to the inode, and there's no telling how many other
> references there are to the inode that the filesystem a) can't find,
> and b) even if it can find them, it is illegal to release them.
> 
> IOWs, if you are relying on being able to force eviction of inode
> from the cache for correct operation of a user controlled flag, then
> it's just not going to work.

Agree, I see the difficulty of forcing the effective flag to change in this
path.  However, the only thing I am relying on is that the ioctl will change
the physical flag.

IOW I am proposing that the semantic be that changing the physical flag does
_not_ immediately change the effective flag.  With that clarified up front the
user can adjust accordingly.

After thinking about this more I think there is a strong use case to be able to
change the physical flag on a non-zero length file.  That use case is to be
able to restore files from backups.

Therefore, having the effective flag flip at some later time when the a_ops can
safely be changed (for example a remount/drop_cache event) is beneficial.

I propose the user has no direct control over this event and it is mainly used
to restore files from backups which is mainly an admin operation where a
remount is a reasonable thing to do.

Users direct control of the effective flag is through inheritance.  The user
needs to create the file in a DAX enable dir and they get effective operation
right away.

If in the future we can determine a safe way to trigger the a_ops change we can
add that to the semantic as an alternative for users.

Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-10-31 16:17                       ` Ira Weiny
@ 2019-11-01 22:47                         ` Dave Chinner
  2019-11-02  4:25                           ` Dan Williams
  0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2019-11-01 22:47 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Boaz Harrosh, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Oct 31, 2019 at 09:17:58AM -0700, Ira Weiny wrote:
> On Mon, Oct 28, 2019 at 09:10:39AM +1100, Dave Chinner wrote:
> > On Fri, Oct 25, 2019 at 01:49:26PM -0700, Ira Weiny wrote:
> 
> [snip]
> 
> > 
> > > Currently this works if I remount the fs or if I use <procfs>/drop_caches like
> > > Boaz mentioned.
> > 
> > drop_caches frees all the dentries that don't have an active
> > references before it iterates over inodes, thereby dropping the
> > cached reference(s) to the inode that pins it in memory before it
> > iterates the inode LRU.
> > 
> > > Isn't there a way to get xfs to do that on it's own?
> > 
> > Not reliably. Killing all the dentries doesn't guarantee the inode
> > will be reclaimed immediately. The ioctl() itself requires an open
> > file reference to the inode, and there's no telling how many other
> > references there are to the inode that the filesystem a) can't find,
> > and b) even if it can find them, it is illegal to release them.
> > 
> > IOWs, if you are relying on being able to force eviction of inode
> > from the cache for correct operation of a user controlled flag, then
> > it's just not going to work.
> 
> Agree, I see the difficulty of forcing the effective flag to change in this
> path.  However, the only thing I am relying on is that the ioctl will change
> the physical flag.
> 
> IOW I am proposing that the semantic be that changing the physical flag does
> _not_ immediately change the effective flag.  With that clarified up front the
> user can adjust accordingly.

Which makes it useless from an admin perspective. i.e. to change the
way the application uses DAX now, admins are going to have to end up
rebooting the machine to guarantee that the kernel has picked up the
change in the on-disk flag.

> After thinking about this more I think there is a strong use case to be able to
> change the physical flag on a non-zero length file.  That use case is to be
> able to restore files from backups.

Why does that matter? Backup programs need to set the flag before
the data is written into the destination file, just like they do
with restoring other flags that influence data placement like the RT
device bit and extent size hints...

Basically, all these issues you keep trying to work around go away
if we can come up with a way of swapping the aops vector safely.
That's the problem we need to solve, anything else results in
largely unacceptible user visible admin warts.

> I propose the user has no direct control over this event and it is mainly used
> to restore files from backups which is mainly an admin operation where a
> remount is a reasonable thing to do.

As soon as users understand that they flag can be changed, they are
going to want to do that and they are going to want it to work
reliably.

> Users direct control of the effective flag is through inheritance.  The user
> needs to create the file in a DAX enable dir and they get effective operation
> right away.

Until they realise the application is slow or broken because it is
using DAX, and they want to turn DAX off for that application. Then
they have *no control*. You cannot have it both ways - being able to
turn something on but not turn it off is not "effective operation"
or user friendly.

> If in the future we can determine a safe way to trigger the a_ops change we can
> add that to the semantic as an alternative for users.

No, the flag does not get turned on until we've solved the problems
that resulted in us turning it off. We've gone over this mutliple
times, and nobody has solved the issues that need solving - everyone
seems to just hack around the issues rather than solving it
properly. If we thought taking some kind of shortcut full of
compromises and gotchas was the right solution, we would have never
turned the flag off in the first place.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/5] Enable per-file/directory DAX operations
  2019-11-01 22:47                         ` Dave Chinner
@ 2019-11-02  4:25                           ` Dan Williams
  0 siblings, 0 replies; 33+ messages in thread
From: Dan Williams @ 2019-11-02  4:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ira Weiny, Boaz Harrosh, Linux Kernel Mailing List,
	Alexander Viro, Darrick J. Wong, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Fri, Nov 1, 2019 at 3:47 PM Dave Chinner <david@fromorbit.com> wrote:
[..]
> No, the flag does not get turned on until we've solved the problems
> that resulted in us turning it off. We've gone over this mutliple
> times, and nobody has solved the issues that need solving - everyone
> seems to just hack around the issues rather than solving it
> properly. If we thought taking some kind of shortcut full of
> compromises and gotchas was the right solution, we would have never
> turned the flag off in the first place.

My fault. I thought the effective vs physical distinction was worth
taking an incremental step forward. Ira was continuing to look at the
a_ops issue in the meantime.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag
  2019-10-21 22:49     ` Ira Weiny
  2019-10-21 23:46       ` Dave Chinner
@ 2019-11-08 13:12       ` Jan Kara
  2019-11-08 13:46         ` Jan Kara
  1 sibling, 1 reply; 33+ messages in thread
From: Jan Kara @ 2019-11-08 13:12 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Chinner, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Mon 21-10-19 15:49:31, Ira Weiny wrote:
> On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
> > On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.weiny@intel.com wrote:
> > That, fundamentally, is the issue here - it's not setting/clearing
> > the DAX flag that is the issue, it's doing a swap of the
> > mapping->a_ops while there may be other code using that ops
> > structure.
> > 
> > IOWs, if there is any code anywhere in the kernel that
> > calls an address space op without holding one of the three locks we
> > hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
> > of the address space operations.
> > 
> > By limiting the address space swap to file sizes of zero, we rule
> > out the page fault path (mmap of a zero length file segv's with an
> > access beyond EOF on the first read/write page fault, right?).
> 
> Yes I checked that and thought we were safe here...
> 
> > However, other aops callers that might run unlocked and do the wrong
> > thing if the aops pointer is swapped between check of the aop method
> > existing and actually calling it even if the file size is zero?
> > 
> > A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
> > to such a race condition with the current definitions of the XFS DAX
> > aops. I'm guessing there will be others, but I haven't looked
> > further than this...
> 
> I'll check for others and think on what to do about this.  ext4 will have the
> same problem I think.  :-(

Just as a datapoint, ext4 is bold and sets inode->i_mapping->a_ops on
existing inodes when switching journal data flag and so far it has not
blown up. What we did to deal with issues Dave describes is that we
introduced percpu rw-semaphore guarding switching of aops and then inside
problematic functions redirect callbacks in the right direction under this
semaphore. Somewhat ugly but it seems to work.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag
  2019-11-08 13:12       ` Jan Kara
@ 2019-11-08 13:46         ` Jan Kara
  2019-11-08 19:36           ` Ira Weiny
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2019-11-08 13:46 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Chinner, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Fri 08-11-19 14:12:38, Jan Kara wrote:
> On Mon 21-10-19 15:49:31, Ira Weiny wrote:
> > On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
> > > On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.weiny@intel.com wrote:
> > > That, fundamentally, is the issue here - it's not setting/clearing
> > > the DAX flag that is the issue, it's doing a swap of the
> > > mapping->a_ops while there may be other code using that ops
> > > structure.
> > > 
> > > IOWs, if there is any code anywhere in the kernel that
> > > calls an address space op without holding one of the three locks we
> > > hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
> > > of the address space operations.
> > > 
> > > By limiting the address space swap to file sizes of zero, we rule
> > > out the page fault path (mmap of a zero length file segv's with an
> > > access beyond EOF on the first read/write page fault, right?).
> > 
> > Yes I checked that and thought we were safe here...
> > 
> > > However, other aops callers that might run unlocked and do the wrong
> > > thing if the aops pointer is swapped between check of the aop method
> > > existing and actually calling it even if the file size is zero?
> > > 
> > > A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
> > > to such a race condition with the current definitions of the XFS DAX
> > > aops. I'm guessing there will be others, but I haven't looked
> > > further than this...
> > 
> > I'll check for others and think on what to do about this.  ext4 will have the
> > same problem I think.  :-(
> 
> Just as a datapoint, ext4 is bold and sets inode->i_mapping->a_ops on
> existing inodes when switching journal data flag and so far it has not
> blown up. What we did to deal with issues Dave describes is that we
> introduced percpu rw-semaphore guarding switching of aops and then inside
> problematic functions redirect callbacks in the right direction under this
> semaphore. Somewhat ugly but it seems to work.

Thinking about this some more, perhaps this scheme could be actually
transformed in something workable. We could have a global (or maybe per-sb
but I'm not sure it's worth it) percpu rwsem and we could transform aops
calls into:

percpu_down_read(aops_rwsem);
do_call();
percpu_up_read(aops_rwsem);

With some macro magic it needn't be even that ugly.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag
  2019-11-08 13:46         ` Jan Kara
@ 2019-11-08 19:36           ` Ira Weiny
  2019-11-11 16:07             ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: Ira Weiny @ 2019-11-08 19:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Fri, Nov 08, 2019 at 02:46:06PM +0100, Jan Kara wrote:
> On Fri 08-11-19 14:12:38, Jan Kara wrote:
> > On Mon 21-10-19 15:49:31, Ira Weiny wrote:
> > > On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
> > > > On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.weiny@intel.com wrote:
> > > > That, fundamentally, is the issue here - it's not setting/clearing
> > > > the DAX flag that is the issue, it's doing a swap of the
> > > > mapping->a_ops while there may be other code using that ops
> > > > structure.
> > > > 
> > > > IOWs, if there is any code anywhere in the kernel that
> > > > calls an address space op without holding one of the three locks we
> > > > hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
> > > > of the address space operations.
> > > > 
> > > > By limiting the address space swap to file sizes of zero, we rule
> > > > out the page fault path (mmap of a zero length file segv's with an
> > > > access beyond EOF on the first read/write page fault, right?).
> > > 
> > > Yes I checked that and thought we were safe here...
> > > 
> > > > However, other aops callers that might run unlocked and do the wrong
> > > > thing if the aops pointer is swapped between check of the aop method
> > > > existing and actually calling it even if the file size is zero?
> > > > 
> > > > A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
> > > > to such a race condition with the current definitions of the XFS DAX
> > > > aops. I'm guessing there will be others, but I haven't looked
> > > > further than this...
> > > 
> > > I'll check for others and think on what to do about this.  ext4 will have the
> > > same problem I think.  :-(
> > 
> > Just as a datapoint, ext4 is bold and sets inode->i_mapping->a_ops on
> > existing inodes when switching journal data flag and so far it has not
> > blown up. What we did to deal with issues Dave describes is that we
> > introduced percpu rw-semaphore guarding switching of aops and then inside
> > problematic functions redirect callbacks in the right direction under this
> > semaphore. Somewhat ugly but it seems to work.

Ah I am glad you brought this up.  I had not seen this before.

Is that s_journal_flag_rwsem?

In the general case I don't think that correctly protects against:

	if (a_ops->call)
		a_ops->call();

Because not all operations are defined in both ext4_aops and
ext4_journalled_aops.  Specifically migratepage.

move_to_new_page() specifically follows the pattern above with migratepage.  So
is there a bug here?

> 
> Thinking about this some more, perhaps this scheme could be actually
> transformed in something workable. We could have a global (or maybe per-sb
> but I'm not sure it's worth it) percpu rwsem and we could transform aops
> calls into:
> 
> percpu_down_read(aops_rwsem);
> do_call();
> percpu_up_read(aops_rwsem);
> 
> With some macro magic it needn't be even that ugly.

I think this is safer.  And what I have been investigating/coding up.  Because
that also would protect against the above with:

percpu_down_read(aops_rwsem);
	if (a_ops->call)
		a_ops->call();
percpu_up_read(aops_rwsem);

However I have been looking at SRCU because we also have patterns like:


	generic_file_buffered_read
		if (a_ops->is_partially_uptodate)
			a_ops->is_partially_uptodate()
		page_cache_sync_readahead
			force_page_cache_readahead
				if (!a_ops->readpage && !a_ops->readpages)
					return;
				__do_page_cache_readahead
					read_pages
						if (a_ops->readpages)
							a_ops->readpages()
						a_ops->readpage


So we would have to pass the a_ops through to use a rwsem.  Where SRCU I think
would be fine to just take the SRCU read lock multiple times.  Am I wrong?


We also have a 3rd (2nd?) issue.  There are callers who check for the presence
of an operation to be used later.  For example do_dentry_open():

do_dentry_open()
{
...
	if (<flags> & O_DIRECT)
		if (!<a_ops> || !<a_ops>->direct_IO)
			return -EINVAL;
...
}

After this open direct_IO better be there AFAICT so changing the a_ops later
would not be good.  For ext4 direct_IO is defined for all the a_ops...  so I
guess that is not a big deal.  However, is the user really getting the behavior
they expect in this case?

I'm afraid of requiring FSs to have to follow rules in defining their a_ops.
Because I'm afraid maintaining those rules would be hard and would eventually
lead to crashes when someone did it wrong.

:-/

So for this 3rd (2nd) case I think we should simply take a reference to the
a_ops and fail changing the mode.  For the DAX case that means the user is best
served by taking a write lease on the file to ensure there are no other opens
which could cause issues.

Would that work for changing the journaling mode?

And I _think_ this is the only issue we have with this right now. But if other
callers of a_ops needed the pattern of using the a_ops at a time across context
changes they would need to ensure this reference was taken.

What I have come up with thus far is an interface like:

/*
 * as_get_a_ops() -- safely get the a_ops from the address_space specified
 *
 * @as: address space to get a_ops from
 * @ref: used to indicate if a reference is required on this a_ops
 * @tok: srcu token to be returned in as_put_a_ops()
 *
 * The a_ops returned is protected from changing until as_put_a_ops().
 *
 * If ref is specified then ref must also be specified in as_put_a_ops() to
 * release this reference.  In this case a reference is taken on the a_ops
 * which will prevent it from changing until the reference is released.
 *
 * References should _ONLY_ be taken when the a_ops needs to be constant
 * across a user context switch because doing so will block changing the a_ops
 * until that reference is released.
 *
 * Examples of using a reference are checks for specific a_ops pointers which
 * are expected to support functionality at a later date (example direct_IO)
 */
static inline const struct address_space_operations *
as_get_a_ops(struct address_space *as, bool ref, int *tok)
{
	...
}

static inline void
as_assign_a_ops(struct address_space *as,
                const struct address_space_operations *a_ops)
{
	...
}

static inline void as_put_a_ops(struct address_space *as, int tok, bool ref)
{
	...
}


I'm still working out the details of using SRCU and a ref count.  I have made
at least 1 complete pass of all the a_ops users and I think this would cover
them all.

Thoughts?
Ira

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag
  2019-11-08 19:36           ` Ira Weiny
@ 2019-11-11 16:07             ` Jan Kara
  2019-11-11 23:54               ` Ira Weiny
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2019-11-11 16:07 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jan Kara, Dave Chinner, linux-kernel, Alexander Viro,
	Darrick J. Wong, Dan Williams, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

On Fri 08-11-19 11:36:13, Ira Weiny wrote:
> On Fri, Nov 08, 2019 at 02:46:06PM +0100, Jan Kara wrote:
> > On Fri 08-11-19 14:12:38, Jan Kara wrote:
> > > On Mon 21-10-19 15:49:31, Ira Weiny wrote:
> > > > On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
> > > > > On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.weiny@intel.com wrote:
> > > > > That, fundamentally, is the issue here - it's not setting/clearing
> > > > > the DAX flag that is the issue, it's doing a swap of the
> > > > > mapping->a_ops while there may be other code using that ops
> > > > > structure.
> > > > > 
> > > > > IOWs, if there is any code anywhere in the kernel that
> > > > > calls an address space op without holding one of the three locks we
> > > > > hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
> > > > > of the address space operations.
> > > > > 
> > > > > By limiting the address space swap to file sizes of zero, we rule
> > > > > out the page fault path (mmap of a zero length file segv's with an
> > > > > access beyond EOF on the first read/write page fault, right?).
> > > > 
> > > > Yes I checked that and thought we were safe here...
> > > > 
> > > > > However, other aops callers that might run unlocked and do the wrong
> > > > > thing if the aops pointer is swapped between check of the aop method
> > > > > existing and actually calling it even if the file size is zero?
> > > > > 
> > > > > A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
> > > > > to such a race condition with the current definitions of the XFS DAX
> > > > > aops. I'm guessing there will be others, but I haven't looked
> > > > > further than this...
> > > > 
> > > > I'll check for others and think on what to do about this.  ext4 will have the
> > > > same problem I think.  :-(
> > > 
> > > Just as a datapoint, ext4 is bold and sets inode->i_mapping->a_ops on
> > > existing inodes when switching journal data flag and so far it has not
> > > blown up. What we did to deal with issues Dave describes is that we
> > > introduced percpu rw-semaphore guarding switching of aops and then inside
> > > problematic functions redirect callbacks in the right direction under this
> > > semaphore. Somewhat ugly but it seems to work.
> 
> Ah I am glad you brought this up.  I had not seen this before.
> 
> Is that s_journal_flag_rwsem?

Yes.

> In the general case I don't think that correctly protects against:
> 
> 	if (a_ops->call)
> 		a_ops->call();
> 
> Because not all operations are defined in both ext4_aops and
> ext4_journalled_aops.  Specifically migratepage.
> 
> move_to_new_page() specifically follows the pattern above with migratepage.  So
> is there a bug here?

Looks like there could be.

> > Thinking about this some more, perhaps this scheme could be actually
> > transformed in something workable. We could have a global (or maybe per-sb
> > but I'm not sure it's worth it) percpu rwsem and we could transform aops
> > calls into:
> > 
> > percpu_down_read(aops_rwsem);
> > do_call();
> > percpu_up_read(aops_rwsem);
> > 
> > With some macro magic it needn't be even that ugly.
> 
> I think this is safer.  And what I have been investigating/coding up.
> Because that also would protect against the above with:
> 
> percpu_down_read(aops_rwsem);
> 	if (a_ops->call)
> 		a_ops->call();
> percpu_up_read(aops_rwsem);
> 
> However I have been looking at SRCU because we also have patterns like:
> 
> 
> 	generic_file_buffered_read
> 		if (a_ops->is_partially_uptodate)
> 			a_ops->is_partially_uptodate()
> 		page_cache_sync_readahead
> 			force_page_cache_readahead
> 				if (!a_ops->readpage && !a_ops->readpages)
> 					return;
> 				__do_page_cache_readahead
> 					read_pages
> 						if (a_ops->readpages)
> 							a_ops->readpages()
> 						a_ops->readpage
> 
> 
> So we would have to pass the a_ops through to use a rwsem.  Where SRCU I
> think would be fine to just take the SRCU read lock multiple times.  Am I
> wrong?

So the idea I had would not solve this issue because we'd release the rwsem
once we return from ->is_partially_uptodate(). This example shows that we
actually expect consistency among different aops as they are called in
sequence and that's much more difficult to achieve than just a consistency
within single aop call.

> We also have a 3rd (2nd?) issue.  There are callers who check for the
> presence of an operation to be used later.  For example do_dentry_open():
> 
> do_dentry_open()
> {
> ...
> 	if (<flags> & O_DIRECT)
> 		if (!<a_ops> || !<a_ops>->direct_IO)
> 			return -EINVAL;
> ...
> }
> 
> After this open direct_IO better be there AFAICT so changing the a_ops
> later would not be good.  For ext4 direct_IO is defined for all the
> a_ops...  so I guess that is not a big deal.  However, is the user really
> getting the behavior they expect in this case?

In this particular case I don't think there's any practical harm for any
filesystem but in general this is another instance where consistency of
aops over time is assumed.

> I'm afraid of requiring FSs to have to follow rules in defining their a_ops.
> Because I'm afraid maintaining those rules would be hard and would eventually
> lead to crashes when someone did it wrong.

I guess this very much depends on the rules. But yes, anything non-obvious
or hard to check would quickly lead to bugs, I agree. But IMHO fully
general solution to above problems would clutter the generic code in rather
ugly way as well because usage of aops is pretty widespread in mm and fs
code. It isn't just a few places that call them...

But I think we could significantly reduce the problem by looking at what's
in aops. We have lots of operations there that operate on pages. If we
mandate that before and during switching of aops, you must make sure
there's nothing in page cache for the inode, you've already dealt with 90%
of the problems.

Beside these we have:
* write_begin - that creates page in page cache so above rule should stop
  it as well
* bmap - honestly I'd be inclined to just move this to inode_operations
  just like fiemap. There's nothing about address_space in its functionality.
* swap_activate / swap_deactivate - Either I'd move these to
  file_operations (what's there about address_space, right), or since all
  instances of this only care about the inode, we can as well just pass
  only inode to the function and move it to inode_operations.

And then the really problematic ones:
* direct_IO - Logically with how the IO path is structured, it belongs in
  aops so I wouldn't move it. With the advance of iomap it is on its way to
  being removed altogether but that will take a long time to happen
  completely. So for now I'd mandate that direct_IO path must be locked out
  while switching aops.
* readpages - these should be locked out by the rule that page creation is
  forbidden.
* writepages - these need to be locked out when switching aops.

And that should be it. So I don't think there's a need for reference-counting
of aops in the generic code, especially since I don't think it can be done
in an elegant way (but feel free to correct me). I think that just
providing a way to lock-out above three calls would be enough.

> So for this 3rd (2nd) case I think we should simply take a reference to the
> a_ops and fail changing the mode.  For the DAX case that means the user is best
> served by taking a write lease on the file to ensure there are no other opens
> which could cause issues.
> 
> Would that work for changing the journaling mode?
> 
> And I _think_ this is the only issue we have with this right now. But if other
> callers of a_ops needed the pattern of using the a_ops at a time across context
> changes they would need to ensure this reference was taken.
> 
> What I have come up with thus far is an interface like:
> 
> /*
>  * as_get_a_ops() -- safely get the a_ops from the address_space specified
>  *
>  * @as: address space to get a_ops from
>  * @ref: used to indicate if a reference is required on this a_ops
>  * @tok: srcu token to be returned in as_put_a_ops()
>  *
>  * The a_ops returned is protected from changing until as_put_a_ops().
>  *
>  * If ref is specified then ref must also be specified in as_put_a_ops() to
>  * release this reference.  In this case a reference is taken on the a_ops
>  * which will prevent it from changing until the reference is released.
>  *
>  * References should _ONLY_ be taken when the a_ops needs to be constant
>  * across a user context switch because doing so will block changing the a_ops
>  * until that reference is released.
>  *
>  * Examples of using a reference are checks for specific a_ops pointers which
>  * are expected to support functionality at a later date (example direct_IO)
>  */
> static inline const struct address_space_operations *
> as_get_a_ops(struct address_space *as, bool ref, int *tok)
> {
> 	...
> }
> 
> static inline void
> as_assign_a_ops(struct address_space *as,
>                 const struct address_space_operations *a_ops)
> {
> 	...
> }
> 
> static inline void as_put_a_ops(struct address_space *as, int tok, bool ref)
> {
> 	...
> }
> 
> 
> I'm still working out the details of using SRCU and a ref count.  I have made
> at least 1 complete pass of all the a_ops users and I think this would cover
> them all.

Well, my concern with the use of interface like this is:

a) The clutter in the generic code
b) It's difficult to make this work with SRCU because presumably you want
   to use synchronize_srcu() while switching aops. But then you have three
   operations to do:
   1) switch aops
   2) set inode flag
   3) synchronize_srcu

   and depending on the order in which you do these either "old aops"
   operations will see inode with a flag or "new aops" will see the inode
   without a flag and either can confuse those functions...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag
  2019-11-11 16:07             ` Jan Kara
@ 2019-11-11 23:54               ` Ira Weiny
  0 siblings, 0 replies; 33+ messages in thread
From: Ira Weiny @ 2019-11-11 23:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

> 
> > In the general case I don't think that correctly protects against:
> > 
> > 	if (a_ops->call)
> > 		a_ops->call();
> > 
> > Because not all operations are defined in both ext4_aops and
> > ext4_journalled_aops.  Specifically migratepage.
> > 
> > move_to_new_page() specifically follows the pattern above with migratepage.  So
> > is there a bug here?
> 
> Looks like there could be.

Ok I'm going to leave this alone and whatever I come up with try and make sure
that the ext4 journal a_ops fits.

[snip]

> > 
> > However I have been looking at SRCU because we also have patterns like:
> > 
> > 
> > 	generic_file_buffered_read
> > 		if (a_ops->is_partially_uptodate)
> > 			a_ops->is_partially_uptodate()
> > 		page_cache_sync_readahead
> > 			force_page_cache_readahead
> > 				if (!a_ops->readpage && !a_ops->readpages)
> > 					return;
> > 				__do_page_cache_readahead
> > 					read_pages
> > 						if (a_ops->readpages)
> > 							a_ops->readpages()
> > 						a_ops->readpage
> > 
> > 
> > So we would have to pass the a_ops through to use a rwsem.  Where SRCU I
> > think would be fine to just take the SRCU read lock multiple times.  Am I
> > wrong?
> 
> So the idea I had would not solve this issue because we'd release the rwsem
> once we return from ->is_partially_uptodate(). This example shows that we
> actually expect consistency among different aops as they are called in
> sequence and that's much more difficult to achieve than just a consistency
> within single aop call.

I can't be sure but is seems like consistency for an operation is somewhat
important.  OR we have to develop rules around when filesystems can, or can not
change a_ops so that the consistency does not matter.

I think what scares me the most is that I'm not really sure what those rules
are...

> 
> > We also have a 3rd (2nd?) issue.  There are callers who check for the
> > presence of an operation to be used later.  For example do_dentry_open():
> > 
> > do_dentry_open()
> > {
> > ...
> > 	if (<flags> & O_DIRECT)
> > 		if (!<a_ops> || !<a_ops>->direct_IO)
> > 			return -EINVAL;
> > ...
> > }
> > 
> > After this open direct_IO better be there AFAICT so changing the a_ops
> > later would not be good.  For ext4 direct_IO is defined for all the
> > a_ops...  so I guess that is not a big deal.  However, is the user really
> > getting the behavior they expect in this case?
> 
> In this particular case I don't think there's any practical harm for any
> filesystem but in general this is another instance where consistency of
> aops over time is assumed.

Yes...  But this is just a more complex situation to maintain consistency.
Again I don't have any idea if consistency is required.

But the current definition/use of a_ops is very static (with the exception of
the ext4 case you gave), so I'm thinking that much of the code is written with
the assumption that the vectors do not change.

> 
> > I'm afraid of requiring FSs to have to follow rules in defining their a_ops.
> > Because I'm afraid maintaining those rules would be hard and would eventually
> > lead to crashes when someone did it wrong.
> 
> I guess this very much depends on the rules. But yes, anything non-obvious
> or hard to check would quickly lead to bugs, I agree. But IMHO fully
> general solution to above problems would clutter the generic code in rather
> ugly way as well because usage of aops is pretty widespread in mm and fs
> code. It isn't just a few places that call them...

I agree it does clutter the code.  and the patches I have are not pretty and
I'm not even sure they are not broken...  So yea I'm open to ideas!

> 
> But I think we could significantly reduce the problem by looking at what's
> in aops. We have lots of operations there that operate on pages. If we
> mandate that before and during switching of aops, you must make sure
> there's nothing in page cache for the inode, you've already dealt with 90%
> of the problems.

Sounds promising...

But digging into this it looks like we need a similar rule for the DAX side to
have no mappings outstanding.  Perhaps you meant that when you said page cache?

> 
> Beside these we have:
> * write_begin - that creates page in page cache so above rule should stop
>   it as well

Just to make sure I understand, do you propose that we put a check in
pagecache_write_[begin|end]() to protect from these calls while changing?

I just want to be sure because I've wondered if we can get away with minimal
checks, or checks on individual functions, in the generic code.  But that
seemed kind of ugly as well.

> * bmap - honestly I'd be inclined to just move this to inode_operations
>   just like fiemap. There's nothing about address_space in its functionality.

Hmmm...  What about these call paths?

ext4_bmap()
	filemap_write_and_wait()
		filemap_fdatawrite()
			__filemap_fdatawrite()
				__filemap_fdatawrite_range()
					do_writepages()
						a_ops->writepages()

or

xfs_vm_bmap()
	iomap_bmap()
		filemap_write_and_wait()
			...
:-/

maybe we should leave it and have it covered under the page cache rule you
propose?


> * swap_activate / swap_deactivate - Either I'd move these to
>   file_operations (what's there about address_space, right), or since all
>   instances of this only care about the inode, we can as well just pass
>   only inode to the function and move it to inode_operations.

XFS calls iomap_swapfile_activate() which calls vfs_fsync() which needs the
file.  Seems like file_operations would be better.

I like the idea of cleaning this up a lot.  I've gone ahead with a couple of
patches to do this.  At least this simplifies things a little bit...

> 
> And then the really problematic ones:
> * direct_IO - Logically with how the IO path is structured, it belongs in
>   aops so I wouldn't move it. With the advance of iomap it is on its way to
>   being removed altogether but that will take a long time to happen
>   completely. So for now I'd mandate that direct_IO path must be locked out
>   while switching aops.

How do we lock this out between checking for this support on open and using it
later?

I think if we go down this path we have to make a rule that says that direct_IO
must be defined for both a_ops.  Right now for our 2 use case we are lucky that
direct_IO is also defined as the same function.  So there is little danger of
odd behavior.

Let me explore more.

> * readpages - these should be locked out by the rule that page creation is
>   forbidden.

Agreed.  Same question applies here as for pagecache_write_[begin|end]().
Should I special case a check for this?

> * writepages - these need to be locked out when switching aops.

If nothing is in the pagecache could we ignore this as well?

> 
> And that should be it. So I don't think there's a need for reference-counting
> of aops in the generic code, especially since I don't think it can be done
> in an elegant way (but feel free to correct me). I think that just
> providing a way to lock-out above three calls would be enough.

Ok, I've been thinking about this whole email more today.  And what if we add a
couple of FS specific lockout callbacks in struct address_space itself.

If they are defined the FS has dynamic a_ops capability and these 3 functions
will need to be locked out by a rw_sem controlled by the FS.

We can then document the "rules" for dynamic a_ops better for FS's to support
them by referencing the special cases and the fact that dynamic a_ops requires
these callbacks to be defined.

It clutters the generic code a bit, but not as much as my idea.  At the same
time it helps to self document the special cases in both the FS's and the core
code.

[snip]

> > 
> > 
> > I'm still working out the details of using SRCU and a ref count.  I have made
> > at least 1 complete pass of all the a_ops users and I think this would cover
> > them all.
> 
> Well, my concern with the use of interface like this is:
> 
> a) The clutter in the generic code

There is a bit of clutter.  What I'm most concerned about is the amount of
special casing.  I still think it would be cleaner in the long run...  And
force some structure on the use of a_ops.  But after looking at it more there
may be a middle ground.

> b) It's difficult to make this work with SRCU because presumably you want
>    to use synchronize_srcu() while switching aops. But then you have three
>    operations to do:
>    1) switch aops
>    2) set inode flag
>    3) synchronize_srcu
> 
>    and depending on the order in which you do these either "old aops"
>    operations will see inode with a flag or "new aops" will see the inode
>    without a flag and either can confuse those functions...

Yes that might be a challenge.

Ira


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2019-11-11 23:54 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-20 15:59 [PATCH 0/5] Enable per-file/directory DAX operations ira.weiny
2019-10-20 15:59 ` [PATCH 1/5] fs/stat: Define DAX statx attribute ira.weiny
2019-10-22 11:32   ` Boaz Harrosh
2019-10-22 16:51     ` Ira Weiny
2019-10-20 15:59 ` [PATCH 2/5] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
2019-10-21  0:26   ` Dave Chinner
2019-10-21 17:40     ` Ira Weiny
2019-10-20 15:59 ` [PATCH 3/5] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
2019-10-20 15:59 ` [PATCH 4/5] fs/xfs: Clean up DAX support check ira.weiny
2019-10-20 15:59 ` [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag ira.weiny
2019-10-21  0:45   ` Dave Chinner
2019-10-21 22:49     ` Ira Weiny
2019-10-21 23:46       ` Dave Chinner
2019-11-08 13:12       ` Jan Kara
2019-11-08 13:46         ` Jan Kara
2019-11-08 19:36           ` Ira Weiny
2019-11-11 16:07             ` Jan Kara
2019-11-11 23:54               ` Ira Weiny
2019-10-22 11:21 ` [PATCH 0/5] Enable per-file/directory DAX operations Boaz Harrosh
2019-10-23 13:09   ` Boaz Harrosh
2019-10-23 22:13     ` Dave Chinner
2019-10-24  2:31       ` Boaz Harrosh
2019-10-24  7:34         ` Dave Chinner
2019-10-24 14:05           ` Boaz Harrosh
2019-10-24 21:35             ` Dave Chinner
2019-10-24 23:29               ` Boaz Harrosh
2019-10-25  0:36                 ` Dave Chinner
2019-10-25  1:15                   ` Boaz Harrosh
2019-10-25 20:49                   ` Ira Weiny
2019-10-27 22:10                     ` Dave Chinner
2019-10-31 16:17                       ` Ira Weiny
2019-11-01 22:47                         ` Dave Chinner
2019-11-02  4:25                           ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).