Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2
@ 2020-01-10 19:29 ira.weiny
  2020-01-10 19:29 ` [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute ira.weiny
                   ` (11 more replies)
  0 siblings, 12 replies; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the mode of a file with
active mappings / page cache.  It was thought the races could be avoided by
limiting mode flips to 0-length files.

However, this turns out to not be true.[3] This is because address space
operations (a_ops) may be in use at any time the inode is referenced and users
have expressed a desire to be able to change the mode on a file with data in
it.  For those reasons this patch set allows changing the mode flag on a file
as long as it is not current mapped.

Furthermore, DAX is a property of the inode and as such, many operations other
than address space operations need to be protected during a mode change.

Therefore callbacks are placed within the inode operations and used to lock the
inode as appropriate.

As in V1, Users are able to query the effective and physical flags separately
at any time.  Specifically the addition of the statx attribute bit allows them
to ensure the file is operating in the mode they intend.  This 'effective flag'
and physical flags could differ when the filesystem is mounted with the dax
flag for example.

It should be noted that the physical DAX flag inheritance is not shown in this
patch set as it was maintained from previous work on XFS.  The physical DAX flag
and it's inheritance will need to be added to other file systems for user
control. 

Finally, extensive testing was performed which resulted in a couple of bug fix
and clean up patches.  Specifically:

	fs: remove unneeded IS_DAX() check
	fs/xfs: Fix truncate up

'Fix truncate up' deserves specific attention because I'm not 100% sure it is
the correct fix.  Without that patch fsx testing failed within a few minutes
with this error.

	Mapped Write: non-zero data past EOF (0x3b0da) page offset 0xdb is 0x3711

With 'Fix truncate up' running fsx while changing modes can run for hours but I
have seen 2 other errors in the same genre after many hours of continuous
testing.

They are:

	READ BAD DATA: offset = 0x22dc, size = 0xcc7e, fname = /mnt/pmem/dax-file

	Mapped Read: non-zero data past EOF (0x3309e) page offset 0x9f is 0x6ab4

After seeing the patches to fix stale data exposure problems[4] I'm more
confident now that all 3 of these errors are a latent bug rather than a bug in
this series itself.

However, because of these failures I'm only submitting this set RFC.


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/
[3] https://lkml.org/lkml/2019/10/20/96
[4] https://patchwork.kernel.org/patch/11310511/


To: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-15 11:37   ` Jan Kara
  2020-01-10 19:29 ` [RFC PATCH V2 02/12] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

In order for users to determine if a file is currently operating in DAX
mode (effective DAX).  Define a statx attribute value and set that
attribute if the effective DAX flag is set.

To go along with this we propose the following addition to the statx man
page:

STATX_ATTR_DAX

	DAX (cpu direct access) is a file mode that attempts to minimize
	software cache effects for both I/O and memory mappings of this
	file.  It requires a capable device, a compatible filesystem
	block size, and filesystem opt-in. It generally assumes all
	accesses are via cpu load / store instructions which can
	minimize overhead for small accesses, but adversely affect cpu
	utilization for large transfers. File I/O is done directly
	to/from user-space buffers. While the DAX property tends to
	result in data being transferred synchronously it does not give
	the guarantees of synchronous I/O that data and necessary
	metadata are transferred. Memory mapped I/O may be performed
	with direct mappings that bypass system memory buffering. Again
	while memory-mapped I/O tends to result in data being
	transferred synchronously it does not guarantee synchronous
	metadata updates. A dax file may optionally support being mapped
	with the MAP_SYNC flag which does allow cpu store operations to
	be considered synchronous modulo cpu cache effects.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/stat.c                 | 3 +++
 include/uapi/linux/stat.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/fs/stat.c b/fs/stat.c
index 030008796479..894699c74dde 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	if (IS_DAX(inode))
+		stat->attributes |= STATX_ATTR_DAX;
+
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
 					    query_flags);
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index ad80a5c885d5..e5f9d5517f6b 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -169,6 +169,7 @@ struct statx {
 #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
 #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
+#define STATX_ATTR_DAX			0x00002000 /* [I] File is DAX */
 
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 02/12] fs/xfs: Isolate the physical DAX flag from effective
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
  2020-01-10 19:29 ` [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-10 19:29 ` [RFC PATCH V2 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_ioctl_setattr_dax_invalidate() currently checks if the DAX flag is
changing as a quick check.

But the implementation mixes the physical (XFS_DIFLAG2_DAX) and
effective (S_DAX) DAX flags.

Remove the use of the effective flag when determining if a change of the
physical flag is required.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 7b35d62ede9f..fe37708cea8f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1195,9 +1195,11 @@ xfs_ioctl_setattr_dax_invalidate(
 	}
 
 	/* If the DAX state is not changing, we have nothing to do here. */
-	if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
+	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		return 0;
-	if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
+	if (!(fa->fsx_xflags & FS_XFLAG_DAX) &&
+	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		return 0;
 
 	if (S_ISDIR(inode->i_mode))
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax()
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
  2020-01-10 19:29 ` [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute ira.weiny
  2020-01-10 19:29 ` [RFC PATCH V2 02/12] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-10 19:29 ` [RFC PATCH V2 04/12] fs/xfs: Clean up DAX support check ira.weiny
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_inode_supports_dax() should reflect if the inode can support DAX not
that it is enabled for DAX.  Leave that to other helper functions.

Change the caller of xfs_inode_supports_dax() to call
xfs_inode_use_dax() which reflects new logic to override the effective
DAX flag with either the mount option or the physical DAX flag.

To make the logic clear create 2 helper functions for the mount and
physical flag.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_iops.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8afe69ca188b..0a0ea90259e9 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1234,6 +1234,15 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = {
 	.update_time		= xfs_vn_update_time,
 };
 
+static bool
+xfs_inode_mount_is_dax(
+	struct xfs_inode *ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	return (mp->m_flags & XFS_MOUNT_DAX) == XFS_MOUNT_DAX;
+}
+
 /* Figure out if this file actually supports DAX. */
 static bool
 xfs_inode_supports_dax(
@@ -1245,11 +1254,6 @@ xfs_inode_supports_dax(
 	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
 		return false;
 
-	/* DAX mount option or DAX iflag must be set. */
-	if (!(mp->m_flags & XFS_MOUNT_DAX) &&
-	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
-		return false;
-
 	/* Block size must match page size */
 	if (mp->m_sb.sb_blocksize != PAGE_SIZE)
 		return false;
@@ -1258,6 +1262,22 @@ xfs_inode_supports_dax(
 	return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
 }
 
+static bool
+xfs_inode_is_dax(
+	struct xfs_inode *ip)
+{
+	return (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) == XFS_DIFLAG2_DAX;
+}
+
+static bool
+xfs_inode_use_dax(
+	struct xfs_inode *ip)
+{
+	return xfs_inode_supports_dax(ip) &&
+		(xfs_inode_mount_is_dax(ip) ||
+		 xfs_inode_is_dax(ip));
+}
+
 STATIC void
 xfs_diflags_to_iflags(
 	struct inode		*inode,
@@ -1276,7 +1296,7 @@ xfs_diflags_to_iflags(
 		inode->i_flags |= S_SYNC;
 	if (flags & XFS_DIFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
-	if (xfs_inode_supports_dax(ip))
+	if (xfs_inode_use_dax(ip))
 		inode->i_flags |= S_DAX;
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 04/12] fs/xfs: Clean up DAX support check
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
                   ` (2 preceding siblings ...)
  2020-01-10 19:29 ` [RFC PATCH V2 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-10 19:29 ` [RFC PATCH V2 05/12] fs: remove unneeded IS_DAX() check ira.weiny
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Rather than open coding xfs_inode_supports_dax() in
xfs_ioctl_setattr_dax_invalidate() export xfs_inode_supports_dax() and
call it in preparation for swapping dax flags.

This also means updating xfs_inode_supports_dax() to return true for a
directory.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 16 +++-------------
 fs/xfs/xfs_iops.c  |  8 ++++++--
 fs/xfs/xfs_iops.h  |  2 ++
 3 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index fe37708cea8f..b5e00b67c297 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1176,23 +1176,13 @@ xfs_ioctl_setattr_dax_invalidate(
 	int			*join_flags)
 {
 	struct inode		*inode = VFS_I(ip);
-	struct super_block	*sb = inode->i_sb;
 	int			error;
 
 	*join_flags = 0;
 
-	/*
-	 * It is only valid to set the DAX flag on regular files and
-	 * directories on filesystems where the block size is equal to the page
-	 * size. On directories it serves as an inherited hint so we don't
-	 * have to check the device for dax support or flush pagecache.
-	 */
-	if (fa->fsx_xflags & FS_XFLAG_DAX) {
-		struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
-
-		if (!bdev_dax_supported(target->bt_bdev, sb->s_blocksize))
-			return -EINVAL;
-	}
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
+	    !xfs_inode_supports_dax(ip))
+		return -EINVAL;
 
 	/* If the DAX state is not changing, we have nothing to do here. */
 	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 0a0ea90259e9..d6843cdb51d0 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1244,14 +1244,18 @@ xfs_inode_mount_is_dax(
 }
 
 /* Figure out if this file actually supports DAX. */
-static bool
+bool
 xfs_inode_supports_dax(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 
 	/* Only supported on non-reflinked files. */
-	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
+	if (xfs_is_reflink_inode(ip))
+		return false;
+
+	/* Only supported on regular files and directories. */
+	if (!(S_ISREG(VFS_I(ip)->i_mode) || S_ISDIR(VFS_I(ip)->i_mode)))
 		return false;
 
 	/* Block size must match page size */
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 4d24ff309f59..f24fec8de1d6 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -24,4 +24,6 @@ extern int xfs_setattr_nonsize(struct xfs_inode *ip, struct iattr *vap,
 extern int xfs_vn_setattr_nonsize(struct dentry *dentry, struct iattr *vap);
 extern int xfs_vn_setattr_size(struct dentry *dentry, struct iattr *vap);
 
+extern bool xfs_inode_supports_dax(struct xfs_inode *ip);
+
 #endif /* __XFS_IOPS_H__ */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 05/12] fs: remove unneeded IS_DAX() check
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
                   ` (3 preceding siblings ...)
  2020-01-10 19:29 ` [RFC PATCH V2 04/12] fs/xfs: Clean up DAX support check ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-16  9:38   ` Jan Kara
  2020-01-10 19:29 ` [RFC PATCH V2 06/12] fs/xfs: Check if the inode supports DAX under lock ira.weiny
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

The IS_DAX() check in io_is_direct() causes a race between changing the
DAX mode and creating the iocb flags.

Remove the check because DAX now emulates the page cache API and
therefore it does not matter if the file mode is DAX or not when the
iocb flags are created.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/fs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index d7584bcef5d3..e11989502eac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3365,7 +3365,7 @@ extern int file_update_time(struct file *file);
 
 static inline bool io_is_direct(struct file *filp)
 {
-	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
+	return (filp->f_flags & O_DIRECT);
 }
 
 static inline bool vma_is_dax(struct vm_area_struct *vma)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 06/12] fs/xfs: Check if the inode supports DAX under lock
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
                   ` (4 preceding siblings ...)
  2020-01-10 19:29 ` [RFC PATCH V2 05/12] fs: remove unneeded IS_DAX() check ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-10 19:29 ` [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode' ira.weiny
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

One of the checks for an inode supporting DAX is if the inode is
reflinked.  During a non-DAX to DAX mode change we could race with
the file being reflinked and end up with a reflinked file being in DAX
mode.

Prevent this race by checking for DAX support under the MMAP_LOCK.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index b5e00b67c297..bc3654fe3b5d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1180,10 +1180,6 @@ xfs_ioctl_setattr_dax_invalidate(
 
 	*join_flags = 0;
 
-	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
-	    !xfs_inode_supports_dax(ip))
-		return -EINVAL;
-
 	/* If the DAX state is not changing, we have nothing to do here. */
 	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
 	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
@@ -1197,6 +1193,13 @@ xfs_ioctl_setattr_dax_invalidate(
 
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
+	    !xfs_inode_supports_dax(ip)) {
+		error = -EINVAL;
+		goto out_unlock;
+	}
+
 	error = filemap_write_and_wait(inode->i_mapping);
 	if (error)
 		goto out_unlock;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode'
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
                   ` (5 preceding siblings ...)
  2020-01-10 19:29 ` [RFC PATCH V2 06/12] fs/xfs: Check if the inode supports DAX under lock ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-13 22:12   ` Darrick J. Wong
  2020-01-10 19:29 ` [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs ira.weiny
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

DAX requires special address space operations but many other functions
check the IS_DAX() mode.

DAX is a property of the inode thus we define an inode mode lock as an
inode operation which file systems can optionally define.

This patch defines the core function callbacks as well as puts the
locking calls in place.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 Documentation/filesystems/vfs.rst | 30 ++++++++++++++++
 fs/ioctl.c                        | 23 +++++++++----
 fs/open.c                         |  4 +++
 include/linux/fs.h                | 57 +++++++++++++++++++++++++++++--
 mm/fadvise.c                      | 10 ++++--
 mm/khugepaged.c                   |  2 ++
 mm/mmap.c                         |  7 ++++
 7 files changed, 123 insertions(+), 10 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..b945aa95f15a 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -59,6 +59,28 @@ like open(2) the file, or stat(2) it to peek at the inode data.  The
 stat(2) operation is fairly simple: once the VFS has the dentry, it
 peeks at the inode data and passes some of it back to userspace.
 
+Changing inode 'modes' dynamically
+----------------------------------
+
+Some file systems may have different modes for their inodes which require
+dyanic changing.  A specific example of this is DAX enabled files in XFS and
+ext4.  To switch the mode safely we lock the inode mode in all "normal" file
+system operations and restrict mode changes to those operations.  The specific
+rules are.
+
+To do this a file system must follow the following rules.
+
+        1) the direct_IO address_space_operation must be supported in all
+           potential a_ops vectors for any mode suported by the inode.
+	2) Filesystems must define the lock_mode() and unlock_mode() operations
+           in struct inode_operations.  These functions are used by the core
+           vfs layers to ensure that the mode is stable before allowing the
+           core operations to proceed.
+        3) Mode changes shall not be allowed while the file is mmap'ed
+        4) While changing modes filesystems should take exclusive locks which
+           prevent the core vfs layer from proceeding.
+
+
 
 The File Object
 ---------------
@@ -437,6 +459,8 @@ As of kernel 2.6.22, the following members are defined:
 		int (*atomic_open)(struct inode *, struct dentry *, struct file *,
 				   unsigned open_flag, umode_t create_mode);
 		int (*tmpfile) (struct inode *, struct dentry *, umode_t);
+		void (*lock_mode)(struct inode *);
+		void (*unlock_mode)(struct inode *);
 	};
 
 Again, all methods are called without any locks being held, unless
@@ -584,6 +608,12 @@ otherwise noted.
 	atomically creating, opening and unlinking a file in given
 	directory.
 
+``lock_mode``
+	called to prevent operations which depend on the inode's mode from
+        proceeding should a mode change be in progress
+
+``unlock_mode``
+	called when critical mode dependent operation is complete
 
 The Address Space Object
 ========================
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 7c9a5df5a597..ed6ab5303a24 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -55,18 +55,29 @@ EXPORT_SYMBOL(vfs_ioctl);
 static int ioctl_fibmap(struct file *filp, int __user *p)
 {
 	struct address_space *mapping = filp->f_mapping;
+	struct inode *inode = filp->f_inode;
 	int res, block;
 
+	lock_inode_mode(inode);
+
 	/* do we support this mess? */
-	if (!mapping->a_ops->bmap)
-		return -EINVAL;
-	if (!capable(CAP_SYS_RAWIO))
-		return -EPERM;
+	if (!mapping->a_ops->bmap) {
+		res = -EINVAL;
+		goto out;
+	}
+	if (!capable(CAP_SYS_RAWIO)) {
+		res = -EPERM;
+		goto out;
+	}
 	res = get_user(block, p);
 	if (res)
-		return res;
+		goto out;
 	res = mapping->a_ops->bmap(mapping, block);
-	return put_user(res, p);
+	res = put_user(res, p);
+
+out:
+	unlock_inode_mode(inode);
+	return res;
 }
 
 /**
diff --git a/fs/open.c b/fs/open.c
index b0be77ea8f1b..c62428bbc525 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 	if (ret)
 		newattrs.ia_valid |= ret | ATTR_FORCE;
 
+	lock_inode_mode(dentry->d_inode);
 	inode_lock(dentry->d_inode);
 	/* Note any delegations or leases have already been broken: */
 	ret = notify_change(dentry, &newattrs, NULL);
 	inode_unlock(dentry->d_inode);
+	unlock_inode_mode(dentry->d_inode);
 	return ret;
 }
 
@@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EOPNOTSUPP;
 
 	file_start_write(file);
+	lock_inode_mode(inode);
 	ret = file->f_op->fallocate(file, mode, offset, len);
+	unlock_inode_mode(inode);
 
 	/*
 	 * Create inotify and fanotify events.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e11989502eac..631f11d6246e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -359,6 +359,11 @@ typedef struct {
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
 		unsigned long, unsigned long);
 
+/**
+ * NOTE: DO NOT define new functions in address_space_operations without first
+ * considering how dynamic inode modes can be supported.  See the comment in
+ * struct inode_operations for the lock_mode() and unlock_mode() callbacks.
+ */
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
@@ -1817,6 +1822,11 @@ struct block_device_operations;
 
 struct iov_iter;
 
+/**
+ * NOTE: DO NOT define new functions in file_operations without first
+ * considering how dynamic inode modes can be supported.  See the comment in
+ * struct inode_operations for the lock_mode() and unlock_mode() callbacks.
+ */
 struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
@@ -1859,6 +1869,20 @@ struct file_operations {
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 } __randomize_layout;
 
+/*
+ * Filesystems wishing to support dynamic inode modes must do the following.
+ *
+ * 1) the direct_IO address_space_operation must be supported in all
+ *    potential a_ops vectors for any mode suported by the inode.
+ * 2) Filesystems must define the lock_mode() and unlock_mode() operations
+ *    in struct inode_operations.  These functions are used by the core
+ *    vfs layers to ensure that the mode is stable before allowing the
+ *    core operations to proceed.
+ * 3) Mode changes shall not be allowed while the file is mmap'ed
+ * 4) While changing modes filesystems should take exclusive locks which
+ *    prevent the core vfs layer from proceeding.
+ *
+ */
 struct inode_operations {
 	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
 	const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
@@ -1887,18 +1911,47 @@ struct inode_operations {
 			   umode_t create_mode);
 	int (*tmpfile) (struct inode *, struct dentry *, umode_t);
 	int (*set_acl)(struct inode *, struct posix_acl *, int);
+	void (*lock_mode)(struct inode*);
+	void (*unlock_mode)(struct inode*);
 } ____cacheline_aligned;
 
+static inline void lock_inode_mode(struct inode *inode)
+{
+	WARN_ON_ONCE(inode->i_op->lock_mode &&
+		     !inode->i_op->unlock_mode);
+	if (inode->i_op->lock_mode)
+		inode->i_op->lock_mode(inode);
+}
+static inline void unlock_inode_mode(struct inode *inode)
+{
+	WARN_ON_ONCE(inode->i_op->unlock_mode &&
+		     !inode->i_op->lock_mode);
+	if (inode->i_op->unlock_mode)
+		inode->i_op->unlock_mode(inode);
+}
+
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 				     struct iov_iter *iter)
 {
-	return file->f_op->read_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	lock_inode_mode(inode);
+	ret = file->f_op->read_iter(kio, iter);
+	unlock_inode_mode(inode);
+	return ret;
 }
 
 static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
 				      struct iov_iter *iter)
 {
-	return file->f_op->write_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	lock_inode_mode(inode);
+	ret = file->f_op->write_iter(kio, iter);
+	unlock_inode_mode(inode);
+	return ret;
 }
 
 static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 4f17c83db575..a4095a5deac8 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -47,7 +47,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 
 	bdi = inode_to_bdi(mapping->host);
 
+	lock_inode_mode(inode);
 	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
+		int ret = 0;
+
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
@@ -58,10 +61,13 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 			/* no bad return value, but ignore advice */
 			break;
 		default:
-			return -EINVAL;
+			ret = -EINVAL;
 		}
-		return 0;
+
+		unlock_inode_mode(inode);
+		return ret;
 	}
+	unlock_inode_mode(inode);
 
 	/*
 	 * Careful about overflows. Len == 0 means "as much as possible".  Use
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908743cb..ff49da065db0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
 		} else {	/* !is_shmem */
 			if (!page || xa_is_value(page)) {
 				xas_unlock_irq(&xas);
+				lock_inode_mode(file->f_inode);
 				page_cache_sync_readahead(mapping, &file->f_ra,
 							  file, index,
 							  PAGE_SIZE);
+				unlock_inode_mode(file->f_inode);
 				/* drain pagevecs to help isolate_lru_page() */
 				lru_add_drain();
 				page = find_lock_page(mapping, index);
diff --git a/mm/mmap.c b/mm/mmap.c
index 70f67c4515aa..dfaf1130e706 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1542,11 +1542,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 	}
 
+	if (file)
+		lock_inode_mode(file_inode(file));
+
 	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
 	if (!IS_ERR_VALUE(addr) &&
 	    ((vm_flags & VM_LOCKED) ||
 	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
 		*populate = len;
+
+	if (file)
+		unlock_inode_mode(file_inode(file));
+
 	return addr;
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
                   ` (6 preceding siblings ...)
  2020-01-10 19:29 ` [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode' ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-13 22:19   ` Darrick J. Wong
  2020-01-16  9:24   ` Jan Kara
  2020-01-10 19:29 ` [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed ira.weiny
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

XFS requires regular files to be locked while changing to/from DAX mode.

Define a new DAX lock type and implement the [un]lock_mode() inode
operation callbacks.

We define a new XFS_DAX_* lock type to carry the lock through the
transaction because we don't want to use IOLOCK as that would cause
performance issues with locking of the inode itself.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_icache.c |  2 ++
 fs/xfs/xfs_inode.c  | 37 +++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_inode.h  | 12 ++++++++++--
 fs/xfs/xfs_iops.c   | 24 +++++++++++++++++++++++-
 4 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8dc2e5414276..0288672e8902 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -74,6 +74,8 @@ xfs_inode_alloc(
 	INIT_LIST_HEAD(&ip->i_ioend_list);
 	spin_lock_init(&ip->i_ioend_lock);
 
+	percpu_init_rwsem(&ip->i_dax_sem);
+
 	return ip;
 }
 
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 401da197f012..e8fd95b75e5b 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
  *
  * Basic locking order:
  *
- * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
+ * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock
  *
  * mmap_sem locking order:
  *
  * i_rwsem -> page lock -> mmap_sem
- * mmap_sem -> i_mmap_lock -> page_lock
+ * mmap_sem -> i_dax_sem -> i_mmap_lock -> page_lock
  *
  * The difference in mmap_sem locking order mean that we cannot hold the
  * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
@@ -181,6 +181,13 @@ xfs_ilock(
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
+	ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) !=
+	       (XFS_DAX_SHARED | XFS_DAX_EXCL));
+
+	if (lock_flags & XFS_DAX_EXCL)
+		percpu_down_write(&ip->i_dax_sem);
+	else if (lock_flags & XFS_DAX_SHARED)
+		percpu_down_read(&ip->i_dax_sem);
 
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		down_write_nested(&VFS_I(ip)->i_rwsem,
@@ -224,6 +231,8 @@ xfs_ilock_nowait(
 	 * You can't set both SHARED and EXCL for the same lock,
 	 * and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED,
 	 * and XFS_ILOCK_EXCL are valid values to set in lock_flags.
+	 *
+	 * XFS_DAX_* is not allowed
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
@@ -232,6 +241,7 @@ xfs_ilock_nowait(
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
+	ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0);
 
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		if (!down_write_trylock(&VFS_I(ip)->i_rwsem))
@@ -302,6 +312,8 @@ xfs_iunlock(
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
 	ASSERT(lock_flags != 0);
+	ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) !=
+	       (XFS_DAX_SHARED | XFS_DAX_EXCL));
 
 	if (lock_flags & XFS_IOLOCK_EXCL)
 		up_write(&VFS_I(ip)->i_rwsem);
@@ -318,6 +330,11 @@ xfs_iunlock(
 	else if (lock_flags & XFS_ILOCK_SHARED)
 		mrunlock_shared(&ip->i_lock);
 
+	if (lock_flags & XFS_DAX_EXCL)
+		percpu_up_write(&ip->i_dax_sem);
+	else if (lock_flags & XFS_DAX_SHARED)
+		percpu_up_read(&ip->i_dax_sem);
+
 	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
 }
 
@@ -333,6 +350,8 @@ xfs_ilock_demote(
 	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
 	ASSERT((lock_flags &
 		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
+	/* XFS_DAX_* is not allowed */
+	ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0);
 
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrdemote(&ip->i_lock);
@@ -369,6 +388,13 @@ xfs_isilocked(
 		return rwsem_is_locked(&VFS_I(ip)->i_rwsem);
 	}
 
+	if (lock_flags & (XFS_DAX_EXCL|XFS_DAX_SHARED)) {
+		if (!(lock_flags & XFS_DAX_SHARED))
+			return !debug_locks ||
+				percpu_rwsem_is_held(&ip->i_dax_sem, 0);
+		return rwsem_is_locked(&ip->i_dax_sem);
+	}
+
 	ASSERT(0);
 	return 0;
 }
@@ -465,6 +491,9 @@ xfs_lock_inodes(
 	ASSERT(!(lock_mode & XFS_ILOCK_EXCL) ||
 		inodes <= XFS_ILOCK_MAX_SUBCLASS + 1);
 
+	/* XFS_DAX_* is not allowed */
+	ASSERT((lock_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0);
+
 	if (lock_mode & XFS_IOLOCK_EXCL) {
 		ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL)));
 	} else if (lock_mode & XFS_MMAPLOCK_EXCL)
@@ -566,6 +595,10 @@ xfs_lock_two_inodes(
 	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
 	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
 
+	/* XFS_DAX_* is not allowed */
+	ASSERT((ip0_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0);
+	ASSERT((ip1_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0);
+
 	ASSERT(ip0->i_ino != ip1->i_ino);
 
 	if (ip0->i_ino > ip1->i_ino) {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 492e53992fa9..693ca66bd89b 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -67,6 +67,9 @@ typedef struct xfs_inode {
 	spinlock_t		i_ioend_lock;
 	struct work_struct	i_ioend_work;
 	struct list_head	i_ioend_list;
+
+	/* protect changing the mode to/from DAX */
+	struct percpu_rw_semaphore i_dax_sem;
 } xfs_inode_t;
 
 /* Convert from vfs inode to xfs inode */
@@ -278,10 +281,13 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 #define	XFS_ILOCK_SHARED	(1<<3)
 #define	XFS_MMAPLOCK_EXCL	(1<<4)
 #define	XFS_MMAPLOCK_SHARED	(1<<5)
+#define	XFS_DAX_EXCL		(1<<6)
+#define	XFS_DAX_SHARED		(1<<7)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
 				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
-				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
+				| XFS_DAX_EXCL | XFS_DAX_SHARED)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
@@ -289,7 +295,9 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
 	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
 	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
-	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
+	{ XFS_DAX_EXCL,   	"DAX_EXCL" }, \
+	{ XFS_DAX_SHARED,	"DAX_SHARED" }
 
 
 /*
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index d6843cdb51d0..a2f2604c3187 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1158,6 +1158,16 @@ xfs_vn_tmpfile(
 	return xfs_generic_create(dir, dentry, mode, 0, true);
 }
 
+static void xfs_lock_mode(struct inode *inode)
+{
+	xfs_ilock(XFS_I(inode), XFS_DAX_SHARED);
+}
+
+static void xfs_unlock_mode(struct inode *inode)
+{
+	xfs_iunlock(XFS_I(inode), XFS_DAX_SHARED);
+}
+
 static const struct inode_operations xfs_inode_operations = {
 	.get_acl		= xfs_get_acl,
 	.set_acl		= xfs_set_acl,
@@ -1168,6 +1178,18 @@ static const struct inode_operations xfs_inode_operations = {
 	.update_time		= xfs_vn_update_time,
 };
 
+static const struct inode_operations xfs_reg_inode_operations = {
+	.get_acl		= xfs_get_acl,
+	.set_acl		= xfs_set_acl,
+	.getattr		= xfs_vn_getattr,
+	.setattr		= xfs_vn_setattr,
+	.listxattr		= xfs_vn_listxattr,
+	.fiemap			= xfs_vn_fiemap,
+	.update_time		= xfs_vn_update_time,
+	.lock_mode              = xfs_lock_mode,
+	.unlock_mode            = xfs_unlock_mode,
+};
+
 static const struct inode_operations xfs_dir_inode_operations = {
 	.create			= xfs_vn_create,
 	.lookup			= xfs_vn_lookup,
@@ -1372,7 +1394,7 @@ xfs_setup_iops(
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFREG:
-		inode->i_op = &xfs_inode_operations;
+		inode->i_op = &xfs_reg_inode_operations;
 		inode->i_fop = &xfs_file_operations;
 		if (IS_DAX(inode))
 			inode->i_mapping->a_ops = &xfs_dax_aops;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
                   ` (7 preceding siblings ...)
  2020-01-10 19:29 ` [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-13 22:22   ` Darrick J. Wong
  2020-01-15 10:21   ` David Laight
  2020-01-10 19:29 ` [RFC PATCH V2 10/12] fs/xfs: Fix truncate up ira.weiny
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Page faults need to ensure the inode mode is correct and consistent with
the vmf information at the time of the fault.  There is no easy way to
ensure the vmf information is correct if a mode change is in progress.
Furthermore, there is no good use case to require a mode change while
the file is mmap'ed.

Track mmap's of the file and fail the mode change if the file is
mmap'ed.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/inode.c         |  2 ++
 fs/xfs/xfs_ioctl.c |  8 ++++++++
 include/linux/fs.h |  1 +
 mm/mmap.c          | 19 +++++++++++++++++--
 4 files changed, 28 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 2b0f51161918..944711aed6f8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -245,6 +245,8 @@ static struct inode *alloc_inode(struct super_block *sb)
 		return NULL;
 	}
 
+	atomic64_set(&inode->i_mapped, 0);
+
 	return inode;
 }
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index bc3654fe3b5d..1ab0906c6c7f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1200,6 +1200,14 @@ xfs_ioctl_setattr_dax_invalidate(
 		goto out_unlock;
 	}
 
+	/*
+	 * If there is a mapping in place we must remain in our current mode.
+	 */
+	if (atomic64_read(&inode->i_mapped)) {
+		error = -EBUSY;
+		goto out_unlock;
+	}
+
 	error = filemap_write_and_wait(inode->i_mapping);
 	if (error)
 		goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 631f11d6246e..6e7dc626b657 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -740,6 +740,7 @@ struct inode {
 #endif
 
 	void			*i_private; /* fs or device private pointer */
+	atomic64_t               i_mapped;
 } __randomize_layout;
 
 struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
diff --git a/mm/mmap.c b/mm/mmap.c
index dfaf1130e706..e6b68924b7ca 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -171,12 +171,17 @@ void unlink_file_vma(struct vm_area_struct *vma)
 static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *next = vma->vm_next;
+	struct file *f = vma->vm_file;
 
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
+	if (f) {
+		struct inode *inode = file_inode(f);
+		if (inode)
+			atomic64_dec(&inode->i_mapped);
+		fput(f);
+	}
 	mpol_put(vma_policy(vma));
 	vm_area_free(vma);
 	return next;
@@ -1837,6 +1842,16 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vma_set_page_prot(vma);
 
+	/*
+	 * Track if there is mapping in place such that a mode change
+	 * does not occur on a file which is mapped
+	 */
+	if (file) {
+		struct inode		*inode = file_inode(file);
+
+		atomic64_inc(&inode->i_mapped);
+	}
+
 	return addr;
 
 unmap_and_free_vma:
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 10/12] fs/xfs: Fix truncate up
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
                   ` (8 preceding siblings ...)
  2020-01-10 19:29 ` [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-13 22:27   ` Darrick J. Wong
  2020-01-10 19:29 ` [RFC PATCH V2 11/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
  2020-01-10 19:29 ` [RFC PATCH V2 12/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny
  11 siblings, 1 reply; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

When zeroing the end of a file we must account for bytes contained in
the final page which are past EOF.

Extend the range passed to iomap_zero_range() to reach LLONG_MAX which
will include all bytes of the final page.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a2f2604c3187..a34b04e8ac9c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -910,7 +910,7 @@ xfs_setattr_size(
 	 */
 	if (newsize > oldsize) {
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
-		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
+		error = iomap_zero_range(inode, oldsize, LLONG_MAX - oldsize,
 				&did_zeroing, &xfs_buffered_write_iomap_ops);
 	} else {
 		error = iomap_truncate_page(inode, newsize, &did_zeroing,
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 11/12] fs/xfs: Clean up locking in dax invalidate
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
                   ` (9 preceding siblings ...)
  2020-01-10 19:29 ` [RFC PATCH V2 10/12] fs/xfs: Fix truncate up ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  2020-01-10 19:29 ` [RFC PATCH V2 12/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny
  11 siblings, 0 replies; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Define a variable to hold the lock flags to ensure that the correct
locks are returned or released on error.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 1ab0906c6c7f..9a35bf83eaa1 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1176,7 +1176,7 @@ xfs_ioctl_setattr_dax_invalidate(
 	int			*join_flags)
 {
 	struct inode		*inode = VFS_I(ip);
-	int			error;
+	int			error, flags;
 
 	*join_flags = 0;
 
@@ -1191,8 +1191,10 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (S_ISDIR(inode->i_mode))
 		return 0;
 
+	flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+
 	/* lock, flush and invalidate mapping in preparation for flag change */
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+	xfs_ilock(ip, flags);
 
 	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
 	    !xfs_inode_supports_dax(ip)) {
@@ -1215,11 +1217,11 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (error)
 		goto out_unlock;
 
-	*join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+	*join_flags = flags;
 	return 0;
 
 out_unlock:
-	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+	xfs_iunlock(ip, flags);
 	return error;
 
 }
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [RFC PATCH V2 12/12] fs/xfs: Allow toggle of effective DAX flag
  2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
                   ` (10 preceding siblings ...)
  2020-01-10 19:29 ` [RFC PATCH V2 11/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
@ 2020-01-10 19:29 ` ira.weiny
  11 siblings, 0 replies; 55+ messages in thread
From: ira.weiny @ 2020-01-10 19:29 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Now that locking of the inode is in place we can allow a mode change
while under the new lock.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_inode.h |  1 +
 fs/xfs/xfs_ioctl.c |  9 ++++++---
 fs/xfs/xfs_iops.c  | 15 +++++++++++----
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 693ca66bd89b..b0d2e7da88c6 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -474,6 +474,7 @@ int	xfs_break_layouts(struct inode *inode, uint *iolock,
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
+extern void xfs_setup_a_ops(struct xfs_inode *ip);
 
 /*
  * When setting up a newly allocated inode, we need to call
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 9a35bf83eaa1..e07559bf70a9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1109,12 +1109,11 @@ xfs_diflags_to_linux(
 		inode->i_flags |= S_NOATIME;
 	else
 		inode->i_flags &= ~S_NOATIME;
-#if 0	/* disabled until the flag switching races are sorted out */
 	if (xflags & FS_XFLAG_DAX)
 		inode->i_flags |= S_DAX;
 	else
 		inode->i_flags &= ~S_DAX;
-#endif
+
 }
 
 static int
@@ -1191,7 +1190,7 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (S_ISDIR(inode->i_mode))
 		return 0;
 
-	flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+	flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL | XFS_DAX_EXCL;
 
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, flags);
@@ -1512,6 +1511,8 @@ xfs_ioctl_setattr(
 	else
 		ip->i_d.di_cowextsize = 0;
 
+	xfs_setup_a_ops(ip);
+
 	code = xfs_trans_commit(tp);
 
 	/*
@@ -1621,6 +1622,8 @@ xfs_ioc_setxflags(
 		goto out_drop_write;
 	}
 
+	xfs_setup_a_ops(ip);
+
 	error = xfs_trans_commit(tp);
 out_drop_write:
 	mnt_drop_write_file(filp);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a34b04e8ac9c..c70164a0df97 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1386,6 +1386,16 @@ xfs_setup_inode(
 	}
 }
 
+void xfs_setup_a_ops(struct xfs_inode *ip)
+{
+	struct inode		*inode = &ip->i_vnode;
+
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &xfs_dax_aops;
+	else
+		inode->i_mapping->a_ops = &xfs_address_space_operations;
+}
+
 void
 xfs_setup_iops(
 	struct xfs_inode	*ip)
@@ -1396,10 +1406,7 @@ xfs_setup_iops(
 	case S_IFREG:
 		inode->i_op = &xfs_reg_inode_operations;
 		inode->i_fop = &xfs_file_operations;
-		if (IS_DAX(inode))
-			inode->i_mapping->a_ops = &xfs_dax_aops;
-		else
-			inode->i_mapping->a_ops = &xfs_address_space_operations;
+		xfs_setup_a_ops(ip);
 		break;
 	case S_IFDIR:
 		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
-- 
2.21.0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode'
  2020-01-10 19:29 ` [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode' ira.weiny
@ 2020-01-13 22:12   ` Darrick J. Wong
  2020-01-14  0:20     ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-13 22:12 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> DAX requires special address space operations but many other functions
> check the IS_DAX() mode.
> 
> DAX is a property of the inode thus we define an inode mode lock as an
> inode operation which file systems can optionally define.
> 
> This patch defines the core function callbacks as well as puts the
> locking calls in place.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  Documentation/filesystems/vfs.rst | 30 ++++++++++++++++
>  fs/ioctl.c                        | 23 +++++++++----
>  fs/open.c                         |  4 +++
>  include/linux/fs.h                | 57 +++++++++++++++++++++++++++++--
>  mm/fadvise.c                      | 10 ++++--
>  mm/khugepaged.c                   |  2 ++
>  mm/mmap.c                         |  7 ++++
>  7 files changed, 123 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..b945aa95f15a 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -59,6 +59,28 @@ like open(2) the file, or stat(2) it to peek at the inode data.  The
>  stat(2) operation is fairly simple: once the VFS has the dentry, it
>  peeks at the inode data and passes some of it back to userspace.
>  
> +Changing inode 'modes' dynamically
> +----------------------------------
> +
> +Some file systems may have different modes for their inodes which require
> +dyanic changing.  A specific example of this is DAX enabled files in XFS and
> +ext4.  To switch the mode safely we lock the inode mode in all "normal" file
> +system operations and restrict mode changes to those operations.  The specific
> +rules are.
> +
> +To do this a file system must follow the following rules.
> +
> +        1) the direct_IO address_space_operation must be supported in all
> +           potential a_ops vectors for any mode suported by the inode.
> +	2) Filesystems must define the lock_mode() and unlock_mode() operations
> +           in struct inode_operations.  These functions are used by the core
> +           vfs layers to ensure that the mode is stable before allowing the
> +           core operations to proceed.
> +        3) Mode changes shall not be allowed while the file is mmap'ed
> +        4) While changing modes filesystems should take exclusive locks which
> +           prevent the core vfs layer from proceeding.
> +
> +
>  
>  The File Object
>  ---------------
> @@ -437,6 +459,8 @@ As of kernel 2.6.22, the following members are defined:
>  		int (*atomic_open)(struct inode *, struct dentry *, struct file *,
>  				   unsigned open_flag, umode_t create_mode);
>  		int (*tmpfile) (struct inode *, struct dentry *, umode_t);
> +		void (*lock_mode)(struct inode *);
> +		void (*unlock_mode)(struct inode *);

Yikes.  "mode" has a specific meaning for inodes, and this lock isn't
related to i_mode.  This lock protects aops from changing while an
address space operation is in use.

>  	};
>  
>  Again, all methods are called without any locks being held, unless
> @@ -584,6 +608,12 @@ otherwise noted.
>  	atomically creating, opening and unlinking a file in given
>  	directory.
>  
> +``lock_mode``
> +	called to prevent operations which depend on the inode's mode from
> +        proceeding should a mode change be in progress

"Inodes can't change mode, because files do not suddenly become
directories". ;)

Oh, you meant "lock_XXXX is called to prevent a change in the pagecache
mode from proceeding while there are address space operations in
progress".  So these are really more aops get and put functions...

> +``unlock_mode``
> +	called when critical mode dependent operation is complete
>  
>  The Address Space Object
>  ========================
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 7c9a5df5a597..ed6ab5303a24 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -55,18 +55,29 @@ EXPORT_SYMBOL(vfs_ioctl);
>  static int ioctl_fibmap(struct file *filp, int __user *p)
>  {
>  	struct address_space *mapping = filp->f_mapping;
> +	struct inode *inode = filp->f_inode;
>  	int res, block;
>  
> +	lock_inode_mode(inode);
> +
>  	/* do we support this mess? */
> -	if (!mapping->a_ops->bmap)
> -		return -EINVAL;
> -	if (!capable(CAP_SYS_RAWIO))
> -		return -EPERM;
> +	if (!mapping->a_ops->bmap) {
> +		res = -EINVAL;
> +		goto out;
> +	}
> +	if (!capable(CAP_SYS_RAWIO)) {
> +		res = -EPERM;
> +		goto out;

Why does the order of these checks change here?

> +	}
>  	res = get_user(block, p);
>  	if (res)
> -		return res;
> +		goto out;
>  	res = mapping->a_ops->bmap(mapping, block);
> -	return put_user(res, p);
> +	res = put_user(res, p);
> +
> +out:
> +	unlock_inode_mode(inode);
> +	return res;
>  }
>  
>  /**
> diff --git a/fs/open.c b/fs/open.c
> index b0be77ea8f1b..c62428bbc525 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
>  	if (ret)
>  		newattrs.ia_valid |= ret | ATTR_FORCE;
>  
> +	lock_inode_mode(dentry->d_inode);
>  	inode_lock(dentry->d_inode);
>  	/* Note any delegations or leases have already been broken: */
>  	ret = notify_change(dentry, &newattrs, NULL);
>  	inode_unlock(dentry->d_inode);
> +	unlock_inode_mode(dentry->d_inode);
>  	return ret;
>  }
>  
> @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -EOPNOTSUPP;
>  
>  	file_start_write(file);
> +	lock_inode_mode(inode);
>  	ret = file->f_op->fallocate(file, mode, offset, len);
> +	unlock_inode_mode(inode);
>  
>  	/*
>  	 * Create inotify and fanotify events.
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e11989502eac..631f11d6246e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -359,6 +359,11 @@ typedef struct {
>  typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
>  		unsigned long, unsigned long);
>  
> +/**
> + * NOTE: DO NOT define new functions in address_space_operations without first
> + * considering how dynamic inode modes can be supported.  See the comment in
> + * struct inode_operations for the lock_mode() and unlock_mode() callbacks.
> + */
>  struct address_space_operations {
>  	int (*writepage)(struct page *page, struct writeback_control *wbc);
>  	int (*readpage)(struct file *, struct page *);
> @@ -1817,6 +1822,11 @@ struct block_device_operations;
>  
>  struct iov_iter;
>  
> +/**
> + * NOTE: DO NOT define new functions in file_operations without first
> + * considering how dynamic inode modes can be supported.  See the comment in
> + * struct inode_operations for the lock_mode() and unlock_mode() callbacks.
> + */
>  struct file_operations {
>  	struct module *owner;
>  	loff_t (*llseek) (struct file *, loff_t, int);
> @@ -1859,6 +1869,20 @@ struct file_operations {
>  	int (*fadvise)(struct file *, loff_t, loff_t, int);
>  } __randomize_layout;
>  
> +/*
> + * Filesystems wishing to support dynamic inode modes must do the following.
> + *
> + * 1) the direct_IO address_space_operation must be supported in all
> + *    potential a_ops vectors for any mode suported by the inode.
> + * 2) Filesystems must define the lock_mode() and unlock_mode() operations
> + *    in struct inode_operations.  These functions are used by the core
> + *    vfs layers to ensure that the mode is stable before allowing the
> + *    core operations to proceed.
> + * 3) Mode changes shall not be allowed while the file is mmap'ed
> + * 4) While changing modes filesystems should take exclusive locks which
> + *    prevent the core vfs layer from proceeding.
> + *
> + */
>  struct inode_operations {
>  	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
>  	const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
> @@ -1887,18 +1911,47 @@ struct inode_operations {
>  			   umode_t create_mode);
>  	int (*tmpfile) (struct inode *, struct dentry *, umode_t);
>  	int (*set_acl)(struct inode *, struct posix_acl *, int);
> +	void (*lock_mode)(struct inode*);
> +	void (*unlock_mode)(struct inode*);
>  } ____cacheline_aligned;
>  
> +static inline void lock_inode_mode(struct inode *inode)

inode_aops_get()?

> +{
> +	WARN_ON_ONCE(inode->i_op->lock_mode &&
> +		     !inode->i_op->unlock_mode);
> +	if (inode->i_op->lock_mode)
> +		inode->i_op->lock_mode(inode);
> +}
> +static inline void unlock_inode_mode(struct inode *inode)
> +{
> +	WARN_ON_ONCE(inode->i_op->unlock_mode &&
> +		     !inode->i_op->lock_mode);
> +	if (inode->i_op->unlock_mode)
> +		inode->i_op->unlock_mode(inode);
> +}
> +
>  static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
>  				     struct iov_iter *iter)

inode_aops_put()?

--D

>  {
> -	return file->f_op->read_iter(kio, iter);
> +	struct inode		*inode = file_inode(kio->ki_filp);
> +	ssize_t ret;
> +
> +	lock_inode_mode(inode);
> +	ret = file->f_op->read_iter(kio, iter);
> +	unlock_inode_mode(inode);
> +	return ret;
>  }
>  
>  static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
>  				      struct iov_iter *iter)
>  {
> -	return file->f_op->write_iter(kio, iter);
> +	struct inode		*inode = file_inode(kio->ki_filp);
> +	ssize_t ret;
> +
> +	lock_inode_mode(inode);
> +	ret = file->f_op->write_iter(kio, iter);
> +	unlock_inode_mode(inode);
> +	return ret;
>  }
>  
>  static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 4f17c83db575..a4095a5deac8 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -47,7 +47,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>  
>  	bdi = inode_to_bdi(mapping->host);
>  
> +	lock_inode_mode(inode);
>  	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
> +		int ret = 0;
> +
>  		switch (advice) {
>  		case POSIX_FADV_NORMAL:
>  		case POSIX_FADV_RANDOM:
> @@ -58,10 +61,13 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>  			/* no bad return value, but ignore advice */
>  			break;
>  		default:
> -			return -EINVAL;
> +			ret = -EINVAL;
>  		}
> -		return 0;
> +
> +		unlock_inode_mode(inode);
> +		return ret;
>  	}
> +	unlock_inode_mode(inode);
>  
>  	/*
>  	 * Careful about overflows. Len == 0 means "as much as possible".  Use
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b679908743cb..ff49da065db0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
>  		} else {	/* !is_shmem */
>  			if (!page || xa_is_value(page)) {
>  				xas_unlock_irq(&xas);
> +				lock_inode_mode(file->f_inode);
>  				page_cache_sync_readahead(mapping, &file->f_ra,
>  							  file, index,
>  							  PAGE_SIZE);
> +				unlock_inode_mode(file->f_inode);
>  				/* drain pagevecs to help isolate_lru_page() */
>  				lru_add_drain();
>  				page = find_lock_page(mapping, index);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 70f67c4515aa..dfaf1130e706 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1542,11 +1542,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
>  			vm_flags |= VM_NORESERVE;
>  	}
>  
> +	if (file)
> +		lock_inode_mode(file_inode(file));
> +
>  	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
>  	if (!IS_ERR_VALUE(addr) &&
>  	    ((vm_flags & VM_LOCKED) ||
>  	     (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
>  		*populate = len;
> +
> +	if (file)
> +		unlock_inode_mode(file_inode(file));
> +
>  	return addr;
>  }
>  
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs
  2020-01-10 19:29 ` [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs ira.weiny
@ 2020-01-13 22:19   ` Darrick J. Wong
  2020-01-14  0:35     ` Ira Weiny
  2020-01-15 23:52     ` Ira Weiny
  2020-01-16  9:24   ` Jan Kara
  1 sibling, 2 replies; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-13 22:19 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Jan 10, 2020 at 11:29:38AM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> XFS requires regular files to be locked while changing to/from DAX mode.
> 
> Define a new DAX lock type and implement the [un]lock_mode() inode
> operation callbacks.
> 
> We define a new XFS_DAX_* lock type to carry the lock through the
> transaction because we don't want to use IOLOCK as that would cause
> performance issues with locking of the inode itself.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  fs/xfs/xfs_icache.c |  2 ++
>  fs/xfs/xfs_inode.c  | 37 +++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_inode.h  | 12 ++++++++++--
>  fs/xfs/xfs_iops.c   | 24 +++++++++++++++++++++++-
>  4 files changed, 70 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 8dc2e5414276..0288672e8902 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -74,6 +74,8 @@ xfs_inode_alloc(
>  	INIT_LIST_HEAD(&ip->i_ioend_list);
>  	spin_lock_init(&ip->i_ioend_lock);
>  
> +	percpu_init_rwsem(&ip->i_dax_sem);
> +
>  	return ip;
>  }
>  
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 401da197f012..e8fd95b75e5b 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
>   *
>   * Basic locking order:
>   *
> - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> + * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock

Mmmmmm, more locks.  Can we skip the extra lock if CONFIG_FSDAX=n or if
the filesystem devices don't support DAX at all?

Also, I don't think we're actually following the i_rwsem -> i_daxsem
order in fallocate, and possibly elsewhere too?

Does the vfs have to take the i_dax_sem to do remapping things like
reflink?  (Pretend that reflink and dax are compatible for the moment)

>   * mmap_sem locking order:
>   *
>   * i_rwsem -> page lock -> mmap_sem
> - * mmap_sem -> i_mmap_lock -> page_lock
> + * mmap_sem -> i_dax_sem -> i_mmap_lock -> page_lock
>   *
>   * The difference in mmap_sem locking order mean that we cannot hold the
>   * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> @@ -181,6 +181,13 @@ xfs_ilock(
>  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> +	ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) !=
> +	       (XFS_DAX_SHARED | XFS_DAX_EXCL));
> +
> +	if (lock_flags & XFS_DAX_EXCL)
> +		percpu_down_write(&ip->i_dax_sem);
> +	else if (lock_flags & XFS_DAX_SHARED)
> +		percpu_down_read(&ip->i_dax_sem);
>  
>  	if (lock_flags & XFS_IOLOCK_EXCL) {
>  		down_write_nested(&VFS_I(ip)->i_rwsem,
> @@ -224,6 +231,8 @@ xfs_ilock_nowait(
>  	 * You can't set both SHARED and EXCL for the same lock,
>  	 * and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED,
>  	 * and XFS_ILOCK_EXCL are valid values to set in lock_flags.
> +	 *
> +	 * XFS_DAX_* is not allowed
>  	 */
>  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
>  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> @@ -232,6 +241,7 @@ xfs_ilock_nowait(
>  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> +	ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0);
>  
>  	if (lock_flags & XFS_IOLOCK_EXCL) {
>  		if (!down_write_trylock(&VFS_I(ip)->i_rwsem))
> @@ -302,6 +312,8 @@ xfs_iunlock(
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
>  	ASSERT(lock_flags != 0);
> +	ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) !=
> +	       (XFS_DAX_SHARED | XFS_DAX_EXCL));
>  
>  	if (lock_flags & XFS_IOLOCK_EXCL)
>  		up_write(&VFS_I(ip)->i_rwsem);
> @@ -318,6 +330,11 @@ xfs_iunlock(
>  	else if (lock_flags & XFS_ILOCK_SHARED)
>  		mrunlock_shared(&ip->i_lock);
>  
> +	if (lock_flags & XFS_DAX_EXCL)
> +		percpu_up_write(&ip->i_dax_sem);
> +	else if (lock_flags & XFS_DAX_SHARED)
> +		percpu_up_read(&ip->i_dax_sem);
> +
>  	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
>  }
>  
> @@ -333,6 +350,8 @@ xfs_ilock_demote(
>  	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags &
>  		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
> +	/* XFS_DAX_* is not allowed */
> +	ASSERT((lock_flags & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0);
>  
>  	if (lock_flags & XFS_ILOCK_EXCL)
>  		mrdemote(&ip->i_lock);
> @@ -369,6 +388,13 @@ xfs_isilocked(
>  		return rwsem_is_locked(&VFS_I(ip)->i_rwsem);
>  	}
>  
> +	if (lock_flags & (XFS_DAX_EXCL|XFS_DAX_SHARED)) {
> +		if (!(lock_flags & XFS_DAX_SHARED))
> +			return !debug_locks ||
> +				percpu_rwsem_is_held(&ip->i_dax_sem, 0);
> +		return rwsem_is_locked(&ip->i_dax_sem);
> +	}
> +
>  	ASSERT(0);
>  	return 0;
>  }
> @@ -465,6 +491,9 @@ xfs_lock_inodes(
>  	ASSERT(!(lock_mode & XFS_ILOCK_EXCL) ||
>  		inodes <= XFS_ILOCK_MAX_SUBCLASS + 1);
>  
> +	/* XFS_DAX_* is not allowed */
> +	ASSERT((lock_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0);
> +
>  	if (lock_mode & XFS_IOLOCK_EXCL) {
>  		ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL)));
>  	} else if (lock_mode & XFS_MMAPLOCK_EXCL)
> @@ -566,6 +595,10 @@ xfs_lock_two_inodes(
>  	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
>  	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
>  
> +	/* XFS_DAX_* is not allowed */
> +	ASSERT((ip0_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0);
> +	ASSERT((ip1_mode & (XFS_DAX_SHARED | XFS_DAX_EXCL)) == 0);
> +
>  	ASSERT(ip0->i_ino != ip1->i_ino);
>  
>  	if (ip0->i_ino > ip1->i_ino) {
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 492e53992fa9..693ca66bd89b 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -67,6 +67,9 @@ typedef struct xfs_inode {
>  	spinlock_t		i_ioend_lock;
>  	struct work_struct	i_ioend_work;
>  	struct list_head	i_ioend_list;
> +
> +	/* protect changing the mode to/from DAX */
> +	struct percpu_rw_semaphore i_dax_sem;
>  } xfs_inode_t;
>  
>  /* Convert from vfs inode to xfs inode */
> @@ -278,10 +281,13 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>  #define	XFS_ILOCK_SHARED	(1<<3)
>  #define	XFS_MMAPLOCK_EXCL	(1<<4)
>  #define	XFS_MMAPLOCK_SHARED	(1<<5)
> +#define	XFS_DAX_EXCL		(1<<6)
> +#define	XFS_DAX_SHARED		(1<<7)
>  
>  #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
>  				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
> -				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
> +				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
> +				| XFS_DAX_EXCL | XFS_DAX_SHARED)
>  
>  #define XFS_LOCK_FLAGS \
>  	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
> @@ -289,7 +295,9 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>  	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
>  	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
>  	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
> -	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
> +	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
> +	{ XFS_DAX_EXCL,   	"DAX_EXCL" }, \

Whitespace between the comma & string.

> +	{ XFS_DAX_SHARED,	"DAX_SHARED" }
>  
>  
>  /*
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index d6843cdb51d0..a2f2604c3187 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1158,6 +1158,16 @@ xfs_vn_tmpfile(
>  	return xfs_generic_create(dir, dentry, mode, 0, true);
>  }
>  
> +static void xfs_lock_mode(struct inode *inode)
> +{
> +	xfs_ilock(XFS_I(inode), XFS_DAX_SHARED);
> +}
> +
> +static void xfs_unlock_mode(struct inode *inode)
> +{
> +	xfs_iunlock(XFS_I(inode), XFS_DAX_SHARED);
> +}
> +
>  static const struct inode_operations xfs_inode_operations = {
>  	.get_acl		= xfs_get_acl,
>  	.set_acl		= xfs_set_acl,
> @@ -1168,6 +1178,18 @@ static const struct inode_operations xfs_inode_operations = {
>  	.update_time		= xfs_vn_update_time,
>  };
>  
> +static const struct inode_operations xfs_reg_inode_operations = {
> +	.get_acl		= xfs_get_acl,
> +	.set_acl		= xfs_set_acl,
> +	.getattr		= xfs_vn_getattr,
> +	.setattr		= xfs_vn_setattr,
> +	.listxattr		= xfs_vn_listxattr,
> +	.fiemap			= xfs_vn_fiemap,
> +	.update_time		= xfs_vn_update_time,
> +	.lock_mode              = xfs_lock_mode,
> +	.unlock_mode            = xfs_unlock_mode,
> +};
> +
>  static const struct inode_operations xfs_dir_inode_operations = {
>  	.create			= xfs_vn_create,
>  	.lookup			= xfs_vn_lookup,
> @@ -1372,7 +1394,7 @@ xfs_setup_iops(
>  
>  	switch (inode->i_mode & S_IFMT) {
>  	case S_IFREG:
> -		inode->i_op = &xfs_inode_operations;
> +		inode->i_op = &xfs_reg_inode_operations;

xfs_file_inode_operations?

--D

>  		inode->i_fop = &xfs_file_operations;
>  		if (IS_DAX(inode))
>  			inode->i_mapping->a_ops = &xfs_dax_aops;
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed
  2020-01-10 19:29 ` [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed ira.weiny
@ 2020-01-13 22:22   ` Darrick J. Wong
  2020-01-14  0:46     ` Ira Weiny
  2020-01-15 10:21   ` David Laight
  1 sibling, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-13 22:22 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Jan 10, 2020 at 11:29:39AM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Page faults need to ensure the inode mode is correct and consistent with
> the vmf information at the time of the fault.  There is no easy way to
> ensure the vmf information is correct if a mode change is in progress.
> Furthermore, there is no good use case to require a mode change while
> the file is mmap'ed.
> 
> Track mmap's of the file and fail the mode change if the file is
> mmap'ed.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  fs/inode.c         |  2 ++
>  fs/xfs/xfs_ioctl.c |  8 ++++++++
>  include/linux/fs.h |  1 +
>  mm/mmap.c          | 19 +++++++++++++++++--
>  4 files changed, 28 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 2b0f51161918..944711aed6f8 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -245,6 +245,8 @@ static struct inode *alloc_inode(struct super_block *sb)
>  		return NULL;
>  	}
>  
> +	atomic64_set(&inode->i_mapped, 0);
> +
>  	return inode;
>  }
>  
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index bc3654fe3b5d..1ab0906c6c7f 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1200,6 +1200,14 @@ xfs_ioctl_setattr_dax_invalidate(
>  		goto out_unlock;
>  	}
>  
> +	/*
> +	 * If there is a mapping in place we must remain in our current mode.
> +	 */
> +	if (atomic64_read(&inode->i_mapped)) {

Urk, should we really be messing around with the address space
internals?

> +		error = -EBUSY;
> +		goto out_unlock;
> +	}
> +
>  	error = filemap_write_and_wait(inode->i_mapping);
>  	if (error)
>  		goto out_unlock;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 631f11d6246e..6e7dc626b657 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -740,6 +740,7 @@ struct inode {
>  #endif
>  
>  	void			*i_private; /* fs or device private pointer */
> +	atomic64_t               i_mapped;

I would have expected to find this in struct address_space since the
mapping count is a function of the address space, right?

--D

>  } __randomize_layout;
>  
>  struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index dfaf1130e706..e6b68924b7ca 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -171,12 +171,17 @@ void unlink_file_vma(struct vm_area_struct *vma)
>  static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>  {
>  	struct vm_area_struct *next = vma->vm_next;
> +	struct file *f = vma->vm_file;
>  
>  	might_sleep();
>  	if (vma->vm_ops && vma->vm_ops->close)
>  		vma->vm_ops->close(vma);
> -	if (vma->vm_file)
> -		fput(vma->vm_file);
> +	if (f) {
> +		struct inode *inode = file_inode(f);
> +		if (inode)
> +			atomic64_dec(&inode->i_mapped);
> +		fput(f);
> +	}
>  	mpol_put(vma_policy(vma));
>  	vm_area_free(vma);
>  	return next;
> @@ -1837,6 +1842,16 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  
>  	vma_set_page_prot(vma);
>  
> +	/*
> +	 * Track if there is mapping in place such that a mode change
> +	 * does not occur on a file which is mapped
> +	 */
> +	if (file) {
> +		struct inode		*inode = file_inode(file);
> +
> +		atomic64_inc(&inode->i_mapped);
> +	}
> +
>  	return addr;
>  
>  unmap_and_free_vma:
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 10/12] fs/xfs: Fix truncate up
  2020-01-10 19:29 ` [RFC PATCH V2 10/12] fs/xfs: Fix truncate up ira.weiny
@ 2020-01-13 22:27   ` Darrick J. Wong
  2020-01-14  0:40     ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-13 22:27 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Jan 10, 2020 at 11:29:40AM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> When zeroing the end of a file we must account for bytes contained in
> the final page which are past EOF.
> 
> Extend the range passed to iomap_zero_range() to reach LLONG_MAX which
> will include all bytes of the final page.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  fs/xfs/xfs_iops.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a2f2604c3187..a34b04e8ac9c 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -910,7 +910,7 @@ xfs_setattr_size(
>  	 */
>  	if (newsize > oldsize) {
>  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> +		error = iomap_zero_range(inode, oldsize, LLONG_MAX - oldsize,

Huh?  Won't this cause the file size to be set to LLONG_MAX?

--D

>  				&did_zeroing, &xfs_buffered_write_iomap_ops);
>  	} else {
>  		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode'
  2020-01-13 22:12   ` Darrick J. Wong
@ 2020-01-14  0:20     ` Ira Weiny
  2020-01-14  1:03       ` Darrick J. Wong
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2020-01-14  0:20 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 02:12:18PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>

[snip]

> >  
> >  The File Object
> >  ---------------
> > @@ -437,6 +459,8 @@ As of kernel 2.6.22, the following members are defined:
> >  		int (*atomic_open)(struct inode *, struct dentry *, struct file *,
> >  				   unsigned open_flag, umode_t create_mode);
> >  		int (*tmpfile) (struct inode *, struct dentry *, umode_t);
> > +		void (*lock_mode)(struct inode *);
> > +		void (*unlock_mode)(struct inode *);
> 
> Yikes.  "mode" has a specific meaning for inodes, and this lock isn't
> related to i_mode.  This lock protects aops from changing while an
> address space operation is in use.

Ah...  yea ok mode is a bad name.

> 
> >  	};
> >  
> >  Again, all methods are called without any locks being held, unless
> > @@ -584,6 +608,12 @@ otherwise noted.
> >  	atomically creating, opening and unlinking a file in given
> >  	directory.
> >  
> > +``lock_mode``
> > +	called to prevent operations which depend on the inode's mode from
> > +        proceeding should a mode change be in progress
> 
> "Inodes can't change mode, because files do not suddenly become
> directories". ;)

Yea sorry.

> 
> Oh, you meant "lock_XXXX is called to prevent a change in the pagecache
> mode from proceeding while there are address space operations in
> progress".  So these are really more aops get and put functions...

At first I actually did have aops get/put functions but this is really
protecting more than the aops vector because as Christoph said there are file
operations which need to be protected not just address space operations.

But I agree "mode" is a bad name...  Sorry...

> 
> > +``unlock_mode``
> > +	called when critical mode dependent operation is complete
> >  
> >  The Address Space Object
> >  ========================
> > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > index 7c9a5df5a597..ed6ab5303a24 100644
> > --- a/fs/ioctl.c
> > +++ b/fs/ioctl.c
> > @@ -55,18 +55,29 @@ EXPORT_SYMBOL(vfs_ioctl);
> >  static int ioctl_fibmap(struct file *filp, int __user *p)
> >  {
> >  	struct address_space *mapping = filp->f_mapping;
> > +	struct inode *inode = filp->f_inode;
> >  	int res, block;
> >  
> > +	lock_inode_mode(inode);
> > +
> >  	/* do we support this mess? */
> > -	if (!mapping->a_ops->bmap)
> > -		return -EINVAL;
> > -	if (!capable(CAP_SYS_RAWIO))
> > -		return -EPERM;
> > +	if (!mapping->a_ops->bmap) {
> > +		res = -EINVAL;
> > +		goto out;
> > +	}
> > +	if (!capable(CAP_SYS_RAWIO)) {
> > +		res = -EPERM;
> > +		goto out;
> 
> Why does the order of these checks change here?

I don't understand?  The order does not change we just can't return without
releasing the lock.  And to protect against bmap changing the lock needs to be
taken first.

[snip]

> >  
> > +static inline void lock_inode_mode(struct inode *inode)
> 
> inode_aops_get()?

Let me think on this.  This is not just getting a reference to the aops vector.
It is more than that...  and inode_get is not right either!  ;-P

> 
> > +{
> > +	WARN_ON_ONCE(inode->i_op->lock_mode &&
> > +		     !inode->i_op->unlock_mode);
> > +	if (inode->i_op->lock_mode)
> > +		inode->i_op->lock_mode(inode);
> > +}
> > +static inline void unlock_inode_mode(struct inode *inode)
> > +{
> > +	WARN_ON_ONCE(inode->i_op->unlock_mode &&
> > +		     !inode->i_op->lock_mode);
> > +	if (inode->i_op->unlock_mode)
> > +		inode->i_op->unlock_mode(inode);
> > +}
> > +
> >  static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
> >  				     struct iov_iter *iter)
> 
> inode_aops_put()?

...  something like that but not 'aops'...

Ira

> 
> --D
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs
  2020-01-13 22:19   ` Darrick J. Wong
@ 2020-01-14  0:35     ` Ira Weiny
  2020-01-15  0:57       ` Ira Weiny
  2020-01-15 23:52     ` Ira Weiny
  1 sibling, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2020-01-14  0:35 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 02:19:57PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 10, 2020 at 11:29:38AM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>

[snip]

> >  
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 401da197f012..e8fd95b75e5b 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> >   *
> >   * Basic locking order:
> >   *
> > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > + * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock
> 
> Mmmmmm, more locks.  Can we skip the extra lock if CONFIG_FSDAX=n or if
> the filesystem devices don't support DAX at all?

I'll look into it.

> 
> Also, I don't think we're actually following the i_rwsem -> i_daxsem
> order in fallocate, and possibly elsewhere too?

I'll have to verify.  It took a lot of iterations to get the order working so
I'm not going to claim perfection.

> 
> Does the vfs have to take the i_dax_sem to do remapping things like
> reflink?  (Pretend that reflink and dax are compatible for the moment)

Honestly I can't say for sure.  For this series I was careful to exclude
reflink from the locking requirement.

[snip]

> >  
> >  #define XFS_LOCK_FLAGS \
> >  	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
> > @@ -289,7 +295,9 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
> >  	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
> >  	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
> >  	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
> > -	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
> > +	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
> > +	{ XFS_DAX_EXCL,   	"DAX_EXCL" }, \
> 
> Whitespace between the comma & string.

Fixed.

[snip]

> > +
> >  static const struct inode_operations xfs_dir_inode_operations = {
> >  	.create			= xfs_vn_create,
> >  	.lookup			= xfs_vn_lookup,
> > @@ -1372,7 +1394,7 @@ xfs_setup_iops(
> >  
> >  	switch (inode->i_mode & S_IFMT) {
> >  	case S_IFREG:
> > -		inode->i_op = &xfs_inode_operations;
> > +		inode->i_op = &xfs_reg_inode_operations;
> 
> xfs_file_inode_operations?

Sounds better.  Fixed.

Ira


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 10/12] fs/xfs: Fix truncate up
  2020-01-13 22:27   ` Darrick J. Wong
@ 2020-01-14  0:40     ` Ira Weiny
  2020-01-14  1:14       ` Darrick J. Wong
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2020-01-14  0:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 02:27:55PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 10, 2020 at 11:29:40AM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > When zeroing the end of a file we must account for bytes contained in
> > the final page which are past EOF.
> > 
> > Extend the range passed to iomap_zero_range() to reach LLONG_MAX which
> > will include all bytes of the final page.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > ---
> >  fs/xfs/xfs_iops.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index a2f2604c3187..a34b04e8ac9c 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -910,7 +910,7 @@ xfs_setattr_size(
> >  	 */
> >  	if (newsize > oldsize) {
> >  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> > -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> > +		error = iomap_zero_range(inode, oldsize, LLONG_MAX - oldsize,
> 
> Huh?  Won't this cause the file size to be set to LLONG_MAX?

Not as I understand the code.  But as I said in the cover I am not 100% sure of
this fix.

From what I can tell xfs_ioctl_setattr_dax_invalidate() should invalidate the
mappings and the page cache and the traces I have indicate that the DAX mode
is not changing or was properly held off.

Any other suggestions as to the problem are welcome.

Ira


> 
> --D
> 
> >  				&did_zeroing, &xfs_buffered_write_iomap_ops);
> >  	} else {
> >  		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> > -- 
> > 2.21.0
> > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed
  2020-01-13 22:22   ` Darrick J. Wong
@ 2020-01-14  0:46     ` Ira Weiny
  2020-01-14  1:30       ` Darrick J. Wong
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2020-01-14  0:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 02:22:12PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 10, 2020 at 11:29:39AM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 

[snip]

> >  
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index bc3654fe3b5d..1ab0906c6c7f 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -1200,6 +1200,14 @@ xfs_ioctl_setattr_dax_invalidate(
> >  		goto out_unlock;
> >  	}
> >  
> > +	/*
> > +	 * If there is a mapping in place we must remain in our current mode.
> > +	 */
> > +	if (atomic64_read(&inode->i_mapped)) {
> 
> Urk, should we really be messing around with the address space
> internals?

I contemplated a function call instead of checking i_mapped directly?  Is that
what you mean?


> 
> > +		error = -EBUSY;
> > +		goto out_unlock;
> > +	}
> > +
> >  	error = filemap_write_and_wait(inode->i_mapping);
> >  	if (error)
> >  		goto out_unlock;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 631f11d6246e..6e7dc626b657 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -740,6 +740,7 @@ struct inode {
> >  #endif
> >  
> >  	void			*i_private; /* fs or device private pointer */
> > +	atomic64_t               i_mapped;
> 
> I would have expected to find this in struct address_space since the
> mapping count is a function of the address space, right?

I suppose but the only external call (above) would be passing an inode.  So to
me it seemed better here.

Ira

> 
> --D
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode'
  2020-01-14  0:20     ` Ira Weiny
@ 2020-01-14  1:03       ` Darrick J. Wong
  2020-01-15 19:08         ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-14  1:03 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 04:20:05PM -0800, Ira Weiny wrote:
> On Mon, Jan 13, 2020 at 02:12:18PM -0800, Darrick J. Wong wrote:
> > On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> 
> [snip]
> 
> > >  
> > >  The File Object
> > >  ---------------
> > > @@ -437,6 +459,8 @@ As of kernel 2.6.22, the following members are defined:
> > >  		int (*atomic_open)(struct inode *, struct dentry *, struct file *,
> > >  				   unsigned open_flag, umode_t create_mode);
> > >  		int (*tmpfile) (struct inode *, struct dentry *, umode_t);
> > > +		void (*lock_mode)(struct inode *);
> > > +		void (*unlock_mode)(struct inode *);
> > 
> > Yikes.  "mode" has a specific meaning for inodes, and this lock isn't
> > related to i_mode.  This lock protects aops from changing while an
> > address space operation is in use.
> 
> Ah...  yea ok mode is a bad name.
> 
> > 
> > >  	};
> > >  
> > >  Again, all methods are called without any locks being held, unless
> > > @@ -584,6 +608,12 @@ otherwise noted.
> > >  	atomically creating, opening and unlinking a file in given
> > >  	directory.
> > >  
> > > +``lock_mode``
> > > +	called to prevent operations which depend on the inode's mode from
> > > +        proceeding should a mode change be in progress
> > 
> > "Inodes can't change mode, because files do not suddenly become
> > directories". ;)
> 
> Yea sorry.
> 
> > 
> > Oh, you meant "lock_XXXX is called to prevent a change in the pagecache
> > mode from proceeding while there are address space operations in
> > progress".  So these are really more aops get and put functions...
> 
> At first I actually did have aops get/put functions but this is really
> protecting more than the aops vector because as Christoph said there are file
> operations which need to be protected not just address space operations.
> 
> But I agree "mode" is a bad name...  Sorry...

inode_fops_{get,set}(), then?

inode_start_fileop()
inode_end_fileop() ?

Trying to avoid sounding foppish <COUGH>

> > 
> > > +``unlock_mode``
> > > +	called when critical mode dependent operation is complete
> > >  
> > >  The Address Space Object
> > >  ========================
> > > diff --git a/fs/ioctl.c b/fs/ioctl.c
> > > index 7c9a5df5a597..ed6ab5303a24 100644
> > > --- a/fs/ioctl.c
> > > +++ b/fs/ioctl.c
> > > @@ -55,18 +55,29 @@ EXPORT_SYMBOL(vfs_ioctl);
> > >  static int ioctl_fibmap(struct file *filp, int __user *p)
> > >  {
> > >  	struct address_space *mapping = filp->f_mapping;
> > > +	struct inode *inode = filp->f_inode;
> > >  	int res, block;
> > >  
> > > +	lock_inode_mode(inode);
> > > +
> > >  	/* do we support this mess? */
> > > -	if (!mapping->a_ops->bmap)
> > > -		return -EINVAL;
> > > -	if (!capable(CAP_SYS_RAWIO))
> > > -		return -EPERM;
> > > +	if (!mapping->a_ops->bmap) {
> > > +		res = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +	if (!capable(CAP_SYS_RAWIO)) {
> > > +		res = -EPERM;
> > > +		goto out;
> > 
> > Why does the order of these checks change here?
> 
> I don't understand?  The order does not change we just can't return without
> releasing the lock.  And to protect against bmap changing the lock needs to be
> taken first.

Doh.  -ENOCOFFEE, I plead.

--D

> [snip]
> 
> > >  
> > > +static inline void lock_inode_mode(struct inode *inode)
> > 
> > inode_aops_get()?
> 
> Let me think on this.  This is not just getting a reference to the aops vector.
> It is more than that...  and inode_get is not right either!  ;-P
> 
> > 
> > > +{
> > > +	WARN_ON_ONCE(inode->i_op->lock_mode &&
> > > +		     !inode->i_op->unlock_mode);
> > > +	if (inode->i_op->lock_mode)
> > > +		inode->i_op->lock_mode(inode);
> > > +}
> > > +static inline void unlock_inode_mode(struct inode *inode)
> > > +{
> > > +	WARN_ON_ONCE(inode->i_op->unlock_mode &&
> > > +		     !inode->i_op->lock_mode);
> > > +	if (inode->i_op->unlock_mode)
> > > +		inode->i_op->unlock_mode(inode);
> > > +}
> > > +
> > >  static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
> > >  				     struct iov_iter *iter)
> > 
> > inode_aops_put()?
> 
> ...  something like that but not 'aops'...
> 
> Ira
> 
> > 
> > --D
> > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 10/12] fs/xfs: Fix truncate up
  2020-01-14  0:40     ` Ira Weiny
@ 2020-01-14  1:14       ` Darrick J. Wong
  2020-01-14 19:00         ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-14  1:14 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 04:40:47PM -0800, Ira Weiny wrote:
> On Mon, Jan 13, 2020 at 02:27:55PM -0800, Darrick J. Wong wrote:
> > On Fri, Jan 10, 2020 at 11:29:40AM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > When zeroing the end of a file we must account for bytes contained in
> > > the final page which are past EOF.
> > > 
> > > Extend the range passed to iomap_zero_range() to reach LLONG_MAX which
> > > will include all bytes of the final page.
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > ---
> > >  fs/xfs/xfs_iops.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index a2f2604c3187..a34b04e8ac9c 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -910,7 +910,7 @@ xfs_setattr_size(
> > >  	 */
> > >  	if (newsize > oldsize) {
> > >  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> > > -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> > > +		error = iomap_zero_range(inode, oldsize, LLONG_MAX - oldsize,
> > 
> > Huh?  Won't this cause the file size to be set to LLONG_MAX?
> 
> Not as I understand the code.

iomap_zero_range uses the standard iomap_write_{begin,end} functions,
which means that if you pass it an (offset, length) that extend beyond
EOF it will change isize to offset+length.

> But as I said in the cover I am not 100% sure of
> this fix.

> From what I can tell xfs_ioctl_setattr_dax_invalidate() should invalidate the
> mappings and the page cache and the traces I have indicate that the DAX mode
> is not changing or was properly held off.

Hmm, that implies the invalidation didn't work.  Can you find a way to
report the contents of the page cache after the dax mode change
invalidation fails?  I wonder if this is something dorky like rounding
down such that the EOF page doesn't actually get invalidated?

Hmm, no, xfs_ioctl_setattr_dax_invalidate should be nuking all the
pages... do you have a quick reproducer?

--D

> Any other suggestions as to the problem are welcome.
> 
> Ira
> 
> 
> > 
> > --D
> > 
> > >  				&did_zeroing, &xfs_buffered_write_iomap_ops);
> > >  	} else {
> > >  		error = iomap_truncate_page(inode, newsize, &did_zeroing,
> > > -- 
> > > 2.21.0
> > > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed
  2020-01-14  0:46     ` Ira Weiny
@ 2020-01-14  1:30       ` Darrick J. Wong
  2020-01-14 17:53         ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-14  1:30 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 04:46:10PM -0800, Ira Weiny wrote:
> On Mon, Jan 13, 2020 at 02:22:12PM -0800, Darrick J. Wong wrote:
> > On Fri, Jan 10, 2020 at 11:29:39AM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> 
> [snip]
> 
> > >  
> > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > > index bc3654fe3b5d..1ab0906c6c7f 100644
> > > --- a/fs/xfs/xfs_ioctl.c
> > > +++ b/fs/xfs/xfs_ioctl.c
> > > @@ -1200,6 +1200,14 @@ xfs_ioctl_setattr_dax_invalidate(
> > >  		goto out_unlock;
> > >  	}
> > >  
> > > +	/*
> > > +	 * If there is a mapping in place we must remain in our current mode.
> > > +	 */
> > > +	if (atomic64_read(&inode->i_mapped)) {
> > 
> > Urk, should we really be messing around with the address space
> > internals?
> 
> I contemplated a function call instead of checking i_mapped directly?  Is that
> what you mean?

Yeah.  Abstracting the details just enough that filesystems don't have
to know that i_mapped is atomic64 etc.

> 
> > 
> > > +		error = -EBUSY;
> > > +		goto out_unlock;
> > > +	}
> > > +
> > >  	error = filemap_write_and_wait(inode->i_mapping);
> > >  	if (error)
> > >  		goto out_unlock;
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 631f11d6246e..6e7dc626b657 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -740,6 +740,7 @@ struct inode {
> > >  #endif
> > >  
> > >  	void			*i_private; /* fs or device private pointer */
> > > +	atomic64_t               i_mapped;
> > 
> > I would have expected to find this in struct address_space since the
> > mapping count is a function of the address space, right?
> 
> I suppose but the only external call (above) would be passing an inode.  So to
> me it seemed better here.

But the number of memory mappings reflects the state of the address
space, not the inode.  Or maybe put another way, if I were an mm
developer I would not expect to look in struct inode for mm state.

static inline bool inode_has_mappings(struct inode *inode)
{
	return atomic64_read(&inode->i_mapping->mapcount) > 0;
}

OTOH if there exist other mm developers who /do/ find that storing the
mmap count in struct inode is more logical, please let me know. :)

--D

> Ira
> 
> > 
> > --D
> > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed
  2020-01-14  1:30       ` Darrick J. Wong
@ 2020-01-14 17:53         ` Ira Weiny
  2020-01-15 11:34           ` Jan Kara
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2020-01-14 17:53 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 05:30:04PM -0800, Darrick J. Wong wrote:
> On Mon, Jan 13, 2020 at 04:46:10PM -0800, Ira Weiny wrote:
> > On Mon, Jan 13, 2020 at 02:22:12PM -0800, Darrick J. Wong wrote:
> > > On Fri, Jan 10, 2020 at 11:29:39AM -0800, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > 
> > [snip]
> > 
> > > >  
> > > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > > > index bc3654fe3b5d..1ab0906c6c7f 100644
> > > > --- a/fs/xfs/xfs_ioctl.c
> > > > +++ b/fs/xfs/xfs_ioctl.c
> > > > @@ -1200,6 +1200,14 @@ xfs_ioctl_setattr_dax_invalidate(
> > > >  		goto out_unlock;
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * If there is a mapping in place we must remain in our current mode.
> > > > +	 */
> > > > +	if (atomic64_read(&inode->i_mapped)) {
> > > 
> > > Urk, should we really be messing around with the address space
> > > internals?
> > 
> > I contemplated a function call instead of checking i_mapped directly?  Is that
> > what you mean?
> 
> Yeah.  Abstracting the details just enough that filesystems don't have
> to know that i_mapped is atomic64 etc.

Done.

> 
> > 
> > > 
> > > > +		error = -EBUSY;
> > > > +		goto out_unlock;
> > > > +	}
> > > > +
> > > >  	error = filemap_write_and_wait(inode->i_mapping);
> > > >  	if (error)
> > > >  		goto out_unlock;
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index 631f11d6246e..6e7dc626b657 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -740,6 +740,7 @@ struct inode {
> > > >  #endif
> > > >  
> > > >  	void			*i_private; /* fs or device private pointer */
> > > > +	atomic64_t               i_mapped;
> > > 
> > > I would have expected to find this in struct address_space since the
> > > mapping count is a function of the address space, right?
> > 
> > I suppose but the only external call (above) would be passing an inode.  So to
> > me it seemed better here.
> 
> But the number of memory mappings reflects the state of the address
> space, not the inode.  Or maybe put another way, if I were an mm
> developer I would not expect to look in struct inode for mm state.

This is a good point...

> 
> static inline bool inode_has_mappings(struct inode *inode)
> {
> 	return atomic64_read(&inode->i_mapping->mapcount) > 0;
> }
> 
> OTOH if there exist other mm developers who /do/ find that storing the
> mmap count in struct inode is more logical, please let me know. :)

...  My thinking was that the number of mappings does not matters to the mm
system...  However, I'm starting to think you are correct...  ;-)

I've made a note of it and we will see what others think.

Ira

> 
> --D
> 
> > Ira
> > 
> > > 
> > > --D
> > > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 10/12] fs/xfs: Fix truncate up
  2020-01-14  1:14       ` Darrick J. Wong
@ 2020-01-14 19:00         ` Ira Weiny
  2020-01-14 19:39           ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2020-01-14 19:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 05:14:07PM -0800, Darrick J. Wong wrote:
> On Mon, Jan 13, 2020 at 04:40:47PM -0800, Ira Weiny wrote:
> > On Mon, Jan 13, 2020 at 02:27:55PM -0800, Darrick J. Wong wrote:
> > > On Fri, Jan 10, 2020 at 11:29:40AM -0800, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > > > When zeroing the end of a file we must account for bytes contained in
> > > > the final page which are past EOF.
> > > > 
> > > > Extend the range passed to iomap_zero_range() to reach LLONG_MAX which
> > > > will include all bytes of the final page.
> > > > 
> > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > > ---
> > > >  fs/xfs/xfs_iops.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > > index a2f2604c3187..a34b04e8ac9c 100644
> > > > --- a/fs/xfs/xfs_iops.c
> > > > +++ b/fs/xfs/xfs_iops.c
> > > > @@ -910,7 +910,7 @@ xfs_setattr_size(
> > > >  	 */
> > > >  	if (newsize > oldsize) {
> > > >  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> > > > -		error = iomap_zero_range(inode, oldsize, newsize - oldsize,
> > > > +		error = iomap_zero_range(inode, oldsize, LLONG_MAX - oldsize,
> > > 
> > > Huh?  Won't this cause the file size to be set to LLONG_MAX?
> > 
> > Not as I understand the code.
> 
> iomap_zero_range uses the standard iomap_write_{begin,end} functions,
> which means that if you pass it an (offset, length) that extend beyond
> EOF it will change isize to offset+length.

I don't see that but I'll take your word for it...  That is unfortunate because
I would have thought that the full page would have been zero'ed by something
already.

I found code in xfs_free_file_space() with this comment:

        /*
         * If we zeroed right up to EOF and EOF straddles a page boundary we
         * must make sure that the post-EOF area is also zeroed because the
         * page could be mmap'd and iomap_zero_range doesn't do that for us.
         * Writeback of the eof page will do this, albeit clumsily.
         */

But that just calls filemap_write_and_wait_range()...  :-/

> 
> > But as I said in the cover I am not 100% sure of
> > this fix.
> 
> > From what I can tell xfs_ioctl_setattr_dax_invalidate() should invalidate the
> > mappings and the page cache and the traces I have indicate that the DAX mode
> > is not changing or was properly held off.
> 
> Hmm, that implies the invalidation didn't work.  Can you find a way to
> report the contents of the page cache after the dax mode change
> invalidation fails?  I wonder if this is something dorky like rounding
> down such that the EOF page doesn't actually get invalidated?
> 
> Hmm, no, xfs_ioctl_setattr_dax_invalidate should be nuking all the
> pages... do you have a quick reproducer?

I thought I did...

What I have done is take this patch:

https://www.spinics.net/lists/fstests/msg13313.html

and put [run_fsx ""] in a loop... (diff below) And without this truncate fix
patch it was failing in about 5 - 10 iterations.  But I'm running it right now
and it has gone for 30+...  :-(

I am 90% confident that this series works for 100% of the use cases we have.  I
think this is an existing bug which I've just managed to find.  And again I'm
not comfortable with this patch either.  So I'm not trying to argue for it but
I just don't know what could be wrong...

Ira

diff --git a/tests/generic/999 b/tests/generic/999
index 6dd5529dbc65..929c20c6db04 100755
--- a/tests/generic/999
+++ b/tests/generic/999
@@ -274,7 +274,9 @@ function run_fsx {
        pid=""
 }
 
-run_fsx ""
+while [ 1 ]; do
+       run_fsx ""
+done
 run_fsx "-A"
 run_fsx "-Z -r 4096 -w 4096"


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 10/12] fs/xfs: Fix truncate up
  2020-01-14 19:00         ` Ira Weiny
@ 2020-01-14 19:39           ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2020-01-14 19:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3036 bytes --]

On Tue, Jan 14, 2020 at 11:00:57AM -0800, 'Ira Weiny' wrote:
> On Mon, Jan 13, 2020 at 05:14:07PM -0800, Darrick J. Wong wrote:
> > On Mon, Jan 13, 2020 at 04:40:47PM -0800, Ira Weiny wrote:
> > > On Mon, Jan 13, 2020 at 02:27:55PM -0800, Darrick J. Wong wrote:
> > > > On Fri, Jan 10, 2020 at 11:29:40AM -0800, ira.weiny@intel.com wrote:
> > > > > From: Ira Weiny <ira.weiny@intel.com>

[snip]

> 
> > 
> > > But as I said in the cover I am not 100% sure of
> > > this fix.
> > 
> > > From what I can tell xfs_ioctl_setattr_dax_invalidate() should invalidate the
> > > mappings and the page cache and the traces I have indicate that the DAX mode
> > > is not changing or was properly held off.
> > 
> > Hmm, that implies the invalidation didn't work.  Can you find a way to
> > report the contents of the page cache after the dax mode change
> > invalidation fails?  I wonder if this is something dorky like rounding
> > down such that the EOF page doesn't actually get invalidated?
> > 
> > Hmm, no, xfs_ioctl_setattr_dax_invalidate should be nuking all the
> > pages... do you have a quick reproducer?
> 
> I thought I did...
> 
> What I have done is take this patch:
> 
> https://www.spinics.net/lists/fstests/msg13313.html
> 
> and put [run_fsx ""] in a loop... (diff below) And without this truncate fix
> patch it was failing in about 5 - 10 iterations.  But I'm running it right now
> and it has gone for 30+...  :-(

Ok...  perhaps this is a qemu problem?  In qemu I had a slightly different
script; 'test-suite.sh' (attached).  This was copied to create the xfstest I
sent.  Without this 'fix truncate patch I get something like the following
after in just a few iterations.

...

   *** run 'fsx ' racing with setting/clearing the DAX flag
   *** run 'fsx ' racing with setting/clearing the DAX flag
FAILED: fsx exited with status : 205
        see trace_output
zero_range to largest ever: 0x19867
truncating to largest ever: 0x3fb6d
zero_range to largest ever: 0x40000
Mapped Write: non-zero data past EOF (0x3ab88) page offset 0xb89 is 0xe71c
LOG DUMP (15817 total operations):
15818(202 mod 256): SKIPPED (no operation)
15819(203 mod 256): ZERO     0x2ae9f thru 0x3346c       (0x85ce bytes)
15820(204 mod 256): READ     0x3637 thru 0x91e6 (0x5bb0 bytes)
15821(205 mod 256): MAPREAD  0x38a80 thru 0x3d014       (0x4595 bytes)
15822(206 mod 256): INSERT 0x28000 thru 0x29fff (0x2000 bytes)

Ira

> 
> I am 90% confident that this series works for 100% of the use cases we have.  I
> think this is an existing bug which I've just managed to find.  And again I'm
> not comfortable with this patch either.  So I'm not trying to argue for it but
> I just don't know what could be wrong...
> 
> Ira
> 
> diff --git a/tests/generic/999 b/tests/generic/999
> index 6dd5529dbc65..929c20c6db04 100755
> --- a/tests/generic/999
> +++ b/tests/generic/999
> @@ -274,7 +274,9 @@ function run_fsx {
>         pid=""
>  }
>  
> -run_fsx ""
> +while [ 1 ]; do
> +       run_fsx ""
> +done
>  run_fsx "-A"
>  run_fsx "-Z -r 4096 -w 4096"
> 

[-- Attachment #2: test-suite.sh --]
[-- Type: text/plain, Size: 7344 bytes --]

#!/bin/bash

#
# Run suite of tests for feature
#

mnt_pt=/mnt/pmem

function mount_no_dax {
	umount $mnt_pt
	mount /dev/pmem0 $mnt_pt
}

function mount_dax {
	umount $mnt_pt
	mount -o dax /dev/pmem0 $mnt_pt
}

# test set up
if [ ! -d /mnt/pmem ]; then
	echo ""
	echo "     Setting up FS DAX"
	echo ""

	ndctl create-namespace -e namespace0.0 -f --mode=fsdax
	#mkfs.ext4 /dev/pmem0
	mkfs.xfs -f /dev/pmem0
	mkdir -p /mnt/pmem
	mount_no_dax
fi

#if [ ! -c /dev/dax1.0 ]; then
#	echo ""
#	echo "     Setting up DEV DAX"
#	echo ""
#
#	ndctl create-namespace -f --mode=devdax --align=4096 -e namespace1.0
#fi

if [ "$1" == "--create-fs-only" ]; then
	exit 0
fi


test_file=/mnt/pmem/foo

function start_test {
	rm -f $test_file
	echo ""
	echo "     ***** START ${FUNCNAME[1]} *****"
}

function end_test {
	echo "     ***** END ${FUNCNAME[1]} *****"
	echo ""
	killall -9 rdma-fsdax
}

function expect_pass {
	echo -n "     ***** $2 : "
	if [ "$1" == "0" ]; then
		echo "PASSED"
	else
		echo "FAILED"
		exit 255
	fi
}

function expect_fail {
	echo -n "     ***** $2 : "
	if [ "$1" == "0" ]; then
		echo "FAILED"
		exit 255
	else
		echo "PASSED"
	fi
}


# test below here are not fully written yet.  Just ideas of what we should test
# for.

function check_phys_dax {
	xfs_io -c 'lsattr' $1 | awk -e '{ print $1 }' | grep 'x' &> /dev/null
	if [ "$?" != "0" ]; then
		echo "FAILED: Did NOT find DAX flag on $1"
		exit 255
	fi
}

function check_effective_dax {
	attr=`xfs_io -c 'statx -r' $1 | grep 'stat.attributes' | awk -e '{ print $3 }'`
	masked=$(( $attr & 0x2000 ))
	if [ "$masked" != "8192" ]; then
		echo "FAILED: Did NOT find VFS DAX flag on $1"
		exit 255
	fi
}

function check_phys_no_dax {
	xfs_io -c 'lsattr' $1 | awk -e '{ print $1 }' | grep 'x' &> /dev/null
	if [ "$?" == "0" ]; then
		echo "FAILED: Found DAX flag on $1"
		exit 255
	fi
}

function check_effective_no_dax {
	attr=`xfs_io -c 'statx -r' $1 | grep 'stat.attributes' | awk -e '{ print $3 }'`
	masked=$(( $attr & 0x2000 ))
	if [ "$masked" == "8192" ]; then
		echo "FAILED: Found VFS DAX flag on $1"
		exit 255
	fi
}

XFS_IO_PROG=xfs_io

TEST_DIR=/mnt/pmem

dax_dir=$TEST_DIR/dax-dir
dax_sub_dir=$TEST_DIR/dax-dir/dax-sub-dir
dax_inh_file=$dax_dir/dax-inh-file
dax_non_inh_file=$dax_dir/dax-non-inh-file
non_dax=$TEST_DIR/non-dax
dax_file=$TEST_DIR/dax-file
dax_file_copy=$TEST_DIR/dax-file-copy
dax_file_move=$TEST_DIR/dax-file-move
data_file=$TEST_DIR/data-file

function clean_up {
	echo "cleaning up..."
	rm -rf $TEST_DIR/*
}

clean_up

# Do double mount to have these special options.
mount_no_dax
touch $dax_file

if [ "$1" == "--chattr-only" ]; then
	for i in `seq 1 10000`; do
		xfs_io -c 'chattr +x' $dax_file
		xfs_io -c 'chattr -x' $dax_file
	done
	exit 0
fi

if [ "$1" == "--fsx-only" ]; then
	./fsx -R -W -N 100000 $dax_file
	head $dax_file.fsxlog
	exit 0
fi

if [ "$1" == "--fsx-aio-only" ]; then
	./fsx -R -W -A -N 100000 $dax_file
	head $dax_file.fsxlog
	exit 0
fi

# The below should be kept the same as xfstests

echo "   *** mount w/o dax flag."
mount_no_dax

echo "   *** mark dax-dir as dax enabled"
mkdir $dax_dir
xfs_io -c 'chattr +x' $dax_dir
check_phys_dax $dax_dir
check_effective_dax $dax_dir

echo "   *** check file inheritance"
touch $dax_inh_file
check_phys_dax $dax_inh_file
check_effective_dax $dax_inh_file

echo "   *** check directory inheritance"
mkdir $dax_sub_dir
check_phys_dax $dax_sub_dir
check_effective_dax $dax_sub_dir

echo "   *** check changing directory"
xfs_io -c 'chattr -x' $dax_dir
check_phys_no_dax $dax_dir
check_effective_no_dax $dax_dir

echo "   *** check non file inheritance"
touch $dax_non_inh_file
check_phys_no_dax $dax_non_inh_file
check_effective_no_dax $dax_non_inh_file

echo "   *** check that previous file stays enabled"
check_phys_dax $dax_inh_file
check_effective_dax $dax_inh_file

echo "   *** Reset the directory"
xfs_io -c 'chattr +x' $dax_dir
check_phys_dax $dax_dir
check_effective_dax $dax_dir


# check mount override
# ====================

echo "   *** Remount fs with mount flag"
mount_dax
touch $non_dax
check_phys_no_dax $non_dax
check_effective_dax $non_dax

echo "   *** Check for non-dax files to be dax with mount option"
check_effective_dax $dax_non_inh_file

echo "   *** check for file dax flag 'sticky-ness' after remount"
touch $dax_file
xfs_io -c 'chattr +x' $dax_file
check_phys_dax $dax_file
check_effective_dax $dax_file

echo "   *** remount w/o mount flag"
mount_no_dax
check_phys_dax $dax_file
check_effective_dax $dax_file

check_phys_no_dax $non_dax
check_effective_no_dax $non_dax


# Check non-zero file operations
# ==============================

echo "   *** file should change effective but page cache should be empty"
$XFS_IO_PROG -f -c "pwrite 0 10000" $data_file > /dev/null
xfs_io -c 'chattr +x' $data_file
check_phys_dax $data_file
check_effective_dax $data_file


# Check inheritance on cp, mv
# ===========================

echo "   *** check inheritance on cp, mv"
cp $non_dax $dax_dir/conv-dax
check_phys_dax $dax_dir/conv-dax
check_effective_dax $dax_dir/conv-dax

echo "   *** Moved files 'don't inherit'"
mv $non_dax $dax_dir/move-dax
check_phys_no_dax $dax_dir/move-dax
check_effective_no_dax $dax_dir/move-dax

# Check preservation of phys on cp, mv
# ====================================

mv $dax_file $dax_file_move
check_phys_dax $dax_file_move
check_effective_dax $dax_file_move

cp $dax_file_move $dax_file_copy
check_phys_no_dax $dax_file_copy
check_effective_no_dax $dax_file_copy


# Verify no mode changes on mmap
# ==============================

echo "   *** check no mode change when mmaped"

dd if=/dev/zero of=$dax_file bs=4096 count=10 > $tmp.log 2>&1

# set known state.
xfs_io -c 'chattr -x' $dax_file
check_phys_no_dax $dax_file
check_effective_no_dax $dax_file

python3 - << EOF > $tmp.log 2>&1 &
import mmap
import time
print ('mmaping "$dax_file"')
f=open("$dax_file", "r+b")
mm = mmap.mmap(f.fileno(), 0)
print ('mmaped "$dax_file"')
while True:
	time.sleep(1)
EOF
pid=$!

sleep 1

# attempt to should fail
xfs_io -c 'chattr +x' $dax_file > /dev/null 2>&1
check_phys_no_dax $dax_file
check_effective_no_dax $dax_file

kill -TERM $pid > /dev/null 2>&1
wait $pid > /dev/null 2>&1

# after mmap released should work
xfs_io -c 'chattr +x' $dax_file
check_phys_dax $dax_file
check_effective_dax $dax_file


# Finally run the test stolen from Christoph Hellwig to test changing the mode
# while performing a series of operations
# =============================================================================

function run_fsx {
	options=$1

	echo "   *** run 'fsx $options' racing with setting/clearing the DAX flag"
	./fsx $options -N 20000 $dax_file > $tmp.log 2>&1 &
	pid=$!

	if [ ! -n "$pid" ]; then
		echo "FAILED to start fsx"
		exit 255
	fi

	# NOTE: for some
	for i in `seq 1 500`; do
		xfs_io -c 'chattr +x' $dax_file > /dev/null 2>&1
		xfs_io -c 'chattr -x' $dax_file > /dev/null 2>&1
	done

	wait $pid
	status=$?
	if [ "$status" != "0" ]; then
		cat /sys/kernel/debug/tracing/trace > trace_output
		echo "FAILED: fsx exited with status : $status"
		echo "        see trace_output"
		head $dax_file.fsxlog
		exit $status
	fi
	pid=""
}

while [ 1 ]; do
	run_fsx ""
done
	run_fsx "-A"
	run_fsx "-Z -r 4096 -w 4096"

echo "   *** Check 'dump' and 'xfsdump'"
echo "      TBD"

echo "   *** PASSED!  All tests PASSED ***"
exit 0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs
  2020-01-14  0:35     ` Ira Weiny
@ 2020-01-15  0:57       ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2020-01-15  0:57 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 04:35:21PM -0800, 'Ira Weiny' wrote:
> On Mon, Jan 13, 2020 at 02:19:57PM -0800, Darrick J. Wong wrote:
> > On Fri, Jan 10, 2020 at 11:29:38AM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> 
> [snip]
> 
> > >  
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 401da197f012..e8fd95b75e5b 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> > >   *
> > >   * Basic locking order:
> > >   *
> > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > + * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock
> > 
> > Mmmmmm, more locks.  Can we skip the extra lock if CONFIG_FSDAX=n or if
> > the filesystem devices don't support DAX at all?
> 
> I'll look into it.
> 
> > 
> > Also, I don't think we're actually following the i_rwsem -> i_daxsem
> > order in fallocate, and possibly elsewhere too?
> 
> I'll have to verify.  It took a lot of iterations to get the order working so
> I'm not going to claim perfection.

Yes this was inconsistent.  The code was right WRT i_rwsem.

mmap_sem may have issues:

What about this?

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c5d11b70d067..8808782a085e 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
  *
  * Basic locking order:
  *
- * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock
+ * i_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
  *
  * mmap_sem locking order:
  *
  * i_rwsem -> page lock -> mmap_sem
- * mmap_sem -> i_dax_sem -> i_mmap_lock -> page_lock
+ * i_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
  *
  * The difference in mmap_sem locking order mean that we cannot hold the
  * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
diff --git a/mm/mmap.c b/mm/mmap.c
index e6b68924b7ca..b500aef30b27 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1547,18 +1547,12 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
                        vm_flags |= VM_NORESERVE;
        }
 
-       if (file)
-               lock_inode_mode(file_inode(file));
-
        addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
        if (!IS_ERR_VALUE(addr) &&
            ((vm_flags & VM_LOCKED) ||
             (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
                *populate = len;
 
-       if (file)
-               unlock_inode_mode(file_inode(file));
-
        return addr;
 }
 
diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..1cfead8cd1ce 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 
        ret = security_mmap_file(file, prot, flag);
        if (!ret) {
-               if (down_write_killable(&mm->mmap_sem))
+               if (file)
+                       lock_inode_mode(file_inode(file));
+               if (down_write_killable(&mm->mmap_sem)) {
+                       if (file)
+                               unlock_inode_mode(file_inode(file));
                        return -EINTR;
+               }
                ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
                                    &populate, &uf);
                up_write(&mm->mmap_sem);
+               if (file)
+                       unlock_inode_mode(file_inode(file));
                userfaultfd_unmap_complete(mm, &uf);
                if (populate)
                        mm_populate(ret, populate);


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed
  2020-01-10 19:29 ` [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed ira.weiny
  2020-01-13 22:22   ` Darrick J. Wong
@ 2020-01-15 10:21   ` David Laight
  2020-01-15 17:53     ` Ira Weiny
  1 sibling, 1 reply; 55+ messages in thread
From: David Laight @ 2020-01-15 10:21 UTC (permalink / raw)
  To: ira.weiny, linux-kernel
  Cc: Alexander Viro, Darrick J. Wong, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

From ira.weiny@intel.com
> Sent: 10 January 2020 19:30
> 
> Page faults need to ensure the inode mode is correct and consistent with
> the vmf information at the time of the fault.  There is no easy way to
> ensure the vmf information is correct if a mode change is in progress.
> Furthermore, there is no good use case to require a mode change while
> the file is mmap'ed.
> 
> Track mmap's of the file and fail the mode change if the file is
> mmap'ed.

This seems wrong to me.
I presume the 'mode changes' are from things like 'chmod -w ...'.
mmap() should be no different to open().
Only the permissions set when the file is opened count.

Next you'll be stopping unlink() when a file is open :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed
  2020-01-14 17:53         ` Ira Weiny
@ 2020-01-15 11:34           ` Jan Kara
  2020-01-15 18:24             ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Kara @ 2020-01-15 11:34 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Darrick J. Wong, linux-kernel, Alexander Viro, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Tue 14-01-20 09:53:54, Ira Weiny wrote:
> On Mon, Jan 13, 2020 at 05:30:04PM -0800, Darrick J. Wong wrote:
> > > > > +		error = -EBUSY;
> > > > > +		goto out_unlock;
> > > > > +	}
> > > > > +
> > > > >  	error = filemap_write_and_wait(inode->i_mapping);
> > > > >  	if (error)
> > > > >  		goto out_unlock;
> > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > index 631f11d6246e..6e7dc626b657 100644
> > > > > --- a/include/linux/fs.h
> > > > > +++ b/include/linux/fs.h
> > > > > @@ -740,6 +740,7 @@ struct inode {
> > > > >  #endif
> > > > >  
> > > > >  	void			*i_private; /* fs or device private pointer */
> > > > > +	atomic64_t               i_mapped;
> > > > 
> > > > I would have expected to find this in struct address_space since the
> > > > mapping count is a function of the address space, right?
> > > 
> > > I suppose but the only external call (above) would be passing an inode.  So to
> > > me it seemed better here.
> > 
> > But the number of memory mappings reflects the state of the address
> > space, not the inode.  Or maybe put another way, if I were an mm
> > developer I would not expect to look in struct inode for mm state.
> 
> This is a good point...
> 
> > 
> > static inline bool inode_has_mappings(struct inode *inode)
> > {
> > 	return atomic64_read(&inode->i_mapping->mapcount) > 0;
> > }
> > 
> > OTOH if there exist other mm developers who /do/ find that storing the
> > mmap count in struct inode is more logical, please let me know. :)
> 
> ...  My thinking was that the number of mappings does not matters to the mm
> system...  However, I'm starting to think you are correct...  ;-)
> 
> I've made a note of it and we will see what others think.

Well, more importantly mapping != inode. There can be multiple inodes
pointing to the same mapping (struct address_space) as is the case for
example for block devices. So this counter definitely belongs into struct
address_space.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-10 19:29 ` [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute ira.weiny
@ 2020-01-15 11:37   ` Jan Kara
  2020-01-15 17:38     ` Darrick J. Wong
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Kara @ 2020-01-15 11:37 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> In order for users to determine if a file is currently operating in DAX
> mode (effective DAX).  Define a statx attribute value and set that
> attribute if the effective DAX flag is set.
> 
> To go along with this we propose the following addition to the statx man
> page:
> 
> STATX_ATTR_DAX
> 
> 	DAX (cpu direct access) is a file mode that attempts to minimize
> 	software cache effects for both I/O and memory mappings of this
> 	file.  It requires a capable device, a compatible filesystem
> 	block size, and filesystem opt-in. It generally assumes all
> 	accesses are via cpu load / store instructions which can
> 	minimize overhead for small accesses, but adversely affect cpu
> 	utilization for large transfers. File I/O is done directly
> 	to/from user-space buffers. While the DAX property tends to
> 	result in data being transferred synchronously it does not give
> 	the guarantees of synchronous I/O that data and necessary
> 	metadata are transferred. Memory mapped I/O may be performed
> 	with direct mappings that bypass system memory buffering. Again
> 	while memory-mapped I/O tends to result in data being
> 	transferred synchronously it does not guarantee synchronous
> 	metadata updates. A dax file may optionally support being mapped
> 	with the MAP_SYNC flag which does allow cpu store operations to
> 	be considered synchronous modulo cpu cache effects.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

This looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/stat.c                 | 3 +++
>  include/uapi/linux/stat.h | 1 +
>  2 files changed, 4 insertions(+)
> 
> diff --git a/fs/stat.c b/fs/stat.c
> index 030008796479..894699c74dde 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>  	if (IS_AUTOMOUNT(inode))
>  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
>  
> +	if (IS_DAX(inode))
> +		stat->attributes |= STATX_ATTR_DAX;
> +
>  	if (inode->i_op->getattr)
>  		return inode->i_op->getattr(path, stat, request_mask,
>  					    query_flags);
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index ad80a5c885d5..e5f9d5517f6b 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -169,6 +169,7 @@ struct statx {
>  #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
>  #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
>  #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
> +#define STATX_ATTR_DAX			0x00002000 /* [I] File is DAX */
>  
>  
>  #endif /* _UAPI_LINUX_STAT_H */
> -- 
> 2.21.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-15 11:37   ` Jan Kara
@ 2020-01-15 17:38     ` Darrick J. Wong
  2020-01-15 19:45       ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-15 17:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: ira.weiny, linux-kernel, Alexander Viro, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote:
> On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > In order for users to determine if a file is currently operating in DAX
> > mode (effective DAX).  Define a statx attribute value and set that
> > attribute if the effective DAX flag is set.
> > 
> > To go along with this we propose the following addition to the statx man
> > page:
> > 
> > STATX_ATTR_DAX
> > 
> > 	DAX (cpu direct access) is a file mode that attempts to minimize

"..is a file I/O mode"?

> > 	software cache effects for both I/O and memory mappings of this
> > 	file.  It requires a capable device, a compatible filesystem
> > 	block size, and filesystem opt-in.

"...a capable storage device..."

What does "compatible fs block size" mean?  How does the user figure out
if their fs blocksize is compatible?  Do we tell users to refer their
filesystem's documentation here?

> > It generally assumes all
> > 	accesses are via cpu load / store instructions which can
> > 	minimize overhead for small accesses, but adversely affect cpu
> > 	utilization for large transfers.

Will this always be true for persistent memory?

I wasn't even aware that large transfers adversely affected CPU
utilization. ;)

> >  File I/O is done directly
> > 	to/from user-space buffers. While the DAX property tends to
> > 	result in data being transferred synchronously it does not give

"...transferred synchronously, it does not..."

> > 	the guarantees of synchronous I/O that data and necessary

"...it does not guarantee that I/O or file metadata have been flushed to
the storage device."

> > 	metadata are transferred. Memory mapped I/O may be performed
> > 	with direct mappings that bypass system memory buffering.

"...with direct memory mappings that bypass kernel page cache."

> > Again
> > 	while memory-mapped I/O tends to result in data being

I would move the sentence about "Memory mapped I/O..." to directly after
the sentence about file I/O being done directly to and from userspace so
that you don't need to repeat this statement.

> > 	transferred synchronously it does not guarantee synchronous
> > 	metadata updates. A dax file may optionally support being mapped
> > 	with the MAP_SYNC flag which does allow cpu store operations to
> > 	be considered synchronous modulo cpu cache effects.

How does one detect or work around or deal with "cpu cache effects"?  I
assume some sort of CPU cache flush instruction is what is meant here,
but I think we could mention the basics of what has to be done here:

"A DAX file may support being mapped with the MAP_SYNC flag, which
enables a program to use CPU cache flush operations to persist CPU store
operations without an explicit fsync(2).  See mmap(2) for more
information."?

Oof, a paragraph break would be nice. :)

--D

> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> This looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> 
> 								Honza
> 
> > ---
> >  fs/stat.c                 | 3 +++
> >  include/uapi/linux/stat.h | 1 +
> >  2 files changed, 4 insertions(+)
> > 
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 030008796479..894699c74dde 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
> >  	if (IS_AUTOMOUNT(inode))
> >  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
> >  
> > +	if (IS_DAX(inode))
> > +		stat->attributes |= STATX_ATTR_DAX;
> > +
> >  	if (inode->i_op->getattr)
> >  		return inode->i_op->getattr(path, stat, request_mask,
> >  					    query_flags);
> > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > index ad80a5c885d5..e5f9d5517f6b 100644
> > --- a/include/uapi/linux/stat.h
> > +++ b/include/uapi/linux/stat.h
> > @@ -169,6 +169,7 @@ struct statx {
> >  #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
> >  #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
> >  #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
> > +#define STATX_ATTR_DAX			0x00002000 /* [I] File is DAX */
> >  
> >  
> >  #endif /* _UAPI_LINUX_STAT_H */
> > -- 
> > 2.21.0
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed
  2020-01-15 10:21   ` David Laight
@ 2020-01-15 17:53     ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2020-01-15 17:53 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 10:21:45AM +0000, David Laight wrote:
> From ira.weiny@intel.com
> > Sent: 10 January 2020 19:30
> > 
> > Page faults need to ensure the inode mode is correct and consistent with
> > the vmf information at the time of the fault.  There is no easy way to
> > ensure the vmf information is correct if a mode change is in progress.
> > Furthermore, there is no good use case to require a mode change while
> > the file is mmap'ed.
> > 
> > Track mmap's of the file and fail the mode change if the file is
> > mmap'ed.
> 
> This seems wrong to me.
> I presume the 'mode changes' are from things like 'chmod -w ...'.

No...  Sorry...  "mode" was a _very_ bad name.  In this context "mode" was the
"DAX mode" not the file mode.

> mmap() should be no different to open().
> Only the permissions set when the file is opened count.
> 
> Next you'll be stopping unlink() when a file is open :-)

hehehe   :-D  no ... sorry that was not the meaning.

To be clear what this is preventing is a change from non-DAX to DAX or vice
versa while a file is mmap'ed.

I'm looking at a better name for this.  For this commit message is this more
clear?

<commit>
fs: Prevent DAX change if file is mmap'ed
  
Page faults need to ensure the inode DAX configuration is correct and 
consistent with the vmf information at the time of the fault.  There is
no easy way to ensure the vmf information is correct if a DAX change is
in progress.  Furthermore, there is no good use case to require changing
DAX configs while the file is mmap'ed.

Track mmap's of the file and fail the DAX change if the file is mmap'ed.
</commit>

Sorry for the confusion,
Ira

> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed
  2020-01-15 11:34           ` Jan Kara
@ 2020-01-15 18:24             ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2020-01-15 18:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Darrick J. Wong, linux-kernel, Alexander Viro, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 12:34:55PM +0100, Jan Kara wrote:
> On Tue 14-01-20 09:53:54, Ira Weiny wrote:
> > On Mon, Jan 13, 2020 at 05:30:04PM -0800, Darrick J. Wong wrote:
> > > > > > +		error = -EBUSY;
> > > > > > +		goto out_unlock;
> > > > > > +	}
> > > > > > +
> > > > > >  	error = filemap_write_and_wait(inode->i_mapping);
> > > > > >  	if (error)
> > > > > >  		goto out_unlock;
> > > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > > > index 631f11d6246e..6e7dc626b657 100644
> > > > > > --- a/include/linux/fs.h
> > > > > > +++ b/include/linux/fs.h
> > > > > > @@ -740,6 +740,7 @@ struct inode {
> > > > > >  #endif
> > > > > >  
> > > > > >  	void			*i_private; /* fs or device private pointer */
> > > > > > +	atomic64_t               i_mapped;
> > > > > 
> > > > > I would have expected to find this in struct address_space since the
> > > > > mapping count is a function of the address space, right?
> > > > 
> > > > I suppose but the only external call (above) would be passing an inode.  So to
> > > > me it seemed better here.
> > > 
> > > But the number of memory mappings reflects the state of the address
> > > space, not the inode.  Or maybe put another way, if I were an mm
> > > developer I would not expect to look in struct inode for mm state.
> > 
> > This is a good point...
> > 
> > > 
> > > static inline bool inode_has_mappings(struct inode *inode)
> > > {
> > > 	return atomic64_read(&inode->i_mapping->mapcount) > 0;
> > > }
> > > 
> > > OTOH if there exist other mm developers who /do/ find that storing the
> > > mmap count in struct inode is more logical, please let me know. :)
> > 
> > ...  My thinking was that the number of mappings does not matters to the mm
> > system...  However, I'm starting to think you are correct...  ;-)
> > 
> > I've made a note of it and we will see what others think.
> 
> Well, more importantly mapping != inode. There can be multiple inodes
> pointing to the same mapping (struct address_space) as is the case for
> example for block devices. So this counter definitely belongs into struct
> address_space.

Ah Yes, great point.  Done.

Ira


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode'
  2020-01-14  1:03       ` Darrick J. Wong
@ 2020-01-15 19:08         ` Ira Weiny
  2020-01-16  5:40           ` Darrick J. Wong
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2020-01-15 19:08 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 05:03:22PM -0800, Darrick J. Wong wrote:
> On Mon, Jan 13, 2020 at 04:20:05PM -0800, Ira Weiny wrote:
> > On Mon, Jan 13, 2020 at 02:12:18PM -0800, Darrick J. Wong wrote:
> > > On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>

[snip]

> > > > +``lock_mode``
> > > > +	called to prevent operations which depend on the inode's mode from
> > > > +        proceeding should a mode change be in progress
> > > 
> > > "Inodes can't change mode, because files do not suddenly become
> > > directories". ;)
> > 
> > Yea sorry.
> > 
> > > 
> > > Oh, you meant "lock_XXXX is called to prevent a change in the pagecache
> > > mode from proceeding while there are address space operations in
> > > progress".  So these are really more aops get and put functions...
> > 
> > At first I actually did have aops get/put functions but this is really
> > protecting more than the aops vector because as Christoph said there are file
> > operations which need to be protected not just address space operations.
> > 
> > But I agree "mode" is a bad name...  Sorry...
> 
> inode_fops_{get,set}(), then?
> 
> inode_start_fileop()
> inode_end_fileop() ?
> 
> Trying to avoid sounding foppish <COUGH>

What about?

inode_[un]lock_state()?

Ira


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-15 17:38     ` Darrick J. Wong
@ 2020-01-15 19:45       ` Ira Weiny
  2020-01-15 20:10         ` Dan Williams
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2020-01-15 19:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-kernel, Alexander Viro, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote:
> On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote:
> > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > In order for users to determine if a file is currently operating in DAX
> > > mode (effective DAX).  Define a statx attribute value and set that
> > > attribute if the effective DAX flag is set.
> > > 
> > > To go along with this we propose the following addition to the statx man
> > > page:
> > > 
> > > STATX_ATTR_DAX
> > > 
> > > 	DAX (cpu direct access) is a file mode that attempts to minimize
> 
> "..is a file I/O mode"?

or  "... is a file state ..."?
 
> > > 	software cache effects for both I/O and memory mappings of this
> > > 	file.  It requires a capable device, a compatible filesystem
> > > 	block size, and filesystem opt-in.
> 
> "...a capable storage device..."

Done

> 
> What does "compatible fs block size" mean?  How does the user figure out
> if their fs blocksize is compatible?  Do we tell users to refer their
> filesystem's documentation here?

Perhaps it is wrong for this to be in the man page at all?  Would it be better
to assume the file system and block device are already configured properly by
the admin?

For which the blocksize restrictions are already well documented.  ie:

https://www.kernel.org/doc/Documentation/filesystems/dax.txt

?

How about changing the text to:

	It requires a block device and file system which have been configured
	to support DAX.

?

> 
> > > It generally assumes all
> > > 	accesses are via cpu load / store instructions which can
> > > 	minimize overhead for small accesses, but adversely affect cpu
> > > 	utilization for large transfers.
> 
> Will this always be true for persistent memory?

I'm not clear.  Did you mean; "this" == adverse utilization for large transfers?

> 
> I wasn't even aware that large transfers adversely affected CPU
> utilization. ;)

Sure vs using a DMA engine for example.

> 
> > >  File I/O is done directly
> > > 	to/from user-space buffers. While the DAX property tends to
> > > 	result in data being transferred synchronously it does not give
> 
> "...transferred synchronously, it does not..."

done.

> 
> > > 	the guarantees of synchronous I/O that data and necessary
> 
> "...it does not guarantee that I/O or file metadata have been flushed to
> the storage device."

The lack of guarantee here is mainly regarding metadata.

How about:

        While the DAX property tends to result in data being transferred
        synchronously, it does not give the same guarantees of 
	synchronous I/O where data and the necessary metadata are 
	transferred together.

> 
> > > 	metadata are transferred. Memory mapped I/O may be performed
> > > 	with direct mappings that bypass system memory buffering.
> 
> "...with direct memory mappings that bypass kernel page cache."

Done.

> 
> > > Again
> > > 	while memory-mapped I/O tends to result in data being
> 
> I would move the sentence about "Memory mapped I/O..." to directly after
> the sentence about file I/O being done directly to and from userspace so
> that you don't need to repeat this statement.

Done.

> 
> > > 	transferred synchronously it does not guarantee synchronous
> > > 	metadata updates. A dax file may optionally support being mapped
> > > 	with the MAP_SYNC flag which does allow cpu store operations to
> > > 	be considered synchronous modulo cpu cache effects.
> 
> How does one detect or work around or deal with "cpu cache effects"?  I
> assume some sort of CPU cache flush instruction is what is meant here,
> but I think we could mention the basics of what has to be done here:
> 
> "A DAX file may support being mapped with the MAP_SYNC flag, which
> enables a program to use CPU cache flush operations to persist CPU store
> operations without an explicit fsync(2).  See mmap(2) for more
> information."?

That sounds better.  I like the reference to mmap as well.

Ok I changed a couple of things as well.  How does this sound?


STATX_ATTR_DAX 

        DAX (cpu direct access) is a file mode that attempts to minimize
        software cache effects for both I/O and memory mappings of this
        file.  It requires a block device and file system which have
        been configured to support DAX.

        DAX generally assumes all accesses are via cpu load / store
        instructions which can minimize overhead for small accesses, but
        may adversely affect cpu utilization for large transfers.

        File I/O is done directly to/from user-space buffers and memory
        mapped I/O may be performed with direct memory mappings that
        bypass kernel page cache.

        While the DAX property tends to result in data being transferred
        synchronously, it does not give the same guarantees of
        synchronous I/O where data and the necessary metadata are
        transferred together.

        A DAX file may support being mapped with the MAP_SYNC flag,
        which enables a program to use CPU cache flush operations to
        persist CPU store operations without an explicit fsync(2).  See
        mmap(2) for more information.


Ira

> 
> Oof, a paragraph break would be nice. :)
> 
> --D
> 
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > This looks good to me. You can add:
> > 
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > 								Honza
> > 
> > > ---
> > >  fs/stat.c                 | 3 +++
> > >  include/uapi/linux/stat.h | 1 +
> > >  2 files changed, 4 insertions(+)
> > > 
> > > diff --git a/fs/stat.c b/fs/stat.c
> > > index 030008796479..894699c74dde 100644
> > > --- a/fs/stat.c
> > > +++ b/fs/stat.c
> > > @@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
> > >  	if (IS_AUTOMOUNT(inode))
> > >  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
> > >  
> > > +	if (IS_DAX(inode))
> > > +		stat->attributes |= STATX_ATTR_DAX;
> > > +
> > >  	if (inode->i_op->getattr)
> > >  		return inode->i_op->getattr(path, stat, request_mask,
> > >  					    query_flags);
> > > diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> > > index ad80a5c885d5..e5f9d5517f6b 100644
> > > --- a/include/uapi/linux/stat.h
> > > +++ b/include/uapi/linux/stat.h
> > > @@ -169,6 +169,7 @@ struct statx {
> > >  #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
> > >  #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
> > >  #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
> > > +#define STATX_ATTR_DAX			0x00002000 /* [I] File is DAX */
> > >  
> > >  
> > >  #endif /* _UAPI_LINUX_STAT_H */
> > > -- 
> > > 2.21.0
> > > 
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-15 19:45       ` Ira Weiny
@ 2020-01-15 20:10         ` Dan Williams
  2020-01-15 22:38           ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Dan Williams @ 2020-01-15 20:10 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Darrick J. Wong, Jan Kara, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote:
> > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote:
> > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > >
> > > > In order for users to determine if a file is currently operating in DAX
> > > > mode (effective DAX).  Define a statx attribute value and set that
> > > > attribute if the effective DAX flag is set.
> > > >
> > > > To go along with this we propose the following addition to the statx man
> > > > page:
> > > >
> > > > STATX_ATTR_DAX
> > > >
> > > >   DAX (cpu direct access) is a file mode that attempts to minimize
> >
> > "..is a file I/O mode"?
>
> or  "... is a file state ..."?
>
> > > >   software cache effects for both I/O and memory mappings of this
> > > >   file.  It requires a capable device, a compatible filesystem
> > > >   block size, and filesystem opt-in.
> >
> > "...a capable storage device..."
>
> Done
>
> >
> > What does "compatible fs block size" mean?  How does the user figure out
> > if their fs blocksize is compatible?  Do we tell users to refer their
> > filesystem's documentation here?
>
> Perhaps it is wrong for this to be in the man page at all?  Would it be better
> to assume the file system and block device are already configured properly by
> the admin?
>
> For which the blocksize restrictions are already well documented.  ie:
>
> https://www.kernel.org/doc/Documentation/filesystems/dax.txt
>
> ?
>
> How about changing the text to:
>
>         It requires a block device and file system which have been configured
>         to support DAX.
>
> ?

The goal was to document the gauntlet of checks that
__generic_fsdax_supported() performs so someone could debug "why am I
not able to get dax operation?"

>
> >
> > > > It generally assumes all
> > > >   accesses are via cpu load / store instructions which can
> > > >   minimize overhead for small accesses, but adversely affect cpu
> > > >   utilization for large transfers.
> >
> > Will this always be true for persistent memory?

For direct-mapped pmem there is no opportunity to do dma offload so it
will always be true that application dax access consumes cpu to do I/O
where something like NVMe does not. There has been unfruitful to date
experiments with the driver using an offload engine for kernel
internal I/O, but if you're use case is kernel internal I/O bound then
you don't need dax.

>
> I'm not clear.  Did you mean; "this" == adverse utilization for large transfers?
>
> >
> > I wasn't even aware that large transfers adversely affected CPU
> > utilization. ;)
>
> Sure vs using a DMA engine for example.

Right, this is purely a statement about cpu memcpy vs device-dma.

>
> >
> > > >  File I/O is done directly
> > > >   to/from user-space buffers. While the DAX property tends to
> > > >   result in data being transferred synchronously it does not give
> >
> > "...transferred synchronously, it does not..."
>
> done.
>
> >
> > > >   the guarantees of synchronous I/O that data and necessary
> >
> > "...it does not guarantee that I/O or file metadata have been flushed to
> > the storage device."
>
> The lack of guarantee here is mainly regarding metadata.
>
> How about:
>
>         While the DAX property tends to result in data being transferred
>         synchronously, it does not give the same guarantees of
>         synchronous I/O where data and the necessary metadata are
>         transferred together.
>
> >
> > > >   metadata are transferred. Memory mapped I/O may be performed
> > > >   with direct mappings that bypass system memory buffering.
> >
> > "...with direct memory mappings that bypass kernel page cache."
>
> Done.
>
> >
> > > > Again
> > > >   while memory-mapped I/O tends to result in data being
> >
> > I would move the sentence about "Memory mapped I/O..." to directly after
> > the sentence about file I/O being done directly to and from userspace so
> > that you don't need to repeat this statement.
>
> Done.
>
> >
> > > >   transferred synchronously it does not guarantee synchronous
> > > >   metadata updates. A dax file may optionally support being mapped
> > > >   with the MAP_SYNC flag which does allow cpu store operations to
> > > >   be considered synchronous modulo cpu cache effects.
> >
> > How does one detect or work around or deal with "cpu cache effects"?  I
> > assume some sort of CPU cache flush instruction is what is meant here,
> > but I think we could mention the basics of what has to be done here:
> >
> > "A DAX file may support being mapped with the MAP_SYNC flag, which
> > enables a program to use CPU cache flush operations to persist CPU store
> > operations without an explicit fsync(2).  See mmap(2) for more
> > information."?
>
> That sounds better.  I like the reference to mmap as well.
>
> Ok I changed a couple of things as well.  How does this sound?
>
>
> STATX_ATTR_DAX
>
>         DAX (cpu direct access) is a file mode that attempts to minimize

s/mode/state/?

>         software cache effects for both I/O and memory mappings of this
>         file.  It requires a block device and file system which have
>         been configured to support DAX.

It may not require a block device in the future.

>
>         DAX generally assumes all accesses are via cpu load / store
>         instructions which can minimize overhead for small accesses, but
>         may adversely affect cpu utilization for large transfers.
>
>         File I/O is done directly to/from user-space buffers and memory
>         mapped I/O may be performed with direct memory mappings that
>         bypass kernel page cache.
>
>         While the DAX property tends to result in data being transferred
>         synchronously, it does not give the same guarantees of
>         synchronous I/O where data and the necessary metadata are

Maybe use "O_SYNC I/O" explicitly to further differentiate the 2
meanings of "synchronous" in this sentence?

>         transferred together.
>
>         A DAX file may support being mapped with the MAP_SYNC flag,
>         which enables a program to use CPU cache flush operations to

s/operations/instructions/

>         persist CPU store operations without an explicit fsync(2).  See
>         mmap(2) for more information.

I think this also wants a reference to the Linux interpretation of
platform "persistence domains" we were discussing that here [1], but
maybe it should be part of a "pmem" manpage that can be referenced
from this man page.

[1]: http://lore.kernel.org/r/20200108064905.170394-1-aneesh.kumar@linux.ibm.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-15 20:10         ` Dan Williams
@ 2020-01-15 22:38           ` Ira Weiny
  2020-01-16  5:39             ` Darrick J. Wong
  0 siblings, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2020-01-15 22:38 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Jan Kara, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote:
> On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote:
> >
> > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote:
> > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote:
> > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote:
> > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > >

[snip]

> > Ok I changed a couple of things as well.  How does this sound?
> >
> >
> > STATX_ATTR_DAX
> >
> >         DAX (cpu direct access) is a file mode that attempts to minimize
> 
> s/mode/state/?

DOH!  yes state...  ;-)

> 
> >         software cache effects for both I/O and memory mappings of this
> >         file.  It requires a block device and file system which have
> >         been configured to support DAX.
> 
> It may not require a block device in the future.

Ok:

"It requires a file system which has been configured to support DAX." ?

I'm trying to separate the user of the individual STATX DAX flag from the Admin
details of configuring the file system and/or devices which supports it.

Also, I just realized that we should follow the format of the other STATX_*
attributes.  They all read something like "the file is..."

So I'm adding that text as well.

> 
> >
> >         DAX generally assumes all accesses are via cpu load / store
> >         instructions which can minimize overhead for small accesses, but
> >         may adversely affect cpu utilization for large transfers.
> >
> >         File I/O is done directly to/from user-space buffers and memory
> >         mapped I/O may be performed with direct memory mappings that
> >         bypass kernel page cache.
> >
> >         While the DAX property tends to result in data being transferred
> >         synchronously, it does not give the same guarantees of
> >         synchronous I/O where data and the necessary metadata are
> 
> Maybe use "O_SYNC I/O" explicitly to further differentiate the 2
> meanings of "synchronous" in this sentence?

Done.

> 
> >         transferred together.
> >
> >         A DAX file may support being mapped with the MAP_SYNC flag,
> >         which enables a program to use CPU cache flush operations to
> 
> s/operations/instructions/

Done.

> 
> >         persist CPU store operations without an explicit fsync(2).  See
> >         mmap(2) for more information.
> 
> I think this also wants a reference to the Linux interpretation of
> platform "persistence domains" we were discussing that here [1], but
> maybe it should be part of a "pmem" manpage that can be referenced
> from this man page.

Sure, but for now I think referencing mmap for details on MAP_SYNC works.

I suspect that we may have some word smithing once I get this series in and we
submit a change to the statx man page itself.  Can I move forward with the
following for this patch?

<quote>
STATX_ATTR_DAX

        The file is in the DAX (cpu direct access) state.  DAX state
        attempts to minimize software cache effects for both I/O and
        memory mappings of this file.  It requires a file system which
        has been configured to support DAX.

        DAX generally assumes all accesses are via cpu load / store
        instructions which can minimize overhead for small accesses, but
        may adversely affect cpu utilization for large transfers.

        File I/O is done directly to/from user-space buffers and memory
        mapped I/O may be performed with direct memory mappings that
        bypass kernel page cache.

        While the DAX property tends to result in data being transferred
        synchronously, it does not give the same guarantees of
        synchronous I/O where data and the necessary metadata are
        transferred together.

        A DAX file may support being mapped with the MAP_SYNC flag,
        which enables a program to use CPU cache flush instructions to
        persist CPU store operations without an explicit fsync(2).  See
        mmap(2) for more information.
</quote>

Ira

> 
> [1]: http://lore.kernel.org/r/20200108064905.170394-1-aneesh.kumar@linux.ibm.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs
  2020-01-13 22:19   ` Darrick J. Wong
  2020-01-14  0:35     ` Ira Weiny
@ 2020-01-15 23:52     ` Ira Weiny
  1 sibling, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2020-01-15 23:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Jan 13, 2020 at 02:19:57PM -0800, Darrick J. Wong wrote:
> On Fri, Jan 10, 2020 at 11:29:38AM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 

[snip]

> >  
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 401da197f012..e8fd95b75e5b 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> >   *
> >   * Basic locking order:
> >   *
> > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > + * i_rwsem -> i_dax_sem -> i_mmap_lock -> page_lock -> i_ilock
> 
> Mmmmmm, more locks.  Can we skip the extra lock if CONFIG_FSDAX=n or if
> the filesystem devices don't support DAX at all?

I've looked into this a bit more and I think skipping on CONFIG_FSDAX=n is ok
but doing so on individual devices is not going to be possible because we don't
have that information at the vfs layer.

I'll continue to think about how to mitigate this more while I add CONFIG_FSDAX
checks before rolling out a new patch set.

> 
> Also, I don't think we're actually following the i_rwsem -> i_daxsem
> order in fallocate, and possibly elsewhere too?
> 
> Does the vfs have to take the i_dax_sem to do remapping things like
> reflink?  (Pretend that reflink and dax are compatible for the moment)
> 

I spoke with Dan today about this and we believe the answer is going to be yes.
However, not until the code is added to support DAX in reflink.  Because
currently this is only required for places which use "IS_DAX()" to see the
state of the inode.

Currently the vfs layer does not have any IS_DAX() checks in the reflink path.
But when they are added they will be required to take this sem.

Ira


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-15 22:38           ` Ira Weiny
@ 2020-01-16  5:39             ` Darrick J. Wong
  2020-01-16  6:05               ` Dan Williams
  2020-01-16 17:55               ` Ira Weiny
  0 siblings, 2 replies; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-16  5:39 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Jan Kara, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 02:38:21PM -0800, Ira Weiny wrote:
> On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote:
> > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote:
> > >
> > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote:
> > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote:
> > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote:
> > > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > >
> 
> [snip]
> 
> > > Ok I changed a couple of things as well.  How does this sound?
> > >
> > >
> > > STATX_ATTR_DAX
> > >
> > >         DAX (cpu direct access) is a file mode that attempts to minimize
> > 
> > s/mode/state/?
> 
> DOH!  yes state...  ;-)
> 
> > 
> > >         software cache effects for both I/O and memory mappings of this
> > >         file.  It requires a block device and file system which have
> > >         been configured to support DAX.
> > 
> > It may not require a block device in the future.
> 
> Ok:
> 
> "It requires a file system which has been configured to support DAX." ?
> 
> I'm trying to separate the user of the individual STATX DAX flag from the Admin
> details of configuring the file system and/or devices which supports it.
> 
> Also, I just realized that we should follow the format of the other STATX_*
> attributes.  They all read something like "the file is..."
> 
> So I'm adding that text as well.
> 
> > 
> > >
> > >         DAX generally assumes all accesses are via cpu load / store
> > >         instructions which can minimize overhead for small accesses, but
> > >         may adversely affect cpu utilization for large transfers.
> > >
> > >         File I/O is done directly to/from user-space buffers and memory
> > >         mapped I/O may be performed with direct memory mappings that
> > >         bypass kernel page cache.
> > >
> > >         While the DAX property tends to result in data being transferred
> > >         synchronously, it does not give the same guarantees of
> > >         synchronous I/O where data and the necessary metadata are
> > 
> > Maybe use "O_SYNC I/O" explicitly to further differentiate the 2
> > meanings of "synchronous" in this sentence?
> 
> Done.
> 
> > 
> > >         transferred together.
> > >
> > >         A DAX file may support being mapped with the MAP_SYNC flag,
> > >         which enables a program to use CPU cache flush operations to
> > 
> > s/operations/instructions/
> 
> Done.
> 
> > 
> > >         persist CPU store operations without an explicit fsync(2).  See
> > >         mmap(2) for more information.
> > 
> > I think this also wants a reference to the Linux interpretation of
> > platform "persistence domains" we were discussing that here [1], but
> > maybe it should be part of a "pmem" manpage that can be referenced
> > from this man page.
> 
> Sure, but for now I think referencing mmap for details on MAP_SYNC works.
> 
> I suspect that we may have some word smithing once I get this series in and we
> submit a change to the statx man page itself.  Can I move forward with the
> following for this patch?
> 
> <quote>
> STATX_ATTR_DAX
> 
>         The file is in the DAX (cpu direct access) state.  DAX state

Hmm, now that I see it written out, I <cough> kind of like "DAX mode"
better now. :/

"The file is in DAX (CPU direct access) mode.  DAX mode attempts..."

>         attempts to minimize software cache effects for both I/O and
>         memory mappings of this file.  It requires a file system which
>         has been configured to support DAX.
> 
>         DAX generally assumes all accesses are via cpu load / store
>         instructions which can minimize overhead for small accesses, but
>         may adversely affect cpu utilization for large transfers.
> 
>         File I/O is done directly to/from user-space buffers and memory
>         mapped I/O may be performed with direct memory mappings that
>         bypass kernel page cache.
> 
>         While the DAX property tends to result in data being transferred
>         synchronously, it does not give the same guarantees of
>         synchronous I/O where data and the necessary metadata are
>         transferred together.

(I'm frankly not sure that synchronous I/O actually guarantees that the
metadata has hit stable storage...)

--D

>         A DAX file may support being mapped with the MAP_SYNC flag,
>         which enables a program to use CPU cache flush instructions to
>         persist CPU store operations without an explicit fsync(2).  See
>         mmap(2) for more information.
> </quote>
> 
> Ira
> 
> > 
> > [1]: http://lore.kernel.org/r/20200108064905.170394-1-aneesh.kumar@linux.ibm.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode'
  2020-01-15 19:08         ` Ira Weiny
@ 2020-01-16  5:40           ` Darrick J. Wong
  2020-01-16 18:54             ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-16  5:40 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 11:08:46AM -0800, Ira Weiny wrote:
> On Mon, Jan 13, 2020 at 05:03:22PM -0800, Darrick J. Wong wrote:
> > On Mon, Jan 13, 2020 at 04:20:05PM -0800, Ira Weiny wrote:
> > > On Mon, Jan 13, 2020 at 02:12:18PM -0800, Darrick J. Wong wrote:
> > > > On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote:
> > > > > From: Ira Weiny <ira.weiny@intel.com>
> 
> [snip]
> 
> > > > > +``lock_mode``
> > > > > +	called to prevent operations which depend on the inode's mode from
> > > > > +        proceeding should a mode change be in progress
> > > > 
> > > > "Inodes can't change mode, because files do not suddenly become
> > > > directories". ;)
> > > 
> > > Yea sorry.
> > > 
> > > > 
> > > > Oh, you meant "lock_XXXX is called to prevent a change in the pagecache
> > > > mode from proceeding while there are address space operations in
> > > > progress".  So these are really more aops get and put functions...
> > > 
> > > At first I actually did have aops get/put functions but this is really
> > > protecting more than the aops vector because as Christoph said there are file
> > > operations which need to be protected not just address space operations.
> > > 
> > > But I agree "mode" is a bad name...  Sorry...
> > 
> > inode_fops_{get,set}(), then?
> > 
> > inode_start_fileop()
> > inode_end_fileop() ?
> > 
> > Trying to avoid sounding foppish <COUGH>
> 
> What about?
> 
> inode_[un]lock_state()?

Kinda vague -- which state?  inodes retain a lot of different state.

This locking primitive ensures that file operations pointers can't
change while any operations are ongoing, right?

--D

> Ira
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-16  5:39             ` Darrick J. Wong
@ 2020-01-16  6:05               ` Dan Williams
  2020-01-16  6:18                 ` Darrick J. Wong
  2020-01-18  9:11                 ` Dave Chinner
  2020-01-16 17:55               ` Ira Weiny
  1 sibling, 2 replies; 55+ messages in thread
From: Dan Williams @ 2020-01-16  6:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ira Weiny, Jan Kara, Linux Kernel Mailing List, Alexander Viro,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 9:39 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
[..]
> >         attempts to minimize software cache effects for both I/O and
> >         memory mappings of this file.  It requires a file system which
> >         has been configured to support DAX.
> >
> >         DAX generally assumes all accesses are via cpu load / store
> >         instructions which can minimize overhead for small accesses, but
> >         may adversely affect cpu utilization for large transfers.
> >
> >         File I/O is done directly to/from user-space buffers and memory
> >         mapped I/O may be performed with direct memory mappings that
> >         bypass kernel page cache.
> >
> >         While the DAX property tends to result in data being transferred
> >         synchronously, it does not give the same guarantees of
> >         synchronous I/O where data and the necessary metadata are
> >         transferred together.
>
> (I'm frankly not sure that synchronous I/O actually guarantees that the
> metadata has hit stable storage...)

Oh? That text was motivated by the open(2) man page description of O_SYNC.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-16  6:05               ` Dan Williams
@ 2020-01-16  6:18                 ` Darrick J. Wong
  2020-01-16  6:25                   ` Dan Williams
  2020-01-18  9:11                 ` Dave Chinner
  1 sibling, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-16  6:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ira Weiny, Jan Kara, Linux Kernel Mailing List, Alexander Viro,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 10:05:00PM -0800, Dan Williams wrote:
> On Wed, Jan 15, 2020 at 9:39 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> [..]
> > >         attempts to minimize software cache effects for both I/O and
> > >         memory mappings of this file.  It requires a file system which
> > >         has been configured to support DAX.
> > >
> > >         DAX generally assumes all accesses are via cpu load / store
> > >         instructions which can minimize overhead for small accesses, but
> > >         may adversely affect cpu utilization for large transfers.
> > >
> > >         File I/O is done directly to/from user-space buffers and memory
> > >         mapped I/O may be performed with direct memory mappings that
> > >         bypass kernel page cache.
> > >
> > >         While the DAX property tends to result in data being transferred
> > >         synchronously, it does not give the same guarantees of
> > >         synchronous I/O where data and the necessary metadata are
> > >         transferred together.
> >
> > (I'm frankly not sure that synchronous I/O actually guarantees that the
> > metadata has hit stable storage...)
> 
> Oh? That text was motivated by the open(2) man page description of O_SYNC.

Eh, that's just me being cynical about software.  Yes, the O_SYNC docs
say that data+metadata are supposed to happen; that's good enough for
another section in the man pages. :)

--D

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-16  6:18                 ` Darrick J. Wong
@ 2020-01-16  6:25                   ` Dan Williams
  0 siblings, 0 replies; 55+ messages in thread
From: Dan Williams @ 2020-01-16  6:25 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ira Weiny, Jan Kara, Linux Kernel Mailing List, Alexander Viro,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 10:18 PM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> On Wed, Jan 15, 2020 at 10:05:00PM -0800, Dan Williams wrote:
> > On Wed, Jan 15, 2020 at 9:39 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > [..]
> > > >         attempts to minimize software cache effects for both I/O and
> > > >         memory mappings of this file.  It requires a file system which
> > > >         has been configured to support DAX.
> > > >
> > > >         DAX generally assumes all accesses are via cpu load / store
> > > >         instructions which can minimize overhead for small accesses, but
> > > >         may adversely affect cpu utilization for large transfers.
> > > >
> > > >         File I/O is done directly to/from user-space buffers and memory
> > > >         mapped I/O may be performed with direct memory mappings that
> > > >         bypass kernel page cache.
> > > >
> > > >         While the DAX property tends to result in data being transferred
> > > >         synchronously, it does not give the same guarantees of
> > > >         synchronous I/O where data and the necessary metadata are
> > > >         transferred together.
> > >
> > > (I'm frankly not sure that synchronous I/O actually guarantees that the
> > > metadata has hit stable storage...)
> >
> > Oh? That text was motivated by the open(2) man page description of O_SYNC.
>
> Eh, that's just me being cynical about software.  Yes, the O_SYNC docs
> say that data+metadata are supposed to happen; that's good enough for
> another section in the man pages. :)
>

Ah ok, yes, "all storage is a lie".

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs
  2020-01-10 19:29 ` [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs ira.weiny
  2020-01-13 22:19   ` Darrick J. Wong
@ 2020-01-16  9:24   ` Jan Kara
  2020-01-16 19:12     ` Ira Weiny
  1 sibling, 1 reply; 55+ messages in thread
From: Jan Kara @ 2020-01-16  9:24 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Fri 10-01-20 11:29:38, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> XFS requires regular files to be locked while changing to/from DAX mode.
> 
> Define a new DAX lock type and implement the [un]lock_mode() inode
> operation callbacks.
> 
> We define a new XFS_DAX_* lock type to carry the lock through the
> transaction because we don't want to use IOLOCK as that would cause
> performance issues with locking of the inode itself.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
...
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 492e53992fa9..693ca66bd89b 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -67,6 +67,9 @@ typedef struct xfs_inode {
>  	spinlock_t		i_ioend_lock;
>  	struct work_struct	i_ioend_work;
>  	struct list_head	i_ioend_list;
> +
> +	/* protect changing the mode to/from DAX */
> +	struct percpu_rw_semaphore i_dax_sem;
>  } xfs_inode_t;

This adds overhead of ~32k per inode for typical distro kernel. That's not
going to fly. That's why ext4 has similar kind of lock in the superblock
shared by all inodes. For read side it does not matter because that's
per-cpu and shared lock. For write side we don't care as changing inode
access mode should be rare.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 05/12] fs: remove unneeded IS_DAX() check
  2020-01-10 19:29 ` [RFC PATCH V2 05/12] fs: remove unneeded IS_DAX() check ira.weiny
@ 2020-01-16  9:38   ` Jan Kara
  2020-01-16 18:47     ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Jan Kara @ 2020-01-16  9:38 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Fri 10-01-20 11:29:35, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The IS_DAX() check in io_is_direct() causes a race between changing the
> DAX mode and creating the iocb flags.
> 
> Remove the check because DAX now emulates the page cache API and
> therefore it does not matter if the file mode is DAX or not when the
> iocb flags are created.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/fs.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index d7584bcef5d3..e11989502eac 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3365,7 +3365,7 @@ extern int file_update_time(struct file *file);
>  
>  static inline bool io_is_direct(struct file *filp)
>  {
> -	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
> +	return (filp->f_flags & O_DIRECT);
>  }
>  
>  static inline bool vma_is_dax(struct vm_area_struct *vma)
> -- 
> 2.21.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-16  5:39             ` Darrick J. Wong
  2020-01-16  6:05               ` Dan Williams
@ 2020-01-16 17:55               ` Ira Weiny
  2020-01-16 18:04                 ` Darrick J. Wong
  1 sibling, 1 reply; 55+ messages in thread
From: Ira Weiny @ 2020-01-16 17:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Jan Kara, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 09:39:35PM -0800, Darrick J. Wong wrote:
> On Wed, Jan 15, 2020 at 02:38:21PM -0800, Ira Weiny wrote:
> > On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote:
> > > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote:
> > > >
> > > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote:
> > > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote:
> > > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote:
> > > > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > > >
> > 

[snip]

> > 
> > Sure, but for now I think referencing mmap for details on MAP_SYNC works.
> > 
> > I suspect that we may have some word smithing once I get this series in and we
> > submit a change to the statx man page itself.  Can I move forward with the
> > following for this patch?
> > 
> > <quote>
> > STATX_ATTR_DAX
> > 
> >         The file is in the DAX (cpu direct access) state.  DAX state
> 
> Hmm, now that I see it written out, I <cough> kind of like "DAX mode"
> better now. :/
> 
> "The file is in DAX (CPU direct access) mode.  DAX mode attempts..."

Sure...  now you tell me...  ;-)

Seriously, we could use mode here in the man page as this is less confusing to
say "DAX mode".

But I think the code should still use 'state' because mode is just too
overloaded.  You were not the only one who was thrown by my use of mode and I
don't want that confusion when we look at this code 2 weeks from now...

https://www.reddit.com/r/ProgrammerHumor/comments/852og2/only_god_knows/

;-)

> 
> >         attempts to minimize software cache effects for both I/O and
> >         memory mappings of this file.  It requires a file system which
> >         has been configured to support DAX.
> > 
> >         DAX generally assumes all accesses are via cpu load / store
> >         instructions which can minimize overhead for small accesses, but
> >         may adversely affect cpu utilization for large transfers.
> > 
> >         File I/O is done directly to/from user-space buffers and memory
> >         mapped I/O may be performed with direct memory mappings that
> >         bypass kernel page cache.
> > 
> >         While the DAX property tends to result in data being transferred
> >         synchronously, it does not give the same guarantees of
> >         synchronous I/O where data and the necessary metadata are
> >         transferred together.
> 
> (I'm frankly not sure that synchronous I/O actually guarantees that the
> metadata has hit stable storage...)

I'll let you and Dan work this one out...  ;-)

Ira


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-16 17:55               ` Ira Weiny
@ 2020-01-16 18:04                 ` Darrick J. Wong
  2020-01-16 18:52                   ` Ira Weiny
  0 siblings, 1 reply; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-16 18:04 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Jan Kara, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

On Thu, Jan 16, 2020 at 09:55:02AM -0800, Ira Weiny wrote:
> On Wed, Jan 15, 2020 at 09:39:35PM -0800, Darrick J. Wong wrote:
> > On Wed, Jan 15, 2020 at 02:38:21PM -0800, Ira Weiny wrote:
> > > On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote:
> > > > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote:
> > > > >
> > > > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote:
> > > > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote:
> > > > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote:
> > > > > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > > > >
> > > 
> 
> [snip]
> 
> > > 
> > > Sure, but for now I think referencing mmap for details on MAP_SYNC works.
> > > 
> > > I suspect that we may have some word smithing once I get this series in and we
> > > submit a change to the statx man page itself.  Can I move forward with the
> > > following for this patch?
> > > 
> > > <quote>
> > > STATX_ATTR_DAX
> > > 
> > >         The file is in the DAX (cpu direct access) state.  DAX state
> > 
> > Hmm, now that I see it written out, I <cough> kind of like "DAX mode"
> > better now. :/
> > 
> > "The file is in DAX (CPU direct access) mode.  DAX mode attempts..."
> 
> Sure...  now you tell me...  ;-)
> 
> Seriously, we could use mode here in the man page as this is less confusing to
> say "DAX mode".
> 
> But I think the code should still use 'state' because mode is just too
> overloaded.  You were not the only one who was thrown by my use of mode and I
> don't want that confusion when we look at this code 2 weeks from now...
> 
> https://www.reddit.com/r/ProgrammerHumor/comments/852og2/only_god_knows/
> 
> ;-)

Ok, let's leave it alone for now then.

I'm not even sure what 'DAX' stands for.  Direct Access to ...
Professor Xavier? 8-)

> > 
> > >         attempts to minimize software cache effects for both I/O and
> > >         memory mappings of this file.  It requires a file system which
> > >         has been configured to support DAX.
> > > 
> > >         DAX generally assumes all accesses are via cpu load / store
> > >         instructions which can minimize overhead for small accesses, but
> > >         may adversely affect cpu utilization for large transfers.
> > > 
> > >         File I/O is done directly to/from user-space buffers and memory
> > >         mapped I/O may be performed with direct memory mappings that
> > >         bypass kernel page cache.
> > > 
> > >         While the DAX property tends to result in data being transferred
> > >         synchronously, it does not give the same guarantees of
> > >         synchronous I/O where data and the necessary metadata are
> > >         transferred together.
> > 
> > (I'm frankly not sure that synchronous I/O actually guarantees that the
> > metadata has hit stable storage...)
> 
> I'll let you and Dan work this one out...  ;-)

Hehe.  I think the wording here is fine.

--D

> Ira
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 05/12] fs: remove unneeded IS_DAX() check
  2020-01-16  9:38   ` Jan Kara
@ 2020-01-16 18:47     ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2020-01-16 18:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Jan 16, 2020 at 10:38:07AM +0100, Jan Kara wrote:
> On Fri 10-01-20 11:29:35, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The IS_DAX() check in io_is_direct() causes a race between changing the
> > DAX mode and creating the iocb flags.
> > 
> > Remove the check because DAX now emulates the page cache API and
> > therefore it does not matter if the file mode is DAX or not when the
> > iocb flags are created.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> The patch looks good to me. You can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks,
Ira

> 
> 								Honza
> 
> > ---
> >  include/linux/fs.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index d7584bcef5d3..e11989502eac 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3365,7 +3365,7 @@ extern int file_update_time(struct file *file);
> >  
> >  static inline bool io_is_direct(struct file *filp)
> >  {
> > -	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
> > +	return (filp->f_flags & O_DIRECT);
> >  }
> >  
> >  static inline bool vma_is_dax(struct vm_area_struct *vma)
> > -- 
> > 2.21.0
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-16 18:04                 ` Darrick J. Wong
@ 2020-01-16 18:52                   ` Ira Weiny
  2020-01-16 22:19                     ` Darrick J. Wong
  2020-01-17 11:58                     ` Jan Kara
  0 siblings, 2 replies; 55+ messages in thread
From: Ira Weiny @ 2020-01-16 18:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Jan Kara, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

On Thu, Jan 16, 2020 at 10:04:21AM -0800, Darrick J. Wong wrote:
> On Thu, Jan 16, 2020 at 09:55:02AM -0800, Ira Weiny wrote:
> > On Wed, Jan 15, 2020 at 09:39:35PM -0800, Darrick J. Wong wrote:
> > > On Wed, Jan 15, 2020 at 02:38:21PM -0800, Ira Weiny wrote:
> > > > On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote:
> > > > > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote:
> > > > > >
> > > > > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote:
> > > > > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote:
> > > > > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote:
> > > > > > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > > > > >
> > > > 
> > 
> > [snip]
> > 
> > > > 
> > > > Sure, but for now I think referencing mmap for details on MAP_SYNC works.
> > > > 
> > > > I suspect that we may have some word smithing once I get this series in and we
> > > > submit a change to the statx man page itself.  Can I move forward with the
> > > > following for this patch?
> > > > 
> > > > <quote>
> > > > STATX_ATTR_DAX
> > > > 
> > > >         The file is in the DAX (cpu direct access) state.  DAX state
> > > 
> > > Hmm, now that I see it written out, I <cough> kind of like "DAX mode"
> > > better now. :/
> > > 
> > > "The file is in DAX (CPU direct access) mode.  DAX mode attempts..."
> > 
> > Sure...  now you tell me...  ;-)
> > 
> > Seriously, we could use mode here in the man page as this is less confusing to
> > say "DAX mode".
> > 
> > But I think the code should still use 'state' because mode is just too
> > overloaded.  You were not the only one who was thrown by my use of mode and I
> > don't want that confusion when we look at this code 2 weeks from now...
> > 
> > https://www.reddit.com/r/ProgrammerHumor/comments/852og2/only_god_knows/
> > 
> > ;-)
> 
> Ok, let's leave it alone for now then.

Cool could I get a reviewed by?

And Jan is this reword of the man page/commit ok to keep your reviewed by?

> 
> I'm not even sure what 'DAX' stands for.  Direct Access to ...
> Professor Xavier? 8-)

That is pronounced 'Direct A'Xes'  you know, for chopping wood!

Thanks everyone,
Ira

> 
> > > 
> > > >         attempts to minimize software cache effects for both I/O and
> > > >         memory mappings of this file.  It requires a file system which
> > > >         has been configured to support DAX.
> > > > 
> > > >         DAX generally assumes all accesses are via cpu load / store
> > > >         instructions which can minimize overhead for small accesses, but
> > > >         may adversely affect cpu utilization for large transfers.
> > > > 
> > > >         File I/O is done directly to/from user-space buffers and memory
> > > >         mapped I/O may be performed with direct memory mappings that
> > > >         bypass kernel page cache.
> > > > 
> > > >         While the DAX property tends to result in data being transferred
> > > >         synchronously, it does not give the same guarantees of
> > > >         synchronous I/O where data and the necessary metadata are
> > > >         transferred together.
> > > 
> > > (I'm frankly not sure that synchronous I/O actually guarantees that the
> > > metadata has hit stable storage...)
> > 
> > I'll let you and Dan work this one out...  ;-)
> 
> Hehe.  I think the wording here is fine.
> 
> --D
> 
> > Ira
> > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode'
  2020-01-16  5:40           ` Darrick J. Wong
@ 2020-01-16 18:54             ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2020-01-16 18:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 09:40:40PM -0800, Darrick J. Wong wrote:
> On Wed, Jan 15, 2020 at 11:08:46AM -0800, Ira Weiny wrote:
> > On Mon, Jan 13, 2020 at 05:03:22PM -0800, Darrick J. Wong wrote:
> > > On Mon, Jan 13, 2020 at 04:20:05PM -0800, Ira Weiny wrote:
> > > > On Mon, Jan 13, 2020 at 02:12:18PM -0800, Darrick J. Wong wrote:
> > > > > On Fri, Jan 10, 2020 at 11:29:37AM -0800, ira.weiny@intel.com wrote:
> > > > > > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > [snip]
> > 
> > > > > > +``lock_mode``
> > > > > > +	called to prevent operations which depend on the inode's mode from
> > > > > > +        proceeding should a mode change be in progress
> > > > > 
> > > > > "Inodes can't change mode, because files do not suddenly become
> > > > > directories". ;)
> > > > 
> > > > Yea sorry.
> > > > 
> > > > > 
> > > > > Oh, you meant "lock_XXXX is called to prevent a change in the pagecache
> > > > > mode from proceeding while there are address space operations in
> > > > > progress".  So these are really more aops get and put functions...
> > > > 
> > > > At first I actually did have aops get/put functions but this is really
> > > > protecting more than the aops vector because as Christoph said there are file
> > > > operations which need to be protected not just address space operations.
> > > > 
> > > > But I agree "mode" is a bad name...  Sorry...
> > > 
> > > inode_fops_{get,set}(), then?
> > > 
> > > inode_start_fileop()
> > > inode_end_fileop() ?
> > > 
> > > Trying to avoid sounding foppish <COUGH>
> > 
> > What about?
> > 
> > inode_[un]lock_state()?
> 
> Kinda vague -- which state?  inodes retain a lot of different state.
> 
> This locking primitive ensures that file operations pointers can't
> change while any operations are ongoing, right?

file ops
Address space ops
...

Whatever it needs...

With the current patch set we could be more specific and call it.

inode_[un]lock_dax_state()

Ira

> 
> --D
> 
> > Ira
> > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs
  2020-01-16  9:24   ` Jan Kara
@ 2020-01-16 19:12     ` Ira Weiny
  0 siblings, 0 replies; 55+ messages in thread
From: Ira Weiny @ 2020-01-16 19:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Jan 16, 2020 at 10:24:46AM +0100, Jan Kara wrote:
> On Fri 10-01-20 11:29:38, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > XFS requires regular files to be locked while changing to/from DAX mode.
> > 
> > Define a new DAX lock type and implement the [un]lock_mode() inode
> > operation callbacks.
> > 
> > We define a new XFS_DAX_* lock type to carry the lock through the
> > transaction because we don't want to use IOLOCK as that would cause
> > performance issues with locking of the inode itself.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ...
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index 492e53992fa9..693ca66bd89b 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -67,6 +67,9 @@ typedef struct xfs_inode {
> >  	spinlock_t		i_ioend_lock;
> >  	struct work_struct	i_ioend_work;
> >  	struct list_head	i_ioend_list;
> > +
> > +	/* protect changing the mode to/from DAX */
> > +	struct percpu_rw_semaphore i_dax_sem;
> >  } xfs_inode_t;
> 
> This adds overhead of ~32k per inode for typical distro kernel.

Wow!

> That's not
> going to fly.

Probably not...

> That's why ext4 has similar kind of lock in the superblock
> shared by all inodes. For read side it does not matter because that's
> per-cpu and shared lock. For write side we don't care as changing inode
> access mode should be rare.

Sounds reasonable to me.  I'll convert it.

Thanks for pointing this out, that would have been bad indeed.
Ira


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-16 18:52                   ` Ira Weiny
@ 2020-01-16 22:19                     ` Darrick J. Wong
  2020-01-17 11:58                     ` Jan Kara
  1 sibling, 0 replies; 55+ messages in thread
From: Darrick J. Wong @ 2020-01-16 22:19 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Jan Kara, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

On Thu, Jan 16, 2020 at 10:52:36AM -0800, Ira Weiny wrote:
> On Thu, Jan 16, 2020 at 10:04:21AM -0800, Darrick J. Wong wrote:
> > On Thu, Jan 16, 2020 at 09:55:02AM -0800, Ira Weiny wrote:
> > > On Wed, Jan 15, 2020 at 09:39:35PM -0800, Darrick J. Wong wrote:
> > > > On Wed, Jan 15, 2020 at 02:38:21PM -0800, Ira Weiny wrote:
> > > > > On Wed, Jan 15, 2020 at 12:10:50PM -0800, Dan Williams wrote:
> > > > > > On Wed, Jan 15, 2020 at 11:45 AM Ira Weiny <ira.weiny@intel.com> wrote:
> > > > > > >
> > > > > > > On Wed, Jan 15, 2020 at 09:38:34AM -0800, Darrick J. Wong wrote:
> > > > > > > > On Wed, Jan 15, 2020 at 12:37:15PM +0100, Jan Kara wrote:
> > > > > > > > > On Fri 10-01-20 11:29:31, ira.weiny@intel.com wrote:
> > > > > > > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > > > > > >
> > > > > 
> > > 
> > > [snip]
> > > 
> > > > > 
> > > > > Sure, but for now I think referencing mmap for details on MAP_SYNC works.
> > > > > 
> > > > > I suspect that we may have some word smithing once I get this series in and we
> > > > > submit a change to the statx man page itself.  Can I move forward with the
> > > > > following for this patch?
> > > > > 
> > > > > <quote>
> > > > > STATX_ATTR_DAX
> > > > > 
> > > > >         The file is in the DAX (cpu direct access) state.  DAX state
> > > > 
> > > > Hmm, now that I see it written out, I <cough> kind of like "DAX mode"
> > > > better now. :/
> > > > 
> > > > "The file is in DAX (CPU direct access) mode.  DAX mode attempts..."
> > > 
> > > Sure...  now you tell me...  ;-)
> > > 
> > > Seriously, we could use mode here in the man page as this is less confusing to
> > > say "DAX mode".
> > > 
> > > But I think the code should still use 'state' because mode is just too
> > > overloaded.  You were not the only one who was thrown by my use of mode and I
> > > don't want that confusion when we look at this code 2 weeks from now...
> > > 
> > > https://www.reddit.com/r/ProgrammerHumor/comments/852og2/only_god_knows/
> > > 
> > > ;-)
> > 
> > Ok, let's leave it alone for now then.
> 
> Cool could I get a reviewed by?

My bike shed is painted green with purple polka dots,

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> And Jan is this reword of the man page/commit ok to keep your reviewed by?
> 
> > 
> > I'm not even sure what 'DAX' stands for.  Direct Access to ...
> > Professor Xavier? 8-)
> 
> That is pronounced 'Direct A'Xes'  you know, for chopping wood!
> 
> Thanks everyone,
> Ira
> 
> > 
> > > > 
> > > > >         attempts to minimize software cache effects for both I/O and
> > > > >         memory mappings of this file.  It requires a file system which
> > > > >         has been configured to support DAX.
> > > > > 
> > > > >         DAX generally assumes all accesses are via cpu load / store
> > > > >         instructions which can minimize overhead for small accesses, but
> > > > >         may adversely affect cpu utilization for large transfers.
> > > > > 
> > > > >         File I/O is done directly to/from user-space buffers and memory
> > > > >         mapped I/O may be performed with direct memory mappings that
> > > > >         bypass kernel page cache.
> > > > > 
> > > > >         While the DAX property tends to result in data being transferred
> > > > >         synchronously, it does not give the same guarantees of
> > > > >         synchronous I/O where data and the necessary metadata are
> > > > >         transferred together.
> > > > 
> > > > (I'm frankly not sure that synchronous I/O actually guarantees that the
> > > > metadata has hit stable storage...)
> > > 
> > > I'll let you and Dan work this one out...  ;-)
> > 
> > Hehe.  I think the wording here is fine.
> > 
> > --D
> > 
> > > Ira
> > > 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-16 18:52                   ` Ira Weiny
  2020-01-16 22:19                     ` Darrick J. Wong
@ 2020-01-17 11:58                     ` Jan Kara
  1 sibling, 0 replies; 55+ messages in thread
From: Jan Kara @ 2020-01-17 11:58 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Darrick J. Wong, Dan Williams, Jan Kara,
	Linux Kernel Mailing List, Alexander Viro, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, linux-ext4, linux-xfs,
	linux-fsdevel

On Thu 16-01-20 10:52:36, Ira Weiny wrote:
> And Jan is this reword of the man page/commit ok to keep your reviewed by?

Yes.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute
  2020-01-16  6:05               ` Dan Williams
  2020-01-16  6:18                 ` Darrick J. Wong
@ 2020-01-18  9:11                 ` Dave Chinner
  1 sibling, 0 replies; 55+ messages in thread
From: Dave Chinner @ 2020-01-18  9:11 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Ira Weiny, Jan Kara, Linux Kernel Mailing List,
	Alexander Viro, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Jan 15, 2020 at 10:05:00PM -0800, Dan Williams wrote:
> On Wed, Jan 15, 2020 at 9:39 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> [..]
> > >         attempts to minimize software cache effects for both I/O and
> > >         memory mappings of this file.  It requires a file system which
> > >         has been configured to support DAX.
> > >
> > >         DAX generally assumes all accesses are via cpu load / store
> > >         instructions which can minimize overhead for small accesses, but
> > >         may adversely affect cpu utilization for large transfers.
> > >
> > >         File I/O is done directly to/from user-space buffers and memory
> > >         mapped I/O may be performed with direct memory mappings that
> > >         bypass kernel page cache.
> > >
> > >         While the DAX property tends to result in data being transferred
> > >         synchronously, it does not give the same guarantees of
> > >         synchronous I/O where data and the necessary metadata are
> > >         transferred together.
> >
> > (I'm frankly not sure that synchronous I/O actually guarantees that the
> > metadata has hit stable storage...)
> 
> Oh? That text was motivated by the open(2) man page description of O_SYNC.

Ugh. "synchronous I/O" means two different things, depending on
context. In the AIO context, it means "process context waits for operation
completion direct", but in the O_SYNC context, it means "we guarantee
data integrity for each I/O submitted".

Indeed, O_SYNC AIO is a thing. i.e. we can do an "async sync
write" to guarantee data integrity without directly waiting for
it. Now try describing that only using the words "synchronous
write" to describe both behaviours. :)

IOWs, if you are talking about data integrity, you need to
explicitly say "O_SYNC semantics", not "synchronous write", because
"synchronous write" is totally ambiguous without the O_SYNC context
of the open(2) man page...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, back to index

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-10 19:29 [RFC PATCH V2 00/12] Enable per-file/directory DAX operations V2 ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 01/12] fs/stat: Define DAX statx attribute ira.weiny
2020-01-15 11:37   ` Jan Kara
2020-01-15 17:38     ` Darrick J. Wong
2020-01-15 19:45       ` Ira Weiny
2020-01-15 20:10         ` Dan Williams
2020-01-15 22:38           ` Ira Weiny
2020-01-16  5:39             ` Darrick J. Wong
2020-01-16  6:05               ` Dan Williams
2020-01-16  6:18                 ` Darrick J. Wong
2020-01-16  6:25                   ` Dan Williams
2020-01-18  9:11                 ` Dave Chinner
2020-01-16 17:55               ` Ira Weiny
2020-01-16 18:04                 ` Darrick J. Wong
2020-01-16 18:52                   ` Ira Weiny
2020-01-16 22:19                     ` Darrick J. Wong
2020-01-17 11:58                     ` Jan Kara
2020-01-10 19:29 ` [RFC PATCH V2 02/12] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 04/12] fs/xfs: Clean up DAX support check ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 05/12] fs: remove unneeded IS_DAX() check ira.weiny
2020-01-16  9:38   ` Jan Kara
2020-01-16 18:47     ` Ira Weiny
2020-01-10 19:29 ` [RFC PATCH V2 06/12] fs/xfs: Check if the inode supports DAX under lock ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 07/12] fs: Add locking for a dynamic inode 'mode' ira.weiny
2020-01-13 22:12   ` Darrick J. Wong
2020-01-14  0:20     ` Ira Weiny
2020-01-14  1:03       ` Darrick J. Wong
2020-01-15 19:08         ` Ira Weiny
2020-01-16  5:40           ` Darrick J. Wong
2020-01-16 18:54             ` Ira Weiny
2020-01-10 19:29 ` [RFC PATCH V2 08/12] fs/xfs: Add lock/unlock mode to xfs ira.weiny
2020-01-13 22:19   ` Darrick J. Wong
2020-01-14  0:35     ` Ira Weiny
2020-01-15  0:57       ` Ira Weiny
2020-01-15 23:52     ` Ira Weiny
2020-01-16  9:24   ` Jan Kara
2020-01-16 19:12     ` Ira Weiny
2020-01-10 19:29 ` [RFC PATCH V2 09/12] fs: Prevent mode change if file is mmap'ed ira.weiny
2020-01-13 22:22   ` Darrick J. Wong
2020-01-14  0:46     ` Ira Weiny
2020-01-14  1:30       ` Darrick J. Wong
2020-01-14 17:53         ` Ira Weiny
2020-01-15 11:34           ` Jan Kara
2020-01-15 18:24             ` Ira Weiny
2020-01-15 10:21   ` David Laight
2020-01-15 17:53     ` Ira Weiny
2020-01-10 19:29 ` [RFC PATCH V2 10/12] fs/xfs: Fix truncate up ira.weiny
2020-01-13 22:27   ` Darrick J. Wong
2020-01-14  0:40     ` Ira Weiny
2020-01-14  1:14       ` Darrick J. Wong
2020-01-14 19:00         ` Ira Weiny
2020-01-14 19:39           ` Ira Weiny
2020-01-10 19:29 ` [RFC PATCH V2 11/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
2020-01-10 19:29 ` [RFC PATCH V2 12/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git