Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v3 00/12] Enable per-file/directory DAX operations V3
@ 2020-02-08 19:34 ira.weiny
  2020-02-08 19:34 ` [PATCH v3 01/12] fs/stat: Define DAX statx attribute ira.weiny
                   ` (12 more replies)
  0 siblings, 13 replies; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Changes from V2:

	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
		Internal discussions with Dan determined this would be easier,
		just as performant, and slightly less overhead that having it
		in the SB as suggested by Jan
	* Fix locking order in comments and throughout code
	* Change "mode" to "state" throughout commits
	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
		configured
	* Add static branch for which is activated by a device which supports
		DAX in XFS
	* Change "lock/unlock" to up/down read/write as appropriate
		Previous names were over simplified
	* Update comments/documentation

	* Remove the xfs specific lock to the vfs (global) layer.
	* Fix i_dax_sem locking order and comments

	* Move 'i_mapped' count from struct inode to struct address_space and
		rename it to mmap_count
	* Add inode_has_mappings() call

	* Fix build issues
	* Clean up syntax spacing and minor issues
	* Update man page text for STATX_ATTR_DAX
	* Add reviewed-by's
	* Rebase to latest linux-next

	Rename patch:
		from: fs/xfs: Add lock/unlock state to xfs
		to: fs/xfs: Add write DAX lock to xfs layer
	Add patch:
		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
	Drop patch:
		fs/xfs: Fix truncate up

	https://github.com/weiny2/linux-kernel/tree/dax-file-state-change-v3

At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the DAX state of a file
with active mappings / page cache.  It was thought the races could be avoided
by limiting DAX state flips to 0-length files.

However, this turns out to not be true.[3] This is because address space
operations (a_ops) may be in use at any time the inode is referenced and users
have expressed a desire to be able to change the DAX state on a file with data
in it.  For those reasons this patch set allows changing the DAX state flag on
a file as long as it is not current mapped.

Furthermore, DAX is a property of the inode and as such, many operations other
than address space operations need to be protected during a DAX state change.

Therefore callbacks are placed within the inode operations and used to lock the
inode as appropriate.

As in V1, Users are able to query the effective and physical flags separately
at any time.  Specifically the addition of the statx attribute bit allows them
to ensure the file is operating in the DAX state they intend.  This 'effective
flag' and physical flags could differ when the filesystem is mounted with the
dax flag for example.

It should be noted that the physical DAX flag inheritance is not shown in this
patch set as it was maintained from previous work on XFS.  The physical DAX
flag and it's inheritance will need to be added to other file systems for user
control. 


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/
[3] https://lkml.org/lkml/2019/10/20/96
[4] https://patchwork.kernel.org/patch/11310511/


To: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org


Ira Weiny (12):
  fs/stat: Define DAX statx attribute
  fs/xfs: Isolate the physical DAX flag from effective
  fs/xfs: Separate functionality of xfs_inode_supports_dax()
  fs/xfs: Clean up DAX support check
  fs: remove unneeded IS_DAX() check
  fs/xfs: Check if the inode supports DAX under lock
  fs: Add locking for a dynamic DAX state
  fs/xfs: Clarify lockdep dependency for xfs_isilocked()
  fs/xfs: Add write DAX lock to xfs layer
  fs: Prevent DAX state change if file is mmap'ed
  fs/xfs: Clean up locking in dax invalidate
  fs/xfs: Allow toggle of effective DAX flag

 Documentation/filesystems/vfs.rst | 17 ++++++
 fs/attr.c                         |  1 +
 fs/dax.c                          |  3 ++
 fs/inode.c                        | 15 ++++--
 fs/iomap/buffered-io.c            |  1 +
 fs/open.c                         |  4 ++
 fs/stat.c                         |  5 ++
 fs/super.c                        |  3 ++
 fs/xfs/xfs_inode.c                | 24 +++++++--
 fs/xfs/xfs_inode.h                |  8 ++-
 fs/xfs/xfs_ioctl.c                | 56 ++++++++++++--------
 fs/xfs/xfs_iops.c                 | 51 ++++++++++++++----
 fs/xfs/xfs_iops.h                 |  2 +
 fs/xfs/xfs_super.c                | 16 +++---
 include/linux/fs.h                | 86 +++++++++++++++++++++++++++++--
 include/uapi/linux/stat.h         |  1 +
 mm/fadvise.c                      | 10 +++-
 mm/filemap.c                      |  4 ++
 mm/huge_memory.c                  |  1 +
 mm/khugepaged.c                   |  2 +
 mm/madvise.c                      |  3 ++
 mm/mmap.c                         | 19 ++++++-
 mm/util.c                         |  9 +++-
 23 files changed, 287 insertions(+), 54 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 01/12] fs/stat: Define DAX statx attribute
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-08 19:34 ` [PATCH v3 02/12] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Jan Kara, Darrick J . Wong, Alexander Viro,
	Dan Williams, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

In order for users to determine if a file is currently operating in DAX
state (effective DAX).  Define a statx attribute value and set that
attribute if the effective DAX flag is set.

To go along with this we propose the following addition to the statx man
page:

STATX_ATTR_DAX

	The file is in the DAX (cpu direct access) state.  DAX state
	attempts to minimize software cache effects for both I/O and
	memory mappings of this file.  It requires a file system which
	has been configured to support DAX.

	DAX generally assumes all accesses are via cpu load / store
	instructions which can minimize overhead for small accesses, but
	may adversely affect cpu utilization for large transfers.

	File I/O is done directly to/from user-space buffers and memory
	mapped I/O may be performed with direct memory mappings that
	bypass kernel page cache.

	While the DAX property tends to result in data being transferred
	synchronously, it does not give the same guarantees of O_SYNC
	where data and the necessary metadata are transferred together.

	A DAX file may support being mapped with the MAP_SYNC flag,
	which enables a program to use CPU cache flush instructions to
	persist CPU store operations without an explicit fsync(2).  See
	mmap(2) for more information.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V2:
	Update man page text with comments from Darrick, Jan, Dan, and
	Dave.

 fs/stat.c                 | 3 +++
 include/uapi/linux/stat.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/fs/stat.c b/fs/stat.c
index 030008796479..894699c74dde 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	if (IS_DAX(inode))
+		stat->attributes |= STATX_ATTR_DAX;
+
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
 					    query_flags);
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index ad80a5c885d5..e5f9d5517f6b 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -169,6 +169,7 @@ struct statx {
 #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
 #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
+#define STATX_ATTR_DAX			0x00002000 /* [I] File is DAX */
 
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 02/12] fs/xfs: Isolate the physical DAX flag from effective
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
  2020-02-08 19:34 ` [PATCH v3 01/12] fs/stat: Define DAX statx attribute ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-08 19:34 ` [PATCH v3 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_ioctl_setattr_dax_invalidate() currently checks if the DAX flag is
changing as a quick check.

But the implementation mixes the physical (XFS_DIFLAG2_DAX) and
effective (S_DAX) DAX flags.

Remove the use of the effective flag when determining if a change of the
physical flag is required.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d42de92cb283..1a57be696810 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1209,9 +1209,11 @@ xfs_ioctl_setattr_dax_invalidate(
 	}
 
 	/* If the DAX state is not changing, we have nothing to do here. */
-	if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
+	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		return 0;
-	if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
+	if (!(fa->fsx_xflags & FS_XFLAG_DAX) &&
+	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		return 0;
 
 	if (S_ISDIR(inode->i_mode))
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax()
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
  2020-02-08 19:34 ` [PATCH v3 01/12] fs/stat: Define DAX statx attribute ira.weiny
  2020-02-08 19:34 ` [PATCH v3 02/12] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-11  5:47   ` Dave Chinner
  2020-02-08 19:34 ` [PATCH v3 04/12] fs/xfs: Clean up DAX support check ira.weiny
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_inode_supports_dax() should reflect if the inode can support DAX not
that it is enabled for DAX.  Leave that to other helper functions.

Change the caller of xfs_inode_supports_dax() to call
xfs_inode_use_dax() which reflects new logic to override the effective
DAX flag with either the mount option or the physical DAX flag.

To make the logic clear create 2 helper functions for the mount and
physical flag.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_iops.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 81f2f93caec0..a7db50d923d4 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1236,6 +1236,15 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = {
 	.update_time		= xfs_vn_update_time,
 };
 
+static bool
+xfs_inode_mount_is_dax(
+	struct xfs_inode *ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+
+	return (mp->m_flags & XFS_MOUNT_DAX) == XFS_MOUNT_DAX;
+}
+
 /* Figure out if this file actually supports DAX. */
 static bool
 xfs_inode_supports_dax(
@@ -1247,11 +1256,6 @@ xfs_inode_supports_dax(
 	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
 		return false;
 
-	/* DAX mount option or DAX iflag must be set. */
-	if (!(mp->m_flags & XFS_MOUNT_DAX) &&
-	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
-		return false;
-
 	/* Block size must match page size */
 	if (mp->m_sb.sb_blocksize != PAGE_SIZE)
 		return false;
@@ -1260,6 +1264,22 @@ xfs_inode_supports_dax(
 	return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
 }
 
+static bool
+xfs_inode_is_dax(
+	struct xfs_inode *ip)
+{
+	return (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) == XFS_DIFLAG2_DAX;
+}
+
+static bool
+xfs_inode_use_dax(
+	struct xfs_inode *ip)
+{
+	return xfs_inode_supports_dax(ip) &&
+		(xfs_inode_mount_is_dax(ip) ||
+		 xfs_inode_is_dax(ip));
+}
+
 STATIC void
 xfs_diflags_to_iflags(
 	struct inode		*inode,
@@ -1278,7 +1298,7 @@ xfs_diflags_to_iflags(
 		inode->i_flags |= S_SYNC;
 	if (flags & XFS_DIFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
-	if (xfs_inode_supports_dax(ip))
+	if (xfs_inode_use_dax(ip))
 		inode->i_flags |= S_DAX;
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 04/12] fs/xfs: Clean up DAX support check
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
                   ` (2 preceding siblings ...)
  2020-02-08 19:34 ` [PATCH v3 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-11  5:57   ` Dave Chinner
  2020-02-08 19:34 ` [PATCH v3 05/12] fs: remove unneeded IS_DAX() check ira.weiny
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Rather than open coding xfs_inode_supports_dax() in
xfs_ioctl_setattr_dax_invalidate() export xfs_inode_supports_dax() and
call it in preparation for swapping dax flags.

This also means updating xfs_inode_supports_dax() to return true for a
directory.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 16 +++-------------
 fs/xfs/xfs_iops.c  |  8 ++++++--
 fs/xfs/xfs_iops.h  |  2 ++
 3 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 1a57be696810..da1eb2bdb386 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1190,23 +1190,13 @@ xfs_ioctl_setattr_dax_invalidate(
 	int			*join_flags)
 {
 	struct inode		*inode = VFS_I(ip);
-	struct super_block	*sb = inode->i_sb;
 	int			error;
 
 	*join_flags = 0;
 
-	/*
-	 * It is only valid to set the DAX flag on regular files and
-	 * directories on filesystems where the block size is equal to the page
-	 * size. On directories it serves as an inherited hint so we don't
-	 * have to check the device for dax support or flush pagecache.
-	 */
-	if (fa->fsx_xflags & FS_XFLAG_DAX) {
-		struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
-
-		if (!bdev_dax_supported(target->bt_bdev, sb->s_blocksize))
-			return -EINVAL;
-	}
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
+	    !xfs_inode_supports_dax(ip))
+		return -EINVAL;
 
 	/* If the DAX state is not changing, we have nothing to do here. */
 	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a7db50d923d4..eebec159d873 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1246,14 +1246,18 @@ xfs_inode_mount_is_dax(
 }
 
 /* Figure out if this file actually supports DAX. */
-static bool
+bool
 xfs_inode_supports_dax(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 
 	/* Only supported on non-reflinked files. */
-	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
+	if (xfs_is_reflink_inode(ip))
+		return false;
+
+	/* Only supported on regular files and directories. */
+	if (!(S_ISREG(VFS_I(ip)->i_mode) || S_ISDIR(VFS_I(ip)->i_mode)))
 		return false;
 
 	/* Block size must match page size */
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 4d24ff309f59..f24fec8de1d6 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -24,4 +24,6 @@ extern int xfs_setattr_nonsize(struct xfs_inode *ip, struct iattr *vap,
 extern int xfs_vn_setattr_nonsize(struct dentry *dentry, struct iattr *vap);
 extern int xfs_vn_setattr_size(struct dentry *dentry, struct iattr *vap);
 
+extern bool xfs_inode_supports_dax(struct xfs_inode *ip);
+
 #endif /* __XFS_IOPS_H__ */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 05/12] fs: remove unneeded IS_DAX() check
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
                   ` (3 preceding siblings ...)
  2020-02-08 19:34 ` [PATCH v3 04/12] fs/xfs: Clean up DAX support check ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-11  5:34   ` Dave Chinner
  2020-02-08 19:34 ` [PATCH v3 06/12] fs/xfs: Check if the inode supports DAX under lock ira.weiny
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Jan Kara, Alexander Viro, Darrick J. Wong,
	Dan Williams, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

The IS_DAX() check in io_is_direct() causes a race between changing the
DAX state and creating the iocb flags.

Remove the check because DAX now emulates the page cache API and
therefore it does not matter if the file state is DAX or not when the
iocb flags are created.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 include/linux/fs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..63d1e533a07d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3388,7 +3388,7 @@ extern int file_update_time(struct file *file);
 
 static inline bool io_is_direct(struct file *filp)
 {
-	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
+	return (filp->f_flags & O_DIRECT);
 }
 
 static inline bool vma_is_dax(struct vm_area_struct *vma)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 06/12] fs/xfs: Check if the inode supports DAX under lock
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
                   ` (4 preceding siblings ...)
  2020-02-08 19:34 ` [PATCH v3 05/12] fs: remove unneeded IS_DAX() check ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-11  6:16   ` Dave Chinner
  2020-02-08 19:34 ` [PATCH v3 07/12] fs: Add locking for a dynamic DAX state ira.weiny
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

One of the checks for an inode supporting DAX is if the inode is
reflinked.  During a non-DAX to DAX state change we could race with
the file being reflinked and end up with a reflinked file being in DAX
state.

Prevent this race by checking for DAX support under the MMAP_LOCK.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index da1eb2bdb386..4ff402fd6636 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1194,10 +1194,6 @@ xfs_ioctl_setattr_dax_invalidate(
 
 	*join_flags = 0;
 
-	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
-	    !xfs_inode_supports_dax(ip))
-		return -EINVAL;
-
 	/* If the DAX state is not changing, we have nothing to do here. */
 	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
 	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
@@ -1211,6 +1207,13 @@ xfs_ioctl_setattr_dax_invalidate(
 
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
+	    !xfs_inode_supports_dax(ip)) {
+		error = -EINVAL;
+		goto out_unlock;
+	}
+
 	error = filemap_write_and_wait(inode->i_mapping);
 	if (error)
 		goto out_unlock;
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 07/12] fs: Add locking for a dynamic DAX state
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
                   ` (5 preceding siblings ...)
  2020-02-08 19:34 ` [PATCH v3 06/12] fs/xfs: Check if the inode supports DAX under lock ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-11  8:00   ` Dave Chinner
  2020-02-08 19:34 ` [PATCH v3 08/12] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

DAX requires special address space operations but many other functions
check the IS_DAX() state.

While DAX is a property of the inode we perfer a lock at the super block
level because of the overhead of a rwsem within the inode.

Define a vfs per superblock percpu rs semaphore to lock the DAX state
while performing various VFS layer operations.  Write lock calls are
provided here but are used in subsequent patches by the file systems
themselves.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V2

	Rebase on linux-next-08-02-2020

	Fix locking order
	Change all references from mode to state where appropriate
	add CONFIG_FS_DAX requirement for state change
	Use a static branch to enable locking only when a dax capable
		device has been seen.

	Move the lock to a global vfs lock

		this does a few things
			1) preps us better for ext4 support
			2) removes funky callbacks from inode ops
			3) remove complexity from XFS and probably from
			   ext4 later

		We can do this because
			1) the locking order is required to be at the
			   highest level anyway, so why complicate xfs
			2) We had to move the sem to the super_block
			   because it is too heavy for the inode.
			3) After internal discussions with Dan we
			   decided that this would be easier, just as
			   performant, and with slightly less overhead
			   than in the VFS SB.

		We also change the functions names to up/down;
		read/write as appropriate.  Previous names were over
		simplified.

	Update comments and documentation

 Documentation/filesystems/vfs.rst | 17 +++++++
 fs/attr.c                         |  1 +
 fs/dax.c                          |  3 ++
 fs/inode.c                        | 14 ++++--
 fs/iomap/buffered-io.c            |  1 +
 fs/open.c                         |  4 ++
 fs/stat.c                         |  2 +
 fs/super.c                        |  3 ++
 include/linux/fs.h                | 78 ++++++++++++++++++++++++++++++-
 mm/fadvise.c                      | 10 +++-
 mm/filemap.c                      |  4 ++
 mm/huge_memory.c                  |  1 +
 mm/khugepaged.c                   |  2 +
 mm/madvise.c                      |  3 ++
 mm/util.c                         |  9 +++-
 15 files changed, 144 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..cd011ceb4b72 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -934,6 +934,23 @@ cache in your filesystem.  The following members are defined:
 	Called during swapoff on files where swap_activate was
 	successful.
 
+Changing DAX 'state' dynamically
+----------------------------------
+
+Some file systems which support DAX want to be able to change the DAX state
+dyanically.  To switch the state safely we lock the inode state in all "normal"
+file system operations and restrict state changes to those operations.  The
+specific rules are.
+
+        1) the direct_IO address_space_operation must be supported in all
+           potential a_ops vectors for any state suported by the inode.
+        2) FS's should enable the static branch lock_dax_state_static_key when a DAX
+           capable device is detected.
+        3) DAX state changes shall not be allowed while the file is mmap'ed
+        4) For non-mmaped operations the VFS layer must take the read lock for any
+           use of IS_DAX()
+        5) Filesystems take the write lock when changing DAX states.
+
 
 The File Object
 ===============
diff --git a/fs/attr.c b/fs/attr.c
index b4bbdbd4c8ca..9b15f73d1079 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	if (error)
 		return error;
 
+	/* DAX read state should already be held here */
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
diff --git a/fs/dax.c b/fs/dax.c
index 1f1f0201cad1..96136866f151 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -30,6 +30,9 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/fs_dax.h>
 
+DEFINE_STATIC_KEY_FALSE(lock_dax_state_static_key);
+EXPORT_SYMBOL(lock_dax_state_static_key);
+
 static inline unsigned int pe_order(enum page_entry_size pe_size)
 {
 	if (pe_size == PE_SIZE_PTE)
diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..7d0227f9e3e8 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1616,11 +1616,19 @@ EXPORT_SYMBOL(iput);
  */
 int bmap(struct inode *inode, sector_t *block)
 {
-	if (!inode->i_mapping->a_ops->bmap)
-		return -EINVAL;
+	int ret = 0;
+
+	inode_dax_state_down_read(inode);
+	if (!inode->i_mapping->a_ops->bmap) {
+		ret = -EINVAL;
+		goto err;
+	}
 
 	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
-	return 0;
+
+err:
+	inode_dax_state_up_read(inode);
+	return ret;
 }
 EXPORT_SYMBOL(bmap);
 #endif
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7c84c4c027c4..e313a34d5fa6 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		offset = offset_in_page(pos);
 		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
 
+		/* DAX state read should already be held here */
 		if (IS_DAX(inode))
 			status = iomap_dax_zero(pos, offset, bytes, iomap);
 		else
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..148980e30611 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 	if (ret)
 		newattrs.ia_valid |= ret | ATTR_FORCE;
 
+	inode_dax_state_down_read(dentry->d_inode);
 	inode_lock(dentry->d_inode);
 	/* Note any delegations or leases have already been broken: */
 	ret = notify_change(dentry, &newattrs, NULL);
 	inode_unlock(dentry->d_inode);
+	inode_dax_state_up_read(dentry->d_inode);
 	return ret;
 }
 
@@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EOPNOTSUPP;
 
 	file_start_write(file);
+	inode_dax_state_down_read(inode);
 	ret = file->f_op->fallocate(file, mode, offset, len);
+	inode_dax_state_up_read(inode);
 
 	/*
 	 * Create inotify and fanotify events.
diff --git a/fs/stat.c b/fs/stat.c
index 894699c74dde..bf8841314c08 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	inode_dax_state_down_read(inode);
 	if (IS_DAX(inode))
 		stat->attributes |= STATX_ATTR_DAX;
+	inode_dax_state_up_read(inode);
 
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
diff --git a/fs/super.c b/fs/super.c
index cd352530eca9..3e26e3a1d860 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -51,6 +51,9 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
 	"sb_internal",
 };
 
+DEFINE_PERCPU_RWSEM(sb_dax_rwsem);
+EXPORT_SYMBOL(sb_dax_rwsem);
+
 /*
  * One thing we have to be careful of with a per-sb shrinker is that we don't
  * drop the last active reference to the superblock from within the shrinker.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63d1e533a07d..1a22cd94c4ab 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -40,6 +40,7 @@
 #include <linux/fs_types.h>
 #include <linux/build_bug.h>
 #include <linux/stddef.h>
+#include <linux/percpu-rwsem.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -359,6 +360,11 @@ typedef struct {
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
 		unsigned long, unsigned long);
 
+/**
+ * NOTE: DO NOT define new functions in address_space_operations without first
+ * considering how dynamic DAX states are to be supported.  See the
+ * inode_dax_state_*_read() functions
+ */
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
@@ -1817,6 +1823,11 @@ struct block_device_operations;
 
 struct iov_iter;
 
+/**
+ * NOTE: DO NOT define new functions in file_operations without first
+ * considering how dynamic DAX states are to be supported.  See the
+ * inode_dax_state_*_read() functions
+ */
 struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
@@ -1889,16 +1900,79 @@ struct inode_operations {
 	int (*set_acl)(struct inode *, struct posix_acl *, int);
 } ____cacheline_aligned;
 
+#if defined(CONFIG_FS_DAX)
+/*
+ * Filesystems wishing to support dynamic DAX states must do the following.
+ *
+ * 1) the direct_IO address_space_operation must be supported in all
+ *    potential a_ops vectors for any state suported by the inode.
+ * 2) FS's should enable the static branch lock_dax_state_static_key when a DAX
+ *    capable device is detected.
+ * 3) DAX state changes shall not be allowed while the file is mmap'ed
+ * 4) For non-mmaped operations the VFS layer must take the read lock for any
+ *    use of IS_DAX()
+ * 5) Filesystems take the write lock when changing DAX states.
+ */
+DECLARE_STATIC_KEY_FALSE(lock_dax_state_static_key);
+extern struct percpu_rw_semaphore sb_dax_rwsem;
+static inline void inode_dax_state_down_read(struct inode *inode)
+{
+	if (!static_branch_unlikely(&lock_dax_state_static_key))
+		return;
+	percpu_down_read(&sb_dax_rwsem);
+}
+static inline void inode_dax_state_up_read(struct inode *inode)
+{
+	if (!static_branch_unlikely(&lock_dax_state_static_key))
+		return;
+	percpu_up_read(&sb_dax_rwsem);
+}
+static inline void inode_dax_state_down_write(struct inode *inode)
+{
+	if (!static_branch_unlikely(&lock_dax_state_static_key))
+		return;
+	percpu_down_write(&sb_dax_rwsem);
+}
+static inline void inode_dax_state_up_write(struct inode *inode)
+{
+	if (!static_branch_unlikely(&lock_dax_state_static_key))
+		return;
+	percpu_up_write(&sb_dax_rwsem);
+}
+static inline void enable_dax_state_static_branch(void)
+{
+	static_branch_enable(&lock_dax_state_static_key);
+}
+#else /* !CONFIG_FS_DAX */
+#define inode_dax_state_down_read(inode) do { (void)(inode); } while (0)
+#define inode_dax_state_up_read(inode) do { (void)(inode); } while (0)
+#define inode_dax_state_down_write(inode) do { (void)(inode); } while (0)
+#define inode_dax_state_up_write(inode) do { (void)(inode); } while (0)
+#define enable_dax_state_static_branch()
+#endif /* CONFIG_FS_DAX */
+
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 				     struct iov_iter *iter)
 {
-	return file->f_op->read_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	inode_dax_state_down_read(inode);
+	ret = file->f_op->read_iter(kio, iter);
+	inode_dax_state_up_read(inode);
+	return ret;
 }
 
 static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
 				      struct iov_iter *iter)
 {
-	return file->f_op->write_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	inode_dax_state_down_read(inode);
+	ret = file->f_op->write_iter(kio, iter);
+	inode_dax_state_up_read(inode);
+	return ret;
 }
 
 static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 4f17c83db575..ac85eb778c74 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -47,7 +47,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 
 	bdi = inode_to_bdi(mapping->host);
 
+	inode_dax_state_down_read(inode);
 	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
+		int ret = 0;
+
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
@@ -58,10 +61,13 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 			/* no bad return value, but ignore advice */
 			break;
 		default:
-			return -EINVAL;
+			ret = -EINVAL;
 		}
-		return 0;
+
+		inode_dax_state_up_read(inode);
+		return ret;
 	}
+	inode_dax_state_up_read(inode);
 
 	/*
 	 * Careful about overflows. Len == 0 means "as much as possible".  Use
diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..3a7863ba51b9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		 * and return.  Otherwise fallthrough to buffered io for
 		 * the rest of the read.  Buffered reads will not work for
 		 * DAX files, so don't bother trying.
+		 *
+		 * IS_DAX is protected under ->read_iter lock
 		 */
 		if (retval < 0 || !count || iocb->ki_pos >= size ||
 		    IS_DAX(inode))
@@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		 * holes, for example.  For DAX files, a buffered write will
 		 * not succeed (even if it did, DAX does not handle dirty
 		 * page-cache pages correctly).
+		 *
+		 * IS_DAX is protected under ->write_iter lock
 		 */
 		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
 			goto out;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b08b199f9a11..3d05bd10d83e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 	unsigned long ret;
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
+	/* Should not need locking here because mmap is not allowed */
 	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
 		goto out;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908743cb..3bec46277886 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
 		} else {	/* !is_shmem */
 			if (!page || xa_is_value(page)) {
 				xas_unlock_irq(&xas);
+				inode_dax_state_down_read(file->f_inode);
 				page_cache_sync_readahead(mapping, &file->f_ra,
 							  file, index,
 							  PAGE_SIZE);
+				inode_dax_state_up_read(file->f_inode);
 				/* drain pagevecs to help isolate_lru_page() */
 				lru_add_drain();
 				page = find_lock_page(mapping, index);
diff --git a/mm/madvise.c b/mm/madvise.c
index 43b47d3fae02..419b7c26216b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -275,10 +275,13 @@ static long madvise_willneed(struct vm_area_struct *vma,
 		return -EBADF;
 #endif
 
+	inode_dax_state_down_read(file_inode(file));
 	if (IS_DAX(file_inode(file))) {
+		inode_dax_state_up_read(file_inode(file));
 		/* no bad return value, but ignore advice */
 		return 0;
 	}
+	inode_dax_state_up_read(file_inode(file));
 
 	/*
 	 * Filesystem's fadvise may need to take various locks.  We need to
diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..8dfb9958f2a6 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 
 	ret = security_mmap_file(file, prot, flag);
 	if (!ret) {
-		if (down_write_killable(&mm->mmap_sem))
+		if (file)
+			inode_dax_state_down_read(file_inode(file));
+		if (down_write_killable(&mm->mmap_sem)) {
+			if (file)
+				inode_dax_state_up_read(file_inode(file));
 			return -EINTR;
+		}
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
 				    &populate, &uf);
 		up_write(&mm->mmap_sem);
+		if (file)
+			inode_dax_state_up_read(file_inode(file));
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
 			mm_populate(ret, populate);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 08/12] fs/xfs: Clarify lockdep dependency for xfs_isilocked()
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
                   ` (6 preceding siblings ...)
  2020-02-08 19:34 ` [PATCH v3 07/12] fs: Add locking for a dynamic DAX state ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-08 19:34 ` [PATCH v3 09/12] fs/xfs: Add write DAX lock to xfs layer ira.weiny
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_isilocked() can't work fully without CONFIG_LOCKDEP.  However,
making xfs_isilocked() dependant on CONFIG_LOCKDEP is not feasible
because it is used for more than the i_rwsem.  Therefore a short-circuit
was provided via debug_locks.  However, this caused confusion while
working through the xfs locking.

Rather than use debug_locks as a flag specify this clearly using
IS_ENABLED(CONFIG_LOCKDEP).

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V2:
	This patch is new for V3

 fs/xfs/xfs_inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c5077e6326c7..35df324875db 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -364,7 +364,7 @@ xfs_isilocked(
 
 	if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
 		if (!(lock_flags & XFS_IOLOCK_SHARED))
-			return !debug_locks ||
+			return !IS_ENABLED(CONFIG_LOCKDEP) ||
 				lockdep_is_held_type(&VFS_I(ip)->i_rwsem, 0);
 		return rwsem_is_locked(&VFS_I(ip)->i_rwsem);
 	}
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 09/12] fs/xfs: Add write DAX lock to xfs layer
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
                   ` (7 preceding siblings ...)
  2020-02-08 19:34 ` [PATCH v3 08/12] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-08 19:34 ` [PATCH v3 10/12] fs: Prevent DAX state change if file is mmap'ed ira.weiny
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

XFS requires regular files to be locked for write while changing to/from
DAX state.

Take the DAX write lock while changing DAX state.

We define a new XFS_DAX_EXCL lock type to carry the lock through to
transaction completion.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V2:
	Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
	Remove the xfs specific lock and move to the vfs layer.
		We still use XFS_LOCK_DAX_EXCL to be able to pass this
		flag through to the transaction code.  But we no longer
		have a lock specific to xfs.  This removes a lot of code
		from the XFS layer, preps us for using this in ext4, and
		is actually more straight forward now that all the
		locking requirements are better known.

	Fix locking order comment
	Rework for new 'state' names
	(Other comments on the previous patch are not applicable with
	new patch as much of the code was removed in favor of the vfs
	level lock)

 fs/xfs/xfs_inode.c | 22 ++++++++++++++++++++--
 fs/xfs/xfs_inode.h |  7 +++++--
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 35df324875db..0c7b1855e0c8 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
  *
  * Basic locking order:
  *
- * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
+ * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
  *
  * mmap_sem locking order:
  *
  * i_rwsem -> page lock -> mmap_sem
- * mmap_sem -> i_mmap_lock -> page_lock
+ * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
  *
  * The difference in mmap_sem locking order mean that we cannot hold the
  * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
@@ -182,6 +182,9 @@ xfs_ilock(
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
 
+	if (lock_flags & XFS_DAX_EXCL)
+		inode_dax_state_down_write(VFS_I(ip));
+
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		down_write_nested(&VFS_I(ip)->i_rwsem,
 				  XFS_IOLOCK_DEP(lock_flags));
@@ -224,6 +227,8 @@ xfs_ilock_nowait(
 	 * You can't set both SHARED and EXCL for the same lock,
 	 * and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED,
 	 * and XFS_ILOCK_EXCL are valid values to set in lock_flags.
+	 *
+	 * XFS_DAX_* is not allowed
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
@@ -232,6 +237,7 @@ xfs_ilock_nowait(
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
+	ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
 
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		if (!down_write_trylock(&VFS_I(ip)->i_rwsem))
@@ -318,6 +324,9 @@ xfs_iunlock(
 	else if (lock_flags & XFS_ILOCK_SHARED)
 		mrunlock_shared(&ip->i_lock);
 
+	if (lock_flags & XFS_DAX_EXCL)
+		inode_dax_state_up_write(VFS_I(ip));
+
 	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
 }
 
@@ -333,6 +342,8 @@ xfs_ilock_demote(
 	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
 	ASSERT((lock_flags &
 		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
+	/* XFS_DAX_* is not allowed */
+	ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
 
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrdemote(&ip->i_lock);
@@ -465,6 +476,9 @@ xfs_lock_inodes(
 	ASSERT(!(lock_mode & XFS_ILOCK_EXCL) ||
 		inodes <= XFS_ILOCK_MAX_SUBCLASS + 1);
 
+	/* XFS_DAX_* is not allowed */
+	ASSERT((lock_mode & XFS_DAX_EXCL) == 0);
+
 	if (lock_mode & XFS_IOLOCK_EXCL) {
 		ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL)));
 	} else if (lock_mode & XFS_MMAPLOCK_EXCL)
@@ -566,6 +580,10 @@ xfs_lock_two_inodes(
 	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
 	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
 
+	/* XFS_DAX_* is not allowed */
+	ASSERT((ip0_mode & XFS_DAX_EXCL) == 0);
+	ASSERT((ip1_mode & XFS_DAX_EXCL) == 0);
+
 	ASSERT(ip0->i_ino != ip1->i_ino);
 
 	if (ip0->i_ino > ip1->i_ino) {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 492e53992fa9..25fe20740bf7 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -278,10 +278,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 #define	XFS_ILOCK_SHARED	(1<<3)
 #define	XFS_MMAPLOCK_EXCL	(1<<4)
 #define	XFS_MMAPLOCK_SHARED	(1<<5)
+#define	XFS_DAX_EXCL		(1<<6)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
 				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
-				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
+				| XFS_DAX_EXCL)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
@@ -289,7 +291,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
 	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
 	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
-	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
+	{ XFS_DAX_EXCL,		"DAX_EXCL" }
 
 
 /*
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 10/12] fs: Prevent DAX state change if file is mmap'ed
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
                   ` (8 preceding siblings ...)
  2020-02-08 19:34 ` [PATCH v3 09/12] fs/xfs: Add write DAX lock to xfs layer ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-08 19:34 ` [PATCH v3 11/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Page faults need to ensure the inode DAX configuration is correct and
consistent with the vmf information at the time of the fault.  There is
no easy way to ensure the vmf information is correct if a DAX change is
in progress.  Furthermore, there is no good use case to require changing
DAX configs while the file is mmap'ed.

Track mmap's of the file and fail the DAX change if the file is mmap'ed.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V2:

	move 'i_mapped' to struct address_space and rename mmap_count
	Add inode_has_mappings() helper for FS's
	Change reference to "mode" to "state"

 fs/inode.c         |  1 +
 fs/xfs/xfs_ioctl.c |  8 ++++++++
 include/linux/fs.h |  6 ++++++
 mm/mmap.c          | 19 +++++++++++++++++--
 4 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 7d0227f9e3e8..bca5c9093542 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -371,6 +371,7 @@ static void __address_space_init_once(struct address_space *mapping)
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
 	mapping->i_mmap = RB_ROOT_CACHED;
+	atomic64_set(&mapping->mmap_count, 0);
 }
 
 void address_space_init_once(struct address_space *mapping)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4ff402fd6636..faba232b1f31 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1214,6 +1214,14 @@ xfs_ioctl_setattr_dax_invalidate(
 		goto out_unlock;
 	}
 
+	/*
+	 * If there is a mapping in place we must remain in our current state.
+	 */
+	if (inode_has_mappings(inode)) {
+		error = -EBUSY;
+		goto out_unlock;
+	}
+
 	error = filemap_write_and_wait(inode->i_mapping);
 	if (error)
 		goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a22cd94c4ab..3e0121626d94 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -459,6 +459,7 @@ struct address_space {
 #endif
 	struct rb_root_cached	i_mmap;
 	struct rw_semaphore	i_mmap_rwsem;
+	atomic64_t              mmap_count;
 	unsigned long		nrpages;
 	unsigned long		nrexceptional;
 	pgoff_t			writeback_index;
@@ -1951,6 +1952,11 @@ static inline void enable_dax_state_static_branch(void)
 #define enable_dax_state_static_branch()
 #endif /* CONFIG_FS_DAX */
 
+static inline bool inode_has_mappings(struct inode *inode)
+{
+	return (atomic64_read(&inode->i_mapping->mmap_count) != 0);
+}
+
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 				     struct iov_iter *iter)
 {
diff --git a/mm/mmap.c b/mm/mmap.c
index 7cc2562b99fd..6bb16a0996b5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -171,12 +171,17 @@ void unlink_file_vma(struct vm_area_struct *vma)
 static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *next = vma->vm_next;
+	struct file *f = vma->vm_file;
 
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
+	if (f) {
+		struct inode *inode = file_inode(f);
+		if (inode)
+			atomic64_dec(&inode->i_mapping->mmap_count);
+		fput(f);
+	}
 	mpol_put(vma_policy(vma));
 	vm_area_free(vma);
 	return next;
@@ -1830,6 +1835,16 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vma_set_page_prot(vma);
 
+	/*
+	 * Track if there is mapping in place such that a state change
+	 * does not occur on a file which is mapped
+	 */
+	if (file) {
+		struct inode		*inode = file_inode(file);
+
+		atomic64_inc(&inode->i_mapping->mmap_count);
+	}
+
 	return addr;
 
 unmap_and_free_vma:
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 11/12] fs/xfs: Clean up locking in dax invalidate
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
                   ` (9 preceding siblings ...)
  2020-02-08 19:34 ` [PATCH v3 10/12] fs: Prevent DAX state change if file is mmap'ed ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-08 19:34 ` [PATCH v3 12/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny
  2020-02-10 15:15 ` [PATCH v3 00/12] Enable per-file/directory DAX operations V3 Jeff Moyer
  12 siblings, 0 replies; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Define a variable to hold the lock flags to ensure that the correct
locks are returned or released on error.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index faba232b1f31..0134c9e65efb 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1190,7 +1190,7 @@ xfs_ioctl_setattr_dax_invalidate(
 	int			*join_flags)
 {
 	struct inode		*inode = VFS_I(ip);
-	int			error;
+	int			error, flags;
 
 	*join_flags = 0;
 
@@ -1205,8 +1205,10 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (S_ISDIR(inode->i_mode))
 		return 0;
 
+	flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+
 	/* lock, flush and invalidate mapping in preparation for flag change */
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+	xfs_ilock(ip, flags);
 
 	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
 	    !xfs_inode_supports_dax(ip)) {
@@ -1229,11 +1231,11 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (error)
 		goto out_unlock;
 
-	*join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+	*join_flags = flags;
 	return 0;
 
 out_unlock:
-	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+	xfs_iunlock(ip, flags);
 	return error;
 
 }
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 12/12] fs/xfs: Allow toggle of effective DAX flag
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
                   ` (10 preceding siblings ...)
  2020-02-08 19:34 ` [PATCH v3 11/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
@ 2020-02-08 19:34 ` ira.weiny
  2020-02-10 15:15 ` [PATCH v3 00/12] Enable per-file/directory DAX operations V3 Jeff Moyer
  12 siblings, 0 replies; 52+ messages in thread
From: ira.weiny @ 2020-02-08 19:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Now that locking of the inode is in place we can allow a DAX state
change while under the new lock.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V2:
	Add in lock_dax_state_static_key static branch enabling.

 fs/xfs/xfs_inode.h |  1 +
 fs/xfs/xfs_ioctl.c | 15 ++++++++++++---
 fs/xfs/xfs_iops.c  | 15 +++++++++++----
 fs/xfs/xfs_super.c | 16 +++++++++-------
 4 files changed, 33 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 25fe20740bf7..0064f8eef41d 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -469,6 +469,7 @@ int	xfs_break_layouts(struct inode *inode, uint *iolock,
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
+extern void xfs_setup_a_ops(struct xfs_inode *ip);
 
 /*
  * When setting up a newly allocated inode, we need to call
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0134c9e65efb..0736cf45223d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1123,12 +1123,11 @@ xfs_diflags_to_linux(
 		inode->i_flags |= S_NOATIME;
 	else
 		inode->i_flags &= ~S_NOATIME;
-#if 0	/* disabled until the flag switching races are sorted out */
 	if (xflags & FS_XFLAG_DAX)
 		inode->i_flags |= S_DAX;
 	else
 		inode->i_flags &= ~S_DAX;
-#endif
+
 }
 
 static int
@@ -1194,6 +1193,10 @@ xfs_ioctl_setattr_dax_invalidate(
 
 	*join_flags = 0;
 
+#if !defined(CONFIG_FS_DAX)
+	return -EINVAL;
+#endif
+
 	/* If the DAX state is not changing, we have nothing to do here. */
 	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
 	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
@@ -1205,7 +1208,7 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (S_ISDIR(inode->i_mode))
 		return 0;
 
-	flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+	flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL | XFS_DAX_EXCL;
 
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, flags);
@@ -1526,6 +1529,9 @@ xfs_ioctl_setattr(
 	else
 		ip->i_d.di_cowextsize = 0;
 
+	if (join_flags & XFS_DAX_EXCL)
+		xfs_setup_a_ops(ip);
+
 	code = xfs_trans_commit(tp);
 
 	/*
@@ -1635,6 +1641,9 @@ xfs_ioc_setxflags(
 		goto out_drop_write;
 	}
 
+	if (join_flags & XFS_DAX_EXCL)
+		xfs_setup_a_ops(ip);
+
 	error = xfs_trans_commit(tp);
 out_drop_write:
 	mnt_drop_write_file(filp);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index eebec159d873..15ef51c7f0b3 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1366,6 +1366,16 @@ xfs_setup_inode(
 	}
 }
 
+void xfs_setup_a_ops(struct xfs_inode *ip)
+{
+	struct inode		*inode = &ip->i_vnode;
+
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &xfs_dax_aops;
+	else
+		inode->i_mapping->a_ops = &xfs_address_space_operations;
+}
+
 void
 xfs_setup_iops(
 	struct xfs_inode	*ip)
@@ -1376,10 +1386,7 @@ xfs_setup_iops(
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
 		inode->i_fop = &xfs_file_operations;
-		if (IS_DAX(inode))
-			inode->i_mapping->a_ops = &xfs_dax_aops;
-		else
-			inode->i_mapping->a_ops = &xfs_address_space_operations;
+		xfs_setup_a_ops(ip);
 		break;
 	case S_IFDIR:
 		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2094386af8ac..af57fe07b56d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1332,6 +1332,7 @@ xfs_fc_fill_super(
 	struct xfs_mount	*mp = sb->s_fs_info;
 	struct inode		*root;
 	int			flags = 0, error;
+	bool                    rtdev_is_dax = false, datadev_is_dax;
 
 	mp->m_super = sb;
 
@@ -1437,17 +1438,18 @@ xfs_fc_fill_super(
 	if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
 		sb->s_flags |= SB_I_VERSION;
 
-	if (mp->m_flags & XFS_MOUNT_DAX) {
-		bool rtdev_is_dax = false, datadev_is_dax;
+	datadev_is_dax = bdev_dax_supported(mp->m_ddev_targp->bt_bdev,
+					    sb->s_blocksize);
+	if (mp->m_rtdev_targp)
+		rtdev_is_dax = bdev_dax_supported(mp->m_rtdev_targp->bt_bdev,
+						  sb->s_blocksize);
+	if (rtdev_is_dax || datadev_is_dax)
+		enable_dax_state_static_branch();
 
+	if (mp->m_flags & XFS_MOUNT_DAX) {
 		xfs_warn(mp,
 		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
 
-		datadev_is_dax = bdev_dax_supported(mp->m_ddev_targp->bt_bdev,
-			sb->s_blocksize);
-		if (mp->m_rtdev_targp)
-			rtdev_is_dax = bdev_dax_supported(
-				mp->m_rtdev_targp->bt_bdev, sb->s_blocksize);
 		if (!rtdev_is_dax && !datadev_is_dax) {
 			xfs_alert(mp,
 			"DAX unsupported by block device. Turning off DAX.");
-- 
2.21.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
                   ` (11 preceding siblings ...)
  2020-02-08 19:34 ` [PATCH v3 12/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny
@ 2020-02-10 15:15 ` Jeff Moyer
  2020-02-11 20:17   ` Ira Weiny
  12 siblings, 1 reply; 52+ messages in thread
From: Jeff Moyer @ 2020-02-10 15:15 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

Hi, Ira,

Could you please include documentation patches as part of this series?

Thanks,
Jeff


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/12] fs: remove unneeded IS_DAX() check
  2020-02-08 19:34 ` [PATCH v3 05/12] fs: remove unneeded IS_DAX() check ira.weiny
@ 2020-02-11  5:34   ` Dave Chinner
  2020-02-11 16:38     ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2020-02-11  5:34 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Jan Kara, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Sat, Feb 08, 2020 at 11:34:38AM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The IS_DAX() check in io_is_direct() causes a race between changing the
> DAX state and creating the iocb flags.
> 
> Remove the check because DAX now emulates the page cache API and
> therefore it does not matter if the file state is DAX or not when the
> iocb flags are created.

This statement is ... weird.

DAX doesn't "emulate" the page cache API at all - it has it's own
read/write methods that filesystems call based on the iomap
infrastructure (dax_iomap_rw()). i.e. there are 3 different IO paths
through the filesystems: the DAX IO path, the direct IO path, and
the buffered IO path.

Indeed, it seems like this works a bit by luck: Ext4 and XFS always
check IS_DAX(inode) in the read/write_iter methods before checking
for IOCB_DIRECT, and hence the IOCB_DIRECT flag is ignored by the
filesystems. i.e. when we got rid of the O_DIRECT paths from DAX, we
forgot to clean up io_is_direct() and it's only due to the ordering
of checks that we went down the DAX path correctly....

That said, the code change is good, but the commit message needs a
rewrite.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax()
  2020-02-08 19:34 ` [PATCH v3 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
@ 2020-02-11  5:47   ` Dave Chinner
  2020-02-11 16:13     ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2020-02-11  5:47 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Sat, Feb 08, 2020 at 11:34:36AM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> xfs_inode_supports_dax() should reflect if the inode can support DAX not
> that it is enabled for DAX.  Leave that to other helper functions.
> 
> Change the caller of xfs_inode_supports_dax() to call
> xfs_inode_use_dax() which reflects new logic to override the effective
> DAX flag with either the mount option or the physical DAX flag.
> 
> To make the logic clear create 2 helper functions for the mount and
> physical flag.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  fs/xfs/xfs_iops.c | 32 ++++++++++++++++++++++++++------
>  1 file changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 81f2f93caec0..a7db50d923d4 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1236,6 +1236,15 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = {
>  	.update_time		= xfs_vn_update_time,
>  };
>  
> +static bool
> +xfs_inode_mount_is_dax(
> +	struct xfs_inode *ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +
> +	return (mp->m_flags & XFS_MOUNT_DAX) == XFS_MOUNT_DAX;
> +}
> +
>  /* Figure out if this file actually supports DAX. */
>  static bool
>  xfs_inode_supports_dax(
> @@ -1247,11 +1256,6 @@ xfs_inode_supports_dax(
>  	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
>  		return false;
>  
> -	/* DAX mount option or DAX iflag must be set. */
> -	if (!(mp->m_flags & XFS_MOUNT_DAX) &&
> -	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> -		return false;
> -
>  	/* Block size must match page size */
>  	if (mp->m_sb.sb_blocksize != PAGE_SIZE)
>  		return false;
> @@ -1260,6 +1264,22 @@ xfs_inode_supports_dax(
>  	return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
>  }
>  
> +static bool
> +xfs_inode_is_dax(
> +	struct xfs_inode *ip)
> +{
> +	return (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) == XFS_DIFLAG2_DAX;
> +}

I don't think these wrappers add any value at all - the naming of
them is entirely confusing, too. e.g. "inode is dax" doesn't tell me
that it is checking the on disk flags - it doesn't tell me how it is
different to IS_DAX, or why I'd use one versus the other. And then
xfs_inode_mount_is_dax() is just... worse.

Naming is hard. :)

> +
> +static bool
> +xfs_inode_use_dax(
> +	struct xfs_inode *ip)
> +{
> +	return xfs_inode_supports_dax(ip) &&
> +		(xfs_inode_mount_is_dax(ip) ||
> +		 xfs_inode_is_dax(ip));
> +}

Urk. Naming - we're not "using dax" here, we are checkign to see if
we should enable DAX on this inode. IOWs:

static bool
xfs_inode_enable_dax(
	struct xfs_inode *ip)
{
	if (!xfs_inode_supports_dax(ip))
		return false;

	if (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)
		return true;
	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
		return true;
	return false;
}

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 04/12] fs/xfs: Clean up DAX support check
  2020-02-08 19:34 ` [PATCH v3 04/12] fs/xfs: Clean up DAX support check ira.weiny
@ 2020-02-11  5:57   ` Dave Chinner
  2020-02-11 16:28     ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2020-02-11  5:57 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Sat, Feb 08, 2020 at 11:34:37AM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Rather than open coding xfs_inode_supports_dax() in
> xfs_ioctl_setattr_dax_invalidate() export xfs_inode_supports_dax() and
> call it in preparation for swapping dax flags.
> 
> This also means updating xfs_inode_supports_dax() to return true for a
> directory.

That's not correct. This now means S_DAX gets set on directory inodes
because both xfs_inode_supports_dax() and the on-disk inode flag
checks return true in xfs_diflags_to_iflags(). Hence when we
instantiate a directory inode with a DAX inherit hint set on it
we'll set S_DAX on the inode and so IS_DAX() will return true for
directory inodes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/12] fs/xfs: Check if the inode supports DAX under lock
  2020-02-08 19:34 ` [PATCH v3 06/12] fs/xfs: Check if the inode supports DAX under lock ira.weiny
@ 2020-02-11  6:16   ` Dave Chinner
  2020-02-11 17:55     ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2020-02-11  6:16 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Sat, Feb 08, 2020 at 11:34:39AM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> One of the checks for an inode supporting DAX is if the inode is
> reflinked.  During a non-DAX to DAX state change we could race with
> the file being reflinked and end up with a reflinked file being in DAX
> state.
> 
> Prevent this race by checking for DAX support under the MMAP_LOCK.

The on disk inode flags are protected by the XFS_ILOCK, not the
MMAP_LOCK. i.e. the MMAPLOCK provides data access serialisation, not
metadata modification serialisation.

> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
>  fs/xfs/xfs_ioctl.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index da1eb2bdb386..4ff402fd6636 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1194,10 +1194,6 @@ xfs_ioctl_setattr_dax_invalidate(
>  
>  	*join_flags = 0;
>  
> -	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
> -	    !xfs_inode_supports_dax(ip))
> -		return -EINVAL;
> -
>  	/* If the DAX state is not changing, we have nothing to do here. */
>  	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
>  	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> @@ -1211,6 +1207,13 @@ xfs_ioctl_setattr_dax_invalidate(
>  
>  	/* lock, flush and invalidate mapping in preparation for flag change */
>  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> +
> +	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
> +	    !xfs_inode_supports_dax(ip)) {
> +		error = -EINVAL;
> +		goto out_unlock;
> +	}

Yes, you might be able to get away with reflink vs dax flag
serialisation on the inode flag modification, but it is not correct and
leaves a landmine for future inode flag modifications that are done
without holding either the MMAP or IOLOCK.

e.g. concurrent calls to xfs_ioctl_setattr() setting/clearing flags
other than the on disk DAX flag are all serialised by the ILOCK_EXCL
and will no be serialised against this DAX check. Hence if there are
other flags that we add in future that affect the result of
xfs_inode_supports_dax(), this code will not be correctly
serialised.

This raciness in checking the DAX flags is the reason that
xfs_ioctl_setattr_xflags() redoes all the reflink vs dax checks once
it's called under the XFS_ILOCK_EXCL during the actual change
transaction....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 07/12] fs: Add locking for a dynamic DAX state
  2020-02-08 19:34 ` [PATCH v3 07/12] fs: Add locking for a dynamic DAX state ira.weiny
@ 2020-02-11  8:00   ` Dave Chinner
  2020-02-11 20:14     ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2020-02-11  8:00 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Sat, Feb 08, 2020 at 11:34:40AM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> DAX requires special address space operations but many other functions
> check the IS_DAX() state.
> 
> While DAX is a property of the inode we perfer a lock at the super block
> level because of the overhead of a rwsem within the inode.
> 
> Define a vfs per superblock percpu rs semaphore to lock the DAX state

????

> while performing various VFS layer operations.  Write lock calls are
> provided here but are used in subsequent patches by the file systems
> themselves.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from V2
> 
> 	Rebase on linux-next-08-02-2020
> 
> 	Fix locking order
> 	Change all references from mode to state where appropriate
> 	add CONFIG_FS_DAX requirement for state change
> 	Use a static branch to enable locking only when a dax capable
> 		device has been seen.
> 
> 	Move the lock to a global vfs lock
> 
> 		this does a few things
> 			1) preps us better for ext4 support
> 			2) removes funky callbacks from inode ops
> 			3) remove complexity from XFS and probably from
> 			   ext4 later
> 
> 		We can do this because
> 			1) the locking order is required to be at the
> 			   highest level anyway, so why complicate xfs
> 			2) We had to move the sem to the super_block
> 			   because it is too heavy for the inode.
> 			3) After internal discussions with Dan we
> 			   decided that this would be easier, just as
> 			   performant, and with slightly less overhead
> 			   than in the VFS SB.
> 
> 		We also change the functions names to up/down;
> 		read/write as appropriate.  Previous names were over
> 		simplified.

This, IMO, is a bit of a train wreck.

This patch has nothing to do with "DAX state", it's about
serialising access to the aops vector. There should be zero
references to DAX in this patch at all, except maybe to say
"switching DAX on dynamically requires atomic switching of address
space ops".

Big problems I see here:

1. static key to turn it on globally.
  - just a gross hack around doing it properly with a per-sb
    mechanism and enbaling it only on filesystems that are on DAX
    capable block devices.
  - you're already passing in an inode to all these functions. It's
    trivial to do:

	if (!inode->i_sb->s_flags & S_DYNAMIC_AOPS)
		return
	/* do sb->s_aops_lock manipulation */

2. global lock
  - OMG!
  - global lock will cause entire system IO/page fault stalls
    when someone does recursive/bulk DAX flag modification
    operations. Per-cpu rwsem Contention on  large systems will be
    utterly awful.
  - ext4's use case almost never hits the exclusive lock side of the
    percpu-rwsem - only when changing the journal mode flag on the
    inode. And it only affects writeback in progress, so it's not
    going to have massive concurrency on it like a VFS level global
    lock has. -> Bad model to follow.
  - per-sb lock is trivial - see #1 - which limits scope to single
    filesystem
  - per-inode rwsem would make this problem go away entirely.

3. If we can use a global per-cpu rwsem, why can't we just use a
   per-inode rwsem?
  - locking context rules are the same
  - rwsem scales pretty damn well for shared ops
  - no "global" contention problems
  - small enough that we can put another rwsem in the inode.

4. "inode_dax_state_up_read"
  - Eye bleeds.
  - this is about the aops structure serialisation, not dax.
  - The name makes no sense in the places that it has been added.

5. We already have code that does almost exactly what we need: the
   superblock freezing infrastructure.
  - freezing implements a "hold operations on this superblock until
    further notice" model we can easily follow.
  - sb_start_write/sb_end_write provides a simple API model and a
    clean, clear and concise naming convention we can use, too.


Really, I'm struggling to understand how we got to "global locking
that stops the world" from "need to change per-inode state
atomically". Can someone please explain to me why this patch isn't
just a simple set of largely self-explanitory functions like this:

XX_start_aop()
XX_end_aop()

XX_lock_aops()
XX_switch_aops(aops)
XX_unlock_aops()

where "XX" is "sb" or "inode" depending on what locking granularity
is used...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax()
  2020-02-11  5:47   ` Dave Chinner
@ 2020-02-11 16:13     ` Ira Weiny
  0 siblings, 0 replies; 52+ messages in thread
From: Ira Weiny @ 2020-02-11 16:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Tue, Feb 11, 2020 at 04:47:48PM +1100, Dave Chinner wrote:
> On Sat, Feb 08, 2020 at 11:34:36AM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 

[snip]

> >  
> > +static bool
> > +xfs_inode_is_dax(
> > +	struct xfs_inode *ip)
> > +{
> > +	return (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) == XFS_DIFLAG2_DAX;
> > +}
> 
> I don't think these wrappers add any value at all - the naming of
> them is entirely confusing, too. e.g. "inode is dax" doesn't tell me
> that it is checking the on disk flags - it doesn't tell me how it is
> different to IS_DAX, or why I'd use one versus the other. And then
> xfs_inode_mount_is_dax() is just... worse.
> 
> Naming is hard. :)

Sure...  I'm particularly bad as well...

FWIW I don't see how xfs_inode_mount_is_dax() is worse, I rather think that is
pretty clear but I'm not going to quibble over names because I know I'm rubbish
at it and I'm certainly not enough of a FS person to make them clear...  ;-)

> 
> > +
> > +static bool
> > +xfs_inode_use_dax(
> > +	struct xfs_inode *ip)
> > +{
> > +	return xfs_inode_supports_dax(ip) &&
> > +		(xfs_inode_mount_is_dax(ip) ||
> > +		 xfs_inode_is_dax(ip));
> > +}
> 
> Urk. Naming - we're not "using dax" here, we are checkign to see if
> we should enable DAX on this inode. IOWs:

Well just to defend myself a little bit.  My thought was:

"When setting i_flags, should I use dax?"

> 
> static bool
> xfs_inode_enable_dax(
> 	struct xfs_inode *ip)
> {
> 	if (!xfs_inode_supports_dax(ip))
> 		return false;
> 
> 	if (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)
> 		return true;
> 	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
> 		return true;
> 	return false;
> }

Anyway, I'm good with this.

Changed for V4.

Thanks!
Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 04/12] fs/xfs: Clean up DAX support check
  2020-02-11  5:57   ` Dave Chinner
@ 2020-02-11 16:28     ` Ira Weiny
  2020-02-11 20:38       ` Dave Chinner
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-11 16:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Tue, Feb 11, 2020 at 04:57:45PM +1100, Dave Chinner wrote:
> On Sat, Feb 08, 2020 at 11:34:37AM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Rather than open coding xfs_inode_supports_dax() in
> > xfs_ioctl_setattr_dax_invalidate() export xfs_inode_supports_dax() and
> > call it in preparation for swapping dax flags.
> > 
> > This also means updating xfs_inode_supports_dax() to return true for a
> > directory.
> 
> That's not correct. This now means S_DAX gets set on directory inodes
> because both xfs_inode_supports_dax() and the on-disk inode flag
> checks return true in xfs_diflags_to_iflags(). Hence when we
> instantiate a directory inode with a DAX inherit hint set on it
> we'll set S_DAX on the inode and so IS_DAX() will return true for
> directory inodes...

I'm not following.  Don't we want S_DAX to get set on directory inodes?

IIRC what we wanted was something like this where IS_DAX is the current state
and "dax" is the inode flag:

/ <IS_DAX=0 dax=0>
	dir1 <IS_DAX=0 dax=0>
		f0 <IS_DAX=0 dax=0>
		f1 <IS_DAX=1 dax=1>
	dir2 <IS_DAX=1 dax=1>
		f2 <IS_DAX=1 dax=1>
		f3 <IS_DAX=0 dax=0>
		dir3 <IS_DAX=1 dax=1>
			f4 <IS_DAX=1 dax=1>
		dir4 <IS_DAX=0 dax=0>
			f5 <IS_DAX=0 dax=0>
		f6 <IS_DAX=1 dax=1>

Where f1, dir2, f3, and dir4 required explicit state changes when they were
created.  Because they inherited their dax state from the parent.  All the
other creations happened based on the DAX state of the parent directory.  So we
need to store and know the state of the directories.  What am I missing?

Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/12] fs: remove unneeded IS_DAX() check
  2020-02-11  5:34   ` Dave Chinner
@ 2020-02-11 16:38     ` Ira Weiny
  2020-02-11 20:41       ` Dave Chinner
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-11 16:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Jan Kara, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Tue, Feb 11, 2020 at 04:34:01PM +1100, Dave Chinner wrote:
> On Sat, Feb 08, 2020 at 11:34:38AM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > The IS_DAX() check in io_is_direct() causes a race between changing the
> > DAX state and creating the iocb flags.
> > 
> > Remove the check because DAX now emulates the page cache API and
> > therefore it does not matter if the file state is DAX or not when the
> > iocb flags are created.
> 
> This statement is ... weird.
> 
> DAX doesn't "emulate" the page cache API at all

ah...  yea emulate is a bad word here.

> - it has it's own
> read/write methods that filesystems call based on the iomap
> infrastructure (dax_iomap_rw()). i.e. there are 3 different IO paths
> through the filesystems: the DAX IO path, the direct IO path, and
> the buffered IO path.
> 
> Indeed, it seems like this works a bit by luck: Ext4 and XFS always
> check IS_DAX(inode) in the read/write_iter methods before checking
> for IOCB_DIRECT, and hence the IOCB_DIRECT flag is ignored by the
> filesystems. i.e. when we got rid of the O_DIRECT paths from DAX, we
> forgot to clean up io_is_direct() and it's only due to the ordering
> of checks that we went down the DAX path correctly....
> 
> That said, the code change is good, but the commit message needs a
> rewrite.

How about?

<commit msg>
  fs: Remove unneeded IS_DAX() check
  
  The IS_DAX() check in io_is_direct() causes a race between changing the
  DAX state and creating the iocb flags.

  Remove the check because DAX now has it's own read/write methods and
  file systems which support DAX check IS_DAX() prior to IOCB_DIRECT.
  Therefore, it does not matter if the file state is DAX when the iocb
  flags are created, and we can avoid the race.

  Reviewed-by: Jan Kara <jack@suse.cz>
  Signed-off-by: Ira Weiny <ira.weiny@intel.com>
</commit msg>

Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/12] fs/xfs: Check if the inode supports DAX under lock
  2020-02-11  6:16   ` Dave Chinner
@ 2020-02-11 17:55     ` Ira Weiny
  2020-02-11 20:42       ` Dave Chinner
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-11 17:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Tue, Feb 11, 2020 at 05:16:39PM +1100, Dave Chinner wrote:
> On Sat, Feb 08, 2020 at 11:34:39AM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > One of the checks for an inode supporting DAX is if the inode is
> > reflinked.  During a non-DAX to DAX state change we could race with
> > the file being reflinked and end up with a reflinked file being in DAX
> > state.
> > 
> > Prevent this race by checking for DAX support under the MMAP_LOCK.
> 
> The on disk inode flags are protected by the XFS_ILOCK, not the
> MMAP_LOCK. i.e. the MMAPLOCK provides data access serialisation, not
> metadata modification serialisation.

Ah...

> 
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > ---
> >  fs/xfs/xfs_ioctl.c | 11 +++++++----
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index da1eb2bdb386..4ff402fd6636 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -1194,10 +1194,6 @@ xfs_ioctl_setattr_dax_invalidate(
> >  
> >  	*join_flags = 0;
> >  
> > -	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
> > -	    !xfs_inode_supports_dax(ip))
> > -		return -EINVAL;
> > -
> >  	/* If the DAX state is not changing, we have nothing to do here. */
> >  	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
> >  	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> > @@ -1211,6 +1207,13 @@ xfs_ioctl_setattr_dax_invalidate(
> >  
> >  	/* lock, flush and invalidate mapping in preparation for flag change */
> >  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> > +
> > +	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
> > +	    !xfs_inode_supports_dax(ip)) {
> > +		error = -EINVAL;
> > +		goto out_unlock;
> > +	}
> 
> Yes, you might be able to get away with reflink vs dax flag
> serialisation on the inode flag modification, but it is not correct and
> leaves a landmine for future inode flag modifications that are done
> without holding either the MMAP or IOLOCK.
> 
> e.g. concurrent calls to xfs_ioctl_setattr() setting/clearing flags
> other than the on disk DAX flag are all serialised by the ILOCK_EXCL
> and will no be serialised against this DAX check. Hence if there are
> other flags that we add in future that affect the result of
> xfs_inode_supports_dax(), this code will not be correctly
> serialised.
> 
> This raciness in checking the DAX flags is the reason that
> xfs_ioctl_setattr_xflags() redoes all the reflink vs dax checks once
> it's called under the XFS_ILOCK_EXCL during the actual change
> transaction....

Ok I found this by trying to make sure that the xfs_inode_supports_dax() call
was always returning valid data.  So I don't have a specific test which was
failing.

Looking at the code again, it sounds like I was wrong about which locks protect
what and with your explanation above it looks like there is nothing to be done
here and I can drop the patch.

Would you agree?

Thanks for the review!
Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 07/12] fs: Add locking for a dynamic DAX state
  2020-02-11  8:00   ` Dave Chinner
@ 2020-02-11 20:14     ` Ira Weiny
  2020-02-11 20:59       ` Dan Williams
  2020-02-11 21:49       ` Dave Chinner
  0 siblings, 2 replies; 52+ messages in thread
From: Ira Weiny @ 2020-02-11 20:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Tue, Feb 11, 2020 at 07:00:35PM +1100, Dave Chinner wrote:
> On Sat, Feb 08, 2020 at 11:34:40AM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > DAX requires special address space operations but many other functions
> > check the IS_DAX() state.
> > 
> > While DAX is a property of the inode we perfer a lock at the super block
> > level because of the overhead of a rwsem within the inode.
> > 
> > Define a vfs per superblock percpu rs semaphore to lock the DAX state
> 
> ????

oops...  I must have forgotten to update the commit message when I did the
global RW sem.  I implemented the per-SB, percpu rwsem first but it was
suggested that the percpu nature of the lock combined with the anticipated
infrequent use of the write side made using a global easier.

But before I proceed on V4 I'd like to get consensus on which of the 2 locking
models to go with.

	1) percpu per superblock lock
	2) per inode rwsem

Depending on my mood I can convince myself of both being performant but the
percpu is very attractive because I don't anticipate many changes of state
during run time.  OTOH concurrent threads accessing the same file at run time
may also be low so there is likely to be little read contention across CPU's on
the per-inode lock?

Opinions?

> 
> > while performing various VFS layer operations.  Write lock calls are
> > provided here but are used in subsequent patches by the file systems
> > themselves.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from V2
> > 
> > 	Rebase on linux-next-08-02-2020
> > 
> > 	Fix locking order
> > 	Change all references from mode to state where appropriate
> > 	add CONFIG_FS_DAX requirement for state change
> > 	Use a static branch to enable locking only when a dax capable
> > 		device has been seen.
> > 
> > 	Move the lock to a global vfs lock
> > 
> > 		this does a few things
> > 			1) preps us better for ext4 support
> > 			2) removes funky callbacks from inode ops
> > 			3) remove complexity from XFS and probably from
> > 			   ext4 later
> > 
> > 		We can do this because
> > 			1) the locking order is required to be at the
> > 			   highest level anyway, so why complicate xfs
> > 			2) We had to move the sem to the super_block
> > 			   because it is too heavy for the inode.
> > 			3) After internal discussions with Dan we
> > 			   decided that this would be easier, just as
> > 			   performant, and with slightly less overhead
> > 			   than in the VFS SB.
> > 
> > 		We also change the functions names to up/down;
> > 		read/write as appropriate.  Previous names were over
> > 		simplified.
> 
> This, IMO, is a bit of a train wreck.
> 
> This patch has nothing to do with "DAX state", it's about
> serialising access to the aops vector.

It is a bit more than just the aops vector.  It also has to protect against
checks of IS_DAX() which occur through many of the file operations.

>
> There should be zero
> references to DAX in this patch at all, except maybe to say
> "switching DAX on dynamically requires atomic switching of address
> space ops".

I'm ok with removing the dax references.  However, it very specifically
protects against IS_DAX checks.  Adding in protection for other states will
require care IMO.  This is why I preserved the dax naming because I did not
want to over step what is protected with the locking.  I do _want_ it to
protect more.  But I can't guarantee it, have not tested it, and so made the
calls specific to dax.

> 
> Big problems I see here:
> 
> 1. static key to turn it on globally.
>   - just a gross hack around doing it properly with a per-sb
>     mechanism and enbaling it only on filesystems that are on DAX
>     capable block devices.

Fair enough.  This was a reaction to Darricks desire to get this out of the way
when DAX was not in the system.  The static branch seemed like a good thing for
users who don't have DAX capable hardware while running kernels and FS's which
have DAX enabled.

http://lkml.iu.edu/hypermail/linux/kernel/2001.1/05691.html

>   - you're already passing in an inode to all these functions. It's
>     trivial to do:
> 
> 	if (!inode->i_sb->s_flags & S_DYNAMIC_AOPS)
> 		return
> 	/* do sb->s_aops_lock manipulation */

Yea that would be ok IMO.

Darrick would just having this be CONFIG_FS_DAX as per this patch be ok with
you.  I guess the static key may have been a bit of overkill?

> 
> 2. global lock
>   - OMG!

I know.  The thinking on this is that the locking is percpu which is near
0 overhead in the read case and we are rarely going to take exclusive access.

>   - global lock will cause entire system IO/page fault stalls
>     when someone does recursive/bulk DAX flag modification
>     operations. Per-cpu rwsem Contention on  large systems will be
>     utterly awful.

Yea that is probably bad...  I certainly did not test the responsiveness of
this.  FWIW if the system only has 1 FS the per-SB lock is not going to be any
different from the global which was part of our thinking.

>   - ext4's use case almost never hits the exclusive lock side of the
>     percpu-rwsem - only when changing the journal mode flag on the
>     inode. And it only affects writeback in progress, so it's not
>     going to have massive concurrency on it like a VFS level global
>     lock has. -> Bad model to follow.

I admit I did not research the ext4's journal mode locking that much.

>   - per-sb lock is trivial - see #1 - which limits scope to single
>     filesystem

I agree and...  well the commit message shows I actually implemented it that
way at first...  :-/

>   - per-inode rwsem would make this problem go away entirely.

But would that be ok for concurrent read performance which is going to be used
99.99% of the time?  Maybe Darricks comments made me panic a little bit too
much overhead WRT locking and its potential impact on users not using DAX?

> 
> 3. If we can use a global per-cpu rwsem, why can't we just use a
>    per-inode rwsem?

Per-cpu lock was attractive because of its near 0 overhead to take the read
lock which happens a lot during normal operations.

>   - locking context rules are the same
>   - rwsem scales pretty damn well for shared ops

Does it?  I'm not sure.

>   - no "global" contention problems
>   - small enough that we can put another rwsem in the inode.

Everything else I agree with...  :-D

> 
> 4. "inode_dax_state_up_read"
>   - Eye bleeds.
>   - this is about the aops structure serialisation, not dax.
>   - The name makes no sense in the places that it has been added.

Again it is about more than just the aops.  IS_DAX() needs to be protected in
all the file operations calls as well or we can get races with the logic in
those calls and a state switch.

> 
> 5. We already have code that does almost exactly what we need: the
>    superblock freezing infrastructure.

I think freeze does a lot more than we need.

>   - freezing implements a "hold operations on this superblock until
>     further notice" model we can easily follow.
>   - sb_start_write/sb_end_write provides a simple API model and a
>     clean, clear and concise naming convention we can use, too.

Ok as a model...  If going with the per-SB lock.  After working through my
response I'm leaning toward a per-inode lock again.  This was the way I did
this to begin with.

I want feedback before reworking for V4, please.

> 
> Really, I'm struggling to understand how we got to "global locking
> that stops the world" from "need to change per-inode state
> atomically". Can someone please explain to me why this patch isn't
> just a simple set of largely self-explanitory functions like this:
> 
> XX_start_aop()
> XX_end_aop()
> 
> XX_lock_aops()
> XX_switch_aops(aops)
> XX_unlock_aops()
> 
> where "XX" is "sb" or "inode" depending on what locking granularity
> is used...

Because failing to protect the logic around IS_DAX() results in failures in the
read/write and direct IO paths.

So we need to lock the file operations as well.

Thanks for the review, :-D
Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-10 15:15 ` [PATCH v3 00/12] Enable per-file/directory DAX operations V3 Jeff Moyer
@ 2020-02-11 20:17   ` Ira Weiny
  2020-02-12 19:49     ` Jeff Moyer
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-11 20:17 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Mon, Feb 10, 2020 at 10:15:47AM -0500, Jeff Moyer wrote:
> Hi, Ira,
> 
> Could you please include documentation patches as part of this series?

I do have an update to the vfs.rst doc in

	fs: Add locking for a dynamic DAX state

I'm happy to do more but was there something specific you would like to see?
Or documentation in xfs perhaps?

Ira

> 
> Thanks,
> Jeff
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 04/12] fs/xfs: Clean up DAX support check
  2020-02-11 16:28     ` Ira Weiny
@ 2020-02-11 20:38       ` Dave Chinner
  0 siblings, 0 replies; 52+ messages in thread
From: Dave Chinner @ 2020-02-11 20:38 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Tue, Feb 11, 2020 at 08:28:30AM -0800, Ira Weiny wrote:
> On Tue, Feb 11, 2020 at 04:57:45PM +1100, Dave Chinner wrote:
> > On Sat, Feb 08, 2020 at 11:34:37AM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > Rather than open coding xfs_inode_supports_dax() in
> > > xfs_ioctl_setattr_dax_invalidate() export xfs_inode_supports_dax() and
> > > call it in preparation for swapping dax flags.
> > > 
> > > This also means updating xfs_inode_supports_dax() to return true for a
> > > directory.
> > 
> > That's not correct. This now means S_DAX gets set on directory inodes
> > because both xfs_inode_supports_dax() and the on-disk inode flag
> > checks return true in xfs_diflags_to_iflags(). Hence when we
> > instantiate a directory inode with a DAX inherit hint set on it
> > we'll set S_DAX on the inode and so IS_DAX() will return true for
> > directory inodes...
> 
> I'm not following.  Don't we want S_DAX to get set on directory inodes?

No, because S_DAX is used for controlling direct user data access,
and we *never* let users directly access directory data. Even the
filesystems don't access directory data directly - the transactional
change model of journaling filesystems requires metadata to be
buffered in memory and never directly modified.

Hence when filesystems like ext4 keep their directory data in the
page cache, we do not want the kernel to think that this inode is
accessed through the DAX subsystem - it is accessed via the buffered
IO interfaces like page_cache_sync_readahead() and writeback is
controlled by the journal. Hence setting S_DAX on these inodes is
incorrect.

Whilst XFS doesn't use the page cache for it's metadata
buffering, the issue is the same as it's also a journalling
filesystem. hence setting S_DAX on XFS directory inodes is also
incorrect.

> IIRC what we wanted was something like this where IS_DAX is the current state
> and "dax" is the inode flag:
> 
> / <IS_DAX=0 dax=0>
> 	dir1 <IS_DAX=0 dax=0>
> 		f0 <IS_DAX=0 dax=0>
> 		f1 <IS_DAX=1 dax=1>
> 	dir2 <IS_DAX=1 dax=1>
> 		f2 <IS_DAX=1 dax=1>
> 		f3 <IS_DAX=0 dax=0>
> 		dir3 <IS_DAX=1 dax=1>
> 			f4 <IS_DAX=1 dax=1>
> 		dir4 <IS_DAX=0 dax=0>
> 			f5 <IS_DAX=0 dax=0>
> 		f6 <IS_DAX=1 dax=1>
> 
> Where f1, dir2, f3, and dir4 required explicit state changes when they were
> created.  Because they inherited their dax state from the parent.  All the
> other creations happened based on the DAX state of the parent directory.  So we
> need to store and know the state of the directories.  What am I missing?

I think that you are conflating internal filesystem feature
management details (the inheritance of the DAX flag feature of
directories) with kernel IO path behaviour (the inode S_DAX flag).

i.e. IS_DAX() indicates whether DAX is _actively being used_ to
access the data of a regular file inode, not to indicate whether the
on-disk flags used to manage default behaviour are set or not.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/12] fs: remove unneeded IS_DAX() check
  2020-02-11 16:38     ` Ira Weiny
@ 2020-02-11 20:41       ` Dave Chinner
  2020-02-12 16:04         ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2020-02-11 20:41 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Jan Kara, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Tue, Feb 11, 2020 at 08:38:31AM -0800, Ira Weiny wrote:
> On Tue, Feb 11, 2020 at 04:34:01PM +1100, Dave Chinner wrote:
> > On Sat, Feb 08, 2020 at 11:34:38AM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > The IS_DAX() check in io_is_direct() causes a race between changing the
> > > DAX state and creating the iocb flags.
> > > 
> > > Remove the check because DAX now emulates the page cache API and
> > > therefore it does not matter if the file state is DAX or not when the
> > > iocb flags are created.
> > 
> > This statement is ... weird.
> > 
> > DAX doesn't "emulate" the page cache API at all
> 
> ah...  yea emulate is a bad word here.
> 
> > - it has it's own
> > read/write methods that filesystems call based on the iomap
> > infrastructure (dax_iomap_rw()). i.e. there are 3 different IO paths
> > through the filesystems: the DAX IO path, the direct IO path, and
> > the buffered IO path.
> > 
> > Indeed, it seems like this works a bit by luck: Ext4 and XFS always
> > check IS_DAX(inode) in the read/write_iter methods before checking
> > for IOCB_DIRECT, and hence the IOCB_DIRECT flag is ignored by the
> > filesystems. i.e. when we got rid of the O_DIRECT paths from DAX, we
> > forgot to clean up io_is_direct() and it's only due to the ordering
> > of checks that we went down the DAX path correctly....
> > 
> > That said, the code change is good, but the commit message needs a
> > rewrite.
> 
> How about?
> 
> <commit msg>
>   fs: Remove unneeded IS_DAX() check
>   
>   The IS_DAX() check in io_is_direct() causes a race between changing the
>   DAX state and creating the iocb flags.

This is irrelevant - the check is simply wrong and has been since
~2016 when we moved DAX to use the iomap infrastructure...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/12] fs/xfs: Check if the inode supports DAX under lock
  2020-02-11 17:55     ` Ira Weiny
@ 2020-02-11 20:42       ` Dave Chinner
  2020-02-12 16:10         ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2020-02-11 20:42 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Tue, Feb 11, 2020 at 09:55:09AM -0800, Ira Weiny wrote:
> On Tue, Feb 11, 2020 at 05:16:39PM +1100, Dave Chinner wrote:
> > On Sat, Feb 08, 2020 at 11:34:39AM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > One of the checks for an inode supporting DAX is if the inode is
> > > reflinked.  During a non-DAX to DAX state change we could race with
> > > the file being reflinked and end up with a reflinked file being in DAX
> > > state.
> > > 
> > > Prevent this race by checking for DAX support under the MMAP_LOCK.
> > 
> > The on disk inode flags are protected by the XFS_ILOCK, not the
> > MMAP_LOCK. i.e. the MMAPLOCK provides data access serialisation, not
> > metadata modification serialisation.
> 
> Ah...
> 
> > 
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > ---
> > >  fs/xfs/xfs_ioctl.c | 11 +++++++----
> > >  1 file changed, 7 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > > index da1eb2bdb386..4ff402fd6636 100644
> > > --- a/fs/xfs/xfs_ioctl.c
> > > +++ b/fs/xfs/xfs_ioctl.c
> > > @@ -1194,10 +1194,6 @@ xfs_ioctl_setattr_dax_invalidate(
> > >  
> > >  	*join_flags = 0;
> > >  
> > > -	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
> > > -	    !xfs_inode_supports_dax(ip))
> > > -		return -EINVAL;
> > > -
> > >  	/* If the DAX state is not changing, we have nothing to do here. */
> > >  	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
> > >  	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> > > @@ -1211,6 +1207,13 @@ xfs_ioctl_setattr_dax_invalidate(
> > >  
> > >  	/* lock, flush and invalidate mapping in preparation for flag change */
> > >  	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> > > +
> > > +	if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
> > > +	    !xfs_inode_supports_dax(ip)) {
> > > +		error = -EINVAL;
> > > +		goto out_unlock;
> > > +	}
> > 
> > Yes, you might be able to get away with reflink vs dax flag
> > serialisation on the inode flag modification, but it is not correct and
> > leaves a landmine for future inode flag modifications that are done
> > without holding either the MMAP or IOLOCK.
> > 
> > e.g. concurrent calls to xfs_ioctl_setattr() setting/clearing flags
> > other than the on disk DAX flag are all serialised by the ILOCK_EXCL
> > and will no be serialised against this DAX check. Hence if there are
> > other flags that we add in future that affect the result of
> > xfs_inode_supports_dax(), this code will not be correctly
> > serialised.
> > 
> > This raciness in checking the DAX flags is the reason that
> > xfs_ioctl_setattr_xflags() redoes all the reflink vs dax checks once
> > it's called under the XFS_ILOCK_EXCL during the actual change
> > transaction....
> 
> Ok I found this by trying to make sure that the xfs_inode_supports_dax() call
> was always returning valid data.  So I don't have a specific test which was
> failing.
> 
> Looking at the code again, it sounds like I was wrong about which locks protect
> what and with your explanation above it looks like there is nothing to be done
> here and I can drop the patch.
> 
> Would you agree?

*nod*

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 07/12] fs: Add locking for a dynamic DAX state
  2020-02-11 20:14     ` Ira Weiny
@ 2020-02-11 20:59       ` Dan Williams
  2020-02-11 21:49       ` Dave Chinner
  1 sibling, 0 replies; 52+ messages in thread
From: Dan Williams @ 2020-02-11 20:59 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dave Chinner, Linux Kernel Mailing List, Alexander Viro,
	Darrick J. Wong, Christoph Hellwig, Theodore Y. Ts'o,
	Jan Kara, linux-ext4, linux-xfs, linux-fsdevel

On Tue, Feb 11, 2020 at 12:14 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Tue, Feb 11, 2020 at 07:00:35PM +1100, Dave Chinner wrote:
> > On Sat, Feb 08, 2020 at 11:34:40AM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > >
> > > DAX requires special address space operations but many other functions
> > > check the IS_DAX() state.
> > >
> > > While DAX is a property of the inode we perfer a lock at the super block
> > > level because of the overhead of a rwsem within the inode.
> > >
> > > Define a vfs per superblock percpu rs semaphore to lock the DAX state
> >
> > ????
>
> oops...  I must have forgotten to update the commit message when I did the
> global RW sem.  I implemented the per-SB, percpu rwsem first but it was
> suggested that the percpu nature of the lock combined with the anticipated
> infrequent use of the write side made using a global easier.
>
> But before I proceed on V4 I'd like to get consensus on which of the 2 locking
> models to go with.
>
>         1) percpu per superblock lock
>         2) per inode rwsem
>
> Depending on my mood I can convince myself of both being performant but the
> percpu is very attractive because I don't anticipate many changes of state
> during run time.  OTOH concurrent threads accessing the same file at run time
> may also be low so there is likely to be little read contention across CPU's on
> the per-inode lock?
>
> Opinions?

As one who thought a global lock would be reasonable relative to how
often dax address_space_operations are swapped out (on the order of
taking cgroup_threadgroup_rwsem for write), I think a per-superblock
lock is also an ok starting point. We can always go to finer grained
locking in the future if we see evidence that the benefits of percpu
are lost to stopping the world at the superblock level.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 07/12] fs: Add locking for a dynamic DAX state
  2020-02-11 20:14     ` Ira Weiny
  2020-02-11 20:59       ` Dan Williams
@ 2020-02-11 21:49       ` Dave Chinner
  2020-02-12  6:31         ` Darrick J. Wong
  1 sibling, 1 reply; 52+ messages in thread
From: Dave Chinner @ 2020-02-11 21:49 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Tue, Feb 11, 2020 at 12:14:31PM -0800, Ira Weiny wrote:
> On Tue, Feb 11, 2020 at 07:00:35PM +1100, Dave Chinner wrote:
> > On Sat, Feb 08, 2020 at 11:34:40AM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > DAX requires special address space operations but many other functions
> > > check the IS_DAX() state.
> > > 
> > > While DAX is a property of the inode we perfer a lock at the super block
> > > level because of the overhead of a rwsem within the inode.
> > > 
> > > Define a vfs per superblock percpu rs semaphore to lock the DAX state
> > 
> > ????
> 
> oops...  I must have forgotten to update the commit message when I did the
> global RW sem.  I implemented the per-SB, percpu rwsem first but it was
> suggested that the percpu nature of the lock combined with the anticipated
> infrequent use of the write side made using a global easier.
> 
> But before I proceed on V4 I'd like to get consensus on which of the 2 locking
> models to go with.
> 
> 	1) percpu per superblock lock
> 	2) per inode rwsem
> 
> Depending on my mood I can convince myself of both being performant but the
> percpu is very attractive because I don't anticipate many changes of state
> during run time.  OTOH concurrent threads accessing the same file at run time
> may also be low so there is likely to be little read contention across CPU's on
> the per-inode lock?
> 
> Opinions?
> 
> > 
> > > while performing various VFS layer operations.  Write lock calls are
> > > provided here but are used in subsequent patches by the file systems
> > > themselves.
> > > 
> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > ---
> > > Changes from V2
> > > 
> > > 	Rebase on linux-next-08-02-2020
> > > 
> > > 	Fix locking order
> > > 	Change all references from mode to state where appropriate
> > > 	add CONFIG_FS_DAX requirement for state change
> > > 	Use a static branch to enable locking only when a dax capable
> > > 		device has been seen.
> > > 
> > > 	Move the lock to a global vfs lock
> > > 
> > > 		this does a few things
> > > 			1) preps us better for ext4 support
> > > 			2) removes funky callbacks from inode ops
> > > 			3) remove complexity from XFS and probably from
> > > 			   ext4 later
> > > 
> > > 		We can do this because
> > > 			1) the locking order is required to be at the
> > > 			   highest level anyway, so why complicate xfs
> > > 			2) We had to move the sem to the super_block
> > > 			   because it is too heavy for the inode.
> > > 			3) After internal discussions with Dan we
> > > 			   decided that this would be easier, just as
> > > 			   performant, and with slightly less overhead
> > > 			   than in the VFS SB.
> > > 
> > > 		We also change the functions names to up/down;
> > > 		read/write as appropriate.  Previous names were over
> > > 		simplified.
> > 
> > This, IMO, is a bit of a train wreck.
> > 
> > This patch has nothing to do with "DAX state", it's about
> > serialising access to the aops vector.
> 
> It is a bit more than just the aops vector.  It also has to protect against
> checks of IS_DAX() which occur through many of the file operations.

I think you are looking at this incorrectly. The IS_DAX() construct
is just determining what path to the aops method we take. regardless
of whether IS_DAX() is true or not, we need to execute an aop
method, and so aops vector protection is required regardless of how
IS_DAX() evaluates.

But we do require IS_DAX() to evaluate consistently through an
entire aops protected region, as there may be multiple aops method
calls in a aops context (e.g. write page faults). Hence we have to
ensure that IS_DAX() changes atomically w.r.t. to the aops vector
switches. This is simply:

	sb_aops_lock()
	inode->i_flags |= S_DAX
	sb_aops_switch(new_aops)
	sb_aops_unlock().

This guarantees that inside sb_aops_start/end(), IS_DAX() checks
are stable because we change the state atomically with the aops
vector.

We are *not* providing serialisation of inode->i_flags access or
updates here; all we need to do is ensure that the S_DAX flag is
consistent and stable across an aops access region.  If we are not
in an aops call chain or will not call an aops method, we just don't
care if the IS_DAX() call is racy as whatever we call is still
static and if it's DAAX sensitive it can call IS_DAX() itself when
needed.

Again, this isn't about DAX at all, it's about being able to switch
aops vectors in a safe and reliable manner. The IS_DAX() constraints
are really a minor addition on top of the "stable aops vector"
regions we are laying down here.


> > Big problems I see here:
> > 
> > 1. static key to turn it on globally.
> >   - just a gross hack around doing it properly with a per-sb
> >     mechanism and enbaling it only on filesystems that are on DAX
> >     capable block devices.
> 
> Fair enough.  This was a reaction to Darricks desire to get this out of the way
> when DAX was not in the system.  The static branch seemed like a good thing for
> users who don't have DAX capable hardware while running kernels and FS's which
> have DAX enabled.
> 
> http://lkml.iu.edu/hypermail/linux/kernel/2001.1/05691.html

I think that's largely premature optimisation, and it backs us into
the "all or nothing" corner which is a bad place to be for what is
per-filesystem functionality.

> >   - you're already passing in an inode to all these functions. It's
> >     trivial to do:
> > 
> > 	if (!inode->i_sb->s_flags & S_DYNAMIC_AOPS)
> > 		return
> > 	/* do sb->s_aops_lock manipulation */
> 
> Yea that would be ok IMO.
> 
> Darrick would just having this be CONFIG_FS_DAX as per this patch be ok with
> you.  I guess the static key may have been a bit of overkill?
> 
> > 
> > 2. global lock
> >   - OMG!
> 
> I know.  The thinking on this is that the locking is percpu which is near
> 0 overhead in the read case and we are rarely going to take exclusive access.

The problem is that users can effectively run:

$ xfs_io -c "chattr -R -D -x" /path/to/dax_fs

And then it walks over millions of individual inodes turning off
the DAX flag on each of them. And if each of those inodes takes a
global per-cpu rwsem that blocks all read/write IO and page faults
on everything even for a short time, then this is will have a major
impact on the entire system and users will be very unhappy.

> >   - global lock will cause entire system IO/page fault stalls
> >     when someone does recursive/bulk DAX flag modification
> >     operations. Per-cpu rwsem Contention on  large systems will be
> >     utterly awful.
> 
> Yea that is probably bad...  I certainly did not test the responsiveness of
> this.  FWIW if the system only has 1 FS the per-SB lock is not going to be any
> different from the global which was part of our thinking.

Don't forget that things like /proc, /sys, etc are all filesystems.
Hence the global lock will affect accesses to -everything- in the
system, not just the DAX enabled filesystem. Global locks are -bad-.

> >   - ext4's use case almost never hits the exclusive lock side of the
> >     percpu-rwsem - only when changing the journal mode flag on the
> >     inode. And it only affects writeback in progress, so it's not
> >     going to have massive concurrency on it like a VFS level global
> >     lock has. -> Bad model to follow.
> 
> I admit I did not research the ext4's journal mode locking that much.
> 
> >   - per-sb lock is trivial - see #1 - which limits scope to single
> >     filesystem
> 
> I agree and...  well the commit message shows I actually implemented it that
> way at first...  :-/
> 
> >   - per-inode rwsem would make this problem go away entirely.
> 
> But would that be ok for concurrent read performance which is going to be used
> 99.99% of the time?  Maybe Darricks comments made me panic a little bit too
> much overhead WRT locking and its potential impact on users not using DAX?

I know that a rwsem in shared mode can easily handle 2M lock
cycles/s across a 32p machine without contention (concurrent AIO-DIO
read/writes to a single file) so the baseline performance of a rwsem
is likely good enough to start from.

It's simple, and we can run tests easily enough to find out where it
starts to become a performance limitation. This whole "global percpu
rwsem thing stinks like premature optimisation. Do the simple,
straight forward thing first, then get numbers and analysis the
limitations to determine what the second generation of the
functionality needs to fix.

IMO, we don't have to solve every scalability problem with the
initial implementation. Let's just make it work first, then worry
about extreme scalability when we have some idea of where those
scalability problems are.

> > 3. If we can use a global per-cpu rwsem, why can't we just use a
> >    per-inode rwsem?
> 
> Per-cpu lock was attractive because of its near 0 overhead to take the read
> lock which happens a lot during normal operations.
> 
> >   - locking context rules are the same
> >   - rwsem scales pretty damn well for shared ops
> 
> Does it?  I'm not sure.

If you haven't tested it, then we are definitely in the realm of
premature optimisation...

> 
> >   - no "global" contention problems
> >   - small enough that we can put another rwsem in the inode.
> 
> Everything else I agree with...  :-D
> 
> > 
> > 4. "inode_dax_state_up_read"
> >   - Eye bleeds.
> >   - this is about the aops structure serialisation, not dax.
> >   - The name makes no sense in the places that it has been added.
> 
> Again it is about more than just the aops.  IS_DAX() needs to be protected in
> all the file operations calls as well or we can get races with the logic in
> those calls and a state switch.
> 
> > 
> > 5. We already have code that does almost exactly what we need: the
> >    superblock freezing infrastructure.
> 
> I think freeze does a lot more than we need.
> 
> >   - freezing implements a "hold operations on this superblock until
> >     further notice" model we can easily follow.
> >   - sb_start_write/sb_end_write provides a simple API model and a
> >     clean, clear and concise naming convention we can use, too.
> 
> Ok as a model...  If going with the per-SB lock.

Ok, you completely missed my point. You're still looking at this as
a set of "locks" and serialisation.

Freezing is *not a lock* - it provides a series of "drain points"
where we can transparently block new operations, then wait for all
existing operations to complete so we can make a state change, and
then once that is done we unblock all the waiters....

IOWs, the freeze model provides an ordered barrier mechanism, and
that's precisely what we need for aops protection...

Yes, it may be implemented with locks internally, but that's
implementation detail of the barrier mechanism, not an indication
what functionality it is actually providing.

> After working through my
> response I'm leaning toward a per-inode lock again.  This was the way I did
> this to begin with.
> 
> I want feedback before reworking for V4, please.

IMO, always do the simple thing first.

Only do a more complex thing if the simple thing doesn't work or
doesn't perform sufficiently well for an initial implemenation.
Otherwise we end up with crazy solutions from premature optimisation
that simply aren't viable.

> > Really, I'm struggling to understand how we got to "global locking
> > that stops the world" from "need to change per-inode state
> > atomically". Can someone please explain to me why this patch isn't
> > just a simple set of largely self-explanitory functions like this:
> > 
> > XX_start_aop()
> > XX_end_aop()
> > 
> > XX_lock_aops()
> > XX_switch_aops(aops)
> > XX_unlock_aops()
> > 
> > where "XX" is "sb" or "inode" depending on what locking granularity
> > is used...
> 
> Because failing to protect the logic around IS_DAX() results in failures in the
> read/write and direct IO paths.
> 
> So we need to lock the file operations as well.

Again, there is no "locking" here. We just need to annotate regions
where the aops vector must remain stable. If a fop calls an aop that
the aops vector does not change while it is the middle of executing
the fop. Hence such fops calls will need to be inside
XX_start_aop()/XX_end_aop() markers.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 07/12] fs: Add locking for a dynamic DAX state
  2020-02-11 21:49       ` Dave Chinner
@ 2020-02-12  6:31         ` Darrick J. Wong
  0 siblings, 0 replies; 52+ messages in thread
From: Darrick J. Wong @ 2020-02-12  6:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ira Weiny, linux-kernel, Alexander Viro, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Wed, Feb 12, 2020 at 08:49:17AM +1100, Dave Chinner wrote:
> On Tue, Feb 11, 2020 at 12:14:31PM -0800, Ira Weiny wrote:
> > On Tue, Feb 11, 2020 at 07:00:35PM +1100, Dave Chinner wrote:
> > > On Sat, Feb 08, 2020 at 11:34:40AM -0800, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > > > DAX requires special address space operations but many other functions
> > > > check the IS_DAX() state.
> > > > 
> > > > While DAX is a property of the inode we perfer a lock at the super block
> > > > level because of the overhead of a rwsem within the inode.
> > > > 
> > > > Define a vfs per superblock percpu rs semaphore to lock the DAX state
> > > 
> > > ????
> > 
> > oops...  I must have forgotten to update the commit message when I did the
> > global RW sem.  I implemented the per-SB, percpu rwsem first but it was
> > suggested that the percpu nature of the lock combined with the anticipated
> > infrequent use of the write side made using a global easier.
> > 
> > But before I proceed on V4 I'd like to get consensus on which of the 2 locking
> > models to go with.
> > 
> > 	1) percpu per superblock lock
> > 	2) per inode rwsem
> > 
> > Depending on my mood I can convince myself of both being performant but the
> > percpu is very attractive because I don't anticipate many changes of state
> > during run time.  OTOH concurrent threads accessing the same file at run time
> > may also be low so there is likely to be little read contention across CPU's on
> > the per-inode lock?
> > 
> > Opinions?
> > 
> > > 
> > > > while performing various VFS layer operations.  Write lock calls are
> > > > provided here but are used in subsequent patches by the file systems
> > > > themselves.
> > > > 
> > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > > > ---
> > > > Changes from V2
> > > > 
> > > > 	Rebase on linux-next-08-02-2020
> > > > 
> > > > 	Fix locking order
> > > > 	Change all references from mode to state where appropriate
> > > > 	add CONFIG_FS_DAX requirement for state change
> > > > 	Use a static branch to enable locking only when a dax capable
> > > > 		device has been seen.
> > > > 
> > > > 	Move the lock to a global vfs lock
> > > > 
> > > > 		this does a few things
> > > > 			1) preps us better for ext4 support
> > > > 			2) removes funky callbacks from inode ops
> > > > 			3) remove complexity from XFS and probably from
> > > > 			   ext4 later
> > > > 
> > > > 		We can do this because
> > > > 			1) the locking order is required to be at the
> > > > 			   highest level anyway, so why complicate xfs
> > > > 			2) We had to move the sem to the super_block
> > > > 			   because it is too heavy for the inode.
> > > > 			3) After internal discussions with Dan we
> > > > 			   decided that this would be easier, just as
> > > > 			   performant, and with slightly less overhead
> > > > 			   than in the VFS SB.
> > > > 
> > > > 		We also change the functions names to up/down;
> > > > 		read/write as appropriate.  Previous names were over
> > > > 		simplified.
> > > 
> > > This, IMO, is a bit of a train wreck.
> > > 
> > > This patch has nothing to do with "DAX state", it's about
> > > serialising access to the aops vector.
> > 
> > It is a bit more than just the aops vector.  It also has to protect against
> > checks of IS_DAX() which occur through many of the file operations.
> 
> I think you are looking at this incorrectly. The IS_DAX() construct
> is just determining what path to the aops method we take. regardless
> of whether IS_DAX() is true or not, we need to execute an aop
> method, and so aops vector protection is required regardless of how
> IS_DAX() evaluates.
> 
> But we do require IS_DAX() to evaluate consistently through an
> entire aops protected region, as there may be multiple aops method
> calls in a aops context (e.g. write page faults). Hence we have to
> ensure that IS_DAX() changes atomically w.r.t. to the aops vector
> switches. This is simply:
> 
> 	sb_aops_lock()
> 	inode->i_flags |= S_DAX
> 	sb_aops_switch(new_aops)
> 	sb_aops_unlock().
> 
> This guarantees that inside sb_aops_start/end(), IS_DAX() checks
> are stable because we change the state atomically with the aops
> vector.
> 
> We are *not* providing serialisation of inode->i_flags access or
> updates here; all we need to do is ensure that the S_DAX flag is
> consistent and stable across an aops access region.  If we are not
> in an aops call chain or will not call an aops method, we just don't
> care if the IS_DAX() call is racy as whatever we call is still
> static and if it's DAAX sensitive it can call IS_DAX() itself when
> needed.
> 
> Again, this isn't about DAX at all, it's about being able to switch
> aops vectors in a safe and reliable manner. The IS_DAX() constraints
> are really a minor addition on top of the "stable aops vector"
> regions we are laying down here.
> 
> 
> > > Big problems I see here:
> > > 
> > > 1. static key to turn it on globally.
> > >   - just a gross hack around doing it properly with a per-sb
> > >     mechanism and enbaling it only on filesystems that are on DAX
> > >     capable block devices.
> > 
> > Fair enough.  This was a reaction to Darricks desire to get this out of the way
> > when DAX was not in the system.  The static branch seemed like a good thing for
> > users who don't have DAX capable hardware while running kernels and FS's which
> > have DAX enabled.
> > 
> > http://lkml.iu.edu/hypermail/linux/kernel/2001.1/05691.html
> 
> I think that's largely premature optimisation, and it backs us into
> the "all or nothing" corner which is a bad place to be for what is
> per-filesystem functionality.

Oh, wow, uh... this was a total misunderstanding.  Having a per-inode
primitive to grant ourselves the ability to change fops/aops safely was
fine, I just wanted to have it compile out of existence if CONFIG_DAX=n
(or some other cleverish way if this mount will never support DAX (e.g.
scsi disk)).  I wasn't asking for it to move to the superblock or become
a Big Kernel Primitive.

--D

> 
> > >   - you're already passing in an inode to all these functions. It's
> > >     trivial to do:
> > > 
> > > 	if (!inode->i_sb->s_flags & S_DYNAMIC_AOPS)
> > > 		return
> > > 	/* do sb->s_aops_lock manipulation */
> > 
> > Yea that would be ok IMO.
> > 
> > Darrick would just having this be CONFIG_FS_DAX as per this patch be ok with
> > you.  I guess the static key may have been a bit of overkill?
> > 
> > > 
> > > 2. global lock
> > >   - OMG!
> > 
> > I know.  The thinking on this is that the locking is percpu which is near
> > 0 overhead in the read case and we are rarely going to take exclusive access.
> 
> The problem is that users can effectively run:
> 
> $ xfs_io -c "chattr -R -D -x" /path/to/dax_fs
> 
> And then it walks over millions of individual inodes turning off
> the DAX flag on each of them. And if each of those inodes takes a
> global per-cpu rwsem that blocks all read/write IO and page faults
> on everything even for a short time, then this is will have a major
> impact on the entire system and users will be very unhappy.
> 
> > >   - global lock will cause entire system IO/page fault stalls
> > >     when someone does recursive/bulk DAX flag modification
> > >     operations. Per-cpu rwsem Contention on  large systems will be
> > >     utterly awful.
> > 
> > Yea that is probably bad...  I certainly did not test the responsiveness of
> > this.  FWIW if the system only has 1 FS the per-SB lock is not going to be any
> > different from the global which was part of our thinking.
> 
> Don't forget that things like /proc, /sys, etc are all filesystems.
> Hence the global lock will affect accesses to -everything- in the
> system, not just the DAX enabled filesystem. Global locks are -bad-.
> 
> > >   - ext4's use case almost never hits the exclusive lock side of the
> > >     percpu-rwsem - only when changing the journal mode flag on the
> > >     inode. And it only affects writeback in progress, so it's not
> > >     going to have massive concurrency on it like a VFS level global
> > >     lock has. -> Bad model to follow.
> > 
> > I admit I did not research the ext4's journal mode locking that much.
> > 
> > >   - per-sb lock is trivial - see #1 - which limits scope to single
> > >     filesystem
> > 
> > I agree and...  well the commit message shows I actually implemented it that
> > way at first...  :-/
> > 
> > >   - per-inode rwsem would make this problem go away entirely.
> > 
> > But would that be ok for concurrent read performance which is going to be used
> > 99.99% of the time?  Maybe Darricks comments made me panic a little bit too
> > much overhead WRT locking and its potential impact on users not using DAX?
> 
> I know that a rwsem in shared mode can easily handle 2M lock
> cycles/s across a 32p machine without contention (concurrent AIO-DIO
> read/writes to a single file) so the baseline performance of a rwsem
> is likely good enough to start from.
> 
> It's simple, and we can run tests easily enough to find out where it
> starts to become a performance limitation. This whole "global percpu
> rwsem thing stinks like premature optimisation. Do the simple,
> straight forward thing first, then get numbers and analysis the
> limitations to determine what the second generation of the
> functionality needs to fix.
> 
> IMO, we don't have to solve every scalability problem with the
> initial implementation. Let's just make it work first, then worry
> about extreme scalability when we have some idea of where those
> scalability problems are.
> 
> > > 3. If we can use a global per-cpu rwsem, why can't we just use a
> > >    per-inode rwsem?
> > 
> > Per-cpu lock was attractive because of its near 0 overhead to take the read
> > lock which happens a lot during normal operations.
> > 
> > >   - locking context rules are the same
> > >   - rwsem scales pretty damn well for shared ops
> > 
> > Does it?  I'm not sure.
> 
> If you haven't tested it, then we are definitely in the realm of
> premature optimisation...
> 
> > 
> > >   - no "global" contention problems
> > >   - small enough that we can put another rwsem in the inode.
> > 
> > Everything else I agree with...  :-D
> > 
> > > 
> > > 4. "inode_dax_state_up_read"
> > >   - Eye bleeds.
> > >   - this is about the aops structure serialisation, not dax.
> > >   - The name makes no sense in the places that it has been added.
> > 
> > Again it is about more than just the aops.  IS_DAX() needs to be protected in
> > all the file operations calls as well or we can get races with the logic in
> > those calls and a state switch.
> > 
> > > 
> > > 5. We already have code that does almost exactly what we need: the
> > >    superblock freezing infrastructure.
> > 
> > I think freeze does a lot more than we need.
> > 
> > >   - freezing implements a "hold operations on this superblock until
> > >     further notice" model we can easily follow.
> > >   - sb_start_write/sb_end_write provides a simple API model and a
> > >     clean, clear and concise naming convention we can use, too.
> > 
> > Ok as a model...  If going with the per-SB lock.
> 
> Ok, you completely missed my point. You're still looking at this as
> a set of "locks" and serialisation.
> 
> Freezing is *not a lock* - it provides a series of "drain points"
> where we can transparently block new operations, then wait for all
> existing operations to complete so we can make a state change, and
> then once that is done we unblock all the waiters....
> 
> IOWs, the freeze model provides an ordered barrier mechanism, and
> that's precisely what we need for aops protection...
> 
> Yes, it may be implemented with locks internally, but that's
> implementation detail of the barrier mechanism, not an indication
> what functionality it is actually providing.
> 
> > After working through my
> > response I'm leaning toward a per-inode lock again.  This was the way I did
> > this to begin with.
> > 
> > I want feedback before reworking for V4, please.
> 
> IMO, always do the simple thing first.
> 
> Only do a more complex thing if the simple thing doesn't work or
> doesn't perform sufficiently well for an initial implemenation.
> Otherwise we end up with crazy solutions from premature optimisation
> that simply aren't viable.
> 
> > > Really, I'm struggling to understand how we got to "global locking
> > > that stops the world" from "need to change per-inode state
> > > atomically". Can someone please explain to me why this patch isn't
> > > just a simple set of largely self-explanitory functions like this:
> > > 
> > > XX_start_aop()
> > > XX_end_aop()
> > > 
> > > XX_lock_aops()
> > > XX_switch_aops(aops)
> > > XX_unlock_aops()
> > > 
> > > where "XX" is "sb" or "inode" depending on what locking granularity
> > > is used...
> > 
> > Because failing to protect the logic around IS_DAX() results in failures in the
> > read/write and direct IO paths.
> > 
> > So we need to lock the file operations as well.
> 
> Again, there is no "locking" here. We just need to annotate regions
> where the aops vector must remain stable. If a fop calls an aop that
> the aops vector does not change while it is the middle of executing
> the fop. Hence such fops calls will need to be inside
> XX_start_aop()/XX_end_aop() markers.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 05/12] fs: remove unneeded IS_DAX() check
  2020-02-11 20:41       ` Dave Chinner
@ 2020-02-12 16:04         ` Ira Weiny
  0 siblings, 0 replies; 52+ messages in thread
From: Ira Weiny @ 2020-02-12 16:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Jan Kara, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Feb 12, 2020 at 07:41:07AM +1100, Dave Chinner wrote:
> On Tue, Feb 11, 2020 at 08:38:31AM -0800, Ira Weiny wrote:
> > On Tue, Feb 11, 2020 at 04:34:01PM +1100, Dave Chinner wrote:
> > > On Sat, Feb 08, 2020 at 11:34:38AM -0800, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > > > The IS_DAX() check in io_is_direct() causes a race between changing the
> > > > DAX state and creating the iocb flags.
> > > > 
> > > > Remove the check because DAX now emulates the page cache API and
> > > > therefore it does not matter if the file state is DAX or not when the
> > > > iocb flags are created.
> > > 
> > > This statement is ... weird.
> > > 
> > > DAX doesn't "emulate" the page cache API at all
> > 
> > ah...  yea emulate is a bad word here.
> > 
> > > - it has it's own
> > > read/write methods that filesystems call based on the iomap
> > > infrastructure (dax_iomap_rw()). i.e. there are 3 different IO paths
> > > through the filesystems: the DAX IO path, the direct IO path, and
> > > the buffered IO path.
> > > 
> > > Indeed, it seems like this works a bit by luck: Ext4 and XFS always
> > > check IS_DAX(inode) in the read/write_iter methods before checking
> > > for IOCB_DIRECT, and hence the IOCB_DIRECT flag is ignored by the
> > > filesystems. i.e. when we got rid of the O_DIRECT paths from DAX, we
> > > forgot to clean up io_is_direct() and it's only due to the ordering
> > > of checks that we went down the DAX path correctly....
> > > 
> > > That said, the code change is good, but the commit message needs a
> > > rewrite.
> > 
> > How about?
> > 
> > <commit msg>
> >   fs: Remove unneeded IS_DAX() check
> >   
> >   The IS_DAX() check in io_is_direct() causes a race between changing the
> >   DAX state and creating the iocb flags.
> 
> This is irrelevant - the check is simply wrong and has been since
> ~2016 when we moved DAX to use the iomap infrastructure...

Deleted.
Ira

> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 06/12] fs/xfs: Check if the inode supports DAX under lock
  2020-02-11 20:42       ` Dave Chinner
@ 2020-02-12 16:10         ` Ira Weiny
  0 siblings, 0 replies; 52+ messages in thread
From: Ira Weiny @ 2020-02-12 16:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Wed, Feb 12, 2020 at 07:42:20AM +1100, Dave Chinner wrote:
> On Tue, Feb 11, 2020 at 09:55:09AM -0800, Ira Weiny wrote:
> > On Tue, Feb 11, 2020 at 05:16:39PM +1100, Dave Chinner wrote:
> > > On Sat, Feb 08, 2020 at 11:34:39AM -0800, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 

[snip]

> > > 
> > > This raciness in checking the DAX flags is the reason that
> > > xfs_ioctl_setattr_xflags() redoes all the reflink vs dax checks once
> > > it's called under the XFS_ILOCK_EXCL during the actual change
> > > transaction....
> > 
> > Ok I found this by trying to make sure that the xfs_inode_supports_dax() call
> > was always returning valid data.  So I don't have a specific test which was
> > failing.
> > 
> > Looking at the code again, it sounds like I was wrong about which locks protect
> > what and with your explanation above it looks like there is nothing to be done
> > here and I can drop the patch.
> > 
> > Would you agree?
> 
> *nod*

Thanks! done.
Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-11 20:17   ` Ira Weiny
@ 2020-02-12 19:49     ` Jeff Moyer
  2020-02-13 19:01       ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff Moyer @ 2020-02-12 19:49 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

Ira Weiny <ira.weiny@intel.com> writes:

> On Mon, Feb 10, 2020 at 10:15:47AM -0500, Jeff Moyer wrote:
>> Hi, Ira,
>> 
>> Could you please include documentation patches as part of this series?
>
> I do have an update to the vfs.rst doc in
>
> 	fs: Add locking for a dynamic DAX state
>
> I'm happy to do more but was there something specific you would like to see?
> Or documentation in xfs perhaps?

Sorry, I was referring to your statx man page addition.  It would be
nice if we could find a home for the information in your cover letter,
too.  Right now, I'm not sure how application developers are supposed to
figure out how to use the per-inode settings.

If I read your cover letter correctly, the mount option overrides any
on-disk setting.  Is that right?  Given that we document the dax mount
option as "the way to get dax," it may be a good idea to allow for a
user to selectively disable dax, even when -o dax is specified.  Is that
possible?

-Jeff


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-12 19:49     ` Jeff Moyer
@ 2020-02-13 19:01       ` Ira Weiny
  2020-02-13 19:05         ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-13 19:01 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> Ira Weiny <ira.weiny@intel.com> writes:
> 
> > On Mon, Feb 10, 2020 at 10:15:47AM -0500, Jeff Moyer wrote:
> >> Hi, Ira,
> >> 
> >> Could you please include documentation patches as part of this series?
> >
> > I do have an update to the vfs.rst doc in
> >
> > 	fs: Add locking for a dynamic DAX state
> >
> > I'm happy to do more but was there something specific you would like to see?
> > Or documentation in xfs perhaps?
> 
> Sorry, I was referring to your statx man page addition.

Ah yea I guess I could include that as a patch.  I just wanted to get buy off
on the whole thing prior to setting documentation in.

> It would be
> nice if we could find a home for the information in your cover letter,
> too.  Right now, I'm not sure how application developers are supposed to
> figure out how to use the per-inode settings.

I'm not sure either.  But this is probably a good start:

https://www.kernel.org/doc/Documentation/filesystems/dax.txt

Something under the Usage section like:

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 679729442fd2..1bab5d5d775b 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -20,8 +20,18 @@ Usage
 If you have a block device which supports DAX, you can make a filesystem
 on it as usual.  The DAX code currently only supports files with a block
 size equal to your kernel's PAGE_SIZE, so you may need to specify a block
-size when creating the filesystem.  When mounting it, use the "-o dax"
-option on the command line or add 'dax' to the options in /etc/fstab.
+size when creating the filesystem.
+
+Files can then be enabled to use dax using the statx system call or an
+application using it like 'xfs_io'.  Directories can also be enabled for dax
+to have the file system automatically enable dax on all files within those
+directories.
+
+Alternately, when mounting it one can use the "-o dax" option on the command
+line or add 'dax' to the options in /etc/fstab to globaly override all files to
+use dax on that filesystem.  Using the "-o dax" does not change the state of
+individual files so remounting without "-o dax" will revert them to the state
+saved in the filesystem meta data.
 
 
 Implementation Tips for Block Driver Writers

> 
> If I read your cover letter correctly, the mount option overrides any
> on-disk setting.  Is that right?

Yes

> Given that we document the dax mount
> option as "the way to get dax," it may be a good idea to allow for a
> user to selectively disable dax, even when -o dax is specified.  Is that
> possible?

Not with this patch set.  And I'm not sure how that would work.  The idea was
that -o dax was simply an override for users who were used to having their
entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
individual settings are saved so I don't think it makes sense to ignore the -o
dax in favor of those settings.  Basically that would IMO make the -o dax
useless.

Ira


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-13 19:01       ` Ira Weiny
@ 2020-02-13 19:05         ` Ira Weiny
  2020-02-13 19:58           ` Darrick J. Wong
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-13 19:05 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Feb 13, 2020 at 11:01:57AM -0800, 'Ira Weiny' wrote:
> On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> > Ira Weiny <ira.weiny@intel.com> writes:
> > 
 
[snip]

> > Given that we document the dax mount
> > option as "the way to get dax," it may be a good idea to allow for a
> > user to selectively disable dax, even when -o dax is specified.  Is that
> > possible?
> 
> Not with this patch set.  And I'm not sure how that would work.  The idea was
> that -o dax was simply an override for users who were used to having their
> entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
> individual settings are saved so I don't think it makes sense to ignore the -o
> dax in favor of those settings.  Basically that would IMO make the -o dax
> useless.

Oh and I forgot to mention that setting 'dax' on the root of the FS basically
provides '-o dax' functionality by default with the ability to "turn it off"
for files.

Ira

> 
> Ira
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-13 19:05         ` Ira Weiny
@ 2020-02-13 19:58           ` Darrick J. Wong
  2020-02-13 23:29             ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Darrick J. Wong @ 2020-02-13 19:58 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jeff Moyer, linux-kernel, Alexander Viro, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Feb 13, 2020 at 11:05:13AM -0800, Ira Weiny wrote:
> On Thu, Feb 13, 2020 at 11:01:57AM -0800, 'Ira Weiny' wrote:
> > On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> > > Ira Weiny <ira.weiny@intel.com> writes:
> > > 
>  
> [snip]
> 
> > > Given that we document the dax mount
> > > option as "the way to get dax," it may be a good idea to allow for a
> > > user to selectively disable dax, even when -o dax is specified.  Is that
> > > possible?
> > 
> > Not with this patch set.  And I'm not sure how that would work.  The idea was
> > that -o dax was simply an override for users who were used to having their
> > entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
> > individual settings are saved so I don't think it makes sense to ignore the -o
> > dax in favor of those settings.  Basically that would IMO make the -o dax
> > useless.
> 
> Oh and I forgot to mention that setting 'dax' on the root of the FS basically
> provides '-o dax' functionality by default with the ability to "turn it off"
> for files.

Please don't further confuse FS_XFLAG_DAX and S_DAX.  They are two
separate flags with two separate behaviors:

FS_XFLAG_DAX is a filesystem inode metadata flag.

Setting FS_XFLAG_DAX on a directory causes all files and directories
created within that directory to inherit FS_XFLAG_DAX.

Mounting with -o dax causes all files and directories created to have
FS_XFLAG_DAX set regardless of the parent's status.

The FS_XFLAG_DAX can be get and set via the fs[g]etxattr ioctl.

-------

S_DAX is the flag that controls the IO path in the kernel for a given
inode.

Loading a file inode into the kernel (via _iget) with FS_XFLAG_DAX set
or creating a file inode that inherits FS_XFLAG_DAX causes the incore
inode to have the S_DAX flag set if the storage device supports it.

Files with S_DAX set use the dax IO paths through the kernel.

The S_DAX flag can be queried via statx.

--D

> Ira
> 
> > 
> > Ira
> > 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-13 19:58           ` Darrick J. Wong
@ 2020-02-13 23:29             ` Ira Weiny
  2020-02-14  0:16               ` Dan Williams
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-13 23:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Moyer, linux-kernel, Alexander Viro, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Feb 13, 2020 at 11:58:39AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 13, 2020 at 11:05:13AM -0800, Ira Weiny wrote:
> > On Thu, Feb 13, 2020 at 11:01:57AM -0800, 'Ira Weiny' wrote:
> > > On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> > > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > 
> >  
> > [snip]
> > 
> > > > Given that we document the dax mount
> > > > option as "the way to get dax," it may be a good idea to allow for a
> > > > user to selectively disable dax, even when -o dax is specified.  Is that
> > > > possible?
> > > 
> > > Not with this patch set.  And I'm not sure how that would work.  The idea was
> > > that -o dax was simply an override for users who were used to having their
> > > entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
> > > individual settings are saved so I don't think it makes sense to ignore the -o
> > > dax in favor of those settings.  Basically that would IMO make the -o dax
> > > useless.
> > 
> > Oh and I forgot to mention that setting 'dax' on the root of the FS basically
> > provides '-o dax' functionality by default with the ability to "turn it off"
> > for files.
> 
> Please don't further confuse FS_XFLAG_DAX and S_DAX.

Yes...  the above text is wrong WRT statx.  But setting the physical
XFS_DIFLAG2_DAX flag on the root directory will by default cause all files and
directories created there to be XFS_DIFLAG2_DAX and so forth on down the tree
unless explicitly changed.  This will be the same as mounting with '-o dax' but
with the ability to turn off dax for individual files.  Which I think is the
functionality Jeff is wanting.

>
> They are two
> separate flags with two separate behaviors:
> 
> FS_XFLAG_DAX is a filesystem inode metadata flag.
> 
> Setting FS_XFLAG_DAX on a directory causes all files and directories
> created within that directory to inherit FS_XFLAG_DAX.
> 
> Mounting with -o dax causes all files and directories created to have
> FS_XFLAG_DAX set regardless of the parent's status.

I don't believe this is true, either before _or_ after this patch set.

'-o dax' only causes XFS_MOUNT_DAX to be set which then cause S_DAX to be set.
It does not affect FS_XFLAG_DAX.  This is important because we don't want '-o
dax' to suddenly convert all files to DAX if '-o dax' is not used.

> 
> The FS_XFLAG_DAX can be get and set via the fs[g]etxattr ioctl.

Right statx was the wrong tool...

fs[g|s]etattr via the xfs_io -c 'chatttr|lsattr' is the correct tool.

> 
> -------
> 
> S_DAX is the flag that controls the IO path in the kernel for a given
> inode.
> 
> Loading a file inode into the kernel (via _iget) with FS_XFLAG_DAX set
> or creating a file inode that inherits FS_XFLAG_DAX causes the incore
> inode to have the S_DAX flag set if the storage device supports it.

Yes after reworking "Clean up DAX support check" I believe I've got it correct
now.  Soon to be in V4.

> 
> Files with S_DAX set use the dax IO paths through the kernel.
> 
> The S_DAX flag can be queried via statx.

Yes as a verification that the file is at that moment operating as dax.  It
will not return true for a directory ever.  My bad for saying that.  Sorry I
got my tools flags mixed up...

Ira


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-13 23:29             ` Ira Weiny
@ 2020-02-14  0:16               ` Dan Williams
  2020-02-14 20:06                 ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Dan Williams @ 2020-02-14  0:16 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Darrick J. Wong, Jeff Moyer, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Thu, Feb 13, 2020 at 3:29 PM Ira Weiny <ira.weiny@intel.com> wrote:
>
> On Thu, Feb 13, 2020 at 11:58:39AM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 13, 2020 at 11:05:13AM -0800, Ira Weiny wrote:
> > > On Thu, Feb 13, 2020 at 11:01:57AM -0800, 'Ira Weiny' wrote:
> > > > On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> > > > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > >
> > >
> > > [snip]
> > >
> > > > > Given that we document the dax mount
> > > > > option as "the way to get dax," it may be a good idea to allow for a
> > > > > user to selectively disable dax, even when -o dax is specified.  Is that
> > > > > possible?
> > > >
> > > > Not with this patch set.  And I'm not sure how that would work.  The idea was
> > > > that -o dax was simply an override for users who were used to having their
> > > > entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
> > > > individual settings are saved so I don't think it makes sense to ignore the -o
> > > > dax in favor of those settings.  Basically that would IMO make the -o dax
> > > > useless.
> > >
> > > Oh and I forgot to mention that setting 'dax' on the root of the FS basically
> > > provides '-o dax' functionality by default with the ability to "turn it off"
> > > for files.
> >
> > Please don't further confuse FS_XFLAG_DAX and S_DAX.
>
> Yes...  the above text is wrong WRT statx.  But setting the physical
> XFS_DIFLAG2_DAX flag on the root directory will by default cause all files and
> directories created there to be XFS_DIFLAG2_DAX and so forth on down the tree
> unless explicitly changed.  This will be the same as mounting with '-o dax' but
> with the ability to turn off dax for individual files.  Which I think is the
> functionality Jeff is wanting.

To be clear you mean turn off XFS_DIFLAG2_DAX, not mask S_DAX when you
say "turn off dax", right?

The mount option simply forces "S_DAX" on all regular files as long as
the underlying device (or soon to be superblock for virtiofs) supports
it. There is no method to mask S_DAX when the filesystem was mounted
with -o dax. Otherwise we would seem to need yet another physical flag
to "always disable" dax.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-14  0:16               ` Dan Williams
@ 2020-02-14 20:06                 ` Ira Weiny
  2020-02-14 21:23                   ` Jeff Moyer
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-14 20:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Jeff Moyer, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Thu, Feb 13, 2020 at 04:16:17PM -0800, Dan Williams wrote:
> On Thu, Feb 13, 2020 at 3:29 PM Ira Weiny <ira.weiny@intel.com> wrote:
> >
> > On Thu, Feb 13, 2020 at 11:58:39AM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 13, 2020 at 11:05:13AM -0800, Ira Weiny wrote:
> > > > On Thu, Feb 13, 2020 at 11:01:57AM -0800, 'Ira Weiny' wrote:
> > > > > On Wed, Feb 12, 2020 at 02:49:48PM -0500, Jeff Moyer wrote:
> > > > > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > > >
> > > >
> > > > [snip]
> > > >
> > > > > > Given that we document the dax mount
> > > > > > option as "the way to get dax," it may be a good idea to allow for a
> > > > > > user to selectively disable dax, even when -o dax is specified.  Is that
> > > > > > possible?
> > > > >
> > > > > Not with this patch set.  And I'm not sure how that would work.  The idea was
> > > > > that -o dax was simply an override for users who were used to having their
> > > > > entire FS be dax.  We wanted to depreciate the use of "-o dax" in general.  The
> > > > > individual settings are saved so I don't think it makes sense to ignore the -o
> > > > > dax in favor of those settings.  Basically that would IMO make the -o dax
> > > > > useless.
> > > >
> > > > Oh and I forgot to mention that setting 'dax' on the root of the FS basically
> > > > provides '-o dax' functionality by default with the ability to "turn it off"
> > > > for files.
> > >
> > > Please don't further confuse FS_XFLAG_DAX and S_DAX.
> >
> > Yes...  the above text is wrong WRT statx.  But setting the physical
> > XFS_DIFLAG2_DAX flag on the root directory will by default cause all files and
> > directories created there to be XFS_DIFLAG2_DAX and so forth on down the tree
> > unless explicitly changed.  This will be the same as mounting with '-o dax' but
> > with the ability to turn off dax for individual files.  Which I think is the
> > functionality Jeff is wanting.
> 
> To be clear you mean turn off XFS_DIFLAG2_DAX, not mask S_DAX when you
> say "turn off dax", right?

Yes.

[disclaimer: the following assumes the underlying 'device' (superblock)
supports DAX]

... which results in S_DAX == false when the file is opened without the mount
option.  The key would be that all directories/files created under a root with
XFS_DIFLAG2_DAX == true would inherit their flag and be XFS_DIFLAG2_DAX == true
all the way down the tree.  Any file not wanting DAX would need to set
XFS_DIFLAG2_DAX == false.  And setting false could be used on a directory to
allow a user or group to not use dax on files in that sub-tree.

Then without '-o dax' (XFS_MOUNT_DAX == false) all files when opened set S_DAX
equal to XFS_DIFLAG2_DAX value.  (Directories, as of V4, never get S_DAX set.)

If '-o dax' (XFS_MOUNT_DAX == true) then S_DAX is set on all files.


[IF the underlying 'device' (superblock) does _not_ support DAX]

... S_DAX is _never_ set but the underlying XFS_DIFLAG2_DAX flags can be
toggled and will be inherited as above.  Because S_DAX is never set access to
that file will be restricted to "not dax"...[1]

I could go into that level of detail in the doc if needed?  I feel like we need
a more general name for XFS_DIFLAG2_DAX if I do.[2]

> 
> The mount option simply forces "S_DAX" on all regular files as long as
> the underlying device (or soon to be superblock for virtiofs) supports
> it. There is no method to mask S_DAX when the filesystem was mounted
> with -o dax. Otherwise we would seem to need yet another physical flag
> to "always disable" dax.

Exactly.  I don't think we want to support that.  From this thread alone it
seems we have enough complexity and that would be another layer...

;-)

Ira

[1] I'm beginning to think that if I type dax one more time I'm going to go
crazy...  :-P

[2] I have patches in the wings to introduce EXT4_DAX_FL as an ext4 on disk bit
which would be equivalent to XFS_DIFLAG2_DAX.  If anyone wants a better name
let me know.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-14 20:06                 ` Ira Weiny
@ 2020-02-14 21:23                   ` Jeff Moyer
  2020-02-14 21:58                     ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff Moyer @ 2020-02-14 21:23 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Darrick J. Wong, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

Ira Weiny <ira.weiny@intel.com> writes:

> [disclaimer: the following assumes the underlying 'device' (superblock)
> supports DAX]
>
> ... which results in S_DAX == false when the file is opened without the mount
> option.  The key would be that all directories/files created under a root with
> XFS_DIFLAG2_DAX == true would inherit their flag and be XFS_DIFLAG2_DAX == true
> all the way down the tree.  Any file not wanting DAX would need to set
> XFS_DIFLAG2_DAX == false.  And setting false could be used on a directory to
> allow a user or group to not use dax on files in that sub-tree.
>
> Then without '-o dax' (XFS_MOUNT_DAX == false) all files when opened set S_DAX
> equal to XFS_DIFLAG2_DAX value.  (Directories, as of V4, never get S_DAX set.)
>
> If '-o dax' (XFS_MOUNT_DAX == true) then S_DAX is set on all files.

One more clarifying question.  Let's say I set XFS_DIFLAG2_DAX on an
inode.  I then open the file, and perform mmap/load/store/etc.  I close
the file, and I unset XFS_DIFLAG2_DAX.  Will the next open treat the
file as S_DAX or not?  My guess is the inode won't be evicted, and so
S_DAX will remain set.

The reason I ask is I've had requests from application developers to do
just this.  They want to be able to switch back and forth between dax
modes.

Thanks,
Jeff

> [1] I'm beginning to think that if I type dax one more time I'm going to go
> crazy...  :-P

dax dax dax!


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-14 21:23                   ` Jeff Moyer
@ 2020-02-14 21:58                     ` Ira Weiny
  2020-02-14 22:06                       ` Jeff Moyer
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-14 21:58 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Darrick J. Wong, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Fri, Feb 14, 2020 at 04:23:19PM -0500, Jeff Moyer wrote:
> Ira Weiny <ira.weiny@intel.com> writes:
> 
> > [disclaimer: the following assumes the underlying 'device' (superblock)
> > supports DAX]
> >
> > ... which results in S_DAX == false when the file is opened without the mount
> > option.  The key would be that all directories/files created under a root with
> > XFS_DIFLAG2_DAX == true would inherit their flag and be XFS_DIFLAG2_DAX == true
> > all the way down the tree.  Any file not wanting DAX would need to set
> > XFS_DIFLAG2_DAX == false.  And setting false could be used on a directory to
> > allow a user or group to not use dax on files in that sub-tree.
> >
> > Then without '-o dax' (XFS_MOUNT_DAX == false) all files when opened set S_DAX
> > equal to XFS_DIFLAG2_DAX value.  (Directories, as of V4, never get S_DAX set.)
> >
> > If '-o dax' (XFS_MOUNT_DAX == true) then S_DAX is set on all files.
> 
> One more clarifying question.  Let's say I set XFS_DIFLAG2_DAX on an
> inode.  I then open the file, and perform mmap/load/store/etc.  I close
> the file, and I unset XFS_DIFLAG2_DAX.  Will the next open treat the
> file as S_DAX or not?  My guess is the inode won't be evicted, and so
> S_DAX will remain set.

The inode will not be evicted, or even it happens to be xfs_io will reload it
to unset the XFS_DIFLAG2_DAX flag.  And the S_DAX flag changes _with_ the
XFS_DIFLAG2_DAX change when it can (when the underlying storage supports
S_DAX).

Trying to change XFS_DIFLAG2_DAX while the file is mmap'ed returns -EBUSY.

Ira

> 
> The reason I ask is I've had requests from application developers to do
> just this.  They want to be able to switch back and forth between dax
> modes.
> 
> Thanks,
> Jeff
> 
> > [1] I'm beginning to think that if I type dax one more time I'm going to go
> > crazy...  :-P
> 
> dax dax dax!
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-14 21:58                     ` Ira Weiny
@ 2020-02-14 22:06                       ` Jeff Moyer
  2020-02-14 22:58                         ` Jeff Moyer
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff Moyer @ 2020-02-14 22:06 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Darrick J. Wong, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

Ira Weiny <ira.weiny@intel.com> writes:

> On Fri, Feb 14, 2020 at 04:23:19PM -0500, Jeff Moyer wrote:
>> Ira Weiny <ira.weiny@intel.com> writes:
>> 
>> > [disclaimer: the following assumes the underlying 'device' (superblock)
>> > supports DAX]
>> >
>> > ... which results in S_DAX == false when the file is opened without the mount
>> > option.  The key would be that all directories/files created under a root with
>> > XFS_DIFLAG2_DAX == true would inherit their flag and be XFS_DIFLAG2_DAX == true
>> > all the way down the tree.  Any file not wanting DAX would need to set
>> > XFS_DIFLAG2_DAX == false.  And setting false could be used on a directory to
>> > allow a user or group to not use dax on files in that sub-tree.
>> >
>> > Then without '-o dax' (XFS_MOUNT_DAX == false) all files when opened set S_DAX
>> > equal to XFS_DIFLAG2_DAX value.  (Directories, as of V4, never get S_DAX set.)
>> >
>> > If '-o dax' (XFS_MOUNT_DAX == true) then S_DAX is set on all files.
>> 
>> One more clarifying question.  Let's say I set XFS_DIFLAG2_DAX on an
>> inode.  I then open the file, and perform mmap/load/store/etc.  I close
>> the file, and I unset XFS_DIFLAG2_DAX.  Will the next open treat the
>> file as S_DAX or not?  My guess is the inode won't be evicted, and so
>> S_DAX will remain set.
>
> The inode will not be evicted, or even it happens to be xfs_io will reload it
> to unset the XFS_DIFLAG2_DAX flag.  And the S_DAX flag changes _with_ the
> XFS_DIFLAG2_DAX change when it can (when the underlying storage supports
> S_DAX).

OK, so it will be possible to change the effective mode.

I'll try to get some testing in on this series, now.

Thanks!
Jeff


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-14 22:06                       ` Jeff Moyer
@ 2020-02-14 22:58                         ` Jeff Moyer
  2020-02-14 23:03                           ` Jeff Moyer
  2020-02-18  2:35                           ` Ira Weiny
  0 siblings, 2 replies; 52+ messages in thread
From: Jeff Moyer @ 2020-02-14 22:58 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Darrick J. Wong, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

Hi, Ira,

Jeff Moyer <jmoyer@redhat.com> writes:

> I'll try to get some testing in on this series, now.

This series panics in xfstests generic/013, when run like so:

MKFS_OPTIONS="-m reflink=0" MOUNT_OPTIONS="-o dax" ./check -g auto

I'd dig in further, but it's late on a Friday.  You understand.  :)

Cheers,
Jeff


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-14 22:58                         ` Jeff Moyer
@ 2020-02-14 23:03                           ` Jeff Moyer
  2020-02-18  2:35                           ` Ira Weiny
  1 sibling, 0 replies; 52+ messages in thread
From: Jeff Moyer @ 2020-02-14 23:03 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Darrick J. Wong, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

Jeff Moyer <jmoyer@redhat.com> writes:

> Hi, Ira,
>
> Jeff Moyer <jmoyer@redhat.com> writes:
>
>> I'll try to get some testing in on this series, now.
>
> This series panics in xfstests generic/013, when run like so:
>
> MKFS_OPTIONS="-m reflink=0" MOUNT_OPTIONS="-o dax" ./check -g auto
>
> I'd dig in further, but it's late on a Friday.  You understand.  :)

Sorry, I should have at least given you a clue.  Below is the stack
trace.  We're going down the buffered I/O path, even though the fs is
mounted with -o dax.  Somewhere the inode isn't getting marked properly.

-Jeff

[  549.461099] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  549.468053] #PF: supervisor instruction fetch in kernel mode
[  549.473713] #PF: error_code(0x0010) - not-present page
[  549.478851] PGD 17c7e06067 P4D 17c7e06067 PUD 17c7e01067 PMD 0 
[  549.484773] Oops: 0010 [#1] SMP NOPTI
[  549.488438] CPU: 68 PID: 19851 Comm: fsstress Not tainted 5.6.0-rc1+ #42
[  549.495134] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0395.022720191340 02/27/2019
[  549.505562] RIP: 0010:0x0
[  549.508186] Code: Bad RIP value.
[  549.511418] RSP: 0018:ffffab132dc9fa98 EFLAGS: 00010246
[  549.516642] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
[  549.523768] RDX: 0000000000000000 RSI: ffffdcf75e7060c0 RDI: ffff8c3805d22300
[  549.530900] RBP: ffffab132dc9fb08 R08: 0000000000000000 R09: 00002308a18f9f3f
[  549.538030] R10: 0000000000000000 R11: ffffdcf75f4b19c0 R12: ffff8c37cfe6d2b8
[  549.545155] R13: ffffab132dc9fb60 R14: ffffdcf75e7060c8 R15: ffffdcf75e7060c0
[  549.552288] FS:  00007f849d20cb80(0000) GS:ffff8c3821100000(0000) knlGS:0000000000000000
[  549.560373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  549.566117] CR2: ffffffffffffffd6 CR3: 00000017d3088005 CR4: 00000000007606e0
[  549.573250] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  549.580383] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  549.587515] PKRU: 55555554
[  549.590228] Call Trace:
[  549.592683]  read_pages+0x120/0x190
[  549.596173]  __do_page_cache_readahead+0x1c1/0x1e0
[  549.600965]  ondemand_readahead+0x182/0x2f0
[  549.605152]  generic_file_buffered_read+0x5a6/0xaf0
[  549.610032]  ? security_inode_permission+0x30/0x50
[  549.614824]  ? _cond_resched+0x15/0x30
[  549.618620]  xfs_file_buffered_aio_read+0x47/0xe0 [xfs]
[  549.623861]  xfs_file_read_iter+0x6e/0xd0 [xfs]
[  549.628394]  generic_file_splice_read+0x100/0x220
[  549.633099]  splice_direct_to_actor+0xd5/0x220
[  549.637543]  ? pipe_to_sendpage+0xa0/0xa0
[  549.641557]  do_splice_direct+0x9a/0xd0
[  549.645396]  vfs_copy_file_range+0x153/0x320
[  549.649667]  __x64_sys_copy_file_range+0xdd/0x200
[  549.654375]  do_syscall_64+0x55/0x1d0
[  549.658039]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  549.663091] RIP: 0033:0x7f849c7086bd
[  549.666671] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 9b 67 2c 00 f7 d8 64 89 01 48
[  549.685414] RSP: 002b:00007fff81b78678 EFLAGS: 00000246 ORIG_RAX: 0000000000000146
[  549.692980] RAX: ffffffffffffffda RBX: 00000000000000c9 RCX: 00007f849c7086bd
[  549.700112] RDX: 0000000000000004 RSI: 00007fff81b786b0 RDI: 0000000000000003
[  549.707242] RBP: 00000000005d92ef R08: 0000000000006cb4 R09: 0000000000000000
[  549.714375] R10: 00007fff81b786b8 R11: 0000000000000246 R12: 0000000000000003
[  549.721508] R13: 0000000000006cb4 R14: 0000000000037f58 R15: 0000000000365871
[  549.728641] Modules linked in: xt_CHECKSUM nft_chain_nat xt_MASQUERADE nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 nft_counter nft_compat nf_tables nfnetlink tun bridge stp llc rfkill sunrpc vfat fat intel_rapl_msr intel_rapl_common skx_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt iTCO_vendor_support kvm irqbypass crct10dif_pclmul ipmi_ssif crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mei_me ipmi_si joydev intel_rapl_perf ioatdma pcspkr ipmi_devintf sg i2c_i801 lpc_ich mei dca ipmi_msghandler dax_pmem dax_pmem_core acpi_power_meter acpi_pad xfs libcrc32c nd_pmem nd_btt sd_mod sr_mod cdrom ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec crc32c_intel i40e drm nvme ahci nvme_core libahci t10_pi libata wmi nfit libnvdimm
[  549.805384] CR2: 0000000000000000
[  549.808744] ---[ end trace 62568a4ecc43ee90 ]---


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-14 22:58                         ` Jeff Moyer
  2020-02-14 23:03                           ` Jeff Moyer
@ 2020-02-18  2:35                           ` Ira Weiny
  2020-02-18 14:22                             ` Jeff Moyer
  1 sibling, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-18  2:35 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Darrick J. Wong, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Fri, Feb 14, 2020 at 05:58:10PM -0500, Jeff Moyer wrote:
> Hi, Ira,
> 
> Jeff Moyer <jmoyer@redhat.com> writes:
> 
> > I'll try to get some testing in on this series, now.
> 
> This series panics in xfstests generic/013, when run like so:
> 
> MKFS_OPTIONS="-m reflink=0" MOUNT_OPTIONS="-o dax" ./check -g auto
> 
> I'd dig in further, but it's late on a Friday.  You understand.  :)

Yep...  and a long weekend if you are in the US...  I ran the test with V4 and
got the panic below.

Is this similar to what you see?  If so I'll work on it in V4.  FWIW with '-o
dax' specified I don't see how fsstress is causing an issue with my patch set.
Does fsstress attempt to change dax states?  I don't see that in the test but
I'm not real familiar with generic/013 and fsstress.

If my disassembly of read_pages is correct it looks like readpage is null which
makes sense because all files should be IS_DAX() == true due to the mount option...

But tracing code indicates that the patch:

	fs: remove unneeded IS_DAX() check

... may be the culprit and the following fix may work...

diff --git a/mm/filemap.c b/mm/filemap.c
index 3a7863ba51b9..7eaf74a2a39b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
        if (!count)
                goto out; /* skip atime */
 
-       if (iocb->ki_flags & IOCB_DIRECT) {
+       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
                struct file *file = iocb->ki_filp;
                struct address_space *mapping = file->f_mapping;
                struct inode *inode = mapping->host;

And of course now my server is not responding so I can't reboot to test it...
:-(

I'll continue tomorrow after I go press the power button on the machine since
IPMI has decided not to work...  :-(

Ira

[ 1204.461801] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1204.472375] #PF: supervisor instruction fetch in kernel mode
[ 1204.481440] #PF: error_code(0x0010) - not-present page
[ 1204.489920] PGD 80000003c273d067 P4D 80000003c273d067 PUD 36a73b067 PMD 0 
[ 1204.500396] Oops: 0010 [#1] SMP KASAN PTI
[ 1204.507617] CPU: 6 PID: 15714 Comm: fsstress Not tainted 5.5.0-next-20200207+ #1
[ 1204.518632] Hardware name: Intel Corporation SandyBridge Platform/00, BIOS SE5C600.86B.02.04.0003.102320141138 10/23/2014
[ 1204.533715] RIP: 0010:0x0
[ 1204.539444] Code: Bad RIP value.
[ 1204.545813] RSP: 0018:ffff88837dedf528 EFLAGS: 00010246
[ 1204.554454] RAX: 0000000000000000 RBX: ffffea000cb6ae08 RCX: ffffffff813765fc
[ 1204.565223] RDX: dffffc0000000000 RSI: ffffea000cb6ae00 RDI: ffff8887b032a800
[ 1204.575943] RBP: ffff88837dedf618 R08: fffff9400196d5c1 R09: fffff9400196d5c1
[ 1204.586657] R10: fffff9400196d5c0 R11: ffffea000cb6ae07 R12: ffffea000cb6ae00
[ 1204.597362] R13: ffffffffa0ac3da0 R14: 0000000000000000 R15: ffff888342a14040
[ 1204.608061] FS:  00007fc47c0a8b80(0000) GS:ffff8883c7100000(0000) knlGS:0000000000000000
[ 1204.619869] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1204.629010] CR2: ffffffffffffffd6 CR3: 0000000325892002 CR4: 00000000000606e0
[ 1204.639695] Call Trace:
[ 1204.645128]  read_pages+0x23d/0x2f0
[ 1204.651691]  ? read_cache_pages+0x2b0/0x2b0
[ 1204.659030]  ? policy_node+0x56/0x60
[ 1204.665675]  __do_page_cache_readahead+0x28b/0x2b0
[ 1204.673641]  ? read_pages+0x2f0/0x2f0
[ 1204.680286]  ondemand_readahead+0x2bf/0x5d0
[ 1204.687561]  generic_file_buffered_read+0x992/0x1170
[ 1204.695703]  ? read_cache_page_gfp+0x20/0x20
[ 1204.703052]  ? down_read_nested+0x10b/0x2d0
[ 1204.710266]  ? downgrade_write+0x270/0x270
[ 1204.717399]  ? lock_acquire+0x101/0x200
[ 1204.724171]  ? generic_file_splice_read+0x20d/0x350
[ 1204.732067]  ? generic_file_read_iter+0x3b/0x220
[ 1204.739736]  ? xfs_file_buffered_aio_read+0x87/0x1d0 [xfs]
[ 1204.748351]  xfs_file_buffered_aio_read+0x92/0x1d0 [xfs]
[ 1204.756759]  xfs_file_read_iter+0x120/0x1f0 [xfs]
[ 1204.764420]  generic_file_splice_read+0x239/0x350
[ 1204.772072]  ? pipe_to_user+0x80/0x80
[ 1204.778476]  splice_direct_to_actor+0x1d8/0x460
[ 1204.785831]  ? pipe_to_sendpage+0x1a0/0x1a0
[ 1204.792769]  ? do_splice_to+0xc0/0xc0
[ 1204.799144]  ? selinux_file_permission+0x1d2/0x210
[ 1204.806734]  do_splice_direct+0x10c/0x170
[ 1204.813393]  ? splice_direct_to_actor+0x460/0x460
[ 1204.820830]  ? debug_lockdep_rcu_enabled+0x23/0x60
[ 1204.828349]  ? __sb_start_write+0x12c/0x1f0
[ 1204.835177]  vfs_copy_file_range+0x309/0x5c0
[ 1204.842085]  ? __x64_sys_sendfile+0x160/0x160
[ 1204.849039]  ? from_kgid+0xa0/0xa0
[ 1204.854879]  ? _copy_to_user+0x6a/0x80
[ 1204.861067]  ? cp_new_stat+0x271/0x2c0
[ 1204.867238]  ? __ia32_sys_lstat+0x30/0x30
[ 1204.873672]  ? down_read_non_owner+0x2e0/0x2e0
[ 1204.880579]  __x64_sys_copy_file_range+0x17a/0x310
[ 1204.887844]  ? __ia32_sys_copy_file_range+0x320/0x320
[ 1204.895369]  ? lockdep_hardirqs_off+0x1a/0x140
[ 1204.902142]  do_syscall_64+0x78/0x300
[ 1204.908056]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1204.915507] RIP: 0033:0x7fc47c1a1d6d
[ 1204.921283] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d eb 80 0c 00 f7 d8 64 89 01 48
[ 1204.946109] RSP: 002b:00007ffd9cc4cc08 EFLAGS: 00000202 ORIG_RAX: 0000000000000146
[ 1204.956531] RAX: ffffffffffffffda RBX: 0000000000000047 RCX: 00007fc47c1a1d6d
[ 1204.966477] RDX: 0000000000000004 RSI: 00007ffd9cc4cc40 RDI: 0000000000000003
[ 1204.976394] RBP: 000000000061a35e R08: 000000000000f15b R09: 0000000000000000
[ 1204.986279] R10: 00007ffd9cc4cc48 R11: 0000000000000202 R12: 0000000000000003
[ 1204.996142] R13: 000000000000f15b R14: 00000000000326f7 R15: 0000000000159373
[ 1205.005983] Modules linked in: vfat fat isofs rfkill ib_isert
iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp opa_vnic
rpcrdma sunrpc rdma_ucm ib_iser ib_umad rdma_cm ib_ipoib iw_cm dax_pmem_compat
libiscsi iTCO_wdt device_dax nd_pmem ib_cm dax_pmem_core nd_btt
iTCO_vendor_support scsi_transport_iscsi snd_hda_codec_realtek
snd_hda_codec_generic ledtrig_audio sb_edac x86_pkg_temp_thermal
intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
snd_hda_intel hfi1 snd_intel_dspcfg snd_hda_codec snd_hda_core aesni_intel
snd_hwdep crypto_simd snd_pcm rdmavt snd_timer nd_e820 ib_uverbs cryptd
libnvdimm snd ib_core soundcore glue_helper ipmi_si mei_me ipmi_devintf pcspkr
mei i2c_i801 ipmi_msghandler lpc_ich mfd_core ioatdma wmi acpi_cpufreq
sch_fq_codel xfs libcrc32c mlx4_en sr_mod cdrom sd_mod t10_pi mgag200
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_vram_helper
drm_ttm_helper ahci ttm crc32c_intel libahci mlx4_core isci igb libsas drm
[ 1205.006047]  dca scsi_transport_sas firewire_ohci i2c_algo_bit firewire_core
crc_itu_t libata i2c_core dm_mod [last unloaded: mlx4_ib]
[ 1205.140277] CR2: 0000000000000000
[ 1205.146689] ---[ end trace cf133ac3f2876827 ]---

> 
> Cheers,
> Jeff
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-18  2:35                           ` Ira Weiny
@ 2020-02-18 14:22                             ` Jeff Moyer
  2020-02-18 23:54                               ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Jeff Moyer @ 2020-02-18 14:22 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Dan Williams, Darrick J. Wong, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

Ira Weiny <ira.weiny@intel.com> writes:

> Yep...  and a long weekend if you are in the US...  I ran the test with V4 and
> got the panic below.
>
> Is this similar to what you see?  If so I'll work on it in V4.  FWIW with '-o

Yes, precisely.

> dax' specified I don't see how fsstress is causing an issue with my patch set.
> Does fsstress attempt to change dax states?  I don't see that in the test but
> I'm not real familiar with generic/013 and fsstress.

Not that I'm aware of, no.

> If my disassembly of read_pages is correct it looks like readpage is null which
> makes sense because all files should be IS_DAX() == true due to the mount option...
>
> But tracing code indicates that the patch:
>
> 	fs: remove unneeded IS_DAX() check
>
> ... may be the culprit and the following fix may work...
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 3a7863ba51b9..7eaf74a2a39b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>         if (!count)
>                 goto out; /* skip atime */
>  
> -       if (iocb->ki_flags & IOCB_DIRECT) {
> +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
>                 struct file *file = iocb->ki_filp;
>                 struct address_space *mapping = file->f_mapping;
>                 struct inode *inode = mapping->host;

Well, you'll have to up-level the inode variable instantiation,
obviously.  That solves this particular issue.  The next traceback
you'll hit is in the writeback path:

[  116.044545] ------------[ cut here ]------------
[  116.049163] WARNING: CPU: 48 PID: 4469 at fs/dax.c:862 dax_writeback_mapping_range+0x397/0x530
...
[  116.134509] CPU: 48 PID: 4469 Comm: fsstress Not tainted 5.6.0-rc1+ #43
[  116.141121] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0395.022720191340 02/27/2019
[  116.151549] RIP: 0010:dax_writeback_mapping_range+0x397/0x530
[  116.157294] Code: ff ff 31 db 48 8b 7c 24 28 c6 07 00 0f 1f 40 00 fb 48 8b 7c 24 10 e8 98 fc 29 00 0f 1f 44 00 00 e9 f1 fc ff ff 4c 8b 64 24 08 <0f> 0b be fb ff ff ff 4c 89 e7 e8 fa 87 ed ff f0 41 80 8c 24 80 00
[  116.176036] RSP: 0018:ffffb9b162fa7c18 EFLAGS: 00010046
[  116.181261] RAX: 0000000000000000 RBX: 00000000000001ac RCX: 0000000000000020
[  116.188387] RDX: 0000000000000000 RSI: 00000000000001ac RDI: ffffb9b162fa7c40
[  116.195519] RBP: 0000000000000020 R08: ffff9a73dc24d6b0 R09: 0000000000000020
[  116.202648] R10: 0000000000000000 R11: 0000000000000238 R12: ffff9a73d92c66b8
[  116.209774] R13: ffffe4a09f0cb200 R14: 0000000000000000 R15: ffffe4a09f0cb200
[  116.216907] FS:  00007f2dbcd22b80(0000) GS:ffff9a7420c00000(0000) knlGS:0000000000000000
[  116.224992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  116.230735] CR2: 00007fa21808b648 CR3: 000000179e0a2003 CR4: 00000000007606e0
[  116.237860] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  116.244990] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  116.252115] PKRU: 55555554
[  116.254827] Call Trace:
[  116.257286]  do_writepages+0x41/0xd0
[  116.260862]  __filemap_fdatawrite_range+0xcb/0x100
[  116.265653]  filemap_write_and_wait_range+0x38/0x90
[  116.270579]  xfs_setattr_size+0x2c2/0x3e0 [xfs]
[  116.275126]  xfs_file_fallocate+0x239/0x440 [xfs]
[  116.279831]  ? selinux_file_permission+0x108/0x140
[  116.284622]  vfs_fallocate+0x14d/0x2f0
[  116.288374]  ksys_fallocate+0x3c/0x80
[  116.292039]  __x64_sys_fallocate+0x1a/0x20
[  116.296139]  do_syscall_64+0x55/0x1d0
[  116.299806]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  116.304856] RIP: 0033:0x7f2dbc21983b
[  116.308435] Code: ff ff eb ba 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 8d 05 25 0e 2d 00 49 89 ca 8b 00 85 c0 75 14 b8 1d 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5d c3 0f 1f 40 00 41 55 49 89 cd 41 54 49 89

That's here:

        /*
         * A page got tagged dirty in DAX mapping? Something is seriously
         * wrong.
         */
        if (WARN_ON(!xa_is_value(entry)))
                return -EIO;

Cheers,
Jeff


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-18 14:22                             ` Jeff Moyer
@ 2020-02-18 23:54                               ` Ira Weiny
  2020-02-20 16:20                                 ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-18 23:54 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Darrick J. Wong, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Tue, Feb 18, 2020 at 09:22:58AM -0500, Jeff Moyer wrote:
> Ira Weiny <ira.weiny@intel.com> writes:
> 
> > Yep...  and a long weekend if you are in the US...  I ran the test with V4 and
> > got the panic below.
> >
> > Is this similar to what you see?  If so I'll work on it in V4.  FWIW with '-o
> 
> Yes, precisely.

Ok...

> 
> > dax' specified I don't see how fsstress is causing an issue with my patch set.
> > Does fsstress attempt to change dax states?  I don't see that in the test but
> > I'm not real familiar with generic/013 and fsstress.
> 
> Not that I'm aware of, no.
> 
> > If my disassembly of read_pages is correct it looks like readpage is null which
> > makes sense because all files should be IS_DAX() == true due to the mount option...
> >
> > But tracing code indicates that the patch:
> >
> > 	fs: remove unneeded IS_DAX() check
> >
> > ... may be the culprit and the following fix may work...
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 3a7863ba51b9..7eaf74a2a39b 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> >         if (!count)
> >                 goto out; /* skip atime */
> >  
> > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
> >                 struct file *file = iocb->ki_filp;
> >                 struct address_space *mapping = file->f_mapping;
> >                 struct inode *inode = mapping->host;
> 
> Well, you'll have to up-level the inode variable instantiation,
> obviously.  That solves this particular issue.

Well...  This seems to be a random issue.  I've had BMC issues with
my server most of the day...  But even with this patch I still get the failure
in read_pages().  :-/

And I have gotten it to both succeed and fail with qemu...  :-/

> The next traceback
> you'll hit is in the writeback path:

> 
> [  116.044545] ------------[ cut here ]------------
> [  116.049163] WARNING: CPU: 48 PID: 4469 at fs/dax.c:862 dax_writeback_mapping_range+0x397/0x530
> ...
> [  116.134509] CPU: 48 PID: 4469 Comm: fsstress Not tainted 5.6.0-rc1+ #43
> [  116.141121] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0395.022720191340 02/27/2019
> [  116.151549] RIP: 0010:dax_writeback_mapping_range+0x397/0x530
> [  116.157294] Code: ff ff 31 db 48 8b 7c 24 28 c6 07 00 0f 1f 40 00 fb 48 8b 7c 24 10 e8 98 fc 29 00 0f 1f 44 00 00 e9 f1 fc ff ff 4c 8b 64 24 08 <0f> 0b be fb ff ff ff 4c 89 e7 e8 fa 87 ed ff f0 41 80 8c 24 80 00
> [  116.176036] RSP: 0018:ffffb9b162fa7c18 EFLAGS: 00010046
> [  116.181261] RAX: 0000000000000000 RBX: 00000000000001ac RCX: 0000000000000020
> [  116.188387] RDX: 0000000000000000 RSI: 00000000000001ac RDI: ffffb9b162fa7c40
> [  116.195519] RBP: 0000000000000020 R08: ffff9a73dc24d6b0 R09: 0000000000000020
> [  116.202648] R10: 0000000000000000 R11: 0000000000000238 R12: ffff9a73d92c66b8
> [  116.209774] R13: ffffe4a09f0cb200 R14: 0000000000000000 R15: ffffe4a09f0cb200
> [  116.216907] FS:  00007f2dbcd22b80(0000) GS:ffff9a7420c00000(0000) knlGS:0000000000000000
> [  116.224992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  116.230735] CR2: 00007fa21808b648 CR3: 000000179e0a2003 CR4: 00000000007606e0
> [  116.237860] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  116.244990] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  116.252115] PKRU: 55555554
> [  116.254827] Call Trace:
> [  116.257286]  do_writepages+0x41/0xd0
> [  116.260862]  __filemap_fdatawrite_range+0xcb/0x100
> [  116.265653]  filemap_write_and_wait_range+0x38/0x90
> [  116.270579]  xfs_setattr_size+0x2c2/0x3e0 [xfs]
> [  116.275126]  xfs_file_fallocate+0x239/0x440 [xfs]
> [  116.279831]  ? selinux_file_permission+0x108/0x140
> [  116.284622]  vfs_fallocate+0x14d/0x2f0
> [  116.288374]  ksys_fallocate+0x3c/0x80
> [  116.292039]  __x64_sys_fallocate+0x1a/0x20
> [  116.296139]  do_syscall_64+0x55/0x1d0
> [  116.299806]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  116.304856] RIP: 0033:0x7f2dbc21983b
> [  116.308435] Code: ff ff eb ba 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 8d 05 25 0e 2d 00 49 89 ca 8b 00 85 c0 75 14 b8 1d 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5d c3 0f 1f 40 00 41 55 49 89 cd 41 54 49 89
> 
> That's here:
> 
>         /*
>          * A page got tagged dirty in DAX mapping? Something is seriously
>          * wrong.
>          */
>         if (WARN_ON(!xa_is_value(entry)))
>                 return -EIO;

I have not gotten this.  Having to walk to the lab to power cycle the machine
has slowed my progress...

Ira

> 
> Cheers,
> Jeff
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-18 23:54                               ` Ira Weiny
@ 2020-02-20 16:20                                 ` Ira Weiny
  2020-02-20 16:30                                   ` Darrick J. Wong
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-20 16:20 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Dan Williams, Darrick J. Wong, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Tue, Feb 18, 2020 at 03:54:30PM -0800, 'Ira Weiny' wrote:
> On Tue, Feb 18, 2020 at 09:22:58AM -0500, Jeff Moyer wrote:
> > Ira Weiny <ira.weiny@intel.com> writes:
> > > If my disassembly of read_pages is correct it looks like readpage is null which
> > > makes sense because all files should be IS_DAX() == true due to the mount option...
> > >
> > > But tracing code indicates that the patch:
> > >
> > > 	fs: remove unneeded IS_DAX() check
> > >
> > > ... may be the culprit and the following fix may work...
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 3a7863ba51b9..7eaf74a2a39b 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > >         if (!count)
> > >                 goto out; /* skip atime */
> > >  
> > > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > > +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
> > >                 struct file *file = iocb->ki_filp;
> > >                 struct address_space *mapping = file->f_mapping;
> > >                 struct inode *inode = mapping->host;
> > 
> > Well, you'll have to up-level the inode variable instantiation,
> > obviously.  That solves this particular issue.
> 
> Well...  This seems to be a random issue.  I've had BMC issues with
> my server most of the day...  But even with this patch I still get the failure
> in read_pages().  :-/
> 
> And I have gotten it to both succeed and fail with qemu...  :-/

... here is the fix.  I made the change in xfs_diflags_to_linux() early on with
out factoring in the flag logic changes we have agreed upon...

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 62d9f622bad1..d592949ad396 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1123,11 +1123,11 @@ xfs_diflags_to_linux(
                inode->i_flags |= S_NOATIME;
        else
                inode->i_flags &= ~S_NOATIME;
-       if (xflags & FS_XFLAG_DAX)
+
+       if (xfs_inode_enable_dax(ip))
                inode->i_flags |= S_DAX;
        else
                inode->i_flags &= ~S_DAX;
-
 }

But the one thing which tripped me up, and concerns me, is we have 2 functions
which set the inode flags.

xfs_diflags_to_iflags()
xfs_diflags_to_linux()

xfs_diflags_to_iflags() is geared toward initialization but logically they do
the same thing.  I see no reason to keep them separate.  Does anyone?

Based on this find, the discussion on behavior in this thread, and the comments
from Dave I'm reworking the series because the flag check/set functions have
all changed and I really want to be as clear as possible with both the patches
and the resulting code.[*]  So v4 should be out today including attempting to
document what we have discussed here and being as clear as possible on the
behavior.  :-D

Thanks so much for testing this!

Ira

[*] I will probably throw in a patch to remove xfs_diflags_to_iflags() as I
really don't see a reason to keep it.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-20 16:20                                 ` Ira Weiny
@ 2020-02-20 16:30                                   ` Darrick J. Wong
  2020-02-20 16:49                                     ` Ira Weiny
  0 siblings, 1 reply; 52+ messages in thread
From: Darrick J. Wong @ 2020-02-20 16:30 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jeff Moyer, Dan Williams, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Thu, Feb 20, 2020 at 08:20:28AM -0800, Ira Weiny wrote:
> On Tue, Feb 18, 2020 at 03:54:30PM -0800, 'Ira Weiny' wrote:
> > On Tue, Feb 18, 2020 at 09:22:58AM -0500, Jeff Moyer wrote:
> > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > If my disassembly of read_pages is correct it looks like readpage is null which
> > > > makes sense because all files should be IS_DAX() == true due to the mount option...
> > > >
> > > > But tracing code indicates that the patch:
> > > >
> > > > 	fs: remove unneeded IS_DAX() check
> > > >
> > > > ... may be the culprit and the following fix may work...
> > > >
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 3a7863ba51b9..7eaf74a2a39b 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > > >         if (!count)
> > > >                 goto out; /* skip atime */
> > > >  
> > > > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > > > +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
> > > >                 struct file *file = iocb->ki_filp;
> > > >                 struct address_space *mapping = file->f_mapping;
> > > >                 struct inode *inode = mapping->host;
> > > 
> > > Well, you'll have to up-level the inode variable instantiation,
> > > obviously.  That solves this particular issue.
> > 
> > Well...  This seems to be a random issue.  I've had BMC issues with
> > my server most of the day...  But even with this patch I still get the failure
> > in read_pages().  :-/
> > 
> > And I have gotten it to both succeed and fail with qemu...  :-/
> 
> ... here is the fix.  I made the change in xfs_diflags_to_linux() early on with
> out factoring in the flag logic changes we have agreed upon...
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 62d9f622bad1..d592949ad396 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1123,11 +1123,11 @@ xfs_diflags_to_linux(
>                 inode->i_flags |= S_NOATIME;
>         else
>                 inode->i_flags &= ~S_NOATIME;
> -       if (xflags & FS_XFLAG_DAX)
> +
> +       if (xfs_inode_enable_dax(ip))
>                 inode->i_flags |= S_DAX;
>         else
>                 inode->i_flags &= ~S_DAX;
> -
>  }
> 
> But the one thing which tripped me up, and concerns me, is we have 2 functions
> which set the inode flags.
> 
> xfs_diflags_to_iflags()
> xfs_diflags_to_linux()
> 
> xfs_diflags_to_iflags() is geared toward initialization but logically they do
> the same thing.  I see no reason to keep them separate.  Does anyone?
> 
> Based on this find, the discussion on behavior in this thread, and the comments
> from Dave I'm reworking the series because the flag check/set functions have
> all changed and I really want to be as clear as possible with both the patches
> and the resulting code.[*]  So v4 should be out today including attempting to
> document what we have discussed here and being as clear as possible on the
> behavior.  :-D
> 
> Thanks so much for testing this!
> 
> Ira
> 
> [*] I will probably throw in a patch to remove xfs_diflags_to_iflags() as I
> really don't see a reason to keep it.
> 

I prefer you keep the one in xfs_iops.c since ioctls are a higher level
function than general inode operations.

--D

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-20 16:30                                   ` Darrick J. Wong
@ 2020-02-20 16:49                                     ` Ira Weiny
  2020-02-20 17:00                                       ` Darrick J. Wong
  0 siblings, 1 reply; 52+ messages in thread
From: Ira Weiny @ 2020-02-20 16:49 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Moyer, Dan Williams, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Thu, Feb 20, 2020 at 08:30:24AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 20, 2020 at 08:20:28AM -0800, Ira Weiny wrote:
> > On Tue, Feb 18, 2020 at 03:54:30PM -0800, 'Ira Weiny' wrote:
> > > On Tue, Feb 18, 2020 at 09:22:58AM -0500, Jeff Moyer wrote:
> > > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > > If my disassembly of read_pages is correct it looks like readpage is null which
> > > > > makes sense because all files should be IS_DAX() == true due to the mount option...
> > > > >
> > > > > But tracing code indicates that the patch:
> > > > >
> > > > > 	fs: remove unneeded IS_DAX() check
> > > > >
> > > > > ... may be the culprit and the following fix may work...
> > > > >
> > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > index 3a7863ba51b9..7eaf74a2a39b 100644
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > > > >         if (!count)
> > > > >                 goto out; /* skip atime */
> > > > >  
> > > > > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > > > > +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
> > > > >                 struct file *file = iocb->ki_filp;
> > > > >                 struct address_space *mapping = file->f_mapping;
> > > > >                 struct inode *inode = mapping->host;
> > > > 
> > > > Well, you'll have to up-level the inode variable instantiation,
> > > > obviously.  That solves this particular issue.
> > > 
> > > Well...  This seems to be a random issue.  I've had BMC issues with
> > > my server most of the day...  But even with this patch I still get the failure
> > > in read_pages().  :-/
> > > 
> > > And I have gotten it to both succeed and fail with qemu...  :-/
> > 
> > ... here is the fix.  I made the change in xfs_diflags_to_linux() early on with
> > out factoring in the flag logic changes we have agreed upon...
> > 
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index 62d9f622bad1..d592949ad396 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -1123,11 +1123,11 @@ xfs_diflags_to_linux(
> >                 inode->i_flags |= S_NOATIME;
> >         else
> >                 inode->i_flags &= ~S_NOATIME;
> > -       if (xflags & FS_XFLAG_DAX)
> > +
> > +       if (xfs_inode_enable_dax(ip))
> >                 inode->i_flags |= S_DAX;
> >         else
> >                 inode->i_flags &= ~S_DAX;
> > -
> >  }
> > 
> > But the one thing which tripped me up, and concerns me, is we have 2 functions
> > which set the inode flags.
> > 
> > xfs_diflags_to_iflags()
> > xfs_diflags_to_linux()
> > 
> > xfs_diflags_to_iflags() is geared toward initialization but logically they do
> > the same thing.  I see no reason to keep them separate.  Does anyone?
> > 
> > Based on this find, the discussion on behavior in this thread, and the comments
> > from Dave I'm reworking the series because the flag check/set functions have
> > all changed and I really want to be as clear as possible with both the patches
> > and the resulting code.[*]  So v4 should be out today including attempting to
> > document what we have discussed here and being as clear as possible on the
> > behavior.  :-D
> > 
> > Thanks so much for testing this!
> > 
> > Ira
> > 
> > [*] I will probably throw in a patch to remove xfs_diflags_to_iflags() as I
> > really don't see a reason to keep it.
> > 
> 
> I prefer you keep the one in xfs_iops.c since ioctls are a higher level
> function than general inode operations.

Makes sense.  Do you prefer the xfs_diflags_to_iflags() name as well?

Ira

> 
> --D

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 00/12] Enable per-file/directory DAX operations V3
  2020-02-20 16:49                                     ` Ira Weiny
@ 2020-02-20 17:00                                       ` Darrick J. Wong
  0 siblings, 0 replies; 52+ messages in thread
From: Darrick J. Wong @ 2020-02-20 17:00 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jeff Moyer, Dan Williams, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Thu, Feb 20, 2020 at 08:49:57AM -0800, Ira Weiny wrote:
> On Thu, Feb 20, 2020 at 08:30:24AM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 20, 2020 at 08:20:28AM -0800, Ira Weiny wrote:
> > > On Tue, Feb 18, 2020 at 03:54:30PM -0800, 'Ira Weiny' wrote:
> > > > On Tue, Feb 18, 2020 at 09:22:58AM -0500, Jeff Moyer wrote:
> > > > > Ira Weiny <ira.weiny@intel.com> writes:
> > > > > > If my disassembly of read_pages is correct it looks like readpage is null which
> > > > > > makes sense because all files should be IS_DAX() == true due to the mount option...
> > > > > >
> > > > > > But tracing code indicates that the patch:
> > > > > >
> > > > > > 	fs: remove unneeded IS_DAX() check
> > > > > >
> > > > > > ... may be the culprit and the following fix may work...
> > > > > >
> > > > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > > index 3a7863ba51b9..7eaf74a2a39b 100644
> > > > > > --- a/mm/filemap.c
> > > > > > +++ b/mm/filemap.c
> > > > > > @@ -2257,7 +2257,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> > > > > >         if (!count)
> > > > > >                 goto out; /* skip atime */
> > > > > >  
> > > > > > -       if (iocb->ki_flags & IOCB_DIRECT) {
> > > > > > +       if (iocb->ki_flags & IOCB_DIRECT || IS_DAX(inode)) {
> > > > > >                 struct file *file = iocb->ki_filp;
> > > > > >                 struct address_space *mapping = file->f_mapping;
> > > > > >                 struct inode *inode = mapping->host;
> > > > > 
> > > > > Well, you'll have to up-level the inode variable instantiation,
> > > > > obviously.  That solves this particular issue.
> > > > 
> > > > Well...  This seems to be a random issue.  I've had BMC issues with
> > > > my server most of the day...  But even with this patch I still get the failure
> > > > in read_pages().  :-/
> > > > 
> > > > And I have gotten it to both succeed and fail with qemu...  :-/
> > > 
> > > ... here is the fix.  I made the change in xfs_diflags_to_linux() early on with
> > > out factoring in the flag logic changes we have agreed upon...
> > > 
> > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > > index 62d9f622bad1..d592949ad396 100644
> > > --- a/fs/xfs/xfs_ioctl.c
> > > +++ b/fs/xfs/xfs_ioctl.c
> > > @@ -1123,11 +1123,11 @@ xfs_diflags_to_linux(
> > >                 inode->i_flags |= S_NOATIME;
> > >         else
> > >                 inode->i_flags &= ~S_NOATIME;
> > > -       if (xflags & FS_XFLAG_DAX)
> > > +
> > > +       if (xfs_inode_enable_dax(ip))
> > >                 inode->i_flags |= S_DAX;
> > >         else
> > >                 inode->i_flags &= ~S_DAX;
> > > -
> > >  }
> > > 
> > > But the one thing which tripped me up, and concerns me, is we have 2 functions
> > > which set the inode flags.
> > > 
> > > xfs_diflags_to_iflags()
> > > xfs_diflags_to_linux()
> > > 
> > > xfs_diflags_to_iflags() is geared toward initialization but logically they do
> > > the same thing.  I see no reason to keep them separate.  Does anyone?
> > > 
> > > Based on this find, the discussion on behavior in this thread, and the comments
> > > from Dave I'm reworking the series because the flag check/set functions have
> > > all changed and I really want to be as clear as possible with both the patches
> > > and the resulting code.[*]  So v4 should be out today including attempting to
> > > document what we have discussed here and being as clear as possible on the
> > > behavior.  :-D
> > > 
> > > Thanks so much for testing this!
> > > 
> > > Ira
> > > 
> > > [*] I will probably throw in a patch to remove xfs_diflags_to_iflags() as I
> > > really don't see a reason to keep it.
> > > 
> > 
> > I prefer you keep the one in xfs_iops.c since ioctls are a higher level
> > function than general inode operations.
> 
> Makes sense.  Do you prefer the xfs_diflags_to_iflags() name as well?

I don't really care one way or another, so ... iflags wins by arbitrary
choice! 8)

--D

> Ira
> 
> > 
> > --D

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, back to index

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-08 19:34 [PATCH v3 00/12] Enable per-file/directory DAX operations V3 ira.weiny
2020-02-08 19:34 ` [PATCH v3 01/12] fs/stat: Define DAX statx attribute ira.weiny
2020-02-08 19:34 ` [PATCH v3 02/12] fs/xfs: Isolate the physical DAX flag from effective ira.weiny
2020-02-08 19:34 ` [PATCH v3 03/12] fs/xfs: Separate functionality of xfs_inode_supports_dax() ira.weiny
2020-02-11  5:47   ` Dave Chinner
2020-02-11 16:13     ` Ira Weiny
2020-02-08 19:34 ` [PATCH v3 04/12] fs/xfs: Clean up DAX support check ira.weiny
2020-02-11  5:57   ` Dave Chinner
2020-02-11 16:28     ` Ira Weiny
2020-02-11 20:38       ` Dave Chinner
2020-02-08 19:34 ` [PATCH v3 05/12] fs: remove unneeded IS_DAX() check ira.weiny
2020-02-11  5:34   ` Dave Chinner
2020-02-11 16:38     ` Ira Weiny
2020-02-11 20:41       ` Dave Chinner
2020-02-12 16:04         ` Ira Weiny
2020-02-08 19:34 ` [PATCH v3 06/12] fs/xfs: Check if the inode supports DAX under lock ira.weiny
2020-02-11  6:16   ` Dave Chinner
2020-02-11 17:55     ` Ira Weiny
2020-02-11 20:42       ` Dave Chinner
2020-02-12 16:10         ` Ira Weiny
2020-02-08 19:34 ` [PATCH v3 07/12] fs: Add locking for a dynamic DAX state ira.weiny
2020-02-11  8:00   ` Dave Chinner
2020-02-11 20:14     ` Ira Weiny
2020-02-11 20:59       ` Dan Williams
2020-02-11 21:49       ` Dave Chinner
2020-02-12  6:31         ` Darrick J. Wong
2020-02-08 19:34 ` [PATCH v3 08/12] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
2020-02-08 19:34 ` [PATCH v3 09/12] fs/xfs: Add write DAX lock to xfs layer ira.weiny
2020-02-08 19:34 ` [PATCH v3 10/12] fs: Prevent DAX state change if file is mmap'ed ira.weiny
2020-02-08 19:34 ` [PATCH v3 11/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
2020-02-08 19:34 ` [PATCH v3 12/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny
2020-02-10 15:15 ` [PATCH v3 00/12] Enable per-file/directory DAX operations V3 Jeff Moyer
2020-02-11 20:17   ` Ira Weiny
2020-02-12 19:49     ` Jeff Moyer
2020-02-13 19:01       ` Ira Weiny
2020-02-13 19:05         ` Ira Weiny
2020-02-13 19:58           ` Darrick J. Wong
2020-02-13 23:29             ` Ira Weiny
2020-02-14  0:16               ` Dan Williams
2020-02-14 20:06                 ` Ira Weiny
2020-02-14 21:23                   ` Jeff Moyer
2020-02-14 21:58                     ` Ira Weiny
2020-02-14 22:06                       ` Jeff Moyer
2020-02-14 22:58                         ` Jeff Moyer
2020-02-14 23:03                           ` Jeff Moyer
2020-02-18  2:35                           ` Ira Weiny
2020-02-18 14:22                             ` Jeff Moyer
2020-02-18 23:54                               ` Ira Weiny
2020-02-20 16:20                                 ` Ira Weiny
2020-02-20 16:30                                   ` Darrick J. Wong
2020-02-20 16:49                                     ` Ira Weiny
2020-02-20 17:00                                       ` Darrick J. Wong

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git