linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4
@ 2020-02-21  0:41 ira.weiny
  2020-02-21  0:41 ` [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
                   ` (13 more replies)
  0 siblings, 14 replies; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

https://github.com/weiny2/linux-kernel/pull/new/dax-file-state-change-v4

Changes from V3: 
https://lore.kernel.org/lkml/20200208193445.27421-1-ira.weiny@intel.com/

	* Remove global locking...  :-D
	* put back per inode locking and remove pre-mature optimizations
	* Fix issues with Directories having IS_DAX() set
	* Fix kernel crash issues reported by Jeff
	* Add some clean up patches
	* Consolidate diflags to iflags functions
	* Update/add documentation
	* Reorder/rename patches quite a bit

Changes from V2:
https://lore.kernel.org/lkml/20200110192942.25021-1-ira.weiny@intel.com/

	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
		Internal discussions with Dan determined this would be easier,
		just as performant, and slightly less overhead that having it
		in the SB as suggested by Jan
	* Fix locking order in comments and throughout code
	* Change "mode" to "state" throughout commits
	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
		configured
	* Add static branch for which is activated by a device which supports
		DAX in XFS
	* Change "lock/unlock" to up/down read/write as appropriate
		Previous names were over simplified
	* Update comments/documentation

	* Remove the xfs specific lock to the vfs (global) layer.
	* Fix i_dax_sem locking order and comments

	* Move 'i_mapped' count from struct inode to struct address_space and
		rename it to mmap_count
	* Add inode_has_mappings() call

	* Fix build issues
	* Clean up syntax spacing and minor issues
	* Update man page text for STATX_ATTR_DAX
	* Add reviewed-by's
	* Rebase to 5.6

	Rename patch:
		from: fs/xfs: Add lock/unlock state to xfs
		to: fs/xfs: Add write DAX lock to xfs layer
	Add patch:
		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
	Drop patch:
		fs/xfs: Fix truncate up


At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the DAX state of a file
with active mappings / page cache.  It was thought the races could be avoided
by limiting DAX state flips to 0-length files.

However, this turns out to not be true.[3] This is because address space
operations (a_ops) may be in use at any time the inode is referenced and users
have expressed a desire to be able to change the DAX state on a file with data
in it.  For those reasons this patch set allows changing the DAX state flag on
a file as long as it is not current mapped.

Details of when and how DAX state can be changed on a file is included in a
documentation patch.

It should be noted that the physical DAX flag inheritance is not shown in this
patch set as it was maintained from previous work on XFS.  The physical DAX
flag and it's inheritance will need to be added to other file systems for user
control. 

As submitted this works on real hardware testing.


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/
[3] https://lkml.org/lkml/2019/10/20/96
[4] https://patchwork.kernel.org/patch/11310511/


To: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org

Ira Weiny (13):
  fs/xfs: Remove unnecessary initialization of i_rwsem
  fs/xfs: Clarify lockdep dependency for xfs_isilocked()
  fs: Remove unneeded IS_DAX() check
  fs/stat: Define DAX statx attribute
  fs/xfs: Isolate the physical DAX flag from enabled
  fs/xfs: Create function xfs_inode_enable_dax()
  fs: Add locking for a dynamic address space operations state
  fs: Prevent DAX state change if file is mmap'ed
  fs/xfs: Add write aops lock to xfs layer
  fs/xfs: Clean up locking in dax invalidate
  fs/xfs: Allow toggle of effective DAX flag
  fs/xfs: Remove xfs_diflags_to_linux()
  Documentation/dax: Update Usage section

 Documentation/filesystems/dax.txt | 84 ++++++++++++++++++++++++++++-
 Documentation/filesystems/vfs.rst | 16 ++++++
 fs/attr.c                         |  1 +
 fs/inode.c                        | 16 ++++--
 fs/iomap/buffered-io.c            |  1 +
 fs/open.c                         |  4 ++
 fs/stat.c                         |  5 ++
 fs/xfs/xfs_icache.c               |  5 +-
 fs/xfs/xfs_inode.c                | 24 +++++++--
 fs/xfs/xfs_inode.h                |  9 +++-
 fs/xfs/xfs_ioctl.c                | 89 +++++++++++++------------------
 fs/xfs/xfs_iops.c                 | 69 ++++++++++++++++--------
 include/linux/fs.h                | 75 ++++++++++++++++++++++++--
 include/uapi/linux/stat.h         |  1 +
 mm/fadvise.c                      |  7 ++-
 mm/filemap.c                      |  4 ++
 mm/huge_memory.c                  |  1 +
 mm/khugepaged.c                   |  2 +
 mm/mmap.c                         | 19 ++++++-
 mm/util.c                         |  9 +++-
 20 files changed, 347 insertions(+), 94 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-21  1:26   ` Dave Chinner
  2020-02-21  0:41 ` [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_reinit_inode() -> inode_init_always() already handles calling
init_rwsem(i_rwsem).  Doing so again is unneeded.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
New for V4:

NOTE: This was found while ensuring the new i_aops_sem was properly
handled.  It seems like this is a layering violation so I think it is
worth cleaning up so as to not confuse others.
---
 fs/xfs/xfs_icache.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8dc2e5414276..836a1f09be03 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -419,6 +419,7 @@ xfs_iget_cache_hit(
 		spin_unlock(&ip->i_flags_lock);
 		rcu_read_unlock();
 
+		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 		error = xfs_reinit_inode(mp, inode);
 		if (error) {
 			bool wake;
@@ -452,9 +453,6 @@ xfs_iget_cache_hit(
 		ip->i_sick = 0;
 		ip->i_checked = 0;
 
-		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
-		init_rwsem(&inode->i_rwsem);
-
 		spin_unlock(&ip->i_flags_lock);
 		spin_unlock(&pag->pag_ici_lock);
 	} else {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked()
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
  2020-02-21  0:41 ` [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-21  1:34   ` Dave Chinner
  2020-02-21  0:41 ` [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check ira.weiny
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_isilocked() can't work fully without CONFIG_LOCKDEP.  However,
making xfs_isilocked() dependant on CONFIG_LOCKDEP is not feasible
because it is used for more than the i_rwsem.  Therefore a short-circuit
was provided via debug_locks.  However, this caused confusion while
working through the xfs locking.

Rather than use debug_locks as a flag specify this clearly using
IS_ENABLED(CONFIG_LOCKDEP).

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V3:
	Reordered to be a "pre-cleanup" patch

Changes from V2:
	This patch is new for V3
---
 fs/xfs/xfs_inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c5077e6326c7..35df324875db 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -364,7 +364,7 @@ xfs_isilocked(
 
 	if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
 		if (!(lock_flags & XFS_IOLOCK_SHARED))
-			return !debug_locks ||
+			return !IS_ENABLED(CONFIG_LOCKDEP) ||
 				lockdep_is_held_type(&VFS_I(ip)->i_rwsem, 0);
 		return rwsem_is_locked(&VFS_I(ip)->i_rwsem);
 	}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
  2020-02-21  0:41 ` [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
  2020-02-21  0:41 ` [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-21  1:42   ` Dave Chinner
  2020-02-21 17:42   ` Christoph Hellwig
  2020-02-21  0:41 ` [PATCH V4 04/13] fs/stat: Define DAX statx attribute ira.weiny
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Jan Kara, Alexander Viro, Darrick J. Wong,
	Dan Williams, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Remove the check because DAX now has it's own read/write methods and
file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on
their own.  Therefore, it does not matter if the file state is DAX when
the iocb flags are created.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from v3:
	Reword commit message.
	Reordered to be a 'pre-cleanup' patch
---
 include/linux/fs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..63d1e533a07d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3388,7 +3388,7 @@ extern int file_update_time(struct file *file);
 
 static inline bool io_is_direct(struct file *filp)
 {
-	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
+	return (filp->f_flags & O_DIRECT);
 }
 
 static inline bool vma_is_dax(struct vm_area_struct *vma)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 04/13] fs/stat: Define DAX statx attribute
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (2 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-21  0:41 ` [PATCH V4 05/13] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Jan Kara, Darrick J . Wong, Alexander Viro,
	Dan Williams, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

In order for users to determine if a file is currently operating in DAX
state (effective DAX).  Define a statx attribute value and set that
attribute if the effective DAX flag is set.

To go along with this we propose the following addition to the statx man
page:

STATX_ATTR_DAX

	The file is in the DAX (cpu direct access) state.  DAX state
	attempts to minimize software cache effects for both I/O and
	memory mappings of this file.  It requires a file system which
	has been configured to support DAX.

	DAX generally assumes all accesses are via cpu load / store
	instructions which can minimize overhead for small accesses, but
	may adversely affect cpu utilization for large transfers.

	File I/O is done directly to/from user-space buffers and memory
	mapped I/O may be performed with direct memory mappings that
	bypass kernel page cache.

	While the DAX property tends to result in data being transferred
	synchronously, it does not give the same guarantees of O_SYNC
	where data and the necessary metadata are transferred together.

	A DAX file may support being mapped with the MAP_SYNC flag,
	which enables a program to use CPU cache flush instructions to
	persist CPU store operations without an explicit fsync(2).  See
	mmap(2) for more information.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V2:
	Update man page text with comments from Darrick, Jan, Dan, and
	Dave.
---
 fs/stat.c                 | 3 +++
 include/uapi/linux/stat.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/fs/stat.c b/fs/stat.c
index 030008796479..894699c74dde 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	if (IS_DAX(inode))
+		stat->attributes |= STATX_ATTR_DAX;
+
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
 					    query_flags);
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index ad80a5c885d5..e5f9d5517f6b 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -169,6 +169,7 @@ struct statx {
 #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
 #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
+#define STATX_ATTR_DAX			0x00002000 /* [I] File is DAX */
 
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 05/13] fs/xfs: Isolate the physical DAX flag from enabled
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (3 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 04/13] fs/stat: Define DAX statx attribute ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-21  0:41 ` [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_ioctl_setattr_dax_invalidate() currently checks if the DAX flag is
changing as a quick check.

But the implementation mixes the physical (XFS_DIFLAG2_DAX) and
the enabled (S_DAX) DAX flags.

Remove the use of the enabled flag when determining if a change of the
physical flag is required.

Furthermore, we want the physical flag, XFS_DIFLAG2_DAX, to be changed
regardless of if the underlying storage can support DAX or not.

The enabled flag, IS_DAX(), will be set later IFF the inode supports
dax in a follow on patch.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V3:
	Remove the underlying storage support check
	Rework commit message
	Reorder patch
---
 fs/xfs/xfs_ioctl.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d42de92cb283..25e12ce85075 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1190,28 +1190,16 @@ xfs_ioctl_setattr_dax_invalidate(
 	int			*join_flags)
 {
 	struct inode		*inode = VFS_I(ip);
-	struct super_block	*sb = inode->i_sb;
 	int			error;
 
 	*join_flags = 0;
 
-	/*
-	 * It is only valid to set the DAX flag on regular files and
-	 * directories on filesystems where the block size is equal to the page
-	 * size. On directories it serves as an inherited hint so we don't
-	 * have to check the device for dax support or flush pagecache.
-	 */
-	if (fa->fsx_xflags & FS_XFLAG_DAX) {
-		struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
-
-		if (!bdev_dax_supported(target->bt_bdev, sb->s_blocksize))
-			return -EINVAL;
-	}
-
 	/* If the DAX state is not changing, we have nothing to do here. */
-	if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
+	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		return 0;
-	if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
+	if (!(fa->fsx_xflags & FS_XFLAG_DAX) &&
+	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		return 0;
 
 	if (S_ISDIR(inode->i_mode))
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax()
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (4 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 05/13] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-22  0:28   ` Darrick J. Wong
  2020-02-21  0:41 ` [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state ira.weiny
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_inode_supports_dax() should reflect if the inode can support DAX not
that it is enabled for DAX.

Change the use of xfs_inode_supports_dax() to reflect only if the inode
and underlying storage support dax.

Add a new function xfs_inode_enable_dax() which reflects if the inode
should be enabled for DAX.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from v3:
	Update functions and names to be more clear
	Update commit message
	Merge with
		'fs/xfs: Clean up DAX support check'
		don't allow IS_DAX() on a directory
		use STATIC macro for static
		make xfs_inode_supports_dax() static
---
 fs/xfs/xfs_iops.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 81f2f93caec0..ff711efc5247 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1237,19 +1237,18 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = {
 };
 
 /* Figure out if this file actually supports DAX. */
-static bool
+STATIC bool
 xfs_inode_supports_dax(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 
 	/* Only supported on non-reflinked files. */
-	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
+	if (xfs_is_reflink_inode(ip))
 		return false;
 
-	/* DAX mount option or DAX iflag must be set. */
-	if (!(mp->m_flags & XFS_MOUNT_DAX) &&
-	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
+	/* Only supported on regular files. */
+	if (!S_ISREG(VFS_I(ip)->i_mode))
 		return false;
 
 	/* Block size must match page size */
@@ -1260,6 +1259,20 @@ xfs_inode_supports_dax(
 	return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
 }
 
+STATIC bool
+xfs_inode_enable_dax(
+	struct xfs_inode *ip)
+{
+	if (!xfs_inode_supports_dax(ip))
+		return false;
+
+	if (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)
+		return true;
+	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+		return true;
+	return false;
+}
+
 STATIC void
 xfs_diflags_to_iflags(
 	struct inode		*inode,
@@ -1278,7 +1291,7 @@ xfs_diflags_to_iflags(
 		inode->i_flags |= S_SYNC;
 	if (flags & XFS_DIFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
-	if (xfs_inode_supports_dax(ip))
+	if (xfs_inode_enable_dax(ip))
 		inode->i_flags |= S_DAX;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (5 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-21 17:44   ` Christoph Hellwig
  2020-02-22  0:33   ` Darrick J. Wong
  2020-02-21  0:41 ` [PATCH V4 08/13] fs: Prevent DAX state change if file is mmap'ed ira.weiny
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

DAX requires special address space operations (aops).  Changing DAX
state therefore requires changing those aops.

However, many functions require aops to remain consistent through a deep
call stack.

Define a vfs level inode rwsem to protect aops throughout call stacks
which require them.

Finally, define calls to be used in subsequent patches when aops usage
needs to be quiesced by the file system.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V3:
	Convert from global to a per-inode rwsem
		Remove pre-optimization
	Remove static branch stuff
	Change function names from inode_dax_state_* to inode_aops_*
		I kept 'inode' as the synchronization is at the inode
		level now (probably where it belongs)...

		and I still prefer *_[down|up]_[read|write] as those
		names better reflect the use and interaction between
		users (readers) and writers better.  'XX_start_aop()'
		would have to be matched with something like
		'XX_wait_for_aop_user()' and 'XX_release_aop_users()' or
		something which does not make sense on the 'writer'
		side.

Changes from V2

	Rebase on linux-next-08-02-2020

	Fix locking order
	Change all references from mode to state where appropriate
	add CONFIG_FS_DAX requirement for state change
	Use a static branch to enable locking only when a dax capable
		device has been seen.

	Move the lock to a global vfs lock

		this does a few things
			1) preps us better for ext4 support
			2) removes funky callbacks from inode ops
			3) remove complexity from XFS and probably from
			   ext4 later

		We can do this because
			1) the locking order is required to be at the
			   highest level anyway, so why complicate xfs
			2) We had to move the sem to the super_block
			   because it is too heavy for the inode.
			3) After internal discussions with Dan we
			   decided that this would be easier, just as
			   performant, and with slightly less overhead
			   than in the VFS SB.

		We also change the functions names to up/down;
		read/write as appropriate.  Previous names were over
		simplified.

	Update comments and documentation

squash: add locking
---
 Documentation/filesystems/vfs.rst | 16 ++++++++
 fs/attr.c                         |  1 +
 fs/inode.c                        | 15 +++++--
 fs/iomap/buffered-io.c            |  1 +
 fs/open.c                         |  4 ++
 fs/stat.c                         |  2 +
 fs/xfs/xfs_icache.c               |  1 +
 include/linux/fs.h                | 66 ++++++++++++++++++++++++++++++-
 mm/fadvise.c                      |  7 +++-
 mm/filemap.c                      |  4 ++
 mm/huge_memory.c                  |  1 +
 mm/khugepaged.c                   |  2 +
 mm/util.c                         |  9 ++++-
 13 files changed, 121 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..4a10a232f8e2 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -934,6 +934,22 @@ cache in your filesystem.  The following members are defined:
 	Called during swapoff on files where swap_activate was
 	successful.
 
+Changing DAX 'state' dynamically
+----------------------------------
+
+Some file systems which support DAX want to be able to change the DAX state
+dyanically.  To switch the state safely we lock the inode state in all "normal"
+file system operations and restrict state changes to those operations.  The
+specific rules are.
+
+        1) the direct_IO address_space_operation must be supported in all
+           potential a_ops vectors for any state suported by the inode.
+
+        3) DAX state changes shall not be allowed while the file is mmap'ed
+        4) For non-mmaped operations the VFS layer must take the read lock for any
+           use of IS_DAX()
+        5) Filesystems take the write lock when changing DAX states.
+
 
 The File Object
 ===============
diff --git a/fs/attr.c b/fs/attr.c
index b4bbdbd4c8ca..9b15f73d1079 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	if (error)
 		return error;
 
+	/* DAX read state should already be held here */
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..6e4f1cc872f2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -200,6 +200,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 #endif
 	inode->i_flctx = NULL;
 	this_cpu_inc(nr_inodes);
+	init_rwsem(&inode->i_aops_sem);
 
 	return 0;
 out:
@@ -1616,11 +1617,19 @@ EXPORT_SYMBOL(iput);
  */
 int bmap(struct inode *inode, sector_t *block)
 {
-	if (!inode->i_mapping->a_ops->bmap)
-		return -EINVAL;
+	int ret = 0;
+
+	inode_aops_down_read(inode);
+	if (!inode->i_mapping->a_ops->bmap) {
+		ret = -EINVAL;
+		goto err;
+	}
 
 	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
-	return 0;
+
+err:
+	inode_aops_up_read(inode);
+	return ret;
 }
 EXPORT_SYMBOL(bmap);
 #endif
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7c84c4c027c4..e313a34d5fa6 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		offset = offset_in_page(pos);
 		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
 
+		/* DAX state read should already be held here */
 		if (IS_DAX(inode))
 			status = iomap_dax_zero(pos, offset, bytes, iomap);
 		else
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..3abf0bfac462 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 	if (ret)
 		newattrs.ia_valid |= ret | ATTR_FORCE;
 
+	inode_aops_down_read(dentry->d_inode);
 	inode_lock(dentry->d_inode);
 	/* Note any delegations or leases have already been broken: */
 	ret = notify_change(dentry, &newattrs, NULL);
 	inode_unlock(dentry->d_inode);
+	inode_aops_up_read(dentry->d_inode);
 	return ret;
 }
 
@@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EOPNOTSUPP;
 
 	file_start_write(file);
+	inode_aops_down_read(inode);
 	ret = file->f_op->fallocate(file, mode, offset, len);
+	inode_aops_up_read(inode);
 
 	/*
 	 * Create inotify and fanotify events.
diff --git a/fs/stat.c b/fs/stat.c
index 894699c74dde..274b3ccc82b1 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	inode_aops_down_read(inode);
 	if (IS_DAX(inode))
 		stat->attributes |= STATX_ATTR_DAX;
+	inode_aops_up_read(inode);
 
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 836a1f09be03..3e83a97dc047 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -420,6 +420,7 @@ xfs_iget_cache_hit(
 		rcu_read_unlock();
 
 		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
+		ASSERT(!rwsem_is_locked(&inode->i_aops_sem));
 		error = xfs_reinit_inode(mp, inode);
 		if (error) {
 			bool wake;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63d1e533a07d..ad0f2368031b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -40,6 +40,7 @@
 #include <linux/fs_types.h>
 #include <linux/build_bug.h>
 #include <linux/stddef.h>
+#include <linux/percpu-rwsem.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -359,6 +360,11 @@ typedef struct {
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
 		unsigned long, unsigned long);
 
+/**
+ * NOTE: DO NOT define new functions in address_space_operations without first
+ * considering how dynamic DAX states are to be supported.  See the
+ * inode_aops_*_read() functions
+ */
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
@@ -735,6 +741,9 @@ struct inode {
 #endif
 
 	void			*i_private; /* fs or device private pointer */
+#if defined(CONFIG_FS_DAX)
+	struct rw_semaphore	i_aops_sem;
+#endif
 } __randomize_layout;
 
 struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
@@ -1817,6 +1826,11 @@ struct block_device_operations;
 
 struct iov_iter;
 
+/**
+ * NOTE: DO NOT define new functions in file_operations without first
+ * considering how dynamic address_space operations are to be supported.  See
+ * the inode_aops_*_read() functions in this file.
+ */
 struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
@@ -1889,16 +1903,64 @@ struct inode_operations {
 	int (*set_acl)(struct inode *, struct posix_acl *, int);
 } ____cacheline_aligned;
 
+#if defined(CONFIG_FS_DAX)
+/*
+ * Filesystems wishing to support dynamic DAX states must do the following.
+ *
+ * 1) the direct_IO address_space_operation must be supported in all potential
+ *    a_ops vectors for any state suported by the inode.  This is because the
+ *    direct_IO function is used as a flag long before the function is called.
+
+ * 3) DAX state changes shall not be allowed while the file is mmap'ed
+ * 4) For non-mmaped operations the VFS layer must take the read lock for any
+ *    use of IS_DAX()
+ * 5) Filesystems take the write lock when changing DAX states.
+ */
+static inline void inode_aops_down_read(struct inode *inode)
+{
+	down_read(&inode->i_aops_sem);
+}
+static inline void inode_aops_up_read(struct inode *inode)
+{
+	up_read(&inode->i_aops_sem);
+}
+static inline void inode_aops_down_write(struct inode *inode)
+{
+	down_write(&inode->i_aops_sem);
+}
+static inline void inode_aops_up_write(struct inode *inode)
+{
+	up_write(&inode->i_aops_sem);
+}
+#else /* !CONFIG_FS_DAX */
+#define inode_aops_down_read(inode) do { (void)(inode); } while (0)
+#define inode_aops_up_read(inode) do { (void)(inode); } while (0)
+#define inode_aops_down_write(inode) do { (void)(inode); } while (0)
+#define inode_aops_up_write(inode) do { (void)(inode); } while (0)
+#endif /* CONFIG_FS_DAX */
+
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 				     struct iov_iter *iter)
 {
-	return file->f_op->read_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	inode_aops_down_read(inode);
+	ret = file->f_op->read_iter(kio, iter);
+	inode_aops_up_read(inode);
+	return ret;
 }
 
 static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
 				      struct iov_iter *iter)
 {
-	return file->f_op->write_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	inode_aops_down_read(inode);
+	ret = file->f_op->write_iter(kio, iter);
+	inode_aops_up_read(inode);
+	return ret;
 }
 
 static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 4f17c83db575..6a30febb11e0 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -48,6 +48,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 	bdi = inode_to_bdi(mapping->host);
 
 	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
+		int ret = 0;
+
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
@@ -58,9 +60,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 			/* no bad return value, but ignore advice */
 			break;
 		default:
-			return -EINVAL;
+			ret = -EINVAL;
 		}
-		return 0;
+
+		return ret;
 	}
 
 	/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..3a7863ba51b9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		 * and return.  Otherwise fallthrough to buffered io for
 		 * the rest of the read.  Buffered reads will not work for
 		 * DAX files, so don't bother trying.
+		 *
+		 * IS_DAX is protected under ->read_iter lock
 		 */
 		if (retval < 0 || !count || iocb->ki_pos >= size ||
 		    IS_DAX(inode))
@@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		 * holes, for example.  For DAX files, a buffered write will
 		 * not succeed (even if it did, DAX does not handle dirty
 		 * page-cache pages correctly).
+		 *
+		 * IS_DAX is protected under ->write_iter lock
 		 */
 		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
 			goto out;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b08b199f9a11..3d05bd10d83e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 	unsigned long ret;
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
+	/* Should not need locking here because mmap is not allowed */
 	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
 		goto out;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908743cb..f048178e2b93 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
 		} else {	/* !is_shmem */
 			if (!page || xa_is_value(page)) {
 				xas_unlock_irq(&xas);
+				inode_aops_down_read(file->f_inode);
 				page_cache_sync_readahead(mapping, &file->f_ra,
 							  file, index,
 							  PAGE_SIZE);
+				inode_aops_up_read(file->f_inode);
 				/* drain pagevecs to help isolate_lru_page() */
 				lru_add_drain();
 				page = find_lock_page(mapping, index);
diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..a4fb0670137d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 
 	ret = security_mmap_file(file, prot, flag);
 	if (!ret) {
-		if (down_write_killable(&mm->mmap_sem))
+		if (file)
+			inode_aops_down_read(file_inode(file));
+		if (down_write_killable(&mm->mmap_sem)) {
+			if (file)
+				inode_aops_up_read(file_inode(file));
 			return -EINTR;
+		}
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
 				    &populate, &uf);
 		up_write(&mm->mmap_sem);
+		if (file)
+			inode_aops_up_read(file_inode(file));
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
 			mm_populate(ret, populate);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 08/13] fs: Prevent DAX state change if file is mmap'ed
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (6 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-21  0:41 ` [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer ira.weiny
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Page faults need to ensure the inode DAX configuration is correct and
consistent with the vmf information at the time of the fault.  There is
no easy way to ensure the vmf information is correct if a DAX change is
in progress.  Furthermore, there is no good use case to require changing
DAX configs while the file is mmap'ed.

Track mmap's of the file and fail the DAX change if the file is mmap'ed.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V2:

	move 'i_mapped' to struct address_space and rename mmap_count
	Add inode_has_mappings() helper for FS's
	Change reference to "mode" to "state"
---
Changes from V3:
	Fix htmldoc error from the kbuild test robot.
	Reported-by: kbuild test robot <lkp@intel.com>
	Rebase cleanups
---
 fs/inode.c         |  1 +
 fs/xfs/xfs_ioctl.c |  9 +++++++++
 include/linux/fs.h |  7 +++++++
 mm/mmap.c          | 19 +++++++++++++++++--
 4 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 6e4f1cc872f2..613a045075bd 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -372,6 +372,7 @@ static void __address_space_init_once(struct address_space *mapping)
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
 	mapping->i_mmap = RB_ROOT_CACHED;
+	atomic64_set(&mapping->mmap_count, 0);
 }
 
 void address_space_init_once(struct address_space *mapping)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 25e12ce85075..498fae2ef9f6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1207,6 +1207,15 @@ xfs_ioctl_setattr_dax_invalidate(
 
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+
+	/*
+	 * If there is a mapping in place we must remain in our current state.
+	 */
+	if (inode_has_mappings(inode)) {
+		error = -EBUSY;
+		goto out_unlock;
+	}
+
 	error = filemap_write_and_wait(inode->i_mapping);
 	if (error)
 		goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ad0f2368031b..971fb011d0f0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -438,6 +438,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
  * @nr_thps: Number of THPs in the pagecache (non-shmem only).
  * @i_mmap: Tree of private and shared mappings.
  * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
+ * @mmap_count: The number of times this AS has been mmap'ed
  * @nrpages: Number of page entries, protected by the i_pages lock.
  * @nrexceptional: Shadow or DAX entries, protected by the i_pages lock.
  * @writeback_index: Writeback starts here.
@@ -459,6 +460,7 @@ struct address_space {
 #endif
 	struct rb_root_cached	i_mmap;
 	struct rw_semaphore	i_mmap_rwsem;
+	atomic64_t              mmap_count;
 	unsigned long		nrpages;
 	unsigned long		nrexceptional;
 	pgoff_t			writeback_index;
@@ -1939,6 +1941,11 @@ static inline void inode_aops_up_write(struct inode *inode)
 #define inode_aops_up_write(inode) do { (void)(inode); } while (0)
 #endif /* CONFIG_FS_DAX */
 
+static inline bool inode_has_mappings(struct inode *inode)
+{
+	return (atomic64_read(&inode->i_mapping->mmap_count) != 0);
+}
+
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 				     struct iov_iter *iter)
 {
diff --git a/mm/mmap.c b/mm/mmap.c
index 7cc2562b99fd..6bb16a0996b5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -171,12 +171,17 @@ void unlink_file_vma(struct vm_area_struct *vma)
 static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *next = vma->vm_next;
+	struct file *f = vma->vm_file;
 
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
+	if (f) {
+		struct inode *inode = file_inode(f);
+		if (inode)
+			atomic64_dec(&inode->i_mapping->mmap_count);
+		fput(f);
+	}
 	mpol_put(vma_policy(vma));
 	vm_area_free(vma);
 	return next;
@@ -1830,6 +1835,16 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vma_set_page_prot(vma);
 
+	/*
+	 * Track if there is mapping in place such that a state change
+	 * does not occur on a file which is mapped
+	 */
+	if (file) {
+		struct inode		*inode = file_inode(file);
+
+		atomic64_inc(&inode->i_mapping->mmap_count);
+	}
+
 	return addr;
 
 unmap_and_free_vma:
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (7 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 08/13] fs: Prevent DAX state change if file is mmap'ed ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-22  0:31   ` Darrick J. Wong
  2020-02-24  0:34   ` Dave Chinner
  2020-02-21  0:41 ` [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate ira.weiny
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

XFS requires the use of the aops of an inode to quiesced prior to
changing it to/from the DAX aops vector.

Take the aops write lock while changing DAX state.

We define a new XFS_DAX_EXCL lock type to carry the lock through to
transaction completion.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from v3:
	Change locking function names to reflect changes in previous
	patches.

Changes from V2:
	Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
	Remove the xfs specific lock and move to the vfs layer.
		We still use XFS_LOCK_DAX_EXCL to be able to pass this
		flag through to the transaction code.  But we no longer
		have a lock specific to xfs.  This removes a lot of code
		from the XFS layer, preps us for using this in ext4, and
		is actually more straight forward now that all the
		locking requirements are better known.

	Fix locking order comment
	Rework for new 'state' names
	(Other comments on the previous patch are not applicable with
	new patch as much of the code was removed in favor of the vfs
	level lock)
---
 fs/xfs/xfs_inode.c | 22 ++++++++++++++++++++--
 fs/xfs/xfs_inode.h |  7 +++++--
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 35df324875db..5b014c428f0f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
  *
  * Basic locking order:
  *
- * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
+ * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
  *
  * mmap_sem locking order:
  *
  * i_rwsem -> page lock -> mmap_sem
- * mmap_sem -> i_mmap_lock -> page_lock
+ * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
  *
  * The difference in mmap_sem locking order mean that we cannot hold the
  * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
@@ -182,6 +182,9 @@ xfs_ilock(
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
 
+	if (lock_flags & XFS_DAX_EXCL)
+		inode_aops_down_write(VFS_I(ip));
+
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		down_write_nested(&VFS_I(ip)->i_rwsem,
 				  XFS_IOLOCK_DEP(lock_flags));
@@ -224,6 +227,8 @@ xfs_ilock_nowait(
 	 * You can't set both SHARED and EXCL for the same lock,
 	 * and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED,
 	 * and XFS_ILOCK_EXCL are valid values to set in lock_flags.
+	 *
+	 * XFS_DAX_* is not allowed
 	 */
 	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
 	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
@@ -232,6 +237,7 @@ xfs_ilock_nowait(
 	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
 	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
 	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
+	ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
 
 	if (lock_flags & XFS_IOLOCK_EXCL) {
 		if (!down_write_trylock(&VFS_I(ip)->i_rwsem))
@@ -318,6 +324,9 @@ xfs_iunlock(
 	else if (lock_flags & XFS_ILOCK_SHARED)
 		mrunlock_shared(&ip->i_lock);
 
+	if (lock_flags & XFS_DAX_EXCL)
+		inode_aops_up_write(VFS_I(ip));
+
 	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
 }
 
@@ -333,6 +342,8 @@ xfs_ilock_demote(
 	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
 	ASSERT((lock_flags &
 		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
+	/* XFS_DAX_* is not allowed */
+	ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
 
 	if (lock_flags & XFS_ILOCK_EXCL)
 		mrdemote(&ip->i_lock);
@@ -465,6 +476,9 @@ xfs_lock_inodes(
 	ASSERT(!(lock_mode & XFS_ILOCK_EXCL) ||
 		inodes <= XFS_ILOCK_MAX_SUBCLASS + 1);
 
+	/* XFS_DAX_* is not allowed */
+	ASSERT((lock_mode & XFS_DAX_EXCL) == 0);
+
 	if (lock_mode & XFS_IOLOCK_EXCL) {
 		ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL)));
 	} else if (lock_mode & XFS_MMAPLOCK_EXCL)
@@ -566,6 +580,10 @@ xfs_lock_two_inodes(
 	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
 	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
 
+	/* XFS_DAX_* is not allowed */
+	ASSERT((ip0_mode & XFS_DAX_EXCL) == 0);
+	ASSERT((ip1_mode & XFS_DAX_EXCL) == 0);
+
 	ASSERT(ip0->i_ino != ip1->i_ino);
 
 	if (ip0->i_ino > ip1->i_ino) {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 492e53992fa9..25fe20740bf7 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -278,10 +278,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 #define	XFS_ILOCK_SHARED	(1<<3)
 #define	XFS_MMAPLOCK_EXCL	(1<<4)
 #define	XFS_MMAPLOCK_SHARED	(1<<5)
+#define	XFS_DAX_EXCL		(1<<6)
 
 #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
 				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
-				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
+				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
+				| XFS_DAX_EXCL)
 
 #define XFS_LOCK_FLAGS \
 	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
@@ -289,7 +291,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
 	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
 	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
 	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
-	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
+	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
+	{ XFS_DAX_EXCL,		"DAX_EXCL" }
 
 
 /*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (8 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-21 17:45   ` Christoph Hellwig
  2020-02-21  0:41 ` [PATCH V4 11/13] fs/xfs: Allow toggle of effective DAX flag ira.weiny
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Define a variable to hold the lock flags to ensure that the correct
locks are returned or released on error.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 498fae2ef9f6..321f7789b667 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1190,7 +1190,7 @@ xfs_ioctl_setattr_dax_invalidate(
 	int			*join_flags)
 {
 	struct inode		*inode = VFS_I(ip);
-	int			error;
+	int			error, flags;
 
 	*join_flags = 0;
 
@@ -1205,8 +1205,10 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (S_ISDIR(inode->i_mode))
 		return 0;
 
+	flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+
 	/* lock, flush and invalidate mapping in preparation for flag change */
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+	xfs_ilock(ip, flags);
 
 	/*
 	 * If there is a mapping in place we must remain in our current state.
@@ -1223,11 +1225,11 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (error)
 		goto out_unlock;
 
-	*join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+	*join_flags = flags;
 	return 0;
 
 out_unlock:
-	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+	xfs_iunlock(ip, flags);
 	return error;
 
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 11/13] fs/xfs: Allow toggle of effective DAX flag
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (9 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-21  0:41 ` [PATCH V4 12/13] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Now that locking of the inode aops are in place we can allow a DAX state
change.

Also use the new xfs_inode_enable_dax() to set IS_DAX() as needed and
update the aops vector after the enabled state change.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V3:
	Remove static branch stuff.
	Fix bugs found by Jeff by using xfs_inode_enable_dax()
	Condition xfs_ioctl_setattr_dax_invalidate() on CONFIG_FS_DAX

Changes from V2:
	Add in lock_dax_state_static_key static branch enabling.
	Rebase updates
---
 fs/xfs/xfs_inode.h |  2 ++
 fs/xfs/xfs_ioctl.c | 24 ++++++++++++++++++++----
 fs/xfs/xfs_iops.c  | 17 ++++++++++++-----
 3 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 25fe20740bf7..12aa02461ee1 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -469,6 +469,8 @@ int	xfs_break_layouts(struct inode *inode, uint *iolock,
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
+extern void xfs_setup_a_ops(struct xfs_inode *ip);
+extern bool xfs_inode_enable_dax(struct xfs_inode *ip);
 
 /*
  * When setting up a newly allocated inode, we need to call
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 321f7789b667..42a1639c974f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1123,12 +1123,11 @@ xfs_diflags_to_linux(
 		inode->i_flags |= S_NOATIME;
 	else
 		inode->i_flags &= ~S_NOATIME;
-#if 0	/* disabled until the flag switching races are sorted out */
-	if (xflags & FS_XFLAG_DAX)
+
+	if (xfs_inode_enable_dax(ip))
 		inode->i_flags |= S_DAX;
 	else
 		inode->i_flags &= ~S_DAX;
-#endif
 }
 
 static int
@@ -1183,6 +1182,7 @@ xfs_ioctl_setattr_xflags(
  * so that the cache invalidation is atomic with respect to the DAX flag
  * manipulation.
  */
+#if defined(CONFIG_FS_DAX)
 static int
 xfs_ioctl_setattr_dax_invalidate(
 	struct xfs_inode	*ip,
@@ -1205,7 +1205,7 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (S_ISDIR(inode->i_mode))
 		return 0;
 
-	flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+	flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL | XFS_DAX_EXCL;
 
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, flags);
@@ -1233,6 +1233,16 @@ xfs_ioctl_setattr_dax_invalidate(
 	return error;
 
 }
+#else /* !CONFIG_FS_DAX */
+static int
+xfs_ioctl_setattr_dax_invalidate(
+	struct xfs_inode	*ip,
+	struct fsxattr		*fa,
+	int			*join_flags)
+{
+	return 0;
+}
+#endif
 
 /*
  * Set up the transaction structure for the setattr operation, checking that we
@@ -1520,6 +1530,9 @@ xfs_ioctl_setattr(
 	else
 		ip->i_d.di_cowextsize = 0;
 
+	if (join_flags & XFS_DAX_EXCL)
+		xfs_setup_a_ops(ip);
+
 	code = xfs_trans_commit(tp);
 
 	/*
@@ -1629,6 +1642,9 @@ xfs_ioc_setxflags(
 		goto out_drop_write;
 	}
 
+	if (join_flags & XFS_DAX_EXCL)
+		xfs_setup_a_ops(ip);
+
 	error = xfs_trans_commit(tp);
 out_drop_write:
 	mnt_drop_write_file(filp);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff711efc5247..8f023be39705 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1259,7 +1259,7 @@ xfs_inode_supports_dax(
 	return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
 }
 
-STATIC bool
+bool
 xfs_inode_enable_dax(
 	struct xfs_inode *ip)
 {
@@ -1355,6 +1355,16 @@ xfs_setup_inode(
 	}
 }
 
+void xfs_setup_a_ops(struct xfs_inode *ip)
+{
+	struct inode		*inode = &ip->i_vnode;
+
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &xfs_dax_aops;
+	else
+		inode->i_mapping->a_ops = &xfs_address_space_operations;
+}
+
 void
 xfs_setup_iops(
 	struct xfs_inode	*ip)
@@ -1365,10 +1375,7 @@ xfs_setup_iops(
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
 		inode->i_fop = &xfs_file_operations;
-		if (IS_DAX(inode))
-			inode->i_mapping->a_ops = &xfs_dax_aops;
-		else
-			inode->i_mapping->a_ops = &xfs_address_space_operations;
+		xfs_setup_a_ops(ip);
 		break;
 	case S_IFDIR:
 		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 12/13] fs/xfs: Remove xfs_diflags_to_linux()
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (10 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 11/13] fs/xfs: Allow toggle of effective DAX flag ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-21  0:41 ` [PATCH V4 13/13] Documentation/dax: Update Usage section ira.weiny
  2020-02-26 22:48 ` [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 Jeff Moyer
  13 siblings, 0 replies; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

The functionality in xfs_diflags_to_linux() is now identical to
xfs_diflags_to_iflags().

Remove xfs_diflags_to_linux() and call xfs_diflags_to_iflags() directly.

While we are here simplify xfs_diflags_to_iflags() to take just struct
xfs_inode.

And use xfs_ip2xflags() to ensure future diflags are included correctly.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_inode.h |  2 +-
 fs/xfs/xfs_ioctl.c | 32 +-------------------------------
 fs/xfs/xfs_iops.c  | 29 ++++++++++++++++++-----------
 3 files changed, 20 insertions(+), 43 deletions(-)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 12aa02461ee1..2644f8239633 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -470,7 +470,7 @@ int	xfs_break_layouts(struct inode *inode, uint *iolock,
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
 extern void xfs_setup_a_ops(struct xfs_inode *ip);
-extern bool xfs_inode_enable_dax(struct xfs_inode *ip);
+extern void xfs_diflags_to_iflags(struct xfs_inode *ip);
 
 /*
  * When setting up a newly allocated inode, we need to call
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 42a1639c974f..3ac4ab94e21a 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1100,36 +1100,6 @@ xfs_flags2diflags2(
 	return di_flags2;
 }
 
-STATIC void
-xfs_diflags_to_linux(
-	struct xfs_inode	*ip)
-{
-	struct inode		*inode = VFS_I(ip);
-	unsigned int		xflags = xfs_ip2xflags(ip);
-
-	if (xflags & FS_XFLAG_IMMUTABLE)
-		inode->i_flags |= S_IMMUTABLE;
-	else
-		inode->i_flags &= ~S_IMMUTABLE;
-	if (xflags & FS_XFLAG_APPEND)
-		inode->i_flags |= S_APPEND;
-	else
-		inode->i_flags &= ~S_APPEND;
-	if (xflags & FS_XFLAG_SYNC)
-		inode->i_flags |= S_SYNC;
-	else
-		inode->i_flags &= ~S_SYNC;
-	if (xflags & FS_XFLAG_NOATIME)
-		inode->i_flags |= S_NOATIME;
-	else
-		inode->i_flags &= ~S_NOATIME;
-
-	if (xfs_inode_enable_dax(ip))
-		inode->i_flags |= S_DAX;
-	else
-		inode->i_flags &= ~S_DAX;
-}
-
 static int
 xfs_ioctl_setattr_xflags(
 	struct xfs_trans	*tp,
@@ -1167,7 +1137,7 @@ xfs_ioctl_setattr_xflags(
 	ip->i_d.di_flags = xfs_flags2diflags(ip, fa->fsx_xflags);
 	ip->i_d.di_flags2 = di_flags2;
 
-	xfs_diflags_to_linux(ip);
+	xfs_diflags_to_iflags(ip);
 	xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 	XFS_STATS_INC(mp, xs_ig_attrchg);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8f023be39705..c74a48f810f3 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1273,26 +1273,33 @@ xfs_inode_enable_dax(
 	return false;
 }
 
-STATIC void
+void
 xfs_diflags_to_iflags(
-	struct inode		*inode,
 	struct xfs_inode	*ip)
 {
-	uint16_t		flags = ip->i_d.di_flags;
-
-	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
-			    S_NOATIME | S_DAX);
+	struct inode		*inode = VFS_I(ip);
+	uint16_t		diflags = xfs_ip2xflags(ip);
 
-	if (flags & XFS_DIFLAG_IMMUTABLE)
+	if (diflags & FS_XFLAG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
-	if (flags & XFS_DIFLAG_APPEND)
+	else
+		inode->i_flags &= ~S_IMMUTABLE;
+	if (diflags & FS_XFLAG_APPEND)
 		inode->i_flags |= S_APPEND;
-	if (flags & XFS_DIFLAG_SYNC)
+	else
+		inode->i_flags &= ~S_APPEND;
+	if (diflags & FS_XFLAG_SYNC)
 		inode->i_flags |= S_SYNC;
-	if (flags & XFS_DIFLAG_NOATIME)
+	else
+		inode->i_flags &= ~S_SYNC;
+	if (diflags & FS_XFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
+	else
+		inode->i_flags &= ~S_NOATIME;
 	if (xfs_inode_enable_dax(ip))
 		inode->i_flags |= S_DAX;
+	else
+		inode->i_flags &= ~S_DAX;
 }
 
 /*
@@ -1321,7 +1328,7 @@ xfs_setup_inode(
 	inode->i_gid    = xfs_gid_to_kgid(ip->i_d.di_gid);
 
 	i_size_write(inode, ip->i_d.di_size);
-	xfs_diflags_to_iflags(inode, ip);
+	xfs_diflags_to_iflags(ip);
 
 	if (S_ISDIR(inode->i_mode)) {
 		/*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH V4 13/13] Documentation/dax: Update Usage section
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (11 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 12/13] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
@ 2020-02-21  0:41 ` ira.weiny
  2020-02-26 22:48 ` [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 Jeff Moyer
  13 siblings, 0 replies; 54+ messages in thread
From: ira.weiny @ 2020-02-21  0:41 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Update the Usage section to reflect the new individual dax selection
functionality.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 Documentation/filesystems/dax.txt | 84 ++++++++++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 679729442fd2..32e37c550f76 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -20,8 +20,88 @@ Usage
 If you have a block device which supports DAX, you can make a filesystem
 on it as usual.  The DAX code currently only supports files with a block
 size equal to your kernel's PAGE_SIZE, so you may need to specify a block
-size when creating the filesystem.  When mounting it, use the "-o dax"
-option on the command line or add 'dax' to the options in /etc/fstab.
+size when creating the filesystem.
+
+Enabling DAX on an individual file basis (XFS)
+----------------------------------------------
+
+There are 2 per file dax flags.  One is a physical configuration setting and
+the other a currently enabled state.
+
+The physical configuration setting is maintained on individual file and
+directory inodes.  It is preserved within the file system.  This 'physical'
+config setting can be set using an ioctl and/or an application such as "xfs_io
+-c 'chattr [-+]x'".  Files and directories automatically inherit their physical
+dax setting from their parent directory when created.  Therefore, setting the
+physical dax setting at directory creation time can be used to set a default
+behavior for that sub-tree.  Doing so on the root directory acts to set a
+default for the entire file system.
+
+To clarify inheritance here are 3 examples:
+
+Example A:
+
+mkdir -p a/b/c
+xfs_io 'chattr +x' a
+mkdir a/b/c/d
+mkdir a/e
+
+	dax: a,e
+	no dax: b,c,d
+
+Example B:
+
+mkdir a
+xfs_io 'chattr +x' a
+mkdir -p a/b/c/d
+
+	dax: a,b,c,d
+	no dax:
+
+Example C:
+
+mkdir -p a/b/c
+xfs_io 'chattr +x' c
+mkdir a/b/c/d
+
+	dax: c,d
+	no dax: a,b
+
+
+The current inode enabled state is set when a file inode is loaded and it is
+determined that the underlying media supports dax.
+
+statx can be used to query the file's current enabled state.  NOTE that a
+directory will never be operating in a dax state.  Therefore, the dax config
+state must be queried to see what config state a file or sub-directory will
+inherit from a directory.
+
+NOTE: Setting a file or directory's config state with xfs_io is possible even
+if the underlying media does not support dax.
+
+
+Enabling dax on a file system wide basis ('-o dax' mount option)
+----------------------------------------------------------------
+
+The physical dax configuration of all files can be overridden using a mount
+option.  In summary:
+
+	(physical flag || mount option) && capable device == dax in effect
+	(  <xfs_io>    ||  <'-o dax'> ) && capable device == <statx dax true>
+
+To enable the mount override, use "-o dax" on the command line or add
+'dax' to the options in /etc/fstab
+
+Using the mount option does not change the physical configured state of
+individual files.  Therefore, remounting _without_ the mount option will allow
+the file system to set file's enabled state directly based on their config
+setting.
+
+NOTE: Setting a file or directory's physical config state is possible while the
+file system is mounted with the dax override.  However, the file's enabled
+state will continue to be overridden and "dax enabled" until the mount option
+is removed and a remount performed.  At that point the file's physical config
+state dictates the enabled state.
 
 
 Implementation Tips for Block Driver Writers
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem
  2020-02-21  0:41 ` [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
@ 2020-02-21  1:26   ` Dave Chinner
  2020-02-27 17:52     ` Ira Weiny
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2020-02-21  1:26 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Thu, Feb 20, 2020 at 04:41:22PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> xfs_reinit_inode() -> inode_init_always() already handles calling
> init_rwsem(i_rwsem).  Doing so again is unneeded.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Except that this inode has been destroyed and freed by the VFS, and
we are now recycling it back into the VFS before we actually
physically freed it.

Hence we have re-initialise the semaphore because the semaphore can
contain internal state that is specific to it's new life cycle (e.g.
the lockdep context) that will cause problems if we just assume that
the inode is the same inode as it was before we recycled it back
into the VFS caches.

So, yes, we actually do need to re-initialise the rwsem here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked()
  2020-02-21  0:41 ` [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
@ 2020-02-21  1:34   ` Dave Chinner
  2020-02-21 23:00     ` Ira Weiny
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2020-02-21  1:34 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Thu, Feb 20, 2020 at 04:41:23PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> xfs_isilocked() can't work fully without CONFIG_LOCKDEP.  However,
> making xfs_isilocked() dependant on CONFIG_LOCKDEP is not feasible
> because it is used for more than the i_rwsem.  Therefore a short-circuit
> was provided via debug_locks.  However, this caused confusion while
> working through the xfs locking.
> 
> Rather than use debug_locks as a flag specify this clearly using
> IS_ENABLED(CONFIG_LOCKDEP).
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from V3:
> 	Reordered to be a "pre-cleanup" patch
> 
> Changes from V2:
> 	This patch is new for V3
> ---
>  fs/xfs/xfs_inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c5077e6326c7..35df324875db 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -364,7 +364,7 @@ xfs_isilocked(
>  
>  	if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
>  		if (!(lock_flags & XFS_IOLOCK_SHARED))
> -			return !debug_locks ||
> +			return !IS_ENABLED(CONFIG_LOCKDEP) ||
>  				lockdep_is_held_type(&VFS_I(ip)->i_rwsem, 0);

This breaks expected lockdep behaviour.

We need to use debug_locks here because lockdep turns off lock
checking via debug_locks when lockdep encounters a locking
inconsistency.  We only want to know about the first locking
problem, not spew cascading lock problems over and over once we
already know there is a locking problem.

IOWs, checking debug_locks is required here for the same reason it
is used in lockdep_assert_held_{read/write}(). essentially we are
open coding lockdep_assert_held_write() here because this function
is only called from within ASSERT() statements and we don't want
multiple WARN/BUGs being issued when this triggers....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check
  2020-02-21  0:41 ` [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check ira.weiny
@ 2020-02-21  1:42   ` Dave Chinner
  2020-02-21 23:04     ` Ira Weiny
  2020-02-21 17:42   ` Christoph Hellwig
  1 sibling, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2020-02-21  1:42 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Jan Kara, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Feb 20, 2020 at 04:41:24PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Remove the check because DAX now has it's own read/write methods and
> file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on
> their own.  Therefore, it does not matter if the file state is DAX when
> the iocb flags are created.
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Yup, looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check
  2020-02-21  0:41 ` [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check ira.weiny
  2020-02-21  1:42   ` Dave Chinner
@ 2020-02-21 17:42   ` Christoph Hellwig
  1 sibling, 0 replies; 54+ messages in thread
From: Christoph Hellwig @ 2020-02-21 17:42 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Jan Kara, Alexander Viro, Darrick J. Wong,
	Dan Williams, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

On Thu, Feb 20, 2020 at 04:41:24PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Remove the check because DAX now has it's own read/write methods and
> file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on
> their own.  Therefore, it does not matter if the file state is DAX when
> the iocb flags are created.
> 
> Reviewed-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from v3:
> 	Reword commit message.
> 	Reordered to be a 'pre-cleanup' patch
> ---
>  include/linux/fs.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3cd4fe6b845e..63d1e533a07d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3388,7 +3388,7 @@ extern int file_update_time(struct file *file);
>  
>  static inline bool io_is_direct(struct file *filp)
>  {
> -	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
> +	return (filp->f_flags & O_DIRECT);

Please just kill io_is_direct entirely.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-21  0:41 ` [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state ira.weiny
@ 2020-02-21 17:44   ` Christoph Hellwig
  2020-02-21 22:44     ` Dave Chinner
  2020-02-22  0:33   ` Darrick J. Wong
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2020-02-21 17:44 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> DAX requires special address space operations (aops).  Changing DAX
> state therefore requires changing those aops.
> 
> However, many functions require aops to remain consistent through a deep
> call stack.
> 
> Define a vfs level inode rwsem to protect aops throughout call stacks
> which require them.
> 
> Finally, define calls to be used in subsequent patches when aops usage
> needs to be quiesced by the file system.

I am very much against this.  There is a reason why we don't support
changes of ops vectors at runtime anywhere else, because it is
horribly complicated and impossible to get right.  IFF we ever want
to change the DAX vs non-DAX mode (which I'm still not sold on) the
right way is to just add a few simple conditionals and merge the
aops, which is much easier to reason about, and less costly in overall
overhead.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate
  2020-02-21  0:41 ` [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate ira.weiny
@ 2020-02-21 17:45   ` Christoph Hellwig
  2020-02-21 18:06     ` Ira Weiny
  0 siblings, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2020-02-21 17:45 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Thu, Feb 20, 2020 at 04:41:31PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Define a variable to hold the lock flags to ensure that the correct
> locks are returned or released on error.

I don't see how this cleans up anything..

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate
  2020-02-21 17:45   ` Christoph Hellwig
@ 2020-02-21 18:06     ` Ira Weiny
  0 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-21 18:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Feb 21, 2020 at 06:45:22PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 20, 2020 at 04:41:31PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Define a variable to hold the lock flags to ensure that the correct
> > locks are returned or released on error.
> 
> I don't see how this cleans up anything..

It ensures the correct flags are released in the error path without having to
add the flag in the next patch in 2 places.

Ira


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-21 17:44   ` Christoph Hellwig
@ 2020-02-21 22:44     ` Dave Chinner
  2020-02-21 23:26       ` Dan Williams
  2020-02-24 17:56       ` Christoph Hellwig
  0 siblings, 2 replies; 54+ messages in thread
From: Dave Chinner @ 2020-02-21 22:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: ira.weiny, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Feb 21, 2020 at 06:44:49PM +0100, Christoph Hellwig wrote:
> On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > DAX requires special address space operations (aops).  Changing DAX
> > state therefore requires changing those aops.
> > 
> > However, many functions require aops to remain consistent through a deep
> > call stack.
> > 
> > Define a vfs level inode rwsem to protect aops throughout call stacks
> > which require them.
> > 
> > Finally, define calls to be used in subsequent patches when aops usage
> > needs to be quiesced by the file system.
> 
> I am very much against this.  There is a reason why we don't support
> changes of ops vectors at runtime anywhere else, because it is
> horribly complicated and impossible to get right.  IFF we ever want
> to change the DAX vs non-DAX mode (which I'm still not sold on) the
> right way is to just add a few simple conditionals and merge the
> aops, which is much easier to reason about, and less costly in overall
> overhead.

*cough*

That's exactly what the original code did. And it was broken
because page faults call multiple aops that are dependent on the
result of the previous aop calls setting up the state correctly for
the latter calls. And when S_DAX changes between those calls, shit
breaks.

It's exactly the same problem as switching aops between two
dependent aops method calls - we don't solve anything by merging
aops and checking IS_DAX in each method because the race condition
is still there.

/me throws his hands in the air and walks away

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked()
  2020-02-21  1:34   ` Dave Chinner
@ 2020-02-21 23:00     ` Ira Weiny
  0 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-21 23:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Feb 21, 2020 at 12:34:51PM +1100, Dave Chinner wrote:
> On Thu, Feb 20, 2020 at 04:41:23PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > xfs_isilocked() can't work fully without CONFIG_LOCKDEP.  However,
> > making xfs_isilocked() dependant on CONFIG_LOCKDEP is not feasible
> > because it is used for more than the i_rwsem.  Therefore a short-circuit
> > was provided via debug_locks.  However, this caused confusion while
> > working through the xfs locking.
> > 
> > Rather than use debug_locks as a flag specify this clearly using
> > IS_ENABLED(CONFIG_LOCKDEP).
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from V3:
> > 	Reordered to be a "pre-cleanup" patch
> > 
> > Changes from V2:
> > 	This patch is new for V3
> > ---
> >  fs/xfs/xfs_inode.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index c5077e6326c7..35df324875db 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -364,7 +364,7 @@ xfs_isilocked(
> >  
> >  	if (lock_flags & (XFS_IOLOCK_EXCL|XFS_IOLOCK_SHARED)) {
> >  		if (!(lock_flags & XFS_IOLOCK_SHARED))
> > -			return !debug_locks ||
> > +			return !IS_ENABLED(CONFIG_LOCKDEP) ||
> >  				lockdep_is_held_type(&VFS_I(ip)->i_rwsem, 0);
> 
> This breaks expected lockdep behaviour.
> 
> We need to use debug_locks here because lockdep turns off lock
> checking via debug_locks when lockdep encounters a locking
> inconsistency.  We only want to know about the first locking
> problem, not spew cascading lock problems over and over once we
> already know there is a locking problem.

Ah...  ok...  Seems like that should be part of the lockdep interface...

I'll drop the patch for now,
Ira

> 
> IOWs, checking debug_locks is required here for the same reason it
> is used in lockdep_assert_held_{read/write}(). essentially we are
> open coding lockdep_assert_held_write() here because this function
> is only called from within ASSERT() statements and we don't want
> multiple WARN/BUGs being issued when this triggers....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check
  2020-02-21  1:42   ` Dave Chinner
@ 2020-02-21 23:04     ` Ira Weiny
  0 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-21 23:04 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Jan Kara, Alexander Viro, Darrick J. Wong,
	Dan Williams, Christoph Hellwig, Theodore Y. Ts'o,
	linux-ext4, linux-xfs, linux-fsdevel

On Fri, Feb 21, 2020 at 12:42:25PM +1100, Dave Chinner wrote:
> On Thu, Feb 20, 2020 at 04:41:24PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Remove the check because DAX now has it's own read/write methods and
> > file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on
> > their own.  Therefore, it does not matter if the file state is DAX when
> > the iocb flags are created.
> > 
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Yup, looks good.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>

Thanks,
Ira

> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-21 22:44     ` Dave Chinner
@ 2020-02-21 23:26       ` Dan Williams
  2020-02-24 17:56       ` Christoph Hellwig
  1 sibling, 0 replies; 54+ messages in thread
From: Dan Williams @ 2020-02-21 23:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Weiny, Ira, Linux Kernel Mailing List,
	Alexander Viro, Darrick J. Wong, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Fri, Feb 21, 2020 at 2:44 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Feb 21, 2020 at 06:44:49PM +0100, Christoph Hellwig wrote:
> > On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > >
> > > DAX requires special address space operations (aops).  Changing DAX
> > > state therefore requires changing those aops.
> > >
> > > However, many functions require aops to remain consistent through a deep
> > > call stack.
> > >
> > > Define a vfs level inode rwsem to protect aops throughout call stacks
> > > which require them.
> > >
> > > Finally, define calls to be used in subsequent patches when aops usage
> > > needs to be quiesced by the file system.
> >
> > I am very much against this.  There is a reason why we don't support
> > changes of ops vectors at runtime anywhere else, because it is
> > horribly complicated and impossible to get right.  IFF we ever want
> > to change the DAX vs non-DAX mode (which I'm still not sold on) the
> > right way is to just add a few simple conditionals and merge the
> > aops, which is much easier to reason about, and less costly in overall
> > overhead.
>
> *cough*
>
> That's exactly what the original code did. And it was broken
> because page faults call multiple aops that are dependent on the
> result of the previous aop calls setting up the state correctly for
> the latter calls. And when S_DAX changes between those calls, shit
> breaks.
>
> It's exactly the same problem as switching aops between two
> dependent aops method calls - we don't solve anything by merging
> aops and checking IS_DAX in each method because the race condition
> is still there.
>
> /me throws his hands in the air and walks away

Please come back, because I think it's also clear that the "we don't
support changes of ops vectors at runtime" assertion is already being
violated by ext4 [1]. So that leaves "IFF we ever want to change the
dax vs non-dax mode" which I thought was already consensus given the
lingering hopes about having some future where the kernel is able to
dynamically optimize for dax vs non-dax based on memory media
performance characteristics. I thought the only thing missing from the
conclusion of the last conversation [2] was the "physical" vs
"effective" split that we identified at LSF'19 [3]. Christoph, that
split allows for for your concern about application intent to be
considered / overridden by kernel policy, and it allows for
communication of the effective state which applications need for
resource planning [4] independent of MAP_SYNC and other dax semantics.

The status quo of globally enabling dax for all files is empirically
the wrong choice for page-cache friendly workloads running on
slower-than-DRAM pmem media.

I am struggling to see how we address the above items without first
having a dynamic / less than global-filesystem scope facility to
control dax.

[1]: https://lore.kernel.org/linux-fsdevel/20191108131238.GK20863@quack2.suse.cz
[2]: https://lore.kernel.org/linux-fsdevel/20170927064001.GA27601@infradead.org
[3]: https://lwn.net/Articles/787973/
[4]: https://lwn.net/Articles/787233/

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax()
  2020-02-21  0:41 ` [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
@ 2020-02-22  0:28   ` Darrick J. Wong
  2020-02-23 15:07     ` Ira Weiny
  0 siblings, 1 reply; 54+ messages in thread
From: Darrick J. Wong @ 2020-02-22  0:28 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Thu, Feb 20, 2020 at 04:41:27PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> xfs_inode_supports_dax() should reflect if the inode can support DAX not
> that it is enabled for DAX.
> 
> Change the use of xfs_inode_supports_dax() to reflect only if the inode
> and underlying storage support dax.
> 
> Add a new function xfs_inode_enable_dax() which reflects if the inode

Heavily into bikeshedding here, but "enable" sounds like a verb, but
this function doesn't actually turn dax on for a file, it merely decides
if we /should/ turn it on.

xfs_inode_wants_dax() ?

--D

> should be enabled for DAX.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from v3:
> 	Update functions and names to be more clear
> 	Update commit message
> 	Merge with
> 		'fs/xfs: Clean up DAX support check'
> 		don't allow IS_DAX() on a directory
> 		use STATIC macro for static
> 		make xfs_inode_supports_dax() static
> ---
>  fs/xfs/xfs_iops.c | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 81f2f93caec0..ff711efc5247 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1237,19 +1237,18 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = {
>  };
>  
>  /* Figure out if this file actually supports DAX. */
> -static bool
> +STATIC bool
>  xfs_inode_supports_dax(
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  
>  	/* Only supported on non-reflinked files. */
> -	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
> +	if (xfs_is_reflink_inode(ip))
>  		return false;
>  
> -	/* DAX mount option or DAX iflag must be set. */
> -	if (!(mp->m_flags & XFS_MOUNT_DAX) &&
> -	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> +	/* Only supported on regular files. */
> +	if (!S_ISREG(VFS_I(ip)->i_mode))
>  		return false;
>  
>  	/* Block size must match page size */
> @@ -1260,6 +1259,20 @@ xfs_inode_supports_dax(
>  	return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
>  }
>  
> +STATIC bool
> +xfs_inode_enable_dax(
> +	struct xfs_inode *ip)
> +{
> +	if (!xfs_inode_supports_dax(ip))
> +		return false;
> +
> +	if (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)
> +		return true;
> +	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
> +		return true;
> +	return false;
> +}
> +
>  STATIC void
>  xfs_diflags_to_iflags(
>  	struct inode		*inode,
> @@ -1278,7 +1291,7 @@ xfs_diflags_to_iflags(
>  		inode->i_flags |= S_SYNC;
>  	if (flags & XFS_DIFLAG_NOATIME)
>  		inode->i_flags |= S_NOATIME;
> -	if (xfs_inode_supports_dax(ip))
> +	if (xfs_inode_enable_dax(ip))
>  		inode->i_flags |= S_DAX;
>  }
>  
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer
  2020-02-21  0:41 ` [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer ira.weiny
@ 2020-02-22  0:31   ` Darrick J. Wong
  2020-02-23 15:04     ` Ira Weiny
  2020-02-24  0:34   ` Dave Chinner
  1 sibling, 1 reply; 54+ messages in thread
From: Darrick J. Wong @ 2020-02-22  0:31 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Thu, Feb 20, 2020 at 04:41:30PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> XFS requires the use of the aops of an inode to quiesced prior to
> changing it to/from the DAX aops vector.
> 
> Take the aops write lock while changing DAX state.
> 
> We define a new XFS_DAX_EXCL lock type to carry the lock through to
> transaction completion.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from v3:
> 	Change locking function names to reflect changes in previous
> 	patches.
> 
> Changes from V2:
> 	Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
> 	Remove the xfs specific lock and move to the vfs layer.
> 		We still use XFS_LOCK_DAX_EXCL to be able to pass this
> 		flag through to the transaction code.  But we no longer
> 		have a lock specific to xfs.  This removes a lot of code
> 		from the XFS layer, preps us for using this in ext4, and
> 		is actually more straight forward now that all the
> 		locking requirements are better known.
> 
> 	Fix locking order comment
> 	Rework for new 'state' names
> 	(Other comments on the previous patch are not applicable with
> 	new patch as much of the code was removed in favor of the vfs
> 	level lock)
> ---
>  fs/xfs/xfs_inode.c | 22 ++++++++++++++++++++--
>  fs/xfs/xfs_inode.h |  7 +++++--
>  2 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 35df324875db..5b014c428f0f 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
>   *
>   * Basic locking order:
>   *
> - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock

"dax_sem"?  I thought this was now called i_aops_sem?

>   *
>   * mmap_sem locking order:
>   *
>   * i_rwsem -> page lock -> mmap_sem
> - * mmap_sem -> i_mmap_lock -> page_lock
> + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
>   *
>   * The difference in mmap_sem locking order mean that we cannot hold the
>   * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> @@ -182,6 +182,9 @@ xfs_ilock(
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
>  
> +	if (lock_flags & XFS_DAX_EXCL)

And similarly, I think this should be XFS_OPSLOCK_EXCL...

--D

> +		inode_aops_down_write(VFS_I(ip));
> +
>  	if (lock_flags & XFS_IOLOCK_EXCL) {
>  		down_write_nested(&VFS_I(ip)->i_rwsem,
>  				  XFS_IOLOCK_DEP(lock_flags));
> @@ -224,6 +227,8 @@ xfs_ilock_nowait(
>  	 * You can't set both SHARED and EXCL for the same lock,
>  	 * and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED,
>  	 * and XFS_ILOCK_EXCL are valid values to set in lock_flags.
> +	 *
> +	 * XFS_DAX_* is not allowed
>  	 */
>  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
>  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> @@ -232,6 +237,7 @@ xfs_ilock_nowait(
>  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> +	ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
>  
>  	if (lock_flags & XFS_IOLOCK_EXCL) {
>  		if (!down_write_trylock(&VFS_I(ip)->i_rwsem))
> @@ -318,6 +324,9 @@ xfs_iunlock(
>  	else if (lock_flags & XFS_ILOCK_SHARED)
>  		mrunlock_shared(&ip->i_lock);
>  
> +	if (lock_flags & XFS_DAX_EXCL)
> +		inode_aops_up_write(VFS_I(ip));
> +
>  	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
>  }
>  
> @@ -333,6 +342,8 @@ xfs_ilock_demote(
>  	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags &
>  		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
> +	/* XFS_DAX_* is not allowed */
> +	ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
>  
>  	if (lock_flags & XFS_ILOCK_EXCL)
>  		mrdemote(&ip->i_lock);
> @@ -465,6 +476,9 @@ xfs_lock_inodes(
>  	ASSERT(!(lock_mode & XFS_ILOCK_EXCL) ||
>  		inodes <= XFS_ILOCK_MAX_SUBCLASS + 1);
>  
> +	/* XFS_DAX_* is not allowed */
> +	ASSERT((lock_mode & XFS_DAX_EXCL) == 0);
> +
>  	if (lock_mode & XFS_IOLOCK_EXCL) {
>  		ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL)));
>  	} else if (lock_mode & XFS_MMAPLOCK_EXCL)
> @@ -566,6 +580,10 @@ xfs_lock_two_inodes(
>  	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
>  	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
>  
> +	/* XFS_DAX_* is not allowed */
> +	ASSERT((ip0_mode & XFS_DAX_EXCL) == 0);
> +	ASSERT((ip1_mode & XFS_DAX_EXCL) == 0);
> +
>  	ASSERT(ip0->i_ino != ip1->i_ino);
>  
>  	if (ip0->i_ino > ip1->i_ino) {
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 492e53992fa9..25fe20740bf7 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -278,10 +278,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>  #define	XFS_ILOCK_SHARED	(1<<3)
>  #define	XFS_MMAPLOCK_EXCL	(1<<4)
>  #define	XFS_MMAPLOCK_SHARED	(1<<5)
> +#define	XFS_DAX_EXCL		(1<<6)
>  
>  #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
>  				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
> -				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
> +				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
> +				| XFS_DAX_EXCL)
>  
>  #define XFS_LOCK_FLAGS \
>  	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
> @@ -289,7 +291,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>  	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
>  	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
>  	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
> -	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
> +	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
> +	{ XFS_DAX_EXCL,		"DAX_EXCL" }
>  
>  
>  /*
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-21  0:41 ` [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state ira.weiny
  2020-02-21 17:44   ` Christoph Hellwig
@ 2020-02-22  0:33   ` Darrick J. Wong
  2020-02-23 15:03     ` Ira Weiny
  1 sibling, 1 reply; 54+ messages in thread
From: Darrick J. Wong @ 2020-02-22  0:33 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> DAX requires special address space operations (aops).  Changing DAX
> state therefore requires changing those aops.
> 
> However, many functions require aops to remain consistent through a deep
> call stack.
> 
> Define a vfs level inode rwsem to protect aops throughout call stacks
> which require them.
> 
> Finally, define calls to be used in subsequent patches when aops usage
> needs to be quiesced by the file system.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from V3:
> 	Convert from global to a per-inode rwsem
> 		Remove pre-optimization
> 	Remove static branch stuff
> 	Change function names from inode_dax_state_* to inode_aops_*
> 		I kept 'inode' as the synchronization is at the inode
> 		level now (probably where it belongs)...
> 
> 		and I still prefer *_[down|up]_[read|write] as those
> 		names better reflect the use and interaction between
> 		users (readers) and writers better.  'XX_start_aop()'
> 		would have to be matched with something like
> 		'XX_wait_for_aop_user()' and 'XX_release_aop_users()' or
> 		something which does not make sense on the 'writer'
> 		side.
> 
> Changes from V2
> 
> 	Rebase on linux-next-08-02-2020
> 
> 	Fix locking order
> 	Change all references from mode to state where appropriate
> 	add CONFIG_FS_DAX requirement for state change
> 	Use a static branch to enable locking only when a dax capable
> 		device has been seen.
> 
> 	Move the lock to a global vfs lock
> 
> 		this does a few things
> 			1) preps us better for ext4 support
> 			2) removes funky callbacks from inode ops
> 			3) remove complexity from XFS and probably from
> 			   ext4 later
> 
> 		We can do this because
> 			1) the locking order is required to be at the
> 			   highest level anyway, so why complicate xfs
> 			2) We had to move the sem to the super_block
> 			   because it is too heavy for the inode.
> 			3) After internal discussions with Dan we
> 			   decided that this would be easier, just as
> 			   performant, and with slightly less overhead
> 			   than in the VFS SB.
> 
> 		We also change the functions names to up/down;
> 		read/write as appropriate.  Previous names were over
> 		simplified.
> 
> 	Update comments and documentation
> 
> squash: add locking
> ---
>  Documentation/filesystems/vfs.rst | 16 ++++++++
>  fs/attr.c                         |  1 +
>  fs/inode.c                        | 15 +++++--
>  fs/iomap/buffered-io.c            |  1 +
>  fs/open.c                         |  4 ++
>  fs/stat.c                         |  2 +
>  fs/xfs/xfs_icache.c               |  1 +
>  include/linux/fs.h                | 66 ++++++++++++++++++++++++++++++-
>  mm/fadvise.c                      |  7 +++-
>  mm/filemap.c                      |  4 ++
>  mm/huge_memory.c                  |  1 +
>  mm/khugepaged.c                   |  2 +
>  mm/util.c                         |  9 ++++-
>  13 files changed, 121 insertions(+), 8 deletions(-)
> 
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 7d4d09dd5e6d..4a10a232f8e2 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -934,6 +934,22 @@ cache in your filesystem.  The following members are defined:
>  	Called during swapoff on files where swap_activate was
>  	successful.
>  
> +Changing DAX 'state' dynamically
> +----------------------------------
> +
> +Some file systems which support DAX want to be able to change the DAX state
> +dyanically.  To switch the state safely we lock the inode state in all "normal"
> +file system operations and restrict state changes to those operations.  The
> +specific rules are.
> +
> +        1) the direct_IO address_space_operation must be supported in all
> +           potential a_ops vectors for any state suported by the inode.
> +
> +        3) DAX state changes shall not be allowed while the file is mmap'ed
> +        4) For non-mmaped operations the VFS layer must take the read lock for any
> +           use of IS_DAX()
> +        5) Filesystems take the write lock when changing DAX states.
> +
>  
>  The File Object
>  ===============
> diff --git a/fs/attr.c b/fs/attr.c
> index b4bbdbd4c8ca..9b15f73d1079 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
>  	if (error)
>  		return error;
>  
> +	/* DAX read state should already be held here */
>  	if (inode->i_op->setattr)
>  		error = inode->i_op->setattr(dentry, attr);
>  	else
> diff --git a/fs/inode.c b/fs/inode.c
> index 7d57068b6b7a..6e4f1cc872f2 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -200,6 +200,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>  #endif
>  	inode->i_flctx = NULL;
>  	this_cpu_inc(nr_inodes);
> +	init_rwsem(&inode->i_aops_sem);
>  
>  	return 0;
>  out:
> @@ -1616,11 +1617,19 @@ EXPORT_SYMBOL(iput);
>   */
>  int bmap(struct inode *inode, sector_t *block)
>  {
> -	if (!inode->i_mapping->a_ops->bmap)
> -		return -EINVAL;
> +	int ret = 0;
> +
> +	inode_aops_down_read(inode);
> +	if (!inode->i_mapping->a_ops->bmap) {
> +		ret = -EINVAL;
> +		goto err;
> +	}
>  
>  	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> -	return 0;
> +
> +err:
> +	inode_aops_up_read(inode);
> +	return ret;
>  }
>  EXPORT_SYMBOL(bmap);
>  #endif
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 7c84c4c027c4..e313a34d5fa6 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
>  		offset = offset_in_page(pos);
>  		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
>  
> +		/* DAX state read should already be held here */
>  		if (IS_DAX(inode))
>  			status = iomap_dax_zero(pos, offset, bytes, iomap);
>  		else
> diff --git a/fs/open.c b/fs/open.c
> index 0788b3715731..3abf0bfac462 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
>  	if (ret)
>  		newattrs.ia_valid |= ret | ATTR_FORCE;
>  
> +	inode_aops_down_read(dentry->d_inode);
>  	inode_lock(dentry->d_inode);
>  	/* Note any delegations or leases have already been broken: */
>  	ret = notify_change(dentry, &newattrs, NULL);
>  	inode_unlock(dentry->d_inode);
> +	inode_aops_up_read(dentry->d_inode);
>  	return ret;
>  }
>  
> @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -EOPNOTSUPP;
>  
>  	file_start_write(file);
> +	inode_aops_down_read(inode);
>  	ret = file->f_op->fallocate(file, mode, offset, len);
> +	inode_aops_up_read(inode);
>  
>  	/*
>  	 * Create inotify and fanotify events.
> diff --git a/fs/stat.c b/fs/stat.c
> index 894699c74dde..274b3ccc82b1 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>  	if (IS_AUTOMOUNT(inode))
>  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
>  
> +	inode_aops_down_read(inode);
>  	if (IS_DAX(inode))
>  		stat->attributes |= STATX_ATTR_DAX;
> +	inode_aops_up_read(inode);
>  
>  	if (inode->i_op->getattr)
>  		return inode->i_op->getattr(path, stat, request_mask,
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 836a1f09be03..3e83a97dc047 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -420,6 +420,7 @@ xfs_iget_cache_hit(
>  		rcu_read_unlock();
>  
>  		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> +		ASSERT(!rwsem_is_locked(&inode->i_aops_sem));
>  		error = xfs_reinit_inode(mp, inode);
>  		if (error) {
>  			bool wake;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 63d1e533a07d..ad0f2368031b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -40,6 +40,7 @@
>  #include <linux/fs_types.h>
>  #include <linux/build_bug.h>
>  #include <linux/stddef.h>
> +#include <linux/percpu-rwsem.h>
>  
>  #include <asm/byteorder.h>
>  #include <uapi/linux/fs.h>
> @@ -359,6 +360,11 @@ typedef struct {
>  typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
>  		unsigned long, unsigned long);
>  
> +/**
> + * NOTE: DO NOT define new functions in address_space_operations without first
> + * considering how dynamic DAX states are to be supported.  See the
> + * inode_aops_*_read() functions
> + */
>  struct address_space_operations {
>  	int (*writepage)(struct page *page, struct writeback_control *wbc);
>  	int (*readpage)(struct file *, struct page *);
> @@ -735,6 +741,9 @@ struct inode {
>  #endif
>  
>  	void			*i_private; /* fs or device private pointer */
> +#if defined(CONFIG_FS_DAX)
> +	struct rw_semaphore	i_aops_sem;
> +#endif
>  } __randomize_layout;
>  
>  struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
> @@ -1817,6 +1826,11 @@ struct block_device_operations;
>  
>  struct iov_iter;
>  
> +/**
> + * NOTE: DO NOT define new functions in file_operations without first
> + * considering how dynamic address_space operations are to be supported.  See
> + * the inode_aops_*_read() functions in this file.
> + */
>  struct file_operations {
>  	struct module *owner;
>  	loff_t (*llseek) (struct file *, loff_t, int);
> @@ -1889,16 +1903,64 @@ struct inode_operations {
>  	int (*set_acl)(struct inode *, struct posix_acl *, int);
>  } ____cacheline_aligned;
>  
> +#if defined(CONFIG_FS_DAX)
> +/*
> + * Filesystems wishing to support dynamic DAX states must do the following.
> + *
> + * 1) the direct_IO address_space_operation must be supported in all potential
> + *    a_ops vectors for any state suported by the inode.  This is because the
> + *    direct_IO function is used as a flag long before the function is called.
> +
> + * 3) DAX state changes shall not be allowed while the file is mmap'ed
> + * 4) For non-mmaped operations the VFS layer must take the read lock for any
> + *    use of IS_DAX()
> + * 5) Filesystems take the write lock when changing DAX states.

Do you envision ext4 migrating their aops changing code to use
i_aops_sem in the future?

--D

> + */
> +static inline void inode_aops_down_read(struct inode *inode)
> +{
> +	down_read(&inode->i_aops_sem);
> +}
> +static inline void inode_aops_up_read(struct inode *inode)
> +{
> +	up_read(&inode->i_aops_sem);
> +}
> +static inline void inode_aops_down_write(struct inode *inode)
> +{
> +	down_write(&inode->i_aops_sem);
> +}
> +static inline void inode_aops_up_write(struct inode *inode)
> +{
> +	up_write(&inode->i_aops_sem);
> +}
> +#else /* !CONFIG_FS_DAX */
> +#define inode_aops_down_read(inode) do { (void)(inode); } while (0)
> +#define inode_aops_up_read(inode) do { (void)(inode); } while (0)
> +#define inode_aops_down_write(inode) do { (void)(inode); } while (0)
> +#define inode_aops_up_write(inode) do { (void)(inode); } while (0)
> +#endif /* CONFIG_FS_DAX */
> +
>  static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
>  				     struct iov_iter *iter)
>  {
> -	return file->f_op->read_iter(kio, iter);
> +	struct inode		*inode = file_inode(kio->ki_filp);
> +	ssize_t ret;
> +
> +	inode_aops_down_read(inode);
> +	ret = file->f_op->read_iter(kio, iter);
> +	inode_aops_up_read(inode);
> +	return ret;
>  }
>  
>  static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
>  				      struct iov_iter *iter)
>  {
> -	return file->f_op->write_iter(kio, iter);
> +	struct inode		*inode = file_inode(kio->ki_filp);
> +	ssize_t ret;
> +
> +	inode_aops_down_read(inode);
> +	ret = file->f_op->write_iter(kio, iter);
> +	inode_aops_up_read(inode);
> +	return ret;
>  }
>  
>  static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 4f17c83db575..6a30febb11e0 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -48,6 +48,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>  	bdi = inode_to_bdi(mapping->host);
>  
>  	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
> +		int ret = 0;
> +
>  		switch (advice) {
>  		case POSIX_FADV_NORMAL:
>  		case POSIX_FADV_RANDOM:
> @@ -58,9 +60,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>  			/* no bad return value, but ignore advice */
>  			break;
>  		default:
> -			return -EINVAL;
> +			ret = -EINVAL;
>  		}
> -		return 0;
> +
> +		return ret;
>  	}
>  
>  	/*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1784478270e1..3a7863ba51b9 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  		 * and return.  Otherwise fallthrough to buffered io for
>  		 * the rest of the read.  Buffered reads will not work for
>  		 * DAX files, so don't bother trying.
> +		 *
> +		 * IS_DAX is protected under ->read_iter lock
>  		 */
>  		if (retval < 0 || !count || iocb->ki_pos >= size ||
>  		    IS_DAX(inode))
> @@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		 * holes, for example.  For DAX files, a buffered write will
>  		 * not succeed (even if it did, DAX does not handle dirty
>  		 * page-cache pages correctly).
> +		 *
> +		 * IS_DAX is protected under ->write_iter lock
>  		 */
>  		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
>  			goto out;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b08b199f9a11..3d05bd10d83e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  	unsigned long ret;
>  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>  
> +	/* Should not need locking here because mmap is not allowed */
>  	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
>  		goto out;
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b679908743cb..f048178e2b93 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
>  		} else {	/* !is_shmem */
>  			if (!page || xa_is_value(page)) {
>  				xas_unlock_irq(&xas);
> +				inode_aops_down_read(file->f_inode);
>  				page_cache_sync_readahead(mapping, &file->f_ra,
>  							  file, index,
>  							  PAGE_SIZE);
> +				inode_aops_up_read(file->f_inode);
>  				/* drain pagevecs to help isolate_lru_page() */
>  				lru_add_drain();
>  				page = find_lock_page(mapping, index);
> diff --git a/mm/util.c b/mm/util.c
> index 988d11e6c17c..a4fb0670137d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
>  
>  	ret = security_mmap_file(file, prot, flag);
>  	if (!ret) {
> -		if (down_write_killable(&mm->mmap_sem))
> +		if (file)
> +			inode_aops_down_read(file_inode(file));
> +		if (down_write_killable(&mm->mmap_sem)) {
> +			if (file)
> +				inode_aops_up_read(file_inode(file));
>  			return -EINTR;
> +		}
>  		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
>  				    &populate, &uf);
>  		up_write(&mm->mmap_sem);
> +		if (file)
> +			inode_aops_up_read(file_inode(file));
>  		userfaultfd_unmap_complete(mm, &uf);
>  		if (populate)
>  			mm_populate(ret, populate);
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-22  0:33   ` Darrick J. Wong
@ 2020-02-23 15:03     ` Ira Weiny
  0 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-23 15:03 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Feb 21, 2020 at 04:33:17PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 20, 2020 at 04:41:28PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > DAX requires special address space operations (aops).  Changing DAX
> > state therefore requires changing those aops.
> > 
> > However, many functions require aops to remain consistent through a deep
> > call stack.
> > 
> > Define a vfs level inode rwsem to protect aops throughout call stacks
> > which require them.
> > 
> > Finally, define calls to be used in subsequent patches when aops usage
> > needs to be quiesced by the file system.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from V3:
> > 	Convert from global to a per-inode rwsem
> > 		Remove pre-optimization
> > 	Remove static branch stuff
> > 	Change function names from inode_dax_state_* to inode_aops_*
> > 		I kept 'inode' as the synchronization is at the inode
> > 		level now (probably where it belongs)...
> > 
> > 		and I still prefer *_[down|up]_[read|write] as those
> > 		names better reflect the use and interaction between
> > 		users (readers) and writers better.  'XX_start_aop()'
> > 		would have to be matched with something like
> > 		'XX_wait_for_aop_user()' and 'XX_release_aop_users()' or
> > 		something which does not make sense on the 'writer'
> > 		side.
> > 
> > Changes from V2
> > 
> > 	Rebase on linux-next-08-02-2020
> > 
> > 	Fix locking order
> > 	Change all references from mode to state where appropriate
> > 	add CONFIG_FS_DAX requirement for state change
> > 	Use a static branch to enable locking only when a dax capable
> > 		device has been seen.
> > 
> > 	Move the lock to a global vfs lock
> > 
> > 		this does a few things
> > 			1) preps us better for ext4 support
> > 			2) removes funky callbacks from inode ops
> > 			3) remove complexity from XFS and probably from
> > 			   ext4 later
> > 
> > 		We can do this because
> > 			1) the locking order is required to be at the
> > 			   highest level anyway, so why complicate xfs
> > 			2) We had to move the sem to the super_block
> > 			   because it is too heavy for the inode.
> > 			3) After internal discussions with Dan we
> > 			   decided that this would be easier, just as
> > 			   performant, and with slightly less overhead
> > 			   than in the VFS SB.
> > 
> > 		We also change the functions names to up/down;
> > 		read/write as appropriate.  Previous names were over
> > 		simplified.
> > 
> > 	Update comments and documentation
> > 
> > squash: add locking
> > ---
> >  Documentation/filesystems/vfs.rst | 16 ++++++++
> >  fs/attr.c                         |  1 +
> >  fs/inode.c                        | 15 +++++--
> >  fs/iomap/buffered-io.c            |  1 +
> >  fs/open.c                         |  4 ++
> >  fs/stat.c                         |  2 +
> >  fs/xfs/xfs_icache.c               |  1 +
> >  include/linux/fs.h                | 66 ++++++++++++++++++++++++++++++-
> >  mm/fadvise.c                      |  7 +++-
> >  mm/filemap.c                      |  4 ++
> >  mm/huge_memory.c                  |  1 +
> >  mm/khugepaged.c                   |  2 +
> >  mm/util.c                         |  9 ++++-
> >  13 files changed, 121 insertions(+), 8 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 7d4d09dd5e6d..4a10a232f8e2 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -934,6 +934,22 @@ cache in your filesystem.  The following members are defined:
> >  	Called during swapoff on files where swap_activate was
> >  	successful.
> >  
> > +Changing DAX 'state' dynamically
> > +----------------------------------
> > +
> > +Some file systems which support DAX want to be able to change the DAX state
> > +dyanically.  To switch the state safely we lock the inode state in all "normal"
> > +file system operations and restrict state changes to those operations.  The
> > +specific rules are.
> > +
> > +        1) the direct_IO address_space_operation must be supported in all
> > +           potential a_ops vectors for any state suported by the inode.
> > +
> > +        3) DAX state changes shall not be allowed while the file is mmap'ed
> > +        4) For non-mmaped operations the VFS layer must take the read lock for any
> > +           use of IS_DAX()
> > +        5) Filesystems take the write lock when changing DAX states.
> > +
> >  
> >  The File Object
> >  ===============
> > diff --git a/fs/attr.c b/fs/attr.c
> > index b4bbdbd4c8ca..9b15f73d1079 100644
> > --- a/fs/attr.c
> > +++ b/fs/attr.c
> > @@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
> >  	if (error)
> >  		return error;
> >  
> > +	/* DAX read state should already be held here */
> >  	if (inode->i_op->setattr)
> >  		error = inode->i_op->setattr(dentry, attr);
> >  	else
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 7d57068b6b7a..6e4f1cc872f2 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -200,6 +200,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> >  #endif
> >  	inode->i_flctx = NULL;
> >  	this_cpu_inc(nr_inodes);
> > +	init_rwsem(&inode->i_aops_sem);
> >  
> >  	return 0;
> >  out:
> > @@ -1616,11 +1617,19 @@ EXPORT_SYMBOL(iput);
> >   */
> >  int bmap(struct inode *inode, sector_t *block)
> >  {
> > -	if (!inode->i_mapping->a_ops->bmap)
> > -		return -EINVAL;
> > +	int ret = 0;
> > +
> > +	inode_aops_down_read(inode);
> > +	if (!inode->i_mapping->a_ops->bmap) {
> > +		ret = -EINVAL;
> > +		goto err;
> > +	}
> >  
> >  	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
> > -	return 0;
> > +
> > +err:
> > +	inode_aops_up_read(inode);
> > +	return ret;
> >  }
> >  EXPORT_SYMBOL(bmap);
> >  #endif
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 7c84c4c027c4..e313a34d5fa6 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> >  		offset = offset_in_page(pos);
> >  		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
> >  
> > +		/* DAX state read should already be held here */
> >  		if (IS_DAX(inode))
> >  			status = iomap_dax_zero(pos, offset, bytes, iomap);
> >  		else
> > diff --git a/fs/open.c b/fs/open.c
> > index 0788b3715731..3abf0bfac462 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
> >  	if (ret)
> >  		newattrs.ia_valid |= ret | ATTR_FORCE;
> >  
> > +	inode_aops_down_read(dentry->d_inode);
> >  	inode_lock(dentry->d_inode);
> >  	/* Note any delegations or leases have already been broken: */
> >  	ret = notify_change(dentry, &newattrs, NULL);
> >  	inode_unlock(dentry->d_inode);
> > +	inode_aops_up_read(dentry->d_inode);
> >  	return ret;
> >  }
> >  
> > @@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> >  		return -EOPNOTSUPP;
> >  
> >  	file_start_write(file);
> > +	inode_aops_down_read(inode);
> >  	ret = file->f_op->fallocate(file, mode, offset, len);
> > +	inode_aops_up_read(inode);
> >  
> >  	/*
> >  	 * Create inotify and fanotify events.
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 894699c74dde..274b3ccc82b1 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
> >  	if (IS_AUTOMOUNT(inode))
> >  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
> >  
> > +	inode_aops_down_read(inode);
> >  	if (IS_DAX(inode))
> >  		stat->attributes |= STATX_ATTR_DAX;
> > +	inode_aops_up_read(inode);
> >  
> >  	if (inode->i_op->getattr)
> >  		return inode->i_op->getattr(path, stat, request_mask,
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 836a1f09be03..3e83a97dc047 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -420,6 +420,7 @@ xfs_iget_cache_hit(
> >  		rcu_read_unlock();
> >  
> >  		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> > +		ASSERT(!rwsem_is_locked(&inode->i_aops_sem));
> >  		error = xfs_reinit_inode(mp, inode);
> >  		if (error) {
> >  			bool wake;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 63d1e533a07d..ad0f2368031b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -40,6 +40,7 @@
> >  #include <linux/fs_types.h>
> >  #include <linux/build_bug.h>
> >  #include <linux/stddef.h>
> > +#include <linux/percpu-rwsem.h>
> >  
> >  #include <asm/byteorder.h>
> >  #include <uapi/linux/fs.h>
> > @@ -359,6 +360,11 @@ typedef struct {
> >  typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
> >  		unsigned long, unsigned long);
> >  
> > +/**
> > + * NOTE: DO NOT define new functions in address_space_operations without first
> > + * considering how dynamic DAX states are to be supported.  See the
> > + * inode_aops_*_read() functions
> > + */
> >  struct address_space_operations {
> >  	int (*writepage)(struct page *page, struct writeback_control *wbc);
> >  	int (*readpage)(struct file *, struct page *);
> > @@ -735,6 +741,9 @@ struct inode {
> >  #endif
> >  
> >  	void			*i_private; /* fs or device private pointer */
> > +#if defined(CONFIG_FS_DAX)
> > +	struct rw_semaphore	i_aops_sem;
> > +#endif
> >  } __randomize_layout;
> >  
> >  struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
> > @@ -1817,6 +1826,11 @@ struct block_device_operations;
> >  
> >  struct iov_iter;
> >  
> > +/**
> > + * NOTE: DO NOT define new functions in file_operations without first
> > + * considering how dynamic address_space operations are to be supported.  See
> > + * the inode_aops_*_read() functions in this file.
> > + */
> >  struct file_operations {
> >  	struct module *owner;
> >  	loff_t (*llseek) (struct file *, loff_t, int);
> > @@ -1889,16 +1903,64 @@ struct inode_operations {
> >  	int (*set_acl)(struct inode *, struct posix_acl *, int);
> >  } ____cacheline_aligned;
> >  
> > +#if defined(CONFIG_FS_DAX)
> > +/*
> > + * Filesystems wishing to support dynamic DAX states must do the following.
> > + *
> > + * 1) the direct_IO address_space_operation must be supported in all potential
> > + *    a_ops vectors for any state suported by the inode.  This is because the
> > + *    direct_IO function is used as a flag long before the function is called.
> > +
> > + * 3) DAX state changes shall not be allowed while the file is mmap'ed
> > + * 4) For non-mmaped operations the VFS layer must take the read lock for any
> > + *    use of IS_DAX()
> > + * 5) Filesystems take the write lock when changing DAX states.
> 
> Do you envision ext4 migrating their aops changing code to use
> i_aops_sem in the future?

Ah... yea, which, at that time, this comment would be completely wrong.  I
tried to remove references to dax as per Dave but I missed this one.

Thanks,
Ira

> 
> --D
> 
> > + */
> > +static inline void inode_aops_down_read(struct inode *inode)
> > +{
> > +	down_read(&inode->i_aops_sem);
> > +}
> > +static inline void inode_aops_up_read(struct inode *inode)
> > +{
> > +	up_read(&inode->i_aops_sem);
> > +}
> > +static inline void inode_aops_down_write(struct inode *inode)
> > +{
> > +	down_write(&inode->i_aops_sem);
> > +}
> > +static inline void inode_aops_up_write(struct inode *inode)
> > +{
> > +	up_write(&inode->i_aops_sem);
> > +}
> > +#else /* !CONFIG_FS_DAX */
> > +#define inode_aops_down_read(inode) do { (void)(inode); } while (0)
> > +#define inode_aops_up_read(inode) do { (void)(inode); } while (0)
> > +#define inode_aops_down_write(inode) do { (void)(inode); } while (0)
> > +#define inode_aops_up_write(inode) do { (void)(inode); } while (0)
> > +#endif /* CONFIG_FS_DAX */
> > +
> >  static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
> >  				     struct iov_iter *iter)
> >  {
> > -	return file->f_op->read_iter(kio, iter);
> > +	struct inode		*inode = file_inode(kio->ki_filp);
> > +	ssize_t ret;
> > +
> > +	inode_aops_down_read(inode);
> > +	ret = file->f_op->read_iter(kio, iter);
> > +	inode_aops_up_read(inode);
> > +	return ret;
> >  }
> >  
> >  static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
> >  				      struct iov_iter *iter)
> >  {
> > -	return file->f_op->write_iter(kio, iter);
> > +	struct inode		*inode = file_inode(kio->ki_filp);
> > +	ssize_t ret;
> > +
> > +	inode_aops_down_read(inode);
> > +	ret = file->f_op->write_iter(kio, iter);
> > +	inode_aops_up_read(inode);
> > +	return ret;
> >  }
> >  
> >  static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> > diff --git a/mm/fadvise.c b/mm/fadvise.c
> > index 4f17c83db575..6a30febb11e0 100644
> > --- a/mm/fadvise.c
> > +++ b/mm/fadvise.c
> > @@ -48,6 +48,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> >  	bdi = inode_to_bdi(mapping->host);
> >  
> >  	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
> > +		int ret = 0;
> > +
> >  		switch (advice) {
> >  		case POSIX_FADV_NORMAL:
> >  		case POSIX_FADV_RANDOM:
> > @@ -58,9 +60,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
> >  			/* no bad return value, but ignore advice */
> >  			break;
> >  		default:
> > -			return -EINVAL;
> > +			ret = -EINVAL;
> >  		}
> > -		return 0;
> > +
> > +		return ret;
> >  	}
> >  
> >  	/*
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 1784478270e1..3a7863ba51b9 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> >  		 * and return.  Otherwise fallthrough to buffered io for
> >  		 * the rest of the read.  Buffered reads will not work for
> >  		 * DAX files, so don't bother trying.
> > +		 *
> > +		 * IS_DAX is protected under ->read_iter lock
> >  		 */
> >  		if (retval < 0 || !count || iocb->ki_pos >= size ||
> >  		    IS_DAX(inode))
> > @@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  		 * holes, for example.  For DAX files, a buffered write will
> >  		 * not succeed (even if it did, DAX does not handle dirty
> >  		 * page-cache pages correctly).
> > +		 *
> > +		 * IS_DAX is protected under ->write_iter lock
> >  		 */
> >  		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> >  			goto out;
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index b08b199f9a11..3d05bd10d83e 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> >  	unsigned long ret;
> >  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
> >  
> > +	/* Should not need locking here because mmap is not allowed */
> >  	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> >  		goto out;
> >  
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index b679908743cb..f048178e2b93 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
> >  		} else {	/* !is_shmem */
> >  			if (!page || xa_is_value(page)) {
> >  				xas_unlock_irq(&xas);
> > +				inode_aops_down_read(file->f_inode);
> >  				page_cache_sync_readahead(mapping, &file->f_ra,
> >  							  file, index,
> >  							  PAGE_SIZE);
> > +				inode_aops_up_read(file->f_inode);
> >  				/* drain pagevecs to help isolate_lru_page() */
> >  				lru_add_drain();
> >  				page = find_lock_page(mapping, index);
> > diff --git a/mm/util.c b/mm/util.c
> > index 988d11e6c17c..a4fb0670137d 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
> >  
> >  	ret = security_mmap_file(file, prot, flag);
> >  	if (!ret) {
> > -		if (down_write_killable(&mm->mmap_sem))
> > +		if (file)
> > +			inode_aops_down_read(file_inode(file));
> > +		if (down_write_killable(&mm->mmap_sem)) {
> > +			if (file)
> > +				inode_aops_up_read(file_inode(file));
> >  			return -EINTR;
> > +		}
> >  		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
> >  				    &populate, &uf);
> >  		up_write(&mm->mmap_sem);
> > +		if (file)
> > +			inode_aops_up_read(file_inode(file));
> >  		userfaultfd_unmap_complete(mm, &uf);
> >  		if (populate)
> >  			mm_populate(ret, populate);
> > -- 
> > 2.21.0
> > 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer
  2020-02-22  0:31   ` Darrick J. Wong
@ 2020-02-23 15:04     ` Ira Weiny
  0 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-23 15:04 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Feb 21, 2020 at 04:31:09PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 20, 2020 at 04:41:30PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > XFS requires the use of the aops of an inode to quiesced prior to
> > changing it to/from the DAX aops vector.
> > 
> > Take the aops write lock while changing DAX state.
> > 
> > We define a new XFS_DAX_EXCL lock type to carry the lock through to
> > transaction completion.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from v3:
> > 	Change locking function names to reflect changes in previous
> > 	patches.
> > 
> > Changes from V2:
> > 	Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
> > 	Remove the xfs specific lock and move to the vfs layer.
> > 		We still use XFS_LOCK_DAX_EXCL to be able to pass this
> > 		flag through to the transaction code.  But we no longer
> > 		have a lock specific to xfs.  This removes a lot of code
> > 		from the XFS layer, preps us for using this in ext4, and
> > 		is actually more straight forward now that all the
> > 		locking requirements are better known.
> > 
> > 	Fix locking order comment
> > 	Rework for new 'state' names
> > 	(Other comments on the previous patch are not applicable with
> > 	new patch as much of the code was removed in favor of the vfs
> > 	level lock)
> > ---
> >  fs/xfs/xfs_inode.c | 22 ++++++++++++++++++++--
> >  fs/xfs/xfs_inode.h |  7 +++++--
> >  2 files changed, 25 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 35df324875db..5b014c428f0f 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> >   *
> >   * Basic locking order:
> >   *
> > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> 
> "dax_sem"?  I thought this was now called i_aops_sem?

:-/ yep...

> 
> >   *
> >   * mmap_sem locking order:
> >   *
> >   * i_rwsem -> page lock -> mmap_sem
> > - * mmap_sem -> i_mmap_lock -> page_lock
> > + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> >   *
> >   * The difference in mmap_sem locking order mean that we cannot hold the
> >   * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> > @@ -182,6 +182,9 @@ xfs_ilock(
> >  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> >  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> >  
> > +	if (lock_flags & XFS_DAX_EXCL)
> 
> And similarly, I think this should be XFS_OPSLOCK_EXCL...

... and ... yes...

Thanks for the review, I'll clean it up.

Ira

> 
> --D
> 
> > +		inode_aops_down_write(VFS_I(ip));
> > +
> >  	if (lock_flags & XFS_IOLOCK_EXCL) {
> >  		down_write_nested(&VFS_I(ip)->i_rwsem,
> >  				  XFS_IOLOCK_DEP(lock_flags));
> > @@ -224,6 +227,8 @@ xfs_ilock_nowait(
> >  	 * You can't set both SHARED and EXCL for the same lock,
> >  	 * and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED,
> >  	 * and XFS_ILOCK_EXCL are valid values to set in lock_flags.
> > +	 *
> > +	 * XFS_DAX_* is not allowed
> >  	 */
> >  	ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
> >  	       (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> > @@ -232,6 +237,7 @@ xfs_ilock_nowait(
> >  	ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
> >  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> >  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> > +	ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
> >  
> >  	if (lock_flags & XFS_IOLOCK_EXCL) {
> >  		if (!down_write_trylock(&VFS_I(ip)->i_rwsem))
> > @@ -318,6 +324,9 @@ xfs_iunlock(
> >  	else if (lock_flags & XFS_ILOCK_SHARED)
> >  		mrunlock_shared(&ip->i_lock);
> >  
> > +	if (lock_flags & XFS_DAX_EXCL)
> > +		inode_aops_up_write(VFS_I(ip));
> > +
> >  	trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
> >  }
> >  
> > @@ -333,6 +342,8 @@ xfs_ilock_demote(
> >  	ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
> >  	ASSERT((lock_flags &
> >  		~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
> > +	/* XFS_DAX_* is not allowed */
> > +	ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
> >  
> >  	if (lock_flags & XFS_ILOCK_EXCL)
> >  		mrdemote(&ip->i_lock);
> > @@ -465,6 +476,9 @@ xfs_lock_inodes(
> >  	ASSERT(!(lock_mode & XFS_ILOCK_EXCL) ||
> >  		inodes <= XFS_ILOCK_MAX_SUBCLASS + 1);
> >  
> > +	/* XFS_DAX_* is not allowed */
> > +	ASSERT((lock_mode & XFS_DAX_EXCL) == 0);
> > +
> >  	if (lock_mode & XFS_IOLOCK_EXCL) {
> >  		ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL)));
> >  	} else if (lock_mode & XFS_MMAPLOCK_EXCL)
> > @@ -566,6 +580,10 @@ xfs_lock_two_inodes(
> >  	ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> >  	       !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> >  
> > +	/* XFS_DAX_* is not allowed */
> > +	ASSERT((ip0_mode & XFS_DAX_EXCL) == 0);
> > +	ASSERT((ip1_mode & XFS_DAX_EXCL) == 0);
> > +
> >  	ASSERT(ip0->i_ino != ip1->i_ino);
> >  
> >  	if (ip0->i_ino > ip1->i_ino) {
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index 492e53992fa9..25fe20740bf7 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -278,10 +278,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
> >  #define	XFS_ILOCK_SHARED	(1<<3)
> >  #define	XFS_MMAPLOCK_EXCL	(1<<4)
> >  #define	XFS_MMAPLOCK_SHARED	(1<<5)
> > +#define	XFS_DAX_EXCL		(1<<6)
> >  
> >  #define XFS_LOCK_MASK		(XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
> >  				| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
> > -				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
> > +				| XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
> > +				| XFS_DAX_EXCL)
> >  
> >  #define XFS_LOCK_FLAGS \
> >  	{ XFS_IOLOCK_EXCL,	"IOLOCK_EXCL" }, \
> > @@ -289,7 +291,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
> >  	{ XFS_ILOCK_EXCL,	"ILOCK_EXCL" }, \
> >  	{ XFS_ILOCK_SHARED,	"ILOCK_SHARED" }, \
> >  	{ XFS_MMAPLOCK_EXCL,	"MMAPLOCK_EXCL" }, \
> > -	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }
> > +	{ XFS_MMAPLOCK_SHARED,	"MMAPLOCK_SHARED" }, \
> > +	{ XFS_DAX_EXCL,		"DAX_EXCL" }
> >  
> >  
> >  /*
> > -- 
> > 2.21.0
> > 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax()
  2020-02-22  0:28   ` Darrick J. Wong
@ 2020-02-23 15:07     ` Ira Weiny
  0 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-23 15:07 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-kernel, Alexander Viro, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Feb 21, 2020 at 04:28:21PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 20, 2020 at 04:41:27PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > xfs_inode_supports_dax() should reflect if the inode can support DAX not
> > that it is enabled for DAX.
> > 
> > Change the use of xfs_inode_supports_dax() to reflect only if the inode
> > and underlying storage support dax.
> > 
> > Add a new function xfs_inode_enable_dax() which reflects if the inode
> 
> Heavily into bikeshedding here, but "enable" sounds like a verb, but
> this function doesn't actually turn dax on for a file, it merely decides
> if we /should/ turn it on.

heheheheh...  As Dave said "names are hard"...  :-/

> 
> xfs_inode_wants_dax() ?

I viewed this as a question as in "enable dax"?

And I _think_ Dave actually suggested the name.

I'll contemplate it some,
Ira

> 
> --D
> 
> > should be enabled for DAX.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> > 
> > ---
> > Changes from v3:
> > 	Update functions and names to be more clear
> > 	Update commit message
> > 	Merge with
> > 		'fs/xfs: Clean up DAX support check'
> > 		don't allow IS_DAX() on a directory
> > 		use STATIC macro for static
> > 		make xfs_inode_supports_dax() static
> > ---
> >  fs/xfs/xfs_iops.c | 25 +++++++++++++++++++------
> >  1 file changed, 19 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 81f2f93caec0..ff711efc5247 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -1237,19 +1237,18 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = {
> >  };
> >  
> >  /* Figure out if this file actually supports DAX. */
> > -static bool
> > +STATIC bool
> >  xfs_inode_supports_dax(
> >  	struct xfs_inode	*ip)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  
> >  	/* Only supported on non-reflinked files. */
> > -	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
> > +	if (xfs_is_reflink_inode(ip))
> >  		return false;
> >  
> > -	/* DAX mount option or DAX iflag must be set. */
> > -	if (!(mp->m_flags & XFS_MOUNT_DAX) &&
> > -	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> > +	/* Only supported on regular files. */
> > +	if (!S_ISREG(VFS_I(ip)->i_mode))
> >  		return false;
> >  
> >  	/* Block size must match page size */
> > @@ -1260,6 +1259,20 @@ xfs_inode_supports_dax(
> >  	return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
> >  }
> >  
> > +STATIC bool
> > +xfs_inode_enable_dax(
> > +	struct xfs_inode *ip)
> > +{
> > +	if (!xfs_inode_supports_dax(ip))
> > +		return false;
> > +
> > +	if (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)
> > +		return true;
> > +	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
> > +		return true;
> > +	return false;
> > +}
> > +
> >  STATIC void
> >  xfs_diflags_to_iflags(
> >  	struct inode		*inode,
> > @@ -1278,7 +1291,7 @@ xfs_diflags_to_iflags(
> >  		inode->i_flags |= S_SYNC;
> >  	if (flags & XFS_DIFLAG_NOATIME)
> >  		inode->i_flags |= S_NOATIME;
> > -	if (xfs_inode_supports_dax(ip))
> > +	if (xfs_inode_enable_dax(ip))
> >  		inode->i_flags |= S_DAX;
> >  }
> >  
> > -- 
> > 2.21.0
> > 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer
  2020-02-21  0:41 ` [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer ira.weiny
  2020-02-22  0:31   ` Darrick J. Wong
@ 2020-02-24  0:34   ` Dave Chinner
  2020-02-24 19:57     ` Ira Weiny
  1 sibling, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2020-02-24  0:34 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Thu, Feb 20, 2020 at 04:41:30PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> XFS requires the use of the aops of an inode to quiesced prior to
> changing it to/from the DAX aops vector.
> 
> Take the aops write lock while changing DAX state.
> 
> We define a new XFS_DAX_EXCL lock type to carry the lock through to
> transaction completion.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from v3:
> 	Change locking function names to reflect changes in previous
> 	patches.
> 
> Changes from V2:
> 	Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
> 	Remove the xfs specific lock and move to the vfs layer.
> 		We still use XFS_LOCK_DAX_EXCL to be able to pass this
> 		flag through to the transaction code.  But we no longer
> 		have a lock specific to xfs.  This removes a lot of code
> 		from the XFS layer, preps us for using this in ext4, and
> 		is actually more straight forward now that all the
> 		locking requirements are better known.
> 
> 	Fix locking order comment
> 	Rework for new 'state' names
> 	(Other comments on the previous patch are not applicable with
> 	new patch as much of the code was removed in favor of the vfs
> 	level lock)
> ---
>  fs/xfs/xfs_inode.c | 22 ++++++++++++++++++++--
>  fs/xfs/xfs_inode.h |  7 +++++--
>  2 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 35df324875db..5b014c428f0f 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
>   *
>   * Basic locking order:
>   *
> - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
>   *
>   * mmap_sem locking order:
>   *
>   * i_rwsem -> page lock -> mmap_sem
> - * mmap_sem -> i_mmap_lock -> page_lock
> + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
>   *
>   * The difference in mmap_sem locking order mean that we cannot hold the
>   * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> @@ -182,6 +182,9 @@ xfs_ilock(
>  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
>  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
>  
> +	if (lock_flags & XFS_DAX_EXCL)
> +		inode_aops_down_write(VFS_I(ip));

I largely don't see the point of adding this to xfs_ilock/iunlock.

It's only got one caller, so I don't see much point in adding it to
an interface that has over a hundred other call sites that don't
need or use this lock. just open code it where it is needed in the
ioctl code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-21 22:44     ` Dave Chinner
  2020-02-21 23:26       ` Dan Williams
@ 2020-02-24 17:56       ` Christoph Hellwig
  2020-02-25  0:09         ` Dave Chinner
  1 sibling, 1 reply; 54+ messages in thread
From: Christoph Hellwig @ 2020-02-24 17:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, ira.weiny, linux-kernel, Alexander Viro,
	Darrick J. Wong, Dan Williams, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Sat, Feb 22, 2020 at 09:44:19AM +1100, Dave Chinner wrote:
> > I am very much against this.  There is a reason why we don't support
> > changes of ops vectors at runtime anywhere else, because it is
> > horribly complicated and impossible to get right.  IFF we ever want
> > to change the DAX vs non-DAX mode (which I'm still not sold on) the
> > right way is to just add a few simple conditionals and merge the
> > aops, which is much easier to reason about, and less costly in overall
> > overhead.
> 
> *cough*
> 
> That's exactly what the original code did. And it was broken
> because page faults call multiple aops that are dependent on the
> result of the previous aop calls setting up the state correctly for
> the latter calls. And when S_DAX changes between those calls, shit
> breaks.

No, the original code was broken because it didn't serialize between
DAX and buffer access.

Take a step back and look where the problems are, and they are not
mostly with the aops.  In fact the only aop useful for DAX is
is ->writepages, and how it uses ->writepages is actually a bit of
an abuse of that interface.

So what we really need it just a way to prevent switching the flag
when a file is mapped, and in the normal read/write path ensure the
flag can't be switch while I/O is going on, which could either be
done by ensuring it is only switched under i_rwsem or equivalent
protection, or by setting the DAX flag once in the iocb similar to
IOCB_DIRECT.

And they easiest way to get all this done is as a first step to
just allowing switching the flag when no blocks are allocated at
all, similar to how the rt flag works.  Once that works properly
and use cases show up we can relax the requirements, and we need
to do that in a way without bloating the inode and various VFS
code paths, as DAX is a complete fringe feature, and ѕwitching
the flag and runtime is the fringe of a fringe.  It just isn't worth
bloating the inode and wasting tons of developer time on it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer
  2020-02-24  0:34   ` Dave Chinner
@ 2020-02-24 19:57     ` Ira Weiny
  2020-02-24 22:32       ` Dave Chinner
  0 siblings, 1 reply; 54+ messages in thread
From: Ira Weiny @ 2020-02-24 19:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Feb 24, 2020 at 11:34:55AM +1100, Dave Chinner wrote:
> On Thu, Feb 20, 2020 at 04:41:30PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 

[snip]

> > 
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 35df324875db..5b014c428f0f 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> >   *
> >   * Basic locking order:
> >   *
> > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> >   *
> >   * mmap_sem locking order:
> >   *
> >   * i_rwsem -> page lock -> mmap_sem
> > - * mmap_sem -> i_mmap_lock -> page_lock
> > + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> >   *
> >   * The difference in mmap_sem locking order mean that we cannot hold the
> >   * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> > @@ -182,6 +182,9 @@ xfs_ilock(
> >  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> >  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> >  
> > +	if (lock_flags & XFS_DAX_EXCL)
> > +		inode_aops_down_write(VFS_I(ip));
> 
> I largely don't see the point of adding this to xfs_ilock/iunlock.
> 
> It's only got one caller, so I don't see much point in adding it to
> an interface that has over a hundred other call sites that don't
> need or use this lock. just open code it where it is needed in the
> ioctl code.

I know it seems overkill but if we don't do this we need to code a flag to be
returned from xfs_ioctl_setattr_dax_invalidate().  This flag is then used in
xfs_ioctl_setattr_get_trans() to create the transaction log item which can then
be properly used to unlock the lock in xfs_inode_item_release()

I don't know of a cleaner way to communicate to xfs_inode_item_release() to
unlock i_aops_sem after the transaction is complete.

Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer
  2020-02-24 19:57     ` Ira Weiny
@ 2020-02-24 22:32       ` Dave Chinner
  2020-02-25 21:12         ` Ira Weiny
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2020-02-24 22:32 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Feb 24, 2020 at 11:57:36AM -0800, Ira Weiny wrote:
> On Mon, Feb 24, 2020 at 11:34:55AM +1100, Dave Chinner wrote:
> > On Thu, Feb 20, 2020 at 04:41:30PM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> 
> [snip]
> 
> > > 
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 35df324875db..5b014c428f0f 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> > >   *
> > >   * Basic locking order:
> > >   *
> > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > >   *
> > >   * mmap_sem locking order:
> > >   *
> > >   * i_rwsem -> page lock -> mmap_sem
> > > - * mmap_sem -> i_mmap_lock -> page_lock
> > > + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> > >   *
> > >   * The difference in mmap_sem locking order mean that we cannot hold the
> > >   * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> > > @@ -182,6 +182,9 @@ xfs_ilock(
> > >  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> > >  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> > >  
> > > +	if (lock_flags & XFS_DAX_EXCL)
> > > +		inode_aops_down_write(VFS_I(ip));
> > 
> > I largely don't see the point of adding this to xfs_ilock/iunlock.
> > 
> > It's only got one caller, so I don't see much point in adding it to
> > an interface that has over a hundred other call sites that don't
> > need or use this lock. just open code it where it is needed in the
> > ioctl code.
> 
> I know it seems overkill but if we don't do this we need to code a flag to be
> returned from xfs_ioctl_setattr_dax_invalidate().  This flag is then used in
> xfs_ioctl_setattr_get_trans() to create the transaction log item which can then
> be properly used to unlock the lock in xfs_inode_item_release()
> 
> I don't know of a cleaner way to communicate to xfs_inode_item_release() to
> unlock i_aops_sem after the transaction is complete.

We manually unlock inodes after transactions in many cases -
anywhere we do a rolling transaction, the inode locks do not get
released by the transaction. Hence for a one-off case like this it
doesn't really make sense to push all this infrastructure into the
transaction subsystem. Especially as we can manually lock before and
unlock after the transaction context without any real complexity.

This also means that we can, if necessary, do aops manipulation work
/after/ the transaction that changes on-disk state completes and we
still hold the aops reference exclusively. While we don't do that
now, I think it is worthwhile keeping our options open here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-24 17:56       ` Christoph Hellwig
@ 2020-02-25  0:09         ` Dave Chinner
  2020-02-25 17:36           ` Christoph Hellwig
  2020-02-26 11:17           ` Jan Kara
  0 siblings, 2 replies; 54+ messages in thread
From: Dave Chinner @ 2020-02-25  0:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: ira.weiny, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Mon, Feb 24, 2020 at 06:56:03PM +0100, Christoph Hellwig wrote:
> On Sat, Feb 22, 2020 at 09:44:19AM +1100, Dave Chinner wrote:
> > > I am very much against this.  There is a reason why we don't support
> > > changes of ops vectors at runtime anywhere else, because it is
> > > horribly complicated and impossible to get right.  IFF we ever want
> > > to change the DAX vs non-DAX mode (which I'm still not sold on) the
> > > right way is to just add a few simple conditionals and merge the
> > > aops, which is much easier to reason about, and less costly in overall
> > > overhead.
> > 
> > *cough*
> > 
> > That's exactly what the original code did. And it was broken
> > because page faults call multiple aops that are dependent on the
> > result of the previous aop calls setting up the state correctly for
> > the latter calls. And when S_DAX changes between those calls, shit
> > breaks.
> 
> No, the original code was broken because it didn't serialize between
> DAX and buffer access.
> 
> Take a step back and look where the problems are, and they are not
> mostly with the aops.  In fact the only aop useful for DAX is
> is ->writepages, and how it uses ->writepages is actually a bit of
> an abuse of that interface.

The races are all through the fops, too, which is one of the reasons
Darrick mentioned we should probably move this up to file ops
level...

> So what we really need it just a way to prevent switching the flag
> when a file is mapped,

That's not sufficient.

We also have to prevent the file from being mapped *while we are
switching*. Nothing in the mmap() path actually serialises against
filesystem operations, and the initial behavioural checks in the
page fault path are similarly unserialised against changing the
S_DAX flag.

e.g. there's a race against ->mmap() with switching the the S_DAX
flag. In xfs_file_mmap():

        file_accessed(file);
        vma->vm_ops = &xfs_file_vm_ops;
        if (IS_DAX(inode))
                vma->vm_flags |= VM_HUGEPAGE;

So if this runs while a switch from true to false is occurring,
we've got a non-dax VMA with huge pages enabled, and the non-dax
page fault path doesn't support this.

If we are really lucky, we'll have IS_DAX() set just long
enough to get through this check in xfs_filemap_huge_fault():

        if (!IS_DAX(file_inode(vmf->vma->vm_file)))
                return VM_FAULT_FALLBACK;

and then we'll get into __xfs_filemap_fault() and block on the
MMAPLOCK. When we actually get that lock, S_DAX will now be false
and we have a huge page fault running through a path (page cache IO)
that doesn't support huge pages...

This is the sort of landmine switching S_DAX without serialisation
causes. The methods themselves look sane, but it's cross-method
dependencies for a stable S_DAX flag that is the real problem.

Yes, we can probably fix this by adding XFS_MMAPLOCK_SHARED here,
but means every filesystem has to solve the same race conditions
itself. That's hard to get right and easy to get wrong. If the
VFS/mm subsystem provides the infrastructure needed to use this
functionality safely, then that is hard for filesystem developers to
get wrong.....

> and in the normal read/write path ensure the
> flag can't be switch while I/O is going on, which could either be
> done by ensuring it is only switched under i_rwsem or equivalent
> protection, or by setting the DAX flag once in the iocb similar to
> IOCB_DIRECT.

The iocb path is not the problem - that's entirely serialised
against S_DAX changes by the i_rwsem. The problem is that we have no
equivalent filesystem level serialisation for the entire mmap/page
fault path, and it checks S_DAX all over the place.

It's basically the same limitation that we have with mmap vs direct
IO - we can't lock out mmap when we do direct IO, so we cannot
guarantee coherency between the two. Similar here - we cannot
lockout mmap in any sane way, so we cannot guarantee coherency
between mmap and changing the S_DAX flag.

That's the underlying problem we need to solve here.

/me wonders if the best thing to do is to add a ->fault callout to
tell the filesystem to lock/unlock the inode right up at the top of
the page fault path, outside even the mmap_sem.  That means all the
methods that the page fault calls are protected against S_DAX
changes, and it gives us a low cost method of serialising page
faults against DIO (e.g. via inode_dio_wait())....

> And they easiest way to get all this done is as a first step to
> just allowing switching the flag when no blocks are allocated at
> all, similar to how the rt flag works.

False equivalence - it is not similar because the RT flag changes
and their associated state checks are *already fully serialised* by
the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and
that's the problem we need to solve...

In reality, I don't really care how we solve this problem, but we've
been down the path you are suggesting at least twice before and each
time we've ended up in a "that doesn't work" corner and had to
persue other solutions. I don't like going around in circles.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-25  0:09         ` Dave Chinner
@ 2020-02-25 17:36           ` Christoph Hellwig
  2020-02-25 19:37             ` Jeff Moyer
  2020-02-25 21:03             ` Ira Weiny
  2020-02-26 11:17           ` Jan Kara
  1 sibling, 2 replies; 54+ messages in thread
From: Christoph Hellwig @ 2020-02-25 17:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, ira.weiny, linux-kernel, Alexander Viro,
	Darrick J. Wong, Dan Williams, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Tue, Feb 25, 2020 at 11:09:37AM +1100, Dave Chinner wrote:
> > No, the original code was broken because it didn't serialize between
> > DAX and buffer access.
> > 
> > Take a step back and look where the problems are, and they are not
> > mostly with the aops.  In fact the only aop useful for DAX is
> > is ->writepages, and how it uses ->writepages is actually a bit of
> > an abuse of that interface.
> 
> The races are all through the fops, too, which is one of the reasons
> Darrick mentioned we should probably move this up to file ops
> level...

But the file ops are very simple to use.  Pass the flag in the iocb,
and make sure the flag can only changed with i_rwsem held.  That part
is pretty trivial, the interesting case is mmap because it is so
spread out.

> > So what we really need it just a way to prevent switching the flag
> > when a file is mapped,
> 
> That's not sufficient.
> 
> We also have to prevent the file from being mapped *while we are
> switching*. Nothing in the mmap() path actually serialises against
> filesystem operations, and the initial behavioural checks in the
> page fault path are similarly unserialised against changing the
> S_DAX flag.

And the important part here is ->mmap.  If ->mmap doesn't get through
we are not going to see page faults.

> > and in the normal read/write path ensure the
> > flag can't be switch while I/O is going on, which could either be
> > done by ensuring it is only switched under i_rwsem or equivalent
> > protection, or by setting the DAX flag once in the iocb similar to
> > IOCB_DIRECT.
> 
> The iocb path is not the problem - that's entirely serialised
> against S_DAX changes by the i_rwsem. The problem is that we have no
> equivalent filesystem level serialisation for the entire mmap/page
> fault path, and it checks S_DAX all over the place.

Not right now.  We have various IS_DAX checks outside it.  But it is
easily fixable indeed.

> /me wonders if the best thing to do is to add a ->fault callout to
> tell the filesystem to lock/unlock the inode right up at the top of
> the page fault path, outside even the mmap_sem.  That means all the
> methods that the page fault calls are protected against S_DAX
> changes, and it gives us a low cost method of serialising page
> faults against DIO (e.g. via inode_dio_wait())....

Maybe.  Especially if it solves real problems and isn't just new
overhead to add an esoteric feature.

> 
> > And they easiest way to get all this done is as a first step to
> > just allowing switching the flag when no blocks are allocated at
> > all, similar to how the rt flag works.
> 
> False equivalence - it is not similar because the RT flag changes
> and their associated state checks are *already fully serialised* by
> the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and
> that's the problem we need to solve...

And my point is that if we ensure S_DAX can only be checked if there
are no blocks on the file, is is fairly easy to provide the same
guarantee.  And I've not heard any argument that we really need more
flexibility than that.  In fact I think just being able to change it
on the parent directory and inheriting the flag might be more than
plenty, which would lead to a very simple implementation without any
of the crazy overhead in this series.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-25 17:36           ` Christoph Hellwig
@ 2020-02-25 19:37             ` Jeff Moyer
  2020-02-26  9:28               ` Jonathan Halliday
  2020-02-25 21:03             ` Ira Weiny
  1 sibling, 1 reply; 54+ messages in thread
From: Jeff Moyer @ 2020-02-25 19:37 UTC (permalink / raw)
  To: Christoph Hellwig, jhallida
  Cc: Dave Chinner, ira.weiny, linux-kernel, Alexander Viro,
	Darrick J. Wong, Dan Williams, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

Christoph Hellwig <hch@lst.de> writes:

> And my point is that if we ensure S_DAX can only be checked if there
> are no blocks on the file, is is fairly easy to provide the same
> guarantee.  And I've not heard any argument that we really need more
> flexibility than that.  In fact I think just being able to change it
> on the parent directory and inheriting the flag might be more than
> plenty, which would lead to a very simple implementation without any
> of the crazy overhead in this series.

I know of one user who had at least mentioned it to me, so I cc'd him.
Jonathan, can you describe your use case for being able to change a
file between dax and non-dax modes?  Or, if I'm misremembering, just
correct me?

Thanks!
Jeff


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-25 17:36           ` Christoph Hellwig
  2020-02-25 19:37             ` Jeff Moyer
@ 2020-02-25 21:03             ` Ira Weiny
  1 sibling, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-25 21:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, linux-kernel, Alexander Viro, Darrick J. Wong,
	Dan Williams, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Tue, Feb 25, 2020 at 06:36:33PM +0100, Christoph Hellwig wrote:
> On Tue, Feb 25, 2020 at 11:09:37AM +1100, Dave Chinner wrote:
> > > No, the original code was broken because it didn't serialize between
> > > DAX and buffer access.
> > > 
> > > Take a step back and look where the problems are, and they are not
> > > mostly with the aops.  In fact the only aop useful for DAX is
> > > is ->writepages, and how it uses ->writepages is actually a bit of
> > > an abuse of that interface.
> > 
> > The races are all through the fops, too, which is one of the reasons
> > Darrick mentioned we should probably move this up to file ops
> > level...
> 
> But the file ops are very simple to use.  Pass the flag in the iocb,
> and make sure the flag can only changed with i_rwsem held.  That part
> is pretty trivial, the interesting case is mmap because it is so
> spread out.
> 
> > > So what we really need it just a way to prevent switching the flag
> > > when a file is mapped,
> > 
> > That's not sufficient.
> > 
> > We also have to prevent the file from being mapped *while we are
> > switching*. Nothing in the mmap() path actually serialises against
> > filesystem operations, and the initial behavioural checks in the
> > page fault path are similarly unserialised against changing the
> > S_DAX flag.
> 
> And the important part here is ->mmap.  If ->mmap doesn't get through
> we are not going to see page faults.

The series already blocks mmap'ed files from a switch.  That was not crazy
hard.  I don't see a use case to support changing the mode while the file is
mapped.

> 
> > > and in the normal read/write path ensure the
> > > flag can't be switch while I/O is going on, which could either be
> > > done by ensuring it is only switched under i_rwsem or equivalent
> > > protection, or by setting the DAX flag once in the iocb similar to
> > > IOCB_DIRECT.
> > 
> > The iocb path is not the problem - that's entirely serialised
> > against S_DAX changes by the i_rwsem. The problem is that we have no
> > equivalent filesystem level serialisation for the entire mmap/page
> > fault path, and it checks S_DAX all over the place.
> 
> Not right now.  We have various IS_DAX checks outside it.  But it is
> easily fixable indeed.
> 
> > /me wonders if the best thing to do is to add a ->fault callout to
> > tell the filesystem to lock/unlock the inode right up at the top of
> > the page fault path, outside even the mmap_sem.  That means all the
> > methods that the page fault calls are protected against S_DAX
> > changes, and it gives us a low cost method of serialising page
> > faults against DIO (e.g. via inode_dio_wait())....
> 
> Maybe.  Especially if it solves real problems and isn't just new
> overhead to add an esoteric feature.

I thought about doing something like this but it seemed easier to block the
change while the file was mapped.

> 
> > 
> > > And they easiest way to get all this done is as a first step to
> > > just allowing switching the flag when no blocks are allocated at
> > > all, similar to how the rt flag works.
> > 
> > False equivalence - it is not similar because the RT flag changes
> > and their associated state checks are *already fully serialised* by
> > the XFS_ILOCK_EXCL. S_DAX accesses have no such serialisation, and
> > that's the problem we need to solve...
> 
> And my point is that if we ensure S_DAX can only be checked if there
> are no blocks on the file, is is fairly easy to provide the same
> guarantee.  And I've not heard any argument that we really need more
> flexibility than that.  In fact I think just being able to change it
> on the parent directory and inheriting the flag might be more than
> plenty, which would lead to a very simple implementation without any
> of the crazy overhead in this series.

Inherit on file creation or also if a file was moved under the directory?

Do we need to consider hard or soft links?

Ira


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer
  2020-02-24 22:32       ` Dave Chinner
@ 2020-02-25 21:12         ` Ira Weiny
  2020-02-25 22:59           ` Dave Chinner
  0 siblings, 1 reply; 54+ messages in thread
From: Ira Weiny @ 2020-02-25 21:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Tue, Feb 25, 2020 at 09:32:45AM +1100, Dave Chinner wrote:
> On Mon, Feb 24, 2020 at 11:57:36AM -0800, Ira Weiny wrote:
> > On Mon, Feb 24, 2020 at 11:34:55AM +1100, Dave Chinner wrote:
> > > On Thu, Feb 20, 2020 at 04:41:30PM -0800, ira.weiny@intel.com wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > 
> > [snip]
> > 
> > > > 
> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index 35df324875db..5b014c428f0f 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> > > >   *
> > > >   * Basic locking order:
> > > >   *
> > > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > > + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > >   *
> > > >   * mmap_sem locking order:
> > > >   *
> > > >   * i_rwsem -> page lock -> mmap_sem
> > > > - * mmap_sem -> i_mmap_lock -> page_lock
> > > > + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> > > >   *
> > > >   * The difference in mmap_sem locking order mean that we cannot hold the
> > > >   * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> > > > @@ -182,6 +182,9 @@ xfs_ilock(
> > > >  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> > > >  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> > > >  
> > > > +	if (lock_flags & XFS_DAX_EXCL)
> > > > +		inode_aops_down_write(VFS_I(ip));
> > > 
> > > I largely don't see the point of adding this to xfs_ilock/iunlock.
> > > 
> > > It's only got one caller, so I don't see much point in adding it to
> > > an interface that has over a hundred other call sites that don't
> > > need or use this lock. just open code it where it is needed in the
> > > ioctl code.
> > 
> > I know it seems overkill but if we don't do this we need to code a flag to be
> > returned from xfs_ioctl_setattr_dax_invalidate().  This flag is then used in
> > xfs_ioctl_setattr_get_trans() to create the transaction log item which can then
> > be properly used to unlock the lock in xfs_inode_item_release()
> > 
> > I don't know of a cleaner way to communicate to xfs_inode_item_release() to
> > unlock i_aops_sem after the transaction is complete.
> 
> We manually unlock inodes after transactions in many cases -
> anywhere we do a rolling transaction, the inode locks do not get
> released by the transaction. Hence for a one-off case like this it
> doesn't really make sense to push all this infrastructure into the
> transaction subsystem. Especially as we can manually lock before and
> unlock after the transaction context without any real complexity.

So does xfs_trans_commit() operate synchronously?

I want to understand this better because I have fought with a lot of ABBA
issues with these locks.  So...  can I hold the lock until after
xfs_trans_commit() and safely unlock it there... because the XFS_MMAPLOCK_EXCL,
XFS_IOLOCK_EXCL, and XFS_ILOCK_EXCL will be released at that point?  Thus
preserving the following lock order.

...
 * Basic locking order:
 *
 * i_aops_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
 *
...

Thanks for the review!
Ira

> 
> This also means that we can, if necessary, do aops manipulation work
> /after/ the transaction that changes on-disk state completes and we
> still hold the aops reference exclusively. While we don't do that
> now, I think it is worthwhile keeping our options open here....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer
  2020-02-25 21:12         ` Ira Weiny
@ 2020-02-25 22:59           ` Dave Chinner
  2020-02-26 18:02             ` Ira Weiny
  0 siblings, 1 reply; 54+ messages in thread
From: Dave Chinner @ 2020-02-25 22:59 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Tue, Feb 25, 2020 at 01:12:28PM -0800, Ira Weiny wrote:
> On Tue, Feb 25, 2020 at 09:32:45AM +1100, Dave Chinner wrote:
> > On Mon, Feb 24, 2020 at 11:57:36AM -0800, Ira Weiny wrote:
> > > On Mon, Feb 24, 2020 at 11:34:55AM +1100, Dave Chinner wrote:
> > > > On Thu, Feb 20, 2020 at 04:41:30PM -0800, ira.weiny@intel.com wrote:
> > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > 
> > > 
> > > [snip]
> > > 
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > index 35df324875db..5b014c428f0f 100644
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> > > > >   *
> > > > >   * Basic locking order:
> > > > >   *
> > > > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > > > + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > > >   *
> > > > >   * mmap_sem locking order:
> > > > >   *
> > > > >   * i_rwsem -> page lock -> mmap_sem
> > > > > - * mmap_sem -> i_mmap_lock -> page_lock
> > > > > + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> > > > >   *
> > > > >   * The difference in mmap_sem locking order mean that we cannot hold the
> > > > >   * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> > > > > @@ -182,6 +182,9 @@ xfs_ilock(
> > > > >  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> > > > >  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> > > > >  
> > > > > +	if (lock_flags & XFS_DAX_EXCL)
> > > > > +		inode_aops_down_write(VFS_I(ip));
> > > > 
> > > > I largely don't see the point of adding this to xfs_ilock/iunlock.
> > > > 
> > > > It's only got one caller, so I don't see much point in adding it to
> > > > an interface that has over a hundred other call sites that don't
> > > > need or use this lock. just open code it where it is needed in the
> > > > ioctl code.
> > > 
> > > I know it seems overkill but if we don't do this we need to code a flag to be
> > > returned from xfs_ioctl_setattr_dax_invalidate().  This flag is then used in
> > > xfs_ioctl_setattr_get_trans() to create the transaction log item which can then
> > > be properly used to unlock the lock in xfs_inode_item_release()
> > > 
> > > I don't know of a cleaner way to communicate to xfs_inode_item_release() to
> > > unlock i_aops_sem after the transaction is complete.
> > 
> > We manually unlock inodes after transactions in many cases -
> > anywhere we do a rolling transaction, the inode locks do not get
> > released by the transaction. Hence for a one-off case like this it
> > doesn't really make sense to push all this infrastructure into the
> > transaction subsystem. Especially as we can manually lock before and
> > unlock after the transaction context without any real complexity.
> 
> So does xfs_trans_commit() operate synchronously?

What do you mean by "synchronously", and what are you expecting to
occur (a)synchronously with respect to filesystem objects and/or
on-disk state?

Keep in mid that the xfs transaction subsystem is a complex
asynchronous IO engine full of feedback loops and resource
management, so asking if something is "synchronous" without any
other context is a difficult question to answer :)

> I want to understand this better because I have fought with a lot of ABBA
> issues with these locks.  So...  can I hold the lock until after
> xfs_trans_commit() and safely unlock it there... because the XFS_MMAPLOCK_EXCL,
> XFS_IOLOCK_EXCL, and XFS_ILOCK_EXCL will be released at that point?  Thus
> preserving the following lock order.

See how operations like xfs_create, xfs_unlink, etc work. The don't
specify flags to xfs_ijoin(), and so the transaction commits don't
automatically unlock the inode. This is necessary so that rolling
transactions are executed atomically w.r.t. inode access - no-one
can lock and access the inode while a multi-commit rolling
transaction on the inode is on-going.

In this case it's just a single commit and we don't need to keep
it locked after the change is made, so we can unlock the inode
on commit. So for the XFS internal locks the code is fine and
doesn't need to change. We just need to wrap the VFS aops lock (if
we keep it) around the outside of all the XFS locking until the
transaction commits and unlocks the XFS locks...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-25 19:37             ` Jeff Moyer
@ 2020-02-26  9:28               ` Jonathan Halliday
  2020-02-26 11:31                 ` Jan Kara
                                   ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Jonathan Halliday @ 2020-02-26  9:28 UTC (permalink / raw)
  To: Jeff Moyer, Christoph Hellwig
  Cc: Dave Chinner, ira.weiny, linux-kernel, Alexander Viro,
	Darrick J. Wong, Dan Williams, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel


Hi All

I'm a middleware developer, focused on how Java (JVM) workloads can 
benefit from app-direct mode pmem. Initially the target is apps that 
need a fast binary log for fault tolerance: the classic database WAL use 
case; transaction coordination systems; enterprise message bus 
persistence and suchlike. Critically, there are cases where we use log 
based storage, i.e. it's not the strict 'read rarely, only on recovery' 
model that a classic db may have, but more of a 'append only, read many 
times' event stream model.

Think of the log oriented data storage as having logical segments (let's 
implement them as files), of which the most recent is being appended to 
(read_write) and the remaining N-1 older segments are full and sealed, 
so effectively immutable (read_only) until discarded. The tail segment 
needs to be in DAX mode for optimal write performance, as the size of 
the append may be sub-block and we don't want the overhead of the kernel 
call anyhow. So that's clearly a good fit for putting on a DAX fs mount 
and using mmap with MAP_SYNC.

However, we want fast read access into the segments, to retrieve stored 
records. The small access index can be built in volatile RAM (assuming 
we're willing to take the startup overhead of a full file scan at 
recovery time) but the data itself is big and we don't want to move it 
all off pmem. Which means the requirements are now different: we want 
the O/S cache to pull hot data into fast volatile RAM for us, which DAX 
explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather 
than app-direct mode, except here we're using the O/S rather than the 
hardware memory controller to do the cache management for us.

Currently this requires closing the full (read_write) file, then copying 
it to a non-DAX device and reopening it (read_only) there. Clearly 
that's expensive and rather tedious. Instead, I'd like to close the 
MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode 
that will instead go via the O/S cache in the traditional manner. Bonus 
points if I can do it over non-overlapping ranges in a file without 
closing the DAX mode mmap, since then the segments are entirely logical 
instead of needing separate physical files.

I note a comment below regarding a per-directly setting, but don't have 
the background to fully understand what's being suggested. However, I'll 
note here that I can live with a per-directory granularity, as relinking 
a file into a new dir is a constant time operation, whilst the move 
described above isn't. So if a per-directory granularity is easier than 
a per-file one that's fine, though as a person with only passing 
knowledge of filesystem design I don't see how having multiple links to 
a file can work cleanly in that case.

Hope that helps.

Jonathan

P.S. I'll cheekily take the opportunity of having your attention to tack 
on one minor gripe about the current system: The only way to know if a 
mmap with MAP_SYNC will work is to try it and catch the error. Which 
would be reasonable if it were free of side effects.  However, the 
process requires first expanding the file to at least the size of the 
desired map, which is done non-atomically i.e. is user visible. There 
are thus nasty race conditions in the cleanup, where after a failed mmap 
attempt (e.g the device doesn't support DAX), we try to shrink the file 
back to its original size, but something else has already opened it at 
its new, larger size. This is not theoretical: I got caught by it whilst 
adapting some of our middleware to use pmem.  Therefore, some way to 
probe the file path for its capability would be nice, much the same as I 
can e.g. inspect file permissions to (more or less) evaluate if I can 
write it without actually mutating it.  Thanks!



On 25/02/2020 19:37, Jeff Moyer wrote:
> Christoph Hellwig <hch@lst.de> writes:
> 
>> And my point is that if we ensure S_DAX can only be checked if there
>> are no blocks on the file, is is fairly easy to provide the same
>> guarantee.  And I've not heard any argument that we really need more
>> flexibility than that.  In fact I think just being able to change it
>> on the parent directory and inheriting the flag might be more than
>> plenty, which would lead to a very simple implementation without any
>> of the crazy overhead in this series.
> 
> I know of one user who had at least mentioned it to me, so I cc'd him.
> Jonathan, can you describe your use case for being able to change a
> file between dax and non-dax modes?  Or, if I'm misremembering, just
> correct me?
> 
> Thanks!
> Jeff
> 

-- 
Registered in England and Wales under Company Registration No. 03798903 
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-25  0:09         ` Dave Chinner
  2020-02-25 17:36           ` Christoph Hellwig
@ 2020-02-26 11:17           ` Jan Kara
  2020-02-26 15:57             ` Ira Weiny
  1 sibling, 1 reply; 54+ messages in thread
From: Jan Kara @ 2020-02-26 11:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, ira.weiny, linux-kernel, Alexander Viro,
	Darrick J. Wong, Dan Williams, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Tue 25-02-20 11:09:37, Dave Chinner wrote:
> /me wonders if the best thing to do is to add a ->fault callout to
> tell the filesystem to lock/unlock the inode right up at the top of
> the page fault path, outside even the mmap_sem.  That means all the
> methods that the page fault calls are protected against S_DAX
> changes, and it gives us a low cost method of serialising page
> faults against DIO (e.g. via inode_dio_wait())....

Well, that's going to be pretty hard. The main problem is: you cannot
lookup VMA until you hold mmap_sem and the inode is inside the VMA. And
this is a fundamental problem because until you hold mmap_sem, the address
space can change and thus the virtual address you are faulting can be
changing inode it is mapped to. So you would have to do some dance like:

lock mmap_sem
lookup vma
get inode reference
drop mmap_sem
tell fs about page fault
lock mmap_sem
is the vma still the same?

And I'm pretty confident the overhead will be visible in page fault
intensive workloads...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-26  9:28               ` Jonathan Halliday
@ 2020-02-26 11:31                 ` Jan Kara
  2020-02-26 11:56                   ` Jonathan Halliday
  2020-02-26 16:10                 ` Ira Weiny
  2020-02-26 16:46                 ` Dan Williams
  2 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2020-02-26 11:31 UTC (permalink / raw)
  To: Jonathan Halliday
  Cc: Jeff Moyer, Christoph Hellwig, Dave Chinner, ira.weiny,
	linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

Hello,

On Wed 26-02-20 09:28:57, Jonathan Halliday wrote:
> I'm a middleware developer, focused on how Java (JVM) workloads can benefit
> from app-direct mode pmem. Initially the target is apps that need a fast
> binary log for fault tolerance: the classic database WAL use case;
> transaction coordination systems; enterprise message bus persistence and
> suchlike. Critically, there are cases where we use log based storage, i.e.
> it's not the strict 'read rarely, only on recovery' model that a classic db
> may have, but more of a 'append only, read many times' event stream model.
> 
> Think of the log oriented data storage as having logical segments (let's
> implement them as files), of which the most recent is being appended to
> (read_write) and the remaining N-1 older segments are full and sealed, so
> effectively immutable (read_only) until discarded. The tail segment needs to
> be in DAX mode for optimal write performance, as the size of the append may
> be sub-block and we don't want the overhead of the kernel call anyhow. So
> that's clearly a good fit for putting on a DAX fs mount and using mmap with
> MAP_SYNC.
> 
> However, we want fast read access into the segments, to retrieve stored
> records. The small access index can be built in volatile RAM (assuming we're
> willing to take the startup overhead of a full file scan at recovery time)
> but the data itself is big and we don't want to move it all off pmem. Which
> means the requirements are now different: we want the O/S cache to pull hot
> data into fast volatile RAM for us, which DAX explicitly won't do.
> Effectively a poor man's 'memory mode' pmem, rather than app-direct mode,
> except here we're using the O/S rather than the hardware memory controller
> to do the cache management for us.
> 
> Currently this requires closing the full (read_write) file, then copying it
> to a non-DAX device and reopening it (read_only) there. Clearly that's
> expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap,
> then, leaving the file where it is, reopen it in a mode that will instead go
> via the O/S cache in the traditional manner. Bonus points if I can do it
> over non-overlapping ranges in a file without closing the DAX mode mmap,
> since then the segments are entirely logical instead of needing separate
> physical files.
> 
> I note a comment below regarding a per-directly setting, but don't have the
> background to fully understand what's being suggested. However, I'll note
> here that I can live with a per-directory granularity, as relinking a file
> into a new dir is a constant time operation, whilst the move described above
> isn't. So if a per-directory granularity is easier than a per-file one
> that's fine, though as a person with only passing knowledge of filesystem
> design I don't see how having multiple links to a file can work cleanly in
> that case.

Well, with per-directory setting, relinking the file will not magically
make it stop using DAX. So your situation would be very similar to the
current one, except "copy to non-DAX device" can be replaced by "copy to
non-DAX directory". Maybe the "copy" part could be actually reflink which
would make it faster.

> P.S. I'll cheekily take the opportunity of having your attention to tack on
> one minor gripe about the current system: The only way to know if a mmap
> with MAP_SYNC will work is to try it and catch the error. Which would be
> reasonable if it were free of side effects.  However, the process requires
> first expanding the file to at least the size of the desired map, which is
> done non-atomically i.e. is user visible. There are thus nasty race
> conditions in the cleanup, where after a failed mmap attempt (e.g the device
> doesn't support DAX), we try to shrink the file back to its original size,
> but something else has already opened it at its new, larger size. This is
> not theoretical: I got caught by it whilst adapting some of our middleware
> to use pmem.  Therefore, some way to probe the file path for its capability
> would be nice, much the same as I can e.g. inspect file permissions to (more
> or less) evaluate if I can write it without actually mutating it.  Thanks!

Well, reporting error on mmap(2) is the only way how to avoid
time-to-check-time-to-use races. And these are very important when we are
speaking about data integrity guarantees. So that's not going to change.
But with Ira's patches you could use statx(2) to check whether file at
least supports DAX and so avoid doing mmap check with the side effects in
the common case where it's hopeless... I'd also think that you could
currently do mmap check with the current file size and if it succeeds,
expand the file to the desired size and mmap again. It's not ideal but it
should work.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-26 11:31                 ` Jan Kara
@ 2020-02-26 11:56                   ` Jonathan Halliday
  0 siblings, 0 replies; 54+ messages in thread
From: Jonathan Halliday @ 2020-02-26 11:56 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Moyer, Christoph Hellwig, Dave Chinner, ira.weiny,
	linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel



On 26/02/2020 11:31, Jan Kara wrote:
> Hello,
> 
> On Wed 26-02-20 09:28:57, Jonathan Halliday wrote:
>> I'm a middleware developer, focused on how Java (JVM) workloads can benefit
>> from app-direct mode pmem. Initially the target is apps that need a fast
>> binary log for fault tolerance: the classic database WAL use case;
>> transaction coordination systems; enterprise message bus persistence and
>> suchlike. Critically, there are cases where we use log based storage, i.e.
>> it's not the strict 'read rarely, only on recovery' model that a classic db
>> may have, but more of a 'append only, read many times' event stream model.
>>
>> Think of the log oriented data storage as having logical segments (let's
>> implement them as files), of which the most recent is being appended to
>> (read_write) and the remaining N-1 older segments are full and sealed, so
>> effectively immutable (read_only) until discarded. The tail segment needs to
>> be in DAX mode for optimal write performance, as the size of the append may
>> be sub-block and we don't want the overhead of the kernel call anyhow. So
>> that's clearly a good fit for putting on a DAX fs mount and using mmap with
>> MAP_SYNC.
>>
>> However, we want fast read access into the segments, to retrieve stored
>> records. The small access index can be built in volatile RAM (assuming we're
>> willing to take the startup overhead of a full file scan at recovery time)
>> but the data itself is big and we don't want to move it all off pmem. Which
>> means the requirements are now different: we want the O/S cache to pull hot
>> data into fast volatile RAM for us, which DAX explicitly won't do.
>> Effectively a poor man's 'memory mode' pmem, rather than app-direct mode,
>> except here we're using the O/S rather than the hardware memory controller
>> to do the cache management for us.
>>
>> Currently this requires closing the full (read_write) file, then copying it
>> to a non-DAX device and reopening it (read_only) there. Clearly that's
>> expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap,
>> then, leaving the file where it is, reopen it in a mode that will instead go
>> via the O/S cache in the traditional manner. Bonus points if I can do it
>> over non-overlapping ranges in a file without closing the DAX mode mmap,
>> since then the segments are entirely logical instead of needing separate
>> physical files.
>>
>> I note a comment below regarding a per-directly setting, but don't have the
>> background to fully understand what's being suggested. However, I'll note
>> here that I can live with a per-directory granularity, as relinking a file
>> into a new dir is a constant time operation, whilst the move described above
>> isn't. So if a per-directory granularity is easier than a per-file one
>> that's fine, though as a person with only passing knowledge of filesystem
>> design I don't see how having multiple links to a file can work cleanly in
>> that case.
> 
> Well, with per-directory setting, relinking the file will not magically
> make it stop using DAX. So your situation would be very similar to the
> current one, except "copy to non-DAX device" can be replaced by "copy to
> non-DAX directory". Maybe the "copy" part could be actually reflink which
> would make it faster.

Indeed. The requirement is for 'change mode in constant time' rather 
than the current 'change mode in time proportional to file size'. That 
seems to imply requiring the approach to just change fs metadata, 
without relocating the file data bytes. Beyond that I'm largely 
indifferent to the implementation details.

>> P.S. I'll cheekily take the opportunity of having your attention to tack on
>> one minor gripe about the current system: The only way to know if a mmap
>> with MAP_SYNC will work is to try it and catch the error. Which would be
>> reasonable if it were free of side effects.  However, the process requires
>> first expanding the file to at least the size of the desired map, which is
>> done non-atomically i.e. is user visible. There are thus nasty race
>> conditions in the cleanup, where after a failed mmap attempt (e.g the device
>> doesn't support DAX), we try to shrink the file back to its original size,
>> but something else has already opened it at its new, larger size. This is
>> not theoretical: I got caught by it whilst adapting some of our middleware
>> to use pmem.  Therefore, some way to probe the file path for its capability
>> would be nice, much the same as I can e.g. inspect file permissions to (more
>> or less) evaluate if I can write it without actually mutating it.  Thanks!
> 
> Well, reporting error on mmap(2) is the only way how to avoid
> time-to-check-time-to-use races. And these are very important when we are
> speaking about data integrity guarantees. So that's not going to change.
> But with Ira's patches you could use statx(2) to check whether file at
> least supports DAX and so avoid doing mmap check with the side effects in
> the common case where it's hopeless... I'd also think that you could
> currently do mmap check with the current file size and if it succeeds,
> expand the file to the desired size and mmap again. It's not ideal but it
> should work.

Sure. Best effort is fine here, just as with looking at the permission 
bits on a file example - even in the absence of racing permission 
changes it's not definitive because of additional quota or selinux 
checks, but it's a reasonable approximation. That's a sufficiently 
useful improvement for my purposes, given the impractical nature of a 
100% solution.

Jonathan


-- 
Registered in England and Wales under Company Registration No. 03798903 
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-26 11:17           ` Jan Kara
@ 2020-02-26 15:57             ` Ira Weiny
  0 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-26 15:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christoph Hellwig, linux-kernel, Alexander Viro,
	Darrick J. Wong, Dan Williams, Theodore Y. Ts'o, linux-ext4,
	linux-xfs, linux-fsdevel

On Wed, Feb 26, 2020 at 12:17:40PM +0100, Jan Kara wrote:
> On Tue 25-02-20 11:09:37, Dave Chinner wrote:
> > /me wonders if the best thing to do is to add a ->fault callout to
> > tell the filesystem to lock/unlock the inode right up at the top of
> > the page fault path, outside even the mmap_sem.  That means all the
> > methods that the page fault calls are protected against S_DAX
> > changes, and it gives us a low cost method of serialising page
> > faults against DIO (e.g. via inode_dio_wait())....
> 
> Well, that's going to be pretty hard. The main problem is: you cannot
> lookup VMA until you hold mmap_sem and the inode is inside the VMA. And
> this is a fundamental problem because until you hold mmap_sem, the address
> space can change and thus the virtual address you are faulting can be
> changing inode it is mapped to. So you would have to do some dance like:
> 
> lock mmap_sem
> lookup vma
> get inode reference
> drop mmap_sem
> tell fs about page fault
> lock mmap_sem
> is the vma still the same?
> 
> And I'm pretty confident the overhead will be visible in page fault
> intensive workloads...

I did not get to this level of detail...  Rather I looked at it from a high
level perspective and thought "does the mode need to change while someone has
the mmap?"  My thought is, that it does not make a lot of sense.  Generally the
user has mmaped with some use case in mind (either DAX or non-DAX) and it seems
reasonable to keep that mode consistent while the map is in place.

So I punted and restricted the change.

Ira

> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-26  9:28               ` Jonathan Halliday
  2020-02-26 11:31                 ` Jan Kara
@ 2020-02-26 16:10                 ` Ira Weiny
  2020-02-26 16:46                 ` Dan Williams
  2 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-26 16:10 UTC (permalink / raw)
  To: Jonathan Halliday
  Cc: Jeff Moyer, Christoph Hellwig, Dave Chinner, linux-kernel,
	Alexander Viro, Darrick J. Wong, Dan Williams,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Wed, Feb 26, 2020 at 09:28:57AM +0000, Jonathan Halliday wrote:
> 
> Hi All
> 
> I'm a middleware developer, focused on how Java (JVM) workloads can benefit
> from app-direct mode pmem. Initially the target is apps that need a fast
> binary log for fault tolerance: the classic database WAL use case;
> transaction coordination systems; enterprise message bus persistence and
> suchlike. Critically, there are cases where we use log based storage, i.e.
> it's not the strict 'read rarely, only on recovery' model that a classic db
> may have, but more of a 'append only, read many times' event stream model.
> 
> Think of the log oriented data storage as having logical segments (let's
> implement them as files), of which the most recent is being appended to
> (read_write) and the remaining N-1 older segments are full and sealed, so
> effectively immutable (read_only) until discarded. The tail segment needs to
> be in DAX mode for optimal write performance, as the size of the append may
> be sub-block and we don't want the overhead of the kernel call anyhow. So
> that's clearly a good fit for putting on a DAX fs mount and using mmap with
> MAP_SYNC.
> 
> However, we want fast read access into the segments, to retrieve stored
> records. The small access index can be built in volatile RAM (assuming we're
> willing to take the startup overhead of a full file scan at recovery time)
> but the data itself is big and we don't want to move it all off pmem. Which
> means the requirements are now different: we want the O/S cache to pull hot
> data into fast volatile RAM for us, which DAX explicitly won't do.
> Effectively a poor man's 'memory mode' pmem, rather than app-direct mode,
> except here we're using the O/S rather than the hardware memory controller
> to do the cache management for us.
> 
> Currently this requires closing the full (read_write) file, then copying it
> to a non-DAX device and reopening it (read_only) there. Clearly that's
> expensive and rather tedious. Instead, I'd like to close the MAP_SYNC mmap,
> then, leaving the file where it is, reopen it in a mode that will instead go
> via the O/S cache in the traditional manner.

This patch set implements this capability.

> Bonus points if I can do it
> over non-overlapping ranges in a file without closing the DAX mode mmap,
> since then the segments are entirely logical instead of needing separate
> physical files.

But it is too hard to do so while an mmap is in place so it can't do this.  So
no bonus points...

> 
> I note a comment below regarding a per-directly setting, but don't have the
> background to fully understand what's being suggested. However, I'll note
> here that I can live with a per-directory granularity, as relinking a file
> into a new dir is a constant time operation, whilst the move described above
> isn't. So if a per-directory granularity is easier than a per-file one
> that's fine, though as a person with only passing knowledge of filesystem
> design I don't see how having multiple links to a file can work cleanly in
> that case.

The more I think about it the more I'm not comfortable with a directory option.
soft links and hard links complicate that IMO.  The current inheritance model
is at file creation time and the file sticks with that state (mode).  That is
easy enough and carry's with the file without having the complexity of possibly
looking at the wrong parent directory.

Ira


> 
> Hope that helps.
> 
> Jonathan
> 
> P.S. I'll cheekily take the opportunity of having your attention to tack on
> one minor gripe about the current system: The only way to know if a mmap
> with MAP_SYNC will work is to try it and catch the error. Which would be
> reasonable if it were free of side effects.  However, the process requires
> first expanding the file to at least the size of the desired map, which is
> done non-atomically i.e. is user visible. There are thus nasty race
> conditions in the cleanup, where after a failed mmap attempt (e.g the device
> doesn't support DAX), we try to shrink the file back to its original size,
> but something else has already opened it at its new, larger size. This is
> not theoretical: I got caught by it whilst adapting some of our middleware
> to use pmem.  Therefore, some way to probe the file path for its capability
> would be nice, much the same as I can e.g. inspect file permissions to (more
> or less) evaluate if I can write it without actually mutating it.  Thanks!
> 
> 
> 
> On 25/02/2020 19:37, Jeff Moyer wrote:
> > Christoph Hellwig <hch@lst.de> writes:
> > 
> > > And my point is that if we ensure S_DAX can only be checked if there
> > > are no blocks on the file, is is fairly easy to provide the same
> > > guarantee.  And I've not heard any argument that we really need more
> > > flexibility than that.  In fact I think just being able to change it
> > > on the parent directory and inheriting the flag might be more than
> > > plenty, which would lead to a very simple implementation without any
> > > of the crazy overhead in this series.
> > 
> > I know of one user who had at least mentioned it to me, so I cc'd him.
> > Jonathan, can you describe your use case for being able to change a
> > file between dax and non-dax modes?  Or, if I'm misremembering, just
> > correct me?
> > 
> > Thanks!
> > Jeff
> > 
> 
> -- 
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-26  9:28               ` Jonathan Halliday
  2020-02-26 11:31                 ` Jan Kara
  2020-02-26 16:10                 ` Ira Weiny
@ 2020-02-26 16:46                 ` Dan Williams
  2020-02-26 17:20                   ` Jan Kara
  2 siblings, 1 reply; 54+ messages in thread
From: Dan Williams @ 2020-02-26 16:46 UTC (permalink / raw)
  To: Jonathan Halliday
  Cc: Jeff Moyer, Christoph Hellwig, Dave Chinner, Weiny, Ira,
	Linux Kernel Mailing List, Alexander Viro, Darrick J. Wong,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday
<jonathan.halliday@redhat.com> wrote:
>
>
> Hi All
>
> I'm a middleware developer, focused on how Java (JVM) workloads can
> benefit from app-direct mode pmem. Initially the target is apps that
> need a fast binary log for fault tolerance: the classic database WAL use
> case; transaction coordination systems; enterprise message bus
> persistence and suchlike. Critically, there are cases where we use log
> based storage, i.e. it's not the strict 'read rarely, only on recovery'
> model that a classic db may have, but more of a 'append only, read many
> times' event stream model.
>
> Think of the log oriented data storage as having logical segments (let's
> implement them as files), of which the most recent is being appended to
> (read_write) and the remaining N-1 older segments are full and sealed,
> so effectively immutable (read_only) until discarded. The tail segment
> needs to be in DAX mode for optimal write performance, as the size of
> the append may be sub-block and we don't want the overhead of the kernel
> call anyhow. So that's clearly a good fit for putting on a DAX fs mount
> and using mmap with MAP_SYNC.
>
> However, we want fast read access into the segments, to retrieve stored
> records. The small access index can be built in volatile RAM (assuming
> we're willing to take the startup overhead of a full file scan at
> recovery time) but the data itself is big and we don't want to move it
> all off pmem. Which means the requirements are now different: we want
> the O/S cache to pull hot data into fast volatile RAM for us, which DAX
> explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather
> than app-direct mode, except here we're using the O/S rather than the
> hardware memory controller to do the cache management for us.
>
> Currently this requires closing the full (read_write) file, then copying
> it to a non-DAX device and reopening it (read_only) there. Clearly
> that's expensive and rather tedious. Instead, I'd like to close the
> MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode
> that will instead go via the O/S cache in the traditional manner. Bonus
> points if I can do it over non-overlapping ranges in a file without
> closing the DAX mode mmap, since then the segments are entirely logical
> instead of needing separate physical files.

Hi John,

IIRC we chatted about this at PIRL, right?

At the time it sounded more like mixed mode dax, i.e. dax writes, but
cached reads. To me that's an optimization to optionally use dax for
direct-I/O writes, with its existing set of page-cache coherence
warts, and not a capability to dynamically switch the dax-mode.
mmap+MAP_SYNC seems the wrong interface for this. This writeup
mentions bypassing kernel call overhead, but I don't see how a
dax-write syscall is cheaper than an mmap syscall plus fault. If
direct-I/O to a dax capable file bypasses the block layer, isn't that
about the maximum of kernel overhead that can be cut out of this use
case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block
update-in-place writes not append writes.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-26 16:46                 ` Dan Williams
@ 2020-02-26 17:20                   ` Jan Kara
  2020-02-26 17:54                     ` Dan Williams
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Kara @ 2020-02-26 17:20 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jonathan Halliday, Jeff Moyer, Christoph Hellwig, Dave Chinner,
	Weiny, Ira, Linux Kernel Mailing List, Alexander Viro,
	Darrick J. Wong, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Wed 26-02-20 08:46:42, Dan Williams wrote:
> On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday
> <jonathan.halliday@redhat.com> wrote:
> >
> >
> > Hi All
> >
> > I'm a middleware developer, focused on how Java (JVM) workloads can
> > benefit from app-direct mode pmem. Initially the target is apps that
> > need a fast binary log for fault tolerance: the classic database WAL use
> > case; transaction coordination systems; enterprise message bus
> > persistence and suchlike. Critically, there are cases where we use log
> > based storage, i.e. it's not the strict 'read rarely, only on recovery'
> > model that a classic db may have, but more of a 'append only, read many
> > times' event stream model.
> >
> > Think of the log oriented data storage as having logical segments (let's
> > implement them as files), of which the most recent is being appended to
> > (read_write) and the remaining N-1 older segments are full and sealed,
> > so effectively immutable (read_only) until discarded. The tail segment
> > needs to be in DAX mode for optimal write performance, as the size of
> > the append may be sub-block and we don't want the overhead of the kernel
> > call anyhow. So that's clearly a good fit for putting on a DAX fs mount
> > and using mmap with MAP_SYNC.
> >
> > However, we want fast read access into the segments, to retrieve stored
> > records. The small access index can be built in volatile RAM (assuming
> > we're willing to take the startup overhead of a full file scan at
> > recovery time) but the data itself is big and we don't want to move it
> > all off pmem. Which means the requirements are now different: we want
> > the O/S cache to pull hot data into fast volatile RAM for us, which DAX
> > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather
> > than app-direct mode, except here we're using the O/S rather than the
> > hardware memory controller to do the cache management for us.
> >
> > Currently this requires closing the full (read_write) file, then copying
> > it to a non-DAX device and reopening it (read_only) there. Clearly
> > that's expensive and rather tedious. Instead, I'd like to close the
> > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode
> > that will instead go via the O/S cache in the traditional manner. Bonus
> > points if I can do it over non-overlapping ranges in a file without
> > closing the DAX mode mmap, since then the segments are entirely logical
> > instead of needing separate physical files.
> 
> Hi John,
> 
> IIRC we chatted about this at PIRL, right?
> 
> At the time it sounded more like mixed mode dax, i.e. dax writes, but
> cached reads. To me that's an optimization to optionally use dax for
> direct-I/O writes, with its existing set of page-cache coherence
> warts, and not a capability to dynamically switch the dax-mode.
> mmap+MAP_SYNC seems the wrong interface for this. This writeup
> mentions bypassing kernel call overhead, but I don't see how a
> dax-write syscall is cheaper than an mmap syscall plus fault. If
> direct-I/O to a dax capable file bypasses the block layer, isn't that
> about the maximum of kernel overhead that can be cut out of this use
> case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block
> update-in-place writes not append writes.

Well, even for appends you'll pay the cost only once per page (or maybe even
once per huge page) when using MAP_SYNC. With a syscall you'll pay once per
write. So although it would be good to check real numbers, the design isn't
non-sensical to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state
  2020-02-26 17:20                   ` Jan Kara
@ 2020-02-26 17:54                     ` Dan Williams
  0 siblings, 0 replies; 54+ messages in thread
From: Dan Williams @ 2020-02-26 17:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jonathan Halliday, Jeff Moyer, Christoph Hellwig, Dave Chinner,
	Weiny, Ira, Linux Kernel Mailing List, Alexander Viro,
	Darrick J. Wong, Theodore Y. Ts'o, linux-ext4, linux-xfs,
	linux-fsdevel

On Wed, Feb 26, 2020 at 9:20 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 26-02-20 08:46:42, Dan Williams wrote:
> > On Wed, Feb 26, 2020 at 1:29 AM Jonathan Halliday
> > <jonathan.halliday@redhat.com> wrote:
> > >
> > >
> > > Hi All
> > >
> > > I'm a middleware developer, focused on how Java (JVM) workloads can
> > > benefit from app-direct mode pmem. Initially the target is apps that
> > > need a fast binary log for fault tolerance: the classic database WAL use
> > > case; transaction coordination systems; enterprise message bus
> > > persistence and suchlike. Critically, there are cases where we use log
> > > based storage, i.e. it's not the strict 'read rarely, only on recovery'
> > > model that a classic db may have, but more of a 'append only, read many
> > > times' event stream model.
> > >
> > > Think of the log oriented data storage as having logical segments (let's
> > > implement them as files), of which the most recent is being appended to
> > > (read_write) and the remaining N-1 older segments are full and sealed,
> > > so effectively immutable (read_only) until discarded. The tail segment
> > > needs to be in DAX mode for optimal write performance, as the size of
> > > the append may be sub-block and we don't want the overhead of the kernel
> > > call anyhow. So that's clearly a good fit for putting on a DAX fs mount
> > > and using mmap with MAP_SYNC.
> > >
> > > However, we want fast read access into the segments, to retrieve stored
> > > records. The small access index can be built in volatile RAM (assuming
> > > we're willing to take the startup overhead of a full file scan at
> > > recovery time) but the data itself is big and we don't want to move it
> > > all off pmem. Which means the requirements are now different: we want
> > > the O/S cache to pull hot data into fast volatile RAM for us, which DAX
> > > explicitly won't do. Effectively a poor man's 'memory mode' pmem, rather
> > > than app-direct mode, except here we're using the O/S rather than the
> > > hardware memory controller to do the cache management for us.
> > >
> > > Currently this requires closing the full (read_write) file, then copying
> > > it to a non-DAX device and reopening it (read_only) there. Clearly
> > > that's expensive and rather tedious. Instead, I'd like to close the
> > > MAP_SYNC mmap, then, leaving the file where it is, reopen it in a mode
> > > that will instead go via the O/S cache in the traditional manner. Bonus
> > > points if I can do it over non-overlapping ranges in a file without
> > > closing the DAX mode mmap, since then the segments are entirely logical
> > > instead of needing separate physical files.
> >
> > Hi John,
> >
> > IIRC we chatted about this at PIRL, right?
> >
> > At the time it sounded more like mixed mode dax, i.e. dax writes, but
> > cached reads. To me that's an optimization to optionally use dax for
> > direct-I/O writes, with its existing set of page-cache coherence
> > warts, and not a capability to dynamically switch the dax-mode.
> > mmap+MAP_SYNC seems the wrong interface for this. This writeup
> > mentions bypassing kernel call overhead, but I don't see how a
> > dax-write syscall is cheaper than an mmap syscall plus fault. If
> > direct-I/O to a dax capable file bypasses the block layer, isn't that
> > about the maximum of kernel overhead that can be cut out of this use
> > case? Otherwise MAP_SYNC is a facility to achieve efficient sub-block
> > update-in-place writes not append writes.
>
> Well, even for appends you'll pay the cost only once per page (or maybe even
> once per huge page) when using MAP_SYNC. With a syscall you'll pay once per
> write. So although it would be good to check real numbers, the design isn't
> non-sensical to me.

True, Jonathan, how many writes per page are we talking about in this case?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer
  2020-02-25 22:59           ` Dave Chinner
@ 2020-02-26 18:02             ` Ira Weiny
  0 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-26 18:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Wed, Feb 26, 2020 at 09:59:41AM +1100, Dave Chinner wrote:
> On Tue, Feb 25, 2020 at 01:12:28PM -0800, Ira Weiny wrote:
> > On Tue, Feb 25, 2020 at 09:32:45AM +1100, Dave Chinner wrote:
> > > On Mon, Feb 24, 2020 at 11:57:36AM -0800, Ira Weiny wrote:
> > > > On Mon, Feb 24, 2020 at 11:34:55AM +1100, Dave Chinner wrote:
> > > > > On Thu, Feb 20, 2020 at 04:41:30PM -0800, ira.weiny@intel.com wrote:
> > > > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > > > 
> > > > 
> > > > [snip]
> > > > 
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > > index 35df324875db..5b014c428f0f 100644
> > > > > > --- a/fs/xfs/xfs_inode.c
> > > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> > > > > >   *
> > > > > >   * Basic locking order:
> > > > > >   *
> > > > > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > > > > + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > > > >   *
> > > > > >   * mmap_sem locking order:
> > > > > >   *
> > > > > >   * i_rwsem -> page lock -> mmap_sem
> > > > > > - * mmap_sem -> i_mmap_lock -> page_lock
> > > > > > + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> > > > > >   *
> > > > > >   * The difference in mmap_sem locking order mean that we cannot hold the
> > > > > >   * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> > > > > > @@ -182,6 +182,9 @@ xfs_ilock(
> > > > > >  	       (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> > > > > >  	ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> > > > > >  
> > > > > > +	if (lock_flags & XFS_DAX_EXCL)
> > > > > > +		inode_aops_down_write(VFS_I(ip));
> > > > > 
> > > > > I largely don't see the point of adding this to xfs_ilock/iunlock.
> > > > > 
> > > > > It's only got one caller, so I don't see much point in adding it to
> > > > > an interface that has over a hundred other call sites that don't
> > > > > need or use this lock. just open code it where it is needed in the
> > > > > ioctl code.
> > > > 
> > > > I know it seems overkill but if we don't do this we need to code a flag to be
> > > > returned from xfs_ioctl_setattr_dax_invalidate().  This flag is then used in
> > > > xfs_ioctl_setattr_get_trans() to create the transaction log item which can then
> > > > be properly used to unlock the lock in xfs_inode_item_release()
> > > > 
> > > > I don't know of a cleaner way to communicate to xfs_inode_item_release() to
> > > > unlock i_aops_sem after the transaction is complete.
> > > 
> > > We manually unlock inodes after transactions in many cases -
> > > anywhere we do a rolling transaction, the inode locks do not get
> > > released by the transaction. Hence for a one-off case like this it
> > > doesn't really make sense to push all this infrastructure into the
> > > transaction subsystem. Especially as we can manually lock before and
> > > unlock after the transaction context without any real complexity.
> > 
> > So does xfs_trans_commit() operate synchronously?
> 
> What do you mean by "synchronously", and what are you expecting to
> occur (a)synchronously with respect to filesystem objects and/or
> on-disk state?
> 
> Keep in mid that the xfs transaction subsystem is a complex
> asynchronous IO engine full of feedback loops and resource
> management,

This is precisely why I added the lock to the transaction state.  So that I
could guarantee that the lock will be released in the proper order when the
complicated transaction subsystem was done with it.  I did not see any reason
to allow operations to proceed before that time.  And so this seemed safe...

> so asking if something is "synchronous" without any
> other context is a difficult question to answer :)

Or apparently it is difficult to even ask...  ;-)  (...not trying to be
sarcastic.)  Seriously, I'm not an expert in this area so I did what I thought
was most safe.  Which for me was the number 1 goal.

> 
> > I want to understand this better because I have fought with a lot of ABBA
> > issues with these locks.  So...  can I hold the lock until after
> > xfs_trans_commit() and safely unlock it there... because the XFS_MMAPLOCK_EXCL,
> > XFS_IOLOCK_EXCL, and XFS_ILOCK_EXCL will be released at that point?  Thus
> > preserving the following lock order.
> 
> See how operations like xfs_create, xfs_unlink, etc work. The don't
> specify flags to xfs_ijoin(), and so the transaction commits don't
> automatically unlock the inode.

xfs_ijoin()?  Do you mean xfs_trans_ijoin()?

> This is necessary so that rolling
> transactions are executed atomically w.r.t. inode access - no-one
> can lock and access the inode while a multi-commit rolling
> transaction on the inode is on-going.
> 
> In this case it's just a single commit and we don't need to keep
> it locked after the change is made, so we can unlock the inode
> on commit. So for the XFS internal locks the code is fine and
> doesn't need to change. We just need to wrap the VFS aops lock (if
> we keep it) around the outside of all the XFS locking until the
> transaction commits and unlocks the XFS locks...

Ok, I went ahead and coded it up and it is testing now.  Everything looks good.
I have to say that at this point I have to agree that I can't see how a
deadlock could occur so...

Thanks for the review,
Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4
  2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
                   ` (12 preceding siblings ...)
  2020-02-21  0:41 ` [PATCH V4 13/13] Documentation/dax: Update Usage section ira.weiny
@ 2020-02-26 22:48 ` Jeff Moyer
  2020-02-27  2:43   ` Ira Weiny
  13 siblings, 1 reply; 54+ messages in thread
From: Jeff Moyer @ 2020-02-26 22:48 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

Hi, Ira,

ira.weiny@intel.com writes:

> From: Ira Weiny <ira.weiny@intel.com>
>
> https://github.com/weiny2/linux-kernel/pull/new/dax-file-state-change-v4
>
> Changes from V3: 
> https://lore.kernel.org/lkml/20200208193445.27421-1-ira.weiny@intel.com/
>
> 	* Remove global locking...  :-D
> 	* put back per inode locking and remove pre-mature optimizations
> 	* Fix issues with Directories having IS_DAX() set
> 	* Fix kernel crash issues reported by Jeff
> 	* Add some clean up patches
> 	* Consolidate diflags to iflags functions
> 	* Update/add documentation
> 	* Reorder/rename patches quite a bit

I left out patches 1 and 2, but applied the rest and tested.  This
passes xfs tests in the following configurations:
1) MKFS_OPTIONS="-m reflink=0" MOUNT_OPTIONS="-o dax"
2) MKFS_OPTIONS="-m reflink=0"
   but with the added configuration step of setting the dax attribute on
   the mounted test directory.

I also tested to ensure that reflink fails when a file has the dax
attribute set.  I've got more testing to do, but figured I'd at least
let you know I've been looking at it.

Thanks!
Jeff


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4
  2020-02-26 22:48 ` [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 Jeff Moyer
@ 2020-02-27  2:43   ` Ira Weiny
  0 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-27  2:43 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

On Wed, Feb 26, 2020 at 05:48:38PM -0500, Jeff Moyer wrote:
> Hi, Ira,
> 
> ira.weiny@intel.com writes:
> 
> > From: Ira Weiny <ira.weiny@intel.com>
> >
> > https://github.com/weiny2/linux-kernel/pull/new/dax-file-state-change-v4
> >
> > Changes from V3: 
> > https://lore.kernel.org/lkml/20200208193445.27421-1-ira.weiny@intel.com/
> >
> > 	* Remove global locking...  :-D
> > 	* put back per inode locking and remove pre-mature optimizations
> > 	* Fix issues with Directories having IS_DAX() set
> > 	* Fix kernel crash issues reported by Jeff
> > 	* Add some clean up patches
> > 	* Consolidate diflags to iflags functions
> > 	* Update/add documentation
> > 	* Reorder/rename patches quite a bit
> 
> I left out patches 1 and 2, but applied the rest and tested.  This
> passes xfs tests in the following configurations:
> 1) MKFS_OPTIONS="-m reflink=0" MOUNT_OPTIONS="-o dax"
> 2) MKFS_OPTIONS="-m reflink=0"
>    but with the added configuration step of setting the dax attribute on
>    the mounted test directory.
> 
> I also tested to ensure that reflink fails when a file has the dax
> attribute set.  I've got more testing to do, but figured I'd at least
> let you know I've been looking at it.

Thank you!

I need to update my xfstest which is specific to this as well...  I'll get to
that tomorrow and send an updated patch...

Thanks!
Ira

> 
> Thanks!
> Jeff
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem
  2020-02-21  1:26   ` Dave Chinner
@ 2020-02-27 17:52     ` Ira Weiny
  0 siblings, 0 replies; 54+ messages in thread
From: Ira Weiny @ 2020-02-27 17:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Fri, Feb 21, 2020 at 12:26:25PM +1100, Dave Chinner wrote:
> On Thu, Feb 20, 2020 at 04:41:22PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > xfs_reinit_inode() -> inode_init_always() already handles calling
> > init_rwsem(i_rwsem).  Doing so again is unneeded.
> > 
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Except that this inode has been destroyed and freed by the VFS, and
> we are now recycling it back into the VFS before we actually
> physically freed it.
> 
> Hence we have re-initialise the semaphore because the semaphore can
> contain internal state that is specific to it's new life cycle (e.g.
> the lockdep context) that will cause problems if we just assume that
> the inode is the same inode as it was before we recycled it back
> into the VFS caches.
> 
> So, yes, we actually do need to re-initialise the rwsem here.

I'm fine dropping the patch...

But I'm not following how the xfs_reinit_inode() on line 422 does not take care
of this?

fs/xfs/xfs_icache.c:

 422                 error = xfs_reinit_inode(mp, inode);
 423                 if (error) {
 424                         bool wake;
 425                         /*
 426                          * Re-initializing the inode failed, and we are in deep
 427                          * trouble.  Try to re-add it to the reclaim list.
 428                          */
 429                         rcu_read_lock();
 430                         spin_lock(&ip->i_flags_lock);
 431                         wake = !!__xfs_iflags_test(ip, XFS_INEW);
 432                         ip->i_flags &= ~(XFS_INEW | XFS_IRECLAIM);
 433                         if (wake)
 434                                 wake_up_bit(&ip->i_flags, __XFS_INEW_BIT);
 435                         ASSERT(ip->i_flags & XFS_IRECLAIMABLE);
 436                         trace_xfs_iget_reclaim_fail(ip);
 437                         goto out_error;
 438                 }
 439 
 440                 spin_lock(&pag->pag_ici_lock);
 441                 spin_lock(&ip->i_flags_lock);
 442 
 443                 /*
 444                  * Clear the per-lifetime state in the inode as we are now
 445                  * effectively a new inode and need to return to the initial
 446                  * state before reuse occurs.
 447                  */
 448                 ip->i_flags &= ~XFS_IRECLAIM_RESET_FLAGS;
 449                 ip->i_flags |= XFS_INEW;
 450                 xfs_inode_clear_reclaim_tag(pag, ip->i_ino);
 451                 inode->i_state = I_NEW;
 452                 ip->i_sick = 0;
 453                 ip->i_checked = 0;
 454 
 455                 ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 456                 init_rwsem(&inode->i_rwsem);
 457 
 458                 spin_unlock(&ip->i_flags_lock);
 459                 spin_unlock(&pag->pag_ici_lock);

Does this need to be done under one of these locks?

Ira

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2020-02-27 17:52 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-21  0:41 [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 ira.weiny
2020-02-21  0:41 ` [PATCH V4 01/13] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
2020-02-21  1:26   ` Dave Chinner
2020-02-27 17:52     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 02/13] fs/xfs: Clarify lockdep dependency for xfs_isilocked() ira.weiny
2020-02-21  1:34   ` Dave Chinner
2020-02-21 23:00     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 03/13] fs: Remove unneeded IS_DAX() check ira.weiny
2020-02-21  1:42   ` Dave Chinner
2020-02-21 23:04     ` Ira Weiny
2020-02-21 17:42   ` Christoph Hellwig
2020-02-21  0:41 ` [PATCH V4 04/13] fs/stat: Define DAX statx attribute ira.weiny
2020-02-21  0:41 ` [PATCH V4 05/13] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
2020-02-21  0:41 ` [PATCH V4 06/13] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
2020-02-22  0:28   ` Darrick J. Wong
2020-02-23 15:07     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 07/13] fs: Add locking for a dynamic address space operations state ira.weiny
2020-02-21 17:44   ` Christoph Hellwig
2020-02-21 22:44     ` Dave Chinner
2020-02-21 23:26       ` Dan Williams
2020-02-24 17:56       ` Christoph Hellwig
2020-02-25  0:09         ` Dave Chinner
2020-02-25 17:36           ` Christoph Hellwig
2020-02-25 19:37             ` Jeff Moyer
2020-02-26  9:28               ` Jonathan Halliday
2020-02-26 11:31                 ` Jan Kara
2020-02-26 11:56                   ` Jonathan Halliday
2020-02-26 16:10                 ` Ira Weiny
2020-02-26 16:46                 ` Dan Williams
2020-02-26 17:20                   ` Jan Kara
2020-02-26 17:54                     ` Dan Williams
2020-02-25 21:03             ` Ira Weiny
2020-02-26 11:17           ` Jan Kara
2020-02-26 15:57             ` Ira Weiny
2020-02-22  0:33   ` Darrick J. Wong
2020-02-23 15:03     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 08/13] fs: Prevent DAX state change if file is mmap'ed ira.weiny
2020-02-21  0:41 ` [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer ira.weiny
2020-02-22  0:31   ` Darrick J. Wong
2020-02-23 15:04     ` Ira Weiny
2020-02-24  0:34   ` Dave Chinner
2020-02-24 19:57     ` Ira Weiny
2020-02-24 22:32       ` Dave Chinner
2020-02-25 21:12         ` Ira Weiny
2020-02-25 22:59           ` Dave Chinner
2020-02-26 18:02             ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 10/13] fs/xfs: Clean up locking in dax invalidate ira.weiny
2020-02-21 17:45   ` Christoph Hellwig
2020-02-21 18:06     ` Ira Weiny
2020-02-21  0:41 ` [PATCH V4 11/13] fs/xfs: Allow toggle of effective DAX flag ira.weiny
2020-02-21  0:41 ` [PATCH V4 12/13] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
2020-02-21  0:41 ` [PATCH V4 13/13] Documentation/dax: Update Usage section ira.weiny
2020-02-26 22:48 ` [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4 Jeff Moyer
2020-02-27  2:43   ` Ira Weiny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).