Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
@ 2020-02-27  5:24 ira.weiny
  2020-02-27  5:24 ` [PATCH V5 01/12] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
                   ` (12 more replies)
  0 siblings, 13 replies; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Changes from V4:
	* Open code the aops lock rather than add it to the xfs_ilock()
	  subsystem (Darrick's comments were obsoleted by this change)
	* Fix lkp build suggestions and bugs

Changes from V3:
	* Remove global locking...  :-D
	* put back per inode locking and remove pre-mature optimizations
	* Fix issues with Directories having IS_DAX() set
	* Fix kernel crash issues reported by Jeff
	* Add some clean up patches
	* Consolidate diflags to iflags functions
	* Update/add documentation
	* Reorder/rename patches quite a bit

Changes from V2:

	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
		Internal discussions with Dan determined this would be easier,
		just as performant, and slightly less overhead that having it
		in the SB as suggested by Jan
	* Fix locking order in comments and throughout code
	* Change "mode" to "state" throughout commits
	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
		configured
	* Add static branch for which is activated by a device which supports
		DAX in XFS
	* Change "lock/unlock" to up/down read/write as appropriate
		Previous names were over simplified
	* Update comments/documentation

	* Remove the xfs specific lock to the vfs (global) layer.
	* Fix i_dax_sem locking order and comments

	* Move 'i_mapped' count from struct inode to struct address_space and
		rename it to mmap_count
	* Add inode_has_mappings() call

	* Fix build issues
	* Clean up syntax spacing and minor issues
	* Update man page text for STATX_ATTR_DAX
	* Add reviewed-by's
	* Rebase to 5.6

	Rename patch:
		from: fs/xfs: Add lock/unlock state to xfs
		to: fs/xfs: Add write DAX lock to xfs layer
	Add patch:
		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
	Drop patch:
		fs/xfs: Fix truncate up


At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the DAX state of a file
with active mappings / page cache.  It was thought the races could be avoided
by limiting DAX state flips to 0-length files.

However, this turns out to not be true.[3] This is because address space
operations (a_ops) may be in use at any time the inode is referenced and users
have expressed a desire to be able to change the DAX state on a file with data
in it.  For those reasons this patch set allows changing the DAX state flag on
a file as long as it is not current mapped.

Details of when and how DAX state can be changed on a file is included in a
documentation patch.

It should be noted that the physical DAX flag inheritance is not shown in this
patch set as it was maintained from previous work on XFS.  The physical DAX
flag and it's inheritance will need to be added to other file systems for user
control. 

As submitted this works on real hardware testing.


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/
[3] https://lkml.org/lkml/2019/10/20/96
[4] https://patchwork.kernel.org/patch/11310511/


To: linux-kernel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org


Ira Weiny (12):
  fs/xfs: Remove unnecessary initialization of i_rwsem
  fs: Remove unneeded IS_DAX() check
  fs/stat: Define DAX statx attribute
  fs/xfs: Isolate the physical DAX flag from enabled
  fs/xfs: Create function xfs_inode_enable_dax()
  fs: Add locking for a dynamic address space operations state
  fs: Prevent DAX state change if file is mmap'ed
  fs/xfs: Hold off aops users while changing DAX state
  fs/xfs: Clean up locking in dax invalidate
  fs/xfs: Allow toggle of effective DAX flag
  fs/xfs: Remove xfs_diflags_to_linux()
  Documentation/dax: Update Usage section

 Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
 Documentation/filesystems/vfs.rst | 16 +++++
 fs/attr.c                         |  1 +
 fs/inode.c                        | 16 ++++-
 fs/iomap/buffered-io.c            |  1 +
 fs/open.c                         |  4 ++
 fs/stat.c                         |  5 ++
 fs/xfs/xfs_icache.c               |  5 +-
 fs/xfs/xfs_inode.h                |  2 +
 fs/xfs/xfs_ioctl.c                | 98 +++++++++++++++----------------
 fs/xfs/xfs_iops.c                 | 69 +++++++++++++++-------
 include/linux/fs.h                | 73 ++++++++++++++++++++++-
 include/uapi/linux/stat.h         |  1 +
 mm/fadvise.c                      |  7 ++-
 mm/filemap.c                      |  4 ++
 mm/huge_memory.c                  |  1 +
 mm/khugepaged.c                   |  2 +
 mm/mmap.c                         | 19 +++++-
 mm/util.c                         |  9 ++-
 19 files changed, 328 insertions(+), 89 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 01/12] fs/xfs: Remove unnecessary initialization of i_rwsem
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-02-27 17:25   ` Ira Weiny
  2020-02-27  5:24 ` [PATCH V5 02/12] fs: Remove unneeded IS_DAX() check ira.weiny
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_reinit_inode() -> inode_init_always() already handles calling
init_rwsem(i_rwsem).  Doing so again is unneeded.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
New for V4:

NOTE: This was found while ensuring the new i_aops_sem was properly
handled.  It seems like this is a layering violation so I think it is
worth cleaning up so as to not confuse others.
---
 fs/xfs/xfs_icache.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8dc2e5414276..836a1f09be03 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -419,6 +419,7 @@ xfs_iget_cache_hit(
 		spin_unlock(&ip->i_flags_lock);
 		rcu_read_unlock();
 
+		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
 		error = xfs_reinit_inode(mp, inode);
 		if (error) {
 			bool wake;
@@ -452,9 +453,6 @@ xfs_iget_cache_hit(
 		ip->i_sick = 0;
 		ip->i_checked = 0;
 
-		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
-		init_rwsem(&inode->i_rwsem);
-
 		spin_unlock(&ip->i_flags_lock);
 		spin_unlock(&pag->pag_ici_lock);
 	} else {
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 02/12] fs: Remove unneeded IS_DAX() check
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
  2020-02-27  5:24 ` [PATCH V5 01/12] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-02-27  5:24 ` [PATCH V5 03/12] fs/stat: Define DAX statx attribute ira.weiny
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Dave Chinner, Jan Kara, Alexander Viro,
	Darrick J. Wong, Dan Williams, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Remove the check because DAX now has it's own read/write methods and
file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on
their own.  Therefore, it does not matter if the file state is DAX when
the iocb flags are created.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from v3:
	Reword commit message.
	Reordered to be a 'pre-cleanup' patch
---
 include/linux/fs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..63d1e533a07d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3388,7 +3388,7 @@ extern int file_update_time(struct file *file);
 
 static inline bool io_is_direct(struct file *filp)
 {
-	return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
+	return (filp->f_flags & O_DIRECT);
 }
 
 static inline bool vma_is_dax(struct vm_area_struct *vma)
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 03/12] fs/stat: Define DAX statx attribute
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
  2020-02-27  5:24 ` [PATCH V5 01/12] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
  2020-02-27  5:24 ` [PATCH V5 02/12] fs: Remove unneeded IS_DAX() check ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-02-27  5:24 ` [PATCH V5 04/12] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Jan Kara, Darrick J . Wong, Alexander Viro,
	Dan Williams, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

In order for users to determine if a file is currently operating in DAX
state (effective DAX).  Define a statx attribute value and set that
attribute if the effective DAX flag is set.

To go along with this we propose the following addition to the statx man
page:

STATX_ATTR_DAX

	The file is in the DAX (cpu direct access) state.  DAX state
	attempts to minimize software cache effects for both I/O and
	memory mappings of this file.  It requires a file system which
	has been configured to support DAX.

	DAX generally assumes all accesses are via cpu load / store
	instructions which can minimize overhead for small accesses, but
	may adversely affect cpu utilization for large transfers.

	File I/O is done directly to/from user-space buffers and memory
	mapped I/O may be performed with direct memory mappings that
	bypass kernel page cache.

	While the DAX property tends to result in data being transferred
	synchronously, it does not give the same guarantees of O_SYNC
	where data and the necessary metadata are transferred together.

	A DAX file may support being mapped with the MAP_SYNC flag,
	which enables a program to use CPU cache flush instructions to
	persist CPU store operations without an explicit fsync(2).  See
	mmap(2) for more information.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V2:
	Update man page text with comments from Darrick, Jan, Dan, and
	Dave.
---
 fs/stat.c                 | 3 +++
 include/uapi/linux/stat.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/fs/stat.c b/fs/stat.c
index 030008796479..894699c74dde 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	if (IS_DAX(inode))
+		stat->attributes |= STATX_ATTR_DAX;
+
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
 					    query_flags);
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index ad80a5c885d5..e5f9d5517f6b 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -169,6 +169,7 @@ struct statx {
 #define STATX_ATTR_ENCRYPTED		0x00000800 /* [I] File requires key to decrypt in fs */
 #define STATX_ATTR_AUTOMOUNT		0x00001000 /* Dir: Automount trigger */
 #define STATX_ATTR_VERITY		0x00100000 /* [I] Verity protected file */
+#define STATX_ATTR_DAX			0x00002000 /* [I] File is DAX */
 
 
 #endif /* _UAPI_LINUX_STAT_H */
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 04/12] fs/xfs: Isolate the physical DAX flag from enabled
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
                   ` (2 preceding siblings ...)
  2020-02-27  5:24 ` [PATCH V5 03/12] fs/stat: Define DAX statx attribute ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-02-27  5:24 ` [PATCH V5 05/12] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_ioctl_setattr_dax_invalidate() currently checks if the DAX flag is
changing as a quick check.

But the implementation mixes the physical (XFS_DIFLAG2_DAX) and
the enabled (S_DAX) DAX flags.

Remove the use of the enabled flag when determining if a change of the
physical flag is required.

Furthermore, we want the physical flag, XFS_DIFLAG2_DAX, to be changed
regardless of if the underlying storage can support DAX or not.

The enabled flag, IS_DAX(), will be set later IFF the inode supports
dax in a follow on patch.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V3:
	Remove the underlying storage support check
	Rework commit message
	Reorder patch
---
 fs/xfs/xfs_ioctl.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d42de92cb283..25e12ce85075 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1190,28 +1190,16 @@ xfs_ioctl_setattr_dax_invalidate(
 	int			*join_flags)
 {
 	struct inode		*inode = VFS_I(ip);
-	struct super_block	*sb = inode->i_sb;
 	int			error;
 
 	*join_flags = 0;
 
-	/*
-	 * It is only valid to set the DAX flag on regular files and
-	 * directories on filesystems where the block size is equal to the page
-	 * size. On directories it serves as an inherited hint so we don't
-	 * have to check the device for dax support or flush pagecache.
-	 */
-	if (fa->fsx_xflags & FS_XFLAG_DAX) {
-		struct xfs_buftarg	*target = xfs_inode_buftarg(ip);
-
-		if (!bdev_dax_supported(target->bt_bdev, sb->s_blocksize))
-			return -EINVAL;
-	}
-
 	/* If the DAX state is not changing, we have nothing to do here. */
-	if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
+	if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
+	    (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		return 0;
-	if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
+	if (!(fa->fsx_xflags & FS_XFLAG_DAX) &&
+	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
 		return 0;
 
 	if (S_ISDIR(inode->i_mode))
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 05/12] fs/xfs: Create function xfs_inode_enable_dax()
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
                   ` (3 preceding siblings ...)
  2020-02-27  5:24 ` [PATCH V5 04/12] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-03-01 22:37   ` Dave Chinner
  2020-02-27  5:24 ` [PATCH V5 06/12] fs: Add locking for a dynamic address space operations state ira.weiny
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

xfs_inode_supports_dax() should reflect if the inode can support DAX not
that it is enabled for DAX.

Change the use of xfs_inode_supports_dax() to reflect only if the inode
and underlying storage support dax.

Add a new function xfs_inode_enable_dax() which reflects if the inode
should be enabled for DAX.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from v3:
	Update functions and names to be more clear
	Update commit message
	Merge with
		'fs/xfs: Clean up DAX support check'
		don't allow IS_DAX() on a directory
		use STATIC macro for static
		make xfs_inode_supports_dax() static
---
 fs/xfs/xfs_iops.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 81f2f93caec0..ff711efc5247 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1237,19 +1237,18 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = {
 };
 
 /* Figure out if this file actually supports DAX. */
-static bool
+STATIC bool
 xfs_inode_supports_dax(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
 
 	/* Only supported on non-reflinked files. */
-	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
+	if (xfs_is_reflink_inode(ip))
 		return false;
 
-	/* DAX mount option or DAX iflag must be set. */
-	if (!(mp->m_flags & XFS_MOUNT_DAX) &&
-	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
+	/* Only supported on regular files. */
+	if (!S_ISREG(VFS_I(ip)->i_mode))
 		return false;
 
 	/* Block size must match page size */
@@ -1260,6 +1259,20 @@ xfs_inode_supports_dax(
 	return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
 }
 
+STATIC bool
+xfs_inode_enable_dax(
+	struct xfs_inode *ip)
+{
+	if (!xfs_inode_supports_dax(ip))
+		return false;
+
+	if (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX)
+		return true;
+	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+		return true;
+	return false;
+}
+
 STATIC void
 xfs_diflags_to_iflags(
 	struct inode		*inode,
@@ -1278,7 +1291,7 @@ xfs_diflags_to_iflags(
 		inode->i_flags |= S_SYNC;
 	if (flags & XFS_DIFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
-	if (xfs_inode_supports_dax(ip))
+	if (xfs_inode_enable_dax(ip))
 		inode->i_flags |= S_DAX;
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 06/12] fs: Add locking for a dynamic address space operations state
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
                   ` (4 preceding siblings ...)
  2020-02-27  5:24 ` [PATCH V5 05/12] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-03-02  1:26   ` Dave Chinner
  2020-02-27  5:24 ` [PATCH V5 07/12] fs: Prevent DAX state change if file is mmap'ed ira.weiny
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

DAX requires special address space operations (aops).  Changing DAX
state therefore requires changing those aops.

However, many functions require aops to remain consistent through a deep
call stack.

Define a vfs level inode rwsem to protect aops throughout call stacks
which require them.

Finally, define calls to be used in subsequent patches when aops usage
needs to be quiesced by the file system.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V4:
	Fix 0-day build issue.

Changes from V3:
	Convert from global to a per-inode rwsem
		Remove pre-optimization
	Remove static branch stuff
	Change function names from inode_dax_state_* to inode_aops_*
		I kept 'inode' as the synchronization is at the inode
		level now (probably where it belongs)...

		and I still prefer *_[down|up]_[read|write] as those
		names better reflect the use and interaction between
		users (readers) and writers better.  'XX_start_aop()'
		would have to be matched with something like
		'XX_wait_for_aop_user()' and 'XX_release_aop_users()' or
		something which does not make sense on the 'writer'
		side.

Changes from V2

	Rebase on linux-next-08-02-2020

	Fix locking order
	Change all references from mode to state where appropriate
	add CONFIG_FS_DAX requirement for state change
	Use a static branch to enable locking only when a dax capable
		device has been seen.

	Move the lock to a global vfs lock

		this does a few things
			1) preps us better for ext4 support
			2) removes funky callbacks from inode ops
			3) remove complexity from XFS and probably from
			   ext4 later

		We can do this because
			1) the locking order is required to be at the
			   highest level anyway, so why complicate xfs
			2) We had to move the sem to the super_block
			   because it is too heavy for the inode.
			3) After internal discussions with Dan we
			   decided that this would be easier, just as
			   performant, and with slightly less overhead
			   than in the VFS SB.

		We also change the functions names to up/down;
		read/write as appropriate.  Previous names were over
		simplified.

	Update comments and documentation

squash: add locking

squash:...
---
 Documentation/filesystems/vfs.rst | 16 ++++++++
 fs/attr.c                         |  1 +
 fs/inode.c                        | 15 ++++++--
 fs/iomap/buffered-io.c            |  1 +
 fs/open.c                         |  4 ++
 fs/stat.c                         |  2 +
 fs/xfs/xfs_icache.c               |  1 +
 include/linux/fs.h                | 64 ++++++++++++++++++++++++++++++-
 mm/fadvise.c                      |  7 +++-
 mm/filemap.c                      |  4 ++
 mm/huge_memory.c                  |  1 +
 mm/khugepaged.c                   |  2 +
 mm/util.c                         |  9 ++++-
 13 files changed, 119 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 7d4d09dd5e6d..4a10a232f8e2 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -934,6 +934,22 @@ cache in your filesystem.  The following members are defined:
 	Called during swapoff on files where swap_activate was
 	successful.
 
+Changing DAX 'state' dynamically
+----------------------------------
+
+Some file systems which support DAX want to be able to change the DAX state
+dyanically.  To switch the state safely we lock the inode state in all "normal"
+file system operations and restrict state changes to those operations.  The
+specific rules are.
+
+        1) the direct_IO address_space_operation must be supported in all
+           potential a_ops vectors for any state suported by the inode.
+
+        3) DAX state changes shall not be allowed while the file is mmap'ed
+        4) For non-mmaped operations the VFS layer must take the read lock for any
+           use of IS_DAX()
+        5) Filesystems take the write lock when changing DAX states.
+
 
 The File Object
 ===============
diff --git a/fs/attr.c b/fs/attr.c
index b4bbdbd4c8ca..9b15f73d1079 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -332,6 +332,7 @@ int notify_change(struct dentry * dentry, struct iattr * attr, struct inode **de
 	if (error)
 		return error;
 
+	/* DAX read state should already be held here */
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
diff --git a/fs/inode.c b/fs/inode.c
index 7d57068b6b7a..6e4f1cc872f2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -200,6 +200,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 #endif
 	inode->i_flctx = NULL;
 	this_cpu_inc(nr_inodes);
+	init_rwsem(&inode->i_aops_sem);
 
 	return 0;
 out:
@@ -1616,11 +1617,19 @@ EXPORT_SYMBOL(iput);
  */
 int bmap(struct inode *inode, sector_t *block)
 {
-	if (!inode->i_mapping->a_ops->bmap)
-		return -EINVAL;
+	int ret = 0;
+
+	inode_aops_down_read(inode);
+	if (!inode->i_mapping->a_ops->bmap) {
+		ret = -EINVAL;
+		goto err;
+	}
 
 	*block = inode->i_mapping->a_ops->bmap(inode->i_mapping, *block);
-	return 0;
+
+err:
+	inode_aops_up_read(inode);
+	return ret;
 }
 EXPORT_SYMBOL(bmap);
 #endif
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 7c84c4c027c4..e313a34d5fa6 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -999,6 +999,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
 		offset = offset_in_page(pos);
 		bytes = min_t(loff_t, PAGE_SIZE - offset, count);
 
+		/* DAX state read should already be held here */
 		if (IS_DAX(inode))
 			status = iomap_dax_zero(pos, offset, bytes, iomap);
 		else
diff --git a/fs/open.c b/fs/open.c
index 0788b3715731..3abf0bfac462 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -59,10 +59,12 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 	if (ret)
 		newattrs.ia_valid |= ret | ATTR_FORCE;
 
+	inode_aops_down_read(dentry->d_inode);
 	inode_lock(dentry->d_inode);
 	/* Note any delegations or leases have already been broken: */
 	ret = notify_change(dentry, &newattrs, NULL);
 	inode_unlock(dentry->d_inode);
+	inode_aops_up_read(dentry->d_inode);
 	return ret;
 }
 
@@ -306,7 +308,9 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EOPNOTSUPP;
 
 	file_start_write(file);
+	inode_aops_down_read(inode);
 	ret = file->f_op->fallocate(file, mode, offset, len);
+	inode_aops_up_read(inode);
 
 	/*
 	 * Create inotify and fanotify events.
diff --git a/fs/stat.c b/fs/stat.c
index 894699c74dde..274b3ccc82b1 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
 	if (IS_AUTOMOUNT(inode))
 		stat->attributes |= STATX_ATTR_AUTOMOUNT;
 
+	inode_aops_down_read(inode);
 	if (IS_DAX(inode))
 		stat->attributes |= STATX_ATTR_DAX;
+	inode_aops_up_read(inode);
 
 	if (inode->i_op->getattr)
 		return inode->i_op->getattr(path, stat, request_mask,
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 836a1f09be03..3e83a97dc047 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -420,6 +420,7 @@ xfs_iget_cache_hit(
 		rcu_read_unlock();
 
 		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
+		ASSERT(!rwsem_is_locked(&inode->i_aops_sem));
 		error = xfs_reinit_inode(mp, inode);
 		if (error) {
 			bool wake;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 63d1e533a07d..22cc1aa980f5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -40,6 +40,7 @@
 #include <linux/fs_types.h>
 #include <linux/build_bug.h>
 #include <linux/stddef.h>
+#include <linux/percpu-rwsem.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -359,6 +360,11 @@ typedef struct {
 typedef int (*read_actor_t)(read_descriptor_t *, struct page *,
 		unsigned long, unsigned long);
 
+/**
+ * NOTE: DO NOT define new functions in address_space_operations without first
+ * considering how dynamic DAX states are to be supported.  See the
+ * inode_aops_*_read() functions
+ */
 struct address_space_operations {
 	int (*writepage)(struct page *page, struct writeback_control *wbc);
 	int (*readpage)(struct file *, struct page *);
@@ -735,6 +741,7 @@ struct inode {
 #endif
 
 	void			*i_private; /* fs or device private pointer */
+	struct rw_semaphore	i_aops_sem;
 } __randomize_layout;
 
 struct timespec64 timestamp_truncate(struct timespec64 t, struct inode *inode);
@@ -1817,6 +1824,11 @@ struct block_device_operations;
 
 struct iov_iter;
 
+/**
+ * NOTE: DO NOT define new functions in file_operations without first
+ * considering how dynamic address_space operations are to be supported.  See
+ * the inode_aops_*_read() functions in this file.
+ */
 struct file_operations {
 	struct module *owner;
 	loff_t (*llseek) (struct file *, loff_t, int);
@@ -1889,16 +1901,64 @@ struct inode_operations {
 	int (*set_acl)(struct inode *, struct posix_acl *, int);
 } ____cacheline_aligned;
 
+#if defined(CONFIG_FS_DAX)
+/*
+ * Filesystems wishing to support dynamic DAX states must do the following.
+ *
+ * 1) the direct_IO address_space_operation must be supported in all potential
+ *    a_ops vectors for any state suported by the inode.  This is because the
+ *    direct_IO function is used as a flag long before the function is called.
+
+ * 3) DAX state changes shall not be allowed while the file is mmap'ed
+ * 4) For non-mmaped operations the VFS layer must take the read lock for any
+ *    use of IS_DAX()
+ * 5) Filesystems take the write lock when changing DAX states.
+ */
+static inline void inode_aops_down_read(struct inode *inode)
+{
+	down_read(&inode->i_aops_sem);
+}
+static inline void inode_aops_up_read(struct inode *inode)
+{
+	up_read(&inode->i_aops_sem);
+}
+static inline void inode_aops_down_write(struct inode *inode)
+{
+	down_write(&inode->i_aops_sem);
+}
+static inline void inode_aops_up_write(struct inode *inode)
+{
+	up_write(&inode->i_aops_sem);
+}
+#else /* !CONFIG_FS_DAX */
+#define inode_aops_down_read(inode) do { (void)(inode); } while (0)
+#define inode_aops_up_read(inode) do { (void)(inode); } while (0)
+#define inode_aops_down_write(inode) do { (void)(inode); } while (0)
+#define inode_aops_up_write(inode) do { (void)(inode); } while (0)
+#endif /* CONFIG_FS_DAX */
+
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 				     struct iov_iter *iter)
 {
-	return file->f_op->read_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	inode_aops_down_read(inode);
+	ret = file->f_op->read_iter(kio, iter);
+	inode_aops_up_read(inode);
+	return ret;
 }
 
 static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
 				      struct iov_iter *iter)
 {
-	return file->f_op->write_iter(kio, iter);
+	struct inode		*inode = file_inode(kio->ki_filp);
+	ssize_t ret;
+
+	inode_aops_down_read(inode);
+	ret = file->f_op->write_iter(kio, iter);
+	inode_aops_up_read(inode);
+	return ret;
 }
 
 static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 4f17c83db575..6a30febb11e0 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -48,6 +48,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 	bdi = inode_to_bdi(mapping->host);
 
 	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
+		int ret = 0;
+
 		switch (advice) {
 		case POSIX_FADV_NORMAL:
 		case POSIX_FADV_RANDOM:
@@ -58,9 +60,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
 			/* no bad return value, but ignore advice */
 			break;
 		default:
-			return -EINVAL;
+			ret = -EINVAL;
 		}
-		return 0;
+
+		return ret;
 	}
 
 	/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 1784478270e1..3a7863ba51b9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 		 * and return.  Otherwise fallthrough to buffered io for
 		 * the rest of the read.  Buffered reads will not work for
 		 * DAX files, so don't bother trying.
+		 *
+		 * IS_DAX is protected under ->read_iter lock
 		 */
 		if (retval < 0 || !count || iocb->ki_pos >= size ||
 		    IS_DAX(inode))
@@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		 * holes, for example.  For DAX files, a buffered write will
 		 * not succeed (even if it did, DAX does not handle dirty
 		 * page-cache pages correctly).
+		 *
+		 * IS_DAX is protected under ->write_iter lock
 		 */
 		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
 			goto out;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b08b199f9a11..3d05bd10d83e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 	unsigned long ret;
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
+	/* Should not need locking here because mmap is not allowed */
 	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
 		goto out;
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b679908743cb..f048178e2b93 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
 		} else {	/* !is_shmem */
 			if (!page || xa_is_value(page)) {
 				xas_unlock_irq(&xas);
+				inode_aops_down_read(file->f_inode);
 				page_cache_sync_readahead(mapping, &file->f_ra,
 							  file, index,
 							  PAGE_SIZE);
+				inode_aops_up_read(file->f_inode);
 				/* drain pagevecs to help isolate_lru_page() */
 				lru_add_drain();
 				page = find_lock_page(mapping, index);
diff --git a/mm/util.c b/mm/util.c
index 988d11e6c17c..a4fb0670137d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
 
 	ret = security_mmap_file(file, prot, flag);
 	if (!ret) {
-		if (down_write_killable(&mm->mmap_sem))
+		if (file)
+			inode_aops_down_read(file_inode(file));
+		if (down_write_killable(&mm->mmap_sem)) {
+			if (file)
+				inode_aops_up_read(file_inode(file));
 			return -EINTR;
+		}
 		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
 				    &populate, &uf);
 		up_write(&mm->mmap_sem);
+		if (file)
+			inode_aops_up_read(file_inode(file));
 		userfaultfd_unmap_complete(mm, &uf);
 		if (populate)
 			mm_populate(ret, populate);
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 07/12] fs: Prevent DAX state change if file is mmap'ed
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
                   ` (5 preceding siblings ...)
  2020-02-27  5:24 ` [PATCH V5 06/12] fs: Add locking for a dynamic address space operations state ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-02-27  5:24 ` [PATCH V5 08/12] fs/xfs: Hold off aops users while changing DAX state ira.weiny
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Page faults need to ensure the inode DAX configuration is correct and
consistent with the vmf information at the time of the fault.  There is
no easy way to ensure the vmf information is correct if a DAX change is
in progress.  Furthermore, there is no good use case to require changing
DAX configs while the file is mmap'ed.

Track mmap's of the file and fail the DAX change if the file is mmap'ed.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V2:

	move 'i_mapped' to struct address_space and rename mmap_count
	Add inode_has_mappings() helper for FS's
	Change reference to "mode" to "state"
---
Changes from V3:
	Fix htmldoc error from the kbuild test robot.
	Reported-by: kbuild test robot <lkp@intel.com>
	Rebase cleanups
---
 fs/inode.c         |  1 +
 fs/xfs/xfs_ioctl.c |  9 +++++++++
 include/linux/fs.h |  7 +++++++
 mm/mmap.c          | 19 +++++++++++++++++--
 4 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 6e4f1cc872f2..613a045075bd 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -372,6 +372,7 @@ static void __address_space_init_once(struct address_space *mapping)
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
 	mapping->i_mmap = RB_ROOT_CACHED;
+	atomic64_set(&mapping->mmap_count, 0);
 }
 
 void address_space_init_once(struct address_space *mapping)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 25e12ce85075..498fae2ef9f6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1207,6 +1207,15 @@ xfs_ioctl_setattr_dax_invalidate(
 
 	/* lock, flush and invalidate mapping in preparation for flag change */
 	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+
+	/*
+	 * If there is a mapping in place we must remain in our current state.
+	 */
+	if (inode_has_mappings(inode)) {
+		error = -EBUSY;
+		goto out_unlock;
+	}
+
 	error = filemap_write_and_wait(inode->i_mapping);
 	if (error)
 		goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 22cc1aa980f5..cf18a3c38562 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -438,6 +438,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
  * @nr_thps: Number of THPs in the pagecache (non-shmem only).
  * @i_mmap: Tree of private and shared mappings.
  * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable.
+ * @mmap_count: The number of times this AS has been mmap'ed
  * @nrpages: Number of page entries, protected by the i_pages lock.
  * @nrexceptional: Shadow or DAX entries, protected by the i_pages lock.
  * @writeback_index: Writeback starts here.
@@ -459,6 +460,7 @@ struct address_space {
 #endif
 	struct rb_root_cached	i_mmap;
 	struct rw_semaphore	i_mmap_rwsem;
+	atomic64_t              mmap_count;
 	unsigned long		nrpages;
 	unsigned long		nrexceptional;
 	pgoff_t			writeback_index;
@@ -1937,6 +1939,11 @@ static inline void inode_aops_up_write(struct inode *inode)
 #define inode_aops_up_write(inode) do { (void)(inode); } while (0)
 #endif /* CONFIG_FS_DAX */
 
+static inline bool inode_has_mappings(struct inode *inode)
+{
+	return (atomic64_read(&inode->i_mapping->mmap_count) != 0);
+}
+
 static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
 				     struct iov_iter *iter)
 {
diff --git a/mm/mmap.c b/mm/mmap.c
index 7cc2562b99fd..6bb16a0996b5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -171,12 +171,17 @@ void unlink_file_vma(struct vm_area_struct *vma)
 static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 {
 	struct vm_area_struct *next = vma->vm_next;
+	struct file *f = vma->vm_file;
 
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
-	if (vma->vm_file)
-		fput(vma->vm_file);
+	if (f) {
+		struct inode *inode = file_inode(f);
+		if (inode)
+			atomic64_dec(&inode->i_mapping->mmap_count);
+		fput(f);
+	}
 	mpol_put(vma_policy(vma));
 	vm_area_free(vma);
 	return next;
@@ -1830,6 +1835,16 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vma_set_page_prot(vma);
 
+	/*
+	 * Track if there is mapping in place such that a state change
+	 * does not occur on a file which is mapped
+	 */
+	if (file) {
+		struct inode		*inode = file_inode(file);
+
+		atomic64_inc(&inode->i_mapping->mmap_count);
+	}
+
 	return addr;
 
 unmap_and_free_vma:
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 08/12] fs/xfs: Hold off aops users while changing DAX state
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
                   ` (6 preceding siblings ...)
  2020-02-27  5:24 ` [PATCH V5 07/12] fs: Prevent DAX state change if file is mmap'ed ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-02-27  5:24 ` [PATCH V5 09/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

XFS requires the use of the aops of an inode to be quiesced prior to
changing the aops vector to/from the DAX vector.

Take the aops write lock while changing DAX state.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from v4:
	Open code the aops write lock
	Obsolete: Clean up lock name in comments
	Obsolete: Change #define: XFS_DAX_EXCL => XFS_AOPSLOCK_EXCL

Changes from v3:
	Change locking function names to reflect changes in previous
	patches.

Changes from V2:
	Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
	Remove the xfs specific lock and move to the vfs layer.
		We still use XFS_LOCK_DAX_EXCL to be able to pass this
		flag through to the transaction code.  But we no longer
		have a lock specific to xfs.  This removes a lot of code
		from the XFS layer, preps us for using this in ext4, and
		is actually more straight forward now that all the
		locking requirements are better known.

	Fix locking order comment
	Rework for new 'state' names
	(Other comments on the previous patch are not applicable with
	new patch as much of the code was removed in favor of the vfs
	level lock)
---
 fs/xfs/xfs_ioctl.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 498fae2ef9f6..6c4d4ea3b6b6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1444,7 +1444,11 @@ xfs_ioctl_setattr(
 	 * or cancel time, so need to be passed through to
 	 * xfs_ioctl_setattr_get_trans() so it can apply them to the join call
 	 * appropriately.
+	 *
+	 * Further we hold off aops users until we have completed any potential
+	 * changing of aops due to attribute changes.
 	 */
+	inode_aops_down_write(VFS_I(ip));
 	code = xfs_ioctl_setattr_dax_invalidate(ip, fa, &join_flags);
 	if (code)
 		goto error_free_dquots;
@@ -1527,6 +1531,7 @@ xfs_ioctl_setattr(
 	xfs_qm_dqrele(udqp);
 	xfs_qm_dqrele(pdqp);
 
+	inode_aops_up_write(VFS_I(ip));
 	return code;
 
 error_trans_cancel:
@@ -1534,6 +1539,7 @@ xfs_ioctl_setattr(
 error_free_dquots:
 	xfs_qm_dqrele(udqp);
 	xfs_qm_dqrele(pdqp);
+	inode_aops_up_write(VFS_I(ip));
 	return code;
 }
 
@@ -1603,7 +1609,11 @@ xfs_ioc_setxflags(
 	 * or cancel time, so need to be passed through to
 	 * xfs_ioctl_setattr_get_trans() so it can apply them to the join call
 	 * appropriately.
+	 *
+	 * Further we hold off aops users until we have completed any potential
+	 * changing of aops due to attribute changes.
 	 */
+	inode_aops_down_write(VFS_I(ip));
 	error = xfs_ioctl_setattr_dax_invalidate(ip, &fa, &join_flags);
 	if (error)
 		goto out_drop_write;
@@ -1630,6 +1640,7 @@ xfs_ioc_setxflags(
 	error = xfs_trans_commit(tp);
 out_drop_write:
 	mnt_drop_write_file(filp);
+	inode_aops_up_write(VFS_I(ip));
 	return error;
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 09/12] fs/xfs: Clean up locking in dax invalidate
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
                   ` (7 preceding siblings ...)
  2020-02-27  5:24 ` [PATCH V5 08/12] fs/xfs: Hold off aops users while changing DAX state ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-02-27  5:24 ` [PATCH V5 10/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Define a variable to hold the lock flags to ensure that the correct
locks are returned or released on error.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 fs/xfs/xfs_ioctl.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6c4d4ea3b6b6..40ae791cfb41 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1190,7 +1190,7 @@ xfs_ioctl_setattr_dax_invalidate(
 	int			*join_flags)
 {
 	struct inode		*inode = VFS_I(ip);
-	int			error;
+	int			error, flags;
 
 	*join_flags = 0;
 
@@ -1205,8 +1205,10 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (S_ISDIR(inode->i_mode))
 		return 0;
 
+	flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+
 	/* lock, flush and invalidate mapping in preparation for flag change */
-	xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+	xfs_ilock(ip, flags);
 
 	/*
 	 * If there is a mapping in place we must remain in our current state.
@@ -1223,11 +1225,11 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (error)
 		goto out_unlock;
 
-	*join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+	*join_flags = flags;
 	return 0;
 
 out_unlock:
-	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+	xfs_iunlock(ip, flags);
 	return error;
 
 }
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 10/12] fs/xfs: Allow toggle of effective DAX flag
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
                   ` (8 preceding siblings ...)
  2020-02-27  5:24 ` [PATCH V5 09/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-02-27  5:24 ` [PATCH V5 11/12] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Now that locking of the inode aops is in place we can allow a DAX state
change.

Use the new xfs_inode_enable_dax() to set IS_DAX() as needed and update
the aops vector after the enabled state change.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from v4:
	Adjust for open coding of the aops lock

Changes from V3:
	Remove static branch stuff.
	Fix bugs found by Jeff by using xfs_inode_enable_dax()
	Condition xfs_ioctl_setattr_dax_invalidate() on CONFIG_FS_DAX

Changes from V2:
	Add in lock_dax_state_static_key static branch enabling.
	Rebase updates
---
 fs/xfs/xfs_inode.h |  2 ++
 fs/xfs/xfs_ioctl.c | 20 +++++++++++++++++---
 fs/xfs/xfs_iops.c  | 17 ++++++++++++-----
 3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 492e53992fa9..7d4968915dec 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -466,6 +466,8 @@ int	xfs_break_layouts(struct inode *inode, uint *iolock,
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
+extern void xfs_setup_a_ops(struct xfs_inode *ip);
+extern bool xfs_inode_enable_dax(struct xfs_inode *ip);
 
 /*
  * When setting up a newly allocated inode, we need to call
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 40ae791cfb41..38e91de44a6f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1123,12 +1123,11 @@ xfs_diflags_to_linux(
 		inode->i_flags |= S_NOATIME;
 	else
 		inode->i_flags &= ~S_NOATIME;
-#if 0	/* disabled until the flag switching races are sorted out */
-	if (xflags & FS_XFLAG_DAX)
+
+	if (xfs_inode_enable_dax(ip))
 		inode->i_flags |= S_DAX;
 	else
 		inode->i_flags &= ~S_DAX;
-#endif
 }
 
 static int
@@ -1183,6 +1182,7 @@ xfs_ioctl_setattr_xflags(
  * so that the cache invalidation is atomic with respect to the DAX flag
  * manipulation.
  */
+#if defined(CONFIG_FS_DAX)
 static int
 xfs_ioctl_setattr_dax_invalidate(
 	struct xfs_inode	*ip,
@@ -1233,6 +1233,16 @@ xfs_ioctl_setattr_dax_invalidate(
 	return error;
 
 }
+#else /* !CONFIG_FS_DAX */
+static int
+xfs_ioctl_setattr_dax_invalidate(
+	struct xfs_inode	*ip,
+	struct fsxattr		*fa,
+	int			*join_flags)
+{
+	return 0;
+}
+#endif
 
 /*
  * Set up the transaction structure for the setattr operation, checking that we
@@ -1524,6 +1534,8 @@ xfs_ioctl_setattr(
 	else
 		ip->i_d.di_cowextsize = 0;
 
+	xfs_setup_a_ops(ip);
+
 	code = xfs_trans_commit(tp);
 
 	/*
@@ -1639,6 +1651,8 @@ xfs_ioc_setxflags(
 		goto out_drop_write;
 	}
 
+	xfs_setup_a_ops(ip);
+
 	error = xfs_trans_commit(tp);
 out_drop_write:
 	mnt_drop_write_file(filp);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff711efc5247..8f023be39705 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1259,7 +1259,7 @@ xfs_inode_supports_dax(
 	return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
 }
 
-STATIC bool
+bool
 xfs_inode_enable_dax(
 	struct xfs_inode *ip)
 {
@@ -1355,6 +1355,16 @@ xfs_setup_inode(
 	}
 }
 
+void xfs_setup_a_ops(struct xfs_inode *ip)
+{
+	struct inode		*inode = &ip->i_vnode;
+
+	if (IS_DAX(inode))
+		inode->i_mapping->a_ops = &xfs_dax_aops;
+	else
+		inode->i_mapping->a_ops = &xfs_address_space_operations;
+}
+
 void
 xfs_setup_iops(
 	struct xfs_inode	*ip)
@@ -1365,10 +1375,7 @@ xfs_setup_iops(
 	case S_IFREG:
 		inode->i_op = &xfs_inode_operations;
 		inode->i_fop = &xfs_file_operations;
-		if (IS_DAX(inode))
-			inode->i_mapping->a_ops = &xfs_dax_aops;
-		else
-			inode->i_mapping->a_ops = &xfs_address_space_operations;
+		xfs_setup_a_ops(ip);
 		break;
 	case S_IFDIR:
 		if (xfs_sb_version_hasasciici(&XFS_M(inode->i_sb)->m_sb))
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 11/12] fs/xfs: Remove xfs_diflags_to_linux()
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
                   ` (9 preceding siblings ...)
  2020-02-27  5:24 ` [PATCH V5 10/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-02-27  5:24 ` [PATCH V5 12/12] Documentation/dax: Update Usage section ira.weiny
  2020-03-05 15:51 ` [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 Christoph Hellwig
  12 siblings, 0 replies; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, kbuild test robot, Alexander Viro, Darrick J. Wong,
	Dan Williams, Dave Chinner, Christoph Hellwig,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

The functionality in xfs_diflags_to_linux() is now identical to
xfs_diflags_to_iflags().

Remove xfs_diflags_to_linux() and call xfs_diflags_to_iflags() directly.

While we are here simplify xfs_diflags_to_iflags() to take just struct
xfs_inode.

And use xfs_ip2xflags() to ensure future diflags are included correctly.

Signed-off-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V4:
	Pick up lkp build suggestion (make xfs_inode_enable_dax() static)
---
 fs/xfs/xfs_inode.h |  2 +-
 fs/xfs/xfs_ioctl.c | 32 +-------------------------------
 fs/xfs/xfs_iops.c  | 31 +++++++++++++++++++------------
 3 files changed, 21 insertions(+), 44 deletions(-)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 7d4968915dec..2c8c2804f88b 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -467,7 +467,7 @@ int	xfs_break_layouts(struct inode *inode, uint *iolock,
 extern void xfs_setup_inode(struct xfs_inode *ip);
 extern void xfs_setup_iops(struct xfs_inode *ip);
 extern void xfs_setup_a_ops(struct xfs_inode *ip);
-extern bool xfs_inode_enable_dax(struct xfs_inode *ip);
+extern void xfs_diflags_to_iflags(struct xfs_inode *ip);
 
 /*
  * When setting up a newly allocated inode, we need to call
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 38e91de44a6f..bd67359a3251 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1100,36 +1100,6 @@ xfs_flags2diflags2(
 	return di_flags2;
 }
 
-STATIC void
-xfs_diflags_to_linux(
-	struct xfs_inode	*ip)
-{
-	struct inode		*inode = VFS_I(ip);
-	unsigned int		xflags = xfs_ip2xflags(ip);
-
-	if (xflags & FS_XFLAG_IMMUTABLE)
-		inode->i_flags |= S_IMMUTABLE;
-	else
-		inode->i_flags &= ~S_IMMUTABLE;
-	if (xflags & FS_XFLAG_APPEND)
-		inode->i_flags |= S_APPEND;
-	else
-		inode->i_flags &= ~S_APPEND;
-	if (xflags & FS_XFLAG_SYNC)
-		inode->i_flags |= S_SYNC;
-	else
-		inode->i_flags &= ~S_SYNC;
-	if (xflags & FS_XFLAG_NOATIME)
-		inode->i_flags |= S_NOATIME;
-	else
-		inode->i_flags &= ~S_NOATIME;
-
-	if (xfs_inode_enable_dax(ip))
-		inode->i_flags |= S_DAX;
-	else
-		inode->i_flags &= ~S_DAX;
-}
-
 static int
 xfs_ioctl_setattr_xflags(
 	struct xfs_trans	*tp,
@@ -1167,7 +1137,7 @@ xfs_ioctl_setattr_xflags(
 	ip->i_d.di_flags = xfs_flags2diflags(ip, fa->fsx_xflags);
 	ip->i_d.di_flags2 = di_flags2;
 
-	xfs_diflags_to_linux(ip);
+	xfs_diflags_to_iflags(ip);
 	xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 	XFS_STATS_INC(mp, xs_ig_attrchg);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8f023be39705..c63cf0b73b75 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1259,7 +1259,7 @@ xfs_inode_supports_dax(
 	return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
 }
 
-bool
+static bool
 xfs_inode_enable_dax(
 	struct xfs_inode *ip)
 {
@@ -1273,26 +1273,33 @@ xfs_inode_enable_dax(
 	return false;
 }
 
-STATIC void
+void
 xfs_diflags_to_iflags(
-	struct inode		*inode,
 	struct xfs_inode	*ip)
 {
-	uint16_t		flags = ip->i_d.di_flags;
-
-	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
-			    S_NOATIME | S_DAX);
+	struct inode		*inode = VFS_I(ip);
+	uint16_t		diflags = xfs_ip2xflags(ip);
 
-	if (flags & XFS_DIFLAG_IMMUTABLE)
+	if (diflags & FS_XFLAG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
-	if (flags & XFS_DIFLAG_APPEND)
+	else
+		inode->i_flags &= ~S_IMMUTABLE;
+	if (diflags & FS_XFLAG_APPEND)
 		inode->i_flags |= S_APPEND;
-	if (flags & XFS_DIFLAG_SYNC)
+	else
+		inode->i_flags &= ~S_APPEND;
+	if (diflags & FS_XFLAG_SYNC)
 		inode->i_flags |= S_SYNC;
-	if (flags & XFS_DIFLAG_NOATIME)
+	else
+		inode->i_flags &= ~S_SYNC;
+	if (diflags & FS_XFLAG_NOATIME)
 		inode->i_flags |= S_NOATIME;
+	else
+		inode->i_flags &= ~S_NOATIME;
 	if (xfs_inode_enable_dax(ip))
 		inode->i_flags |= S_DAX;
+	else
+		inode->i_flags &= ~S_DAX;
 }
 
 /*
@@ -1321,7 +1328,7 @@ xfs_setup_inode(
 	inode->i_gid    = xfs_gid_to_kgid(ip->i_d.di_gid);
 
 	i_size_write(inode, ip->i_d.di_size);
-	xfs_diflags_to_iflags(inode, ip);
+	xfs_diflags_to_iflags(ip);
 
 	if (S_ISDIR(inode->i_mode)) {
 		/*
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH V5 12/12] Documentation/dax: Update Usage section
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
                   ` (10 preceding siblings ...)
  2020-02-27  5:24 ` [PATCH V5 11/12] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
@ 2020-02-27  5:24 ` ira.weiny
  2020-03-05 15:51 ` [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 Christoph Hellwig
  12 siblings, 0 replies; 45+ messages in thread
From: ira.weiny @ 2020-02-27  5:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ira Weiny, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel

From: Ira Weiny <ira.weiny@intel.com>

Update the Usage section to reflect the new individual dax selection
functionality.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
 Documentation/filesystems/dax.txt | 84 ++++++++++++++++++++++++++++++-
 1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 679729442fd2..32e37c550f76 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -20,8 +20,88 @@ Usage
 If you have a block device which supports DAX, you can make a filesystem
 on it as usual.  The DAX code currently only supports files with a block
 size equal to your kernel's PAGE_SIZE, so you may need to specify a block
-size when creating the filesystem.  When mounting it, use the "-o dax"
-option on the command line or add 'dax' to the options in /etc/fstab.
+size when creating the filesystem.
+
+Enabling DAX on an individual file basis (XFS)
+----------------------------------------------
+
+There are 2 per file dax flags.  One is a physical configuration setting and
+the other a currently enabled state.
+
+The physical configuration setting is maintained on individual file and
+directory inodes.  It is preserved within the file system.  This 'physical'
+config setting can be set using an ioctl and/or an application such as "xfs_io
+-c 'chattr [-+]x'".  Files and directories automatically inherit their physical
+dax setting from their parent directory when created.  Therefore, setting the
+physical dax setting at directory creation time can be used to set a default
+behavior for that sub-tree.  Doing so on the root directory acts to set a
+default for the entire file system.
+
+To clarify inheritance here are 3 examples:
+
+Example A:
+
+mkdir -p a/b/c
+xfs_io 'chattr +x' a
+mkdir a/b/c/d
+mkdir a/e
+
+	dax: a,e
+	no dax: b,c,d
+
+Example B:
+
+mkdir a
+xfs_io 'chattr +x' a
+mkdir -p a/b/c/d
+
+	dax: a,b,c,d
+	no dax:
+
+Example C:
+
+mkdir -p a/b/c
+xfs_io 'chattr +x' c
+mkdir a/b/c/d
+
+	dax: c,d
+	no dax: a,b
+
+
+The current inode enabled state is set when a file inode is loaded and it is
+determined that the underlying media supports dax.
+
+statx can be used to query the file's current enabled state.  NOTE that a
+directory will never be operating in a dax state.  Therefore, the dax config
+state must be queried to see what config state a file or sub-directory will
+inherit from a directory.
+
+NOTE: Setting a file or directory's config state with xfs_io is possible even
+if the underlying media does not support dax.
+
+
+Enabling dax on a file system wide basis ('-o dax' mount option)
+----------------------------------------------------------------
+
+The physical dax configuration of all files can be overridden using a mount
+option.  In summary:
+
+	(physical flag || mount option) && capable device == dax in effect
+	(  <xfs_io>    ||  <'-o dax'> ) && capable device == <statx dax true>
+
+To enable the mount override, use "-o dax" on the command line or add
+'dax' to the options in /etc/fstab
+
+Using the mount option does not change the physical configured state of
+individual files.  Therefore, remounting _without_ the mount option will allow
+the file system to set file's enabled state directly based on their config
+setting.
+
+NOTE: Setting a file or directory's physical config state is possible while the
+file system is mounted with the dax override.  However, the file's enabled
+state will continue to be overridden and "dax enabled" until the mount option
+is removed and a remount performed.  At that point the file's physical config
+state dictates the enabled state.
 
 
 Implementation Tips for Block Driver Writers
-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 01/12] fs/xfs: Remove unnecessary initialization of i_rwsem
  2020-02-27  5:24 ` [PATCH V5 01/12] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
@ 2020-02-27 17:25   ` Ira Weiny
  0 siblings, 0 replies; 45+ messages in thread
From: Ira Weiny @ 2020-02-27 17:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: Alexander Viro, Darrick J. Wong, Dan Williams, Dave Chinner,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Wed, Feb 26, 2020 at 09:24:31PM -0800, 'Ira Weiny' wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> xfs_reinit_inode() -> inode_init_always() already handles calling
> init_rwsem(i_rwsem).  Doing so again is unneeded.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Ok...  I'm going to blame exchange...

Dan just pointed out that I mist this response from Dave...

Saying this patch was wrong.

https://lore.kernel.org/lkml/20200221012625.GT10776@dread.disaster.area/

Sorry,  I need to read that email and look at it.

Ira

> 
> ---
> New for V4:
> 
> NOTE: This was found while ensuring the new i_aops_sem was properly
> handled.  It seems like this is a layering violation so I think it is
> worth cleaning up so as to not confuse others.
> ---
>  fs/xfs/xfs_icache.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 8dc2e5414276..836a1f09be03 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -419,6 +419,7 @@ xfs_iget_cache_hit(
>  		spin_unlock(&ip->i_flags_lock);
>  		rcu_read_unlock();
>  
> +		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
>  		error = xfs_reinit_inode(mp, inode);
>  		if (error) {
>  			bool wake;
> @@ -452,9 +453,6 @@ xfs_iget_cache_hit(
>  		ip->i_sick = 0;
>  		ip->i_checked = 0;
>  
> -		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> -		init_rwsem(&inode->i_rwsem);
> -
>  		spin_unlock(&ip->i_flags_lock);
>  		spin_unlock(&pag->pag_ici_lock);
>  	} else {
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 05/12] fs/xfs: Create function xfs_inode_enable_dax()
  2020-02-27  5:24 ` [PATCH V5 05/12] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
@ 2020-03-01 22:37   ` Dave Chinner
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Chinner @ 2020-03-01 22:37 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Wed, Feb 26, 2020 at 09:24:35PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> xfs_inode_supports_dax() should reflect if the inode can support DAX not
> that it is enabled for DAX.
> 
> Change the use of xfs_inode_supports_dax() to reflect only if the inode
> and underlying storage support dax.
> 
> Add a new function xfs_inode_enable_dax() which reflects if the inode
> should be enabled for DAX.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from v3:
> 	Update functions and names to be more clear
> 	Update commit message
> 	Merge with
> 		'fs/xfs: Clean up DAX support check'
> 		don't allow IS_DAX() on a directory
> 		use STATIC macro for static
> 		make xfs_inode_supports_dax() static
> ---
>  fs/xfs/xfs_iops.c | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 81f2f93caec0..ff711efc5247 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1237,19 +1237,18 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = {
>  };
>  
>  /* Figure out if this file actually supports DAX. */
> -static bool
> +STATIC bool
>  xfs_inode_supports_dax(
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  
>  	/* Only supported on non-reflinked files. */
> -	if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
> +	if (xfs_is_reflink_inode(ip))
>  		return false;
>  
> -	/* DAX mount option or DAX iflag must be set. */
> -	if (!(mp->m_flags & XFS_MOUNT_DAX) &&
> -	    !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> +	/* Only supported on regular files. */
> +	if (!S_ISREG(VFS_I(ip)->i_mode))
>  		return false;

Why split the initial check? The "non-reflinked files" comment is
pretty explicit, so splitting it just to add a "only support on
regular files" comment doesn't actually add any value here.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 06/12] fs: Add locking for a dynamic address space operations state
  2020-02-27  5:24 ` [PATCH V5 06/12] fs: Add locking for a dynamic address space operations state ira.weiny
@ 2020-03-02  1:26   ` Dave Chinner
  2020-03-02  1:36     ` Dave Chinner
  0 siblings, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2020-03-02  1:26 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel

On Wed, Feb 26, 2020 at 09:24:36PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> DAX requires special address space operations (aops).  Changing DAX
> state therefore requires changing those aops.
> 
> However, many functions require aops to remain consistent through a deep
> call stack.
> 
> Define a vfs level inode rwsem to protect aops throughout call stacks
> which require them.
> 
> Finally, define calls to be used in subsequent patches when aops usage
> needs to be quiesced by the file system.
> 
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
....
> diff --git a/fs/stat.c b/fs/stat.c
> index 894699c74dde..274b3ccc82b1 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -79,8 +79,10 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
>  	if (IS_AUTOMOUNT(inode))
>  		stat->attributes |= STATX_ATTR_AUTOMOUNT;
>  
> +	inode_aops_down_read(inode);
>  	if (IS_DAX(inode))
>  		stat->attributes |= STATX_ATTR_DAX;
> +	inode_aops_up_read(inode);

No locking needed here. statx() is racy to begin with (i.e.
information will be wrong by the time the syscall gets back to
userspace. Hence it really doesn't matter if we race checking the
flag here or not, and because this is a lockless fast path for
many worklaods (e.g. git), keeping locking out of the stat() fast
path is desirable.

>  	if (inode->i_op->getattr)
>  		return inode->i_op->getattr(path, stat, request_mask,
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 836a1f09be03..3e83a97dc047 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -420,6 +420,7 @@ xfs_iget_cache_hit(
>  		rcu_read_unlock();
>  
>  		ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> +		ASSERT(!rwsem_is_locked(&inode->i_aops_sem));
>  		error = xfs_reinit_inode(mp, inode);
>  		if (error) {
>  			bool wake;

No need for this - aops can never be called on an XFS inode in
the reclaimable state - it's not visible to the VFS at this point.
If you need to assert that the lock has not been leaked, then this
needs to be done at the point where the VFS reclaims the inode (i.e.
the evict() path), not deep in XFS.

>  static inline ssize_t call_read_iter(struct file *file, struct kiocb *kio,
>  				     struct iov_iter *iter)
>  {
> -	return file->f_op->read_iter(kio, iter);
> +	struct inode		*inode = file_inode(kio->ki_filp);
> +	ssize_t ret;
> +
> +	inode_aops_down_read(inode);
> +	ret = file->f_op->read_iter(kio, iter);
> +	inode_aops_up_read(inode);
> +	return ret;
>  }
>  
>  static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio,
>  				      struct iov_iter *iter)
>  {
> -	return file->f_op->write_iter(kio, iter);
> +	struct inode		*inode = file_inode(kio->ki_filp);
> +	ssize_t ret;
> +
> +	inode_aops_down_read(inode);
> +	ret = file->f_op->write_iter(kio, iter);
> +	inode_aops_up_read(inode);
> +	return ret;
>  }

I'm really on the fence about this. I don't really like it, but I
can't really put my finger on why :/

>  
>  static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 4f17c83db575..6a30febb11e0 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -48,6 +48,8 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>  	bdi = inode_to_bdi(mapping->host);
>  
>  	if (IS_DAX(inode) || (bdi == &noop_backing_dev_info)) {
> +		int ret = 0;
> +
>  		switch (advice) {
>  		case POSIX_FADV_NORMAL:
>  		case POSIX_FADV_RANDOM:
> @@ -58,9 +60,10 @@ int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice)
>  			/* no bad return value, but ignore advice */
>  			break;
>  		default:
> -			return -EINVAL;
> +			ret = -EINVAL;
>  		}
> -		return 0;
> +
> +		return ret;
>  	}

Completely spurious changes?

>  
>  	/*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1784478270e1..3a7863ba51b9 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
>  		 * and return.  Otherwise fallthrough to buffered io for
>  		 * the rest of the read.  Buffered reads will not work for
>  		 * DAX files, so don't bother trying.
> +		 *
> +		 * IS_DAX is protected under ->read_iter lock
>  		 */
>  		if (retval < 0 || !count || iocb->ki_pos >= size ||
>  		    IS_DAX(inode))

This check is in the DIO path, be we can't do DIO on DAX enabled
files to begin with, so we can only get here if S_DAX is not set on
the file.

Further, if IOCB_DIRECT is set, neither ext4 nor XFS call
generic_file_read_iter(); they run the iomap_dio_rw() path directly
instead. Only ext2 calls generic_file_read_iter() to do direct IO,
so it's the only filesystem that needs this IS_DAX() check in it.

I think we should fix ext2 to be implemented like ext4 and XFS -
they implement the buffered IO fallback, should it be required,
themselves and never use generic_file_read_iter() for direct IO.

That would allow us to add this to generic_file_read_iter():

	if (WARN_ON_ONCE(IS_DAX(inode))
		return -EINVAL;

to indicate that this should never be called directly on a DAX
capable filesystem. This places all the responsibility for managing
DAX behaviour on the filesystem, which then allows us to reason more
solidly about how the filesystem IO paths use and check the S_DAX
flag.

i.e. changing the on-disk flag already locks out the filesystem IO
path via the i_rwsem(), and all the filesystem IO paths (buffered,
direct IO and dax) are serialised by this flag. Hence we can check
once in the filesystem path once we have the i_rwsem held and
know that S_DAX will not change until we release it.

..... and now I realise what I was sitting on the fence about....

I don't like the aops locking in call_read/write_iter() because it
is actually redundant: the filesystem should be doing the necessary
locking in the IO path via the i_rwsem to prevent S_DAX from
changing while it is doing the IO.

IOWs, we need to restructure the locking inside the filesystem
read_iter and write_iter methods so that the i_rwsem protects the
S_DAX flag from changing dynamically. They all do:

	if (dax)
		do_dax_io()
	if (direct)
		do_direct_io()
	do_buffered_io()

And then we take the i_rwsem inside each of those functions and do
the IO. What we actually need to do is something like this:

	inode_lock_shared()
	if (dax)
		do_dax_io()
	if (direct)
		do_direct_io()
	do_buffered_io()
	inode_unlock_shared()

And remove the inode locking from inside the individual IO methods
themselves. It's a bit more complex than this because buffered
writes require exclusive locking, but this completely removes the
need for holding an aops lock over these methods.

I've attached a couple of untested patches (I've compiled them, so
they must be good!) to demonstrate what I mean for the XFS IO path.
The read side removes a heap of duplicate code, but the write side
is .... unfortunately complex. Have to think about that more.

> @@ -3377,6 +3379,8 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  		 * holes, for example.  For DAX files, a buffered write will
>  		 * not succeed (even if it did, DAX does not handle dirty
>  		 * page-cache pages correctly).
> +		 *
> +		 * IS_DAX is protected under ->write_iter lock
>  		 */
>  		if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
>  			goto out;

Same here - this should never be called for DAX+iomap capable
filesystems.

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index b08b199f9a11..3d05bd10d83e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -572,6 +572,7 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  	unsigned long ret;
>  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>  
> +	/* Should not need locking here because mmap is not allowed */
>  	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
>  		goto out;
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b679908743cb..f048178e2b93 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1592,9 +1592,11 @@ static void collapse_file(struct mm_struct *mm,
>  		} else {	/* !is_shmem */
>  			if (!page || xa_is_value(page)) {
>  				xas_unlock_irq(&xas);
> +				inode_aops_down_read(file->f_inode);
>  				page_cache_sync_readahead(mapping, &file->f_ra,
>  							  file, index,
>  							  PAGE_SIZE);
> +				inode_aops_up_read(file->f_inode);

Why is this readahead call needing aops protection, but not anywhere
else? And if this is not being done while holding a filesystem lock
(like i_rwsem or MMAPLOCK) then how is this safe against concurent
hole punch? i.e. doesn't it have exactly the same problems as
readahead in vfs_fadvise(WILL_NEED)?

>  				/* drain pagevecs to help isolate_lru_page() */
>  				lru_add_drain();
>  				page = find_lock_page(mapping, index);
> diff --git a/mm/util.c b/mm/util.c
> index 988d11e6c17c..a4fb0670137d 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -501,11 +501,18 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
>  
>  	ret = security_mmap_file(file, prot, flag);
>  	if (!ret) {
> -		if (down_write_killable(&mm->mmap_sem))
> +		if (file)
> +			inode_aops_down_read(file_inode(file));
> +		if (down_write_killable(&mm->mmap_sem)) {
> +			if (file)
> +				inode_aops_up_read(file_inode(file));
>  			return -EINTR;
> +		}
>  		ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
>  				    &populate, &uf);
>  		up_write(&mm->mmap_sem);
> +		if (file)
> +			inode_aops_up_read(file_inode(file));
>  		userfaultfd_unmap_complete(mm, &uf);
>  		if (populate)
>  			mm_populate(ret, populate);

So this path calls the fops->mmap() filesystem method that we check
IS_DAX() in. As Christoph has mentioned, this could likely go away
if we took the XFS_MMAPLOCK_SHARED() inside xfs_file_mmap(), as that
would then serialise new mmaps against the transaction where we are
changing the on disk flag. i.e. all new attempts to mmap() the file
would then get blocked by the filesystem while the change is taking
place...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 06/12] fs: Add locking for a dynamic address space operations state
  2020-03-02  1:26   ` Dave Chinner
@ 2020-03-02  1:36     ` Dave Chinner
  0 siblings, 0 replies; 45+ messages in thread
From: Dave Chinner @ 2020-03-02  1:36 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Christoph Hellwig, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel


[-- Attachment #1: Type: text/plain, Size: 3459 bytes --]

On Mon, Mar 02, 2020 at 12:26:44PM +1100, Dave Chinner wrote:
> >  	/*
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 1784478270e1..3a7863ba51b9 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2293,6 +2293,8 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> >  		 * and return.  Otherwise fallthrough to buffered io for
> >  		 * the rest of the read.  Buffered reads will not work for
> >  		 * DAX files, so don't bother trying.
> > +		 *
> > +		 * IS_DAX is protected under ->read_iter lock
> >  		 */
> >  		if (retval < 0 || !count || iocb->ki_pos >= size ||
> >  		    IS_DAX(inode))
> 
> This check is in the DIO path, be we can't do DIO on DAX enabled
> files to begin with, so we can only get here if S_DAX is not set on
> the file.
> 
> Further, if IOCB_DIRECT is set, neither ext4 nor XFS call
> generic_file_read_iter(); they run the iomap_dio_rw() path directly
> instead. Only ext2 calls generic_file_read_iter() to do direct IO,
> so it's the only filesystem that needs this IS_DAX() check in it.
> 
> I think we should fix ext2 to be implemented like ext4 and XFS -
> they implement the buffered IO fallback, should it be required,
> themselves and never use generic_file_read_iter() for direct IO.
> 
> That would allow us to add this to generic_file_read_iter():
> 
> 	if (WARN_ON_ONCE(IS_DAX(inode))
> 		return -EINVAL;
> 
> to indicate that this should never be called directly on a DAX
> capable filesystem. This places all the responsibility for managing
> DAX behaviour on the filesystem, which then allows us to reason more
> solidly about how the filesystem IO paths use and check the S_DAX
> flag.
> 
> i.e. changing the on-disk flag already locks out the filesystem IO
> path via the i_rwsem(), and all the filesystem IO paths (buffered,
> direct IO and dax) are serialised by this flag. Hence we can check
> once in the filesystem path once we have the i_rwsem held and
> know that S_DAX will not change until we release it.
> 
> ..... and now I realise what I was sitting on the fence about....
> 
> I don't like the aops locking in call_read/write_iter() because it
> is actually redundant: the filesystem should be doing the necessary
> locking in the IO path via the i_rwsem to prevent S_DAX from
> changing while it is doing the IO.
> 
> IOWs, we need to restructure the locking inside the filesystem
> read_iter and write_iter methods so that the i_rwsem protects the
> S_DAX flag from changing dynamically. They all do:
> 
> 	if (dax)
> 		do_dax_io()
> 	if (direct)
> 		do_direct_io()
> 	do_buffered_io()
> 
> And then we take the i_rwsem inside each of those functions and do
> the IO. What we actually need to do is something like this:
> 
> 	inode_lock_shared()
> 	if (dax)
> 		do_dax_io()
> 	if (direct)
> 		do_direct_io()
> 	do_buffered_io()
> 	inode_unlock_shared()
> 
> And remove the inode locking from inside the individual IO methods
> themselves. It's a bit more complex than this because buffered
> writes require exclusive locking, but this completely removes the
> need for holding an aops lock over these methods.
> 
> I've attached a couple of untested patches (I've compiled them, so
> they must be good!) to demonstrate what I mean for the XFS IO path.
> The read side removes a heap of duplicate code, but the write side
> is .... unfortunately complex. Have to think about that more.

And now with patches...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

[-- Attachment #2: xfs-io-s_dax-locking --]
[-- Type: text/plain, Size: 3658 bytes --]

xfs: pulling read IO path locking up to protect S_DAX checks

From: Dave Chinner <dchinner@redhat.com>

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 77 +++++++++++++++++--------------------------------------
 1 file changed, 24 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index b8a4a3f29b36..7d31a0e76b68 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -176,28 +176,12 @@ xfs_file_dio_aio_read(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
-	size_t			count = iov_iter_count(to);
-	ssize_t			ret;
-
-	trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
-
-	if (!count)
-		return 0; /* skip atime */
+	trace_xfs_file_direct_read(XFS_I(file_inode(iocb->ki_filp)),
+					iov_iter_count(to), iocb->ki_pos);
 
 	file_accessed(iocb->ki_filp);
-
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
-			return -EAGAIN;
-	} else {
-		xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	}
-	ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL,
+	return iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL,
 			is_sync_kiocb(iocb));
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
-	return ret;
 }
 
 static noinline ssize_t
@@ -205,27 +189,12 @@ xfs_file_dax_read(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
-	size_t			count = iov_iter_count(to);
-	ssize_t			ret = 0;
-
-	trace_xfs_file_dax_read(ip, count, iocb->ki_pos);
-
-	if (!count)
-		return 0; /* skip atime */
-
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
-			return -EAGAIN;
-	} else {
-		xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	}
-
-	ret = dax_iomap_rw(iocb, to, &xfs_read_iomap_ops);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+	trace_xfs_file_dax_read(XFS_I(file_inode(iocb->ki_filp)),
+					iov_iter_count(to), iocb->ki_pos);
 
 	file_accessed(iocb->ki_filp);
-	return ret;
+	return dax_iomap_rw(iocb, to, &xfs_read_iomap_ops);
+
 }
 
 STATIC ssize_t
@@ -233,21 +202,10 @@ xfs_file_buffered_aio_read(
 	struct kiocb		*iocb,
 	struct iov_iter		*to)
 {
-	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
-	ssize_t			ret;
-
-	trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);
+	trace_xfs_file_buffered_read(XFS_I(file_inode(iocb->ki_filp)),
+					iov_iter_count(to), iocb->ki_pos);
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
-			return -EAGAIN;
-	} else {
-		xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	}
-	ret = generic_file_read_iter(iocb, to);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
-	return ret;
+	return generic_file_read_iter(iocb, to);
 }
 
 STATIC ssize_t
@@ -256,7 +214,8 @@ xfs_file_read_iter(
 	struct iov_iter		*to)
 {
 	struct inode		*inode = file_inode(iocb->ki_filp);
-	struct xfs_mount	*mp = XFS_I(inode)->i_mount;
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
 	ssize_t			ret = 0;
 
 	XFS_STATS_INC(mp, xs_read_calls);
@@ -264,6 +223,16 @@ xfs_file_read_iter(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	if (!iov_iter_count(to))
+		return 0; /* skip atime */
+
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
+			return -EAGAIN;
+	} else {
+		xfs_ilock(ip, XFS_IOLOCK_SHARED);
+	}
+
 	if (IS_DAX(inode))
 		ret = xfs_file_dax_read(iocb, to);
 	else if (iocb->ki_flags & IOCB_DIRECT)
@@ -271,6 +240,8 @@ xfs_file_read_iter(
 	else
 		ret = xfs_file_buffered_aio_read(iocb, to);
 
+	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+
 	if (ret > 0)
 		XFS_STATS_ADD(mp, xs_read_bytes, ret);
 	return ret;

[-- Attachment #3: xfs-io-s_dax-write-locking --]
[-- Type: text/plain, Size: 10443 bytes --]

xfs: pull write IO locking up to protect S_DAX

From: Dave Chinner <dchinner@redhat.com>

Much more complex than the read side because of all the lock
juggling we for the different types of writes and all the various
metadata updates a write might need prior to dispatching the data
IO.

Not really happy about the way the high level logic turned out;
could probably be make cleaner and smarter with a bit more thought.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 205 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 115 insertions(+), 90 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7d31a0e76b68..b7c2bb09b945 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -250,9 +250,9 @@ xfs_file_read_iter(
 /*
  * Common pre-write limit and setup checks.
  *
- * Called with the iolocked held either shared and exclusive according to
- * @iolock, and returns with it held.  Might upgrade the iolock to exclusive
- * if called for a direct write beyond i_size.
+ * Called without the iolocked held, but will lock it according to @iolock, and
+ * returns with it held on both success and error.  Might upgrade the iolock to
+ * exclusive if called for a direct write beyond i_size.
  */
 STATIC ssize_t
 xfs_file_aio_write_checks(
@@ -268,6 +268,13 @@ xfs_file_aio_write_checks(
 	bool			drained_dio = false;
 	loff_t			isize;
 
+	if (iocb->ki_flags & IOCB_NOWAIT) {
+		if (!xfs_ilock_nowait(ip, *iolock))
+			return -EAGAIN;
+	} else {
+		xfs_ilock(ip, *iolock);
+	}
+
 restart:
 	error = generic_write_checks(iocb, from);
 	if (error <= 0)
@@ -325,7 +332,7 @@ xfs_file_aio_write_checks(
 			drained_dio = true;
 			goto restart;
 		}
-	
+
 		trace_xfs_zero_eof(ip, isize, iocb->ki_pos - isize);
 		error = iomap_zero_range(inode, isize, iocb->ki_pos - isize,
 				NULL, &xfs_buffered_write_iomap_ops);
@@ -449,19 +456,14 @@ static const struct iomap_dio_ops xfs_dio_write_ops = {
  * Returns with locks held indicated by @iolock and errors indicated by
  * negative return values.
  */
-STATIC ssize_t
-xfs_file_dio_aio_write(
+static int
+xfs_file_dio_write_lock_mode(
 	struct kiocb		*iocb,
 	struct iov_iter		*from)
 {
-	struct file		*file = iocb->ki_filp;
-	struct address_space	*mapping = file->f_mapping;
-	struct inode		*inode = mapping->host;
+	struct inode		*inode = file_inode(iocb->ki_filp);
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
-	ssize_t			ret = 0;
-	int			unaligned_io = 0;
-	int			iolock;
 	size_t			count = iov_iter_count(from);
 	struct xfs_buftarg      *target = xfs_inode_buftarg(ip);
 
@@ -478,35 +480,37 @@ xfs_file_dio_aio_write(
 	 */
 	if ((iocb->ki_pos & mp->m_blockmask) ||
 	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
-		unaligned_io = 1;
+		if (iocb->ki_flags & IOCB_NOWAIT) {
+			/* unaligned dio always waits, bail */
+			return -EAGAIN;
+		}
 
 		/*
 		 * We can't properly handle unaligned direct I/O to reflink
 		 * files yet, as we can't unshare a partial block.
 		 */
 		if (xfs_is_cow_inode(ip)) {
-			trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count);
+			trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos,
+							count);
 			return -EREMCHG;
 		}
-		iolock = XFS_IOLOCK_EXCL;
-	} else {
-		iolock = XFS_IOLOCK_SHARED;
-	}
-
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		/* unaligned dio always waits, bail */
-		if (unaligned_io)
-			return -EAGAIN;
-		if (!xfs_ilock_nowait(ip, iolock))
-			return -EAGAIN;
-	} else {
-		xfs_ilock(ip, iolock);
+		return XFS_IOLOCK_EXCL;
 	}
+	return XFS_IOLOCK_SHARED;
+}
 
-	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
-	if (ret)
-		goto out;
-	count = iov_iter_count(from);
+STATIC ssize_t
+xfs_file_dio_aio_write(
+	struct kiocb		*iocb,
+	struct iov_iter		*from,
+	int			iolock)
+{
+	struct inode		*inode = file_inode(iocb->ki_filp);
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
+	ssize_t			ret = 0;
+	bool			unaligned_io = 0;
+	size_t			count = iov_iter_count(from);
 
 	/*
 	 * If we are doing unaligned IO, we can't allow any other overlapping IO
@@ -515,7 +519,9 @@ xfs_file_dio_aio_write(
 	 * iolock if we had to take the exclusive lock in
 	 * xfs_file_aio_write_checks() for other reasons.
 	 */
-	if (unaligned_io) {
+	if ((iocb->ki_pos & mp->m_blockmask) ||
+	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
+		unaligned_io = true;
 		inode_dio_wait(inode);
 	} else if (iolock == XFS_IOLOCK_EXCL) {
 		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
@@ -530,7 +536,6 @@ xfs_file_dio_aio_write(
 	ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
 			   &xfs_dio_write_ops,
 			   is_sync_kiocb(iocb) || unaligned_io);
-out:
 	xfs_iunlock(ip, iolock);
 
 	/*
@@ -544,26 +549,15 @@ xfs_file_dio_aio_write(
 static noinline ssize_t
 xfs_file_dax_write(
 	struct kiocb		*iocb,
-	struct iov_iter		*from)
+	struct iov_iter		*from,
+	int			iolock)
 {
-	struct inode		*inode = iocb->ki_filp->f_mapping->host;
+	struct inode		*inode = file_inode(iocb->ki_filp);
 	struct xfs_inode	*ip = XFS_I(inode);
-	int			iolock = XFS_IOLOCK_EXCL;
 	ssize_t			ret, error = 0;
 	size_t			count;
 	loff_t			pos;
 
-	if (iocb->ki_flags & IOCB_NOWAIT) {
-		if (!xfs_ilock_nowait(ip, iolock))
-			return -EAGAIN;
-	} else {
-		xfs_ilock(ip, iolock);
-	}
-
-	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
-	if (ret)
-		goto out;
-
 	pos = iocb->ki_pos;
 	count = iov_iter_count(from);
 
@@ -573,7 +567,6 @@ xfs_file_dax_write(
 		i_size_write(inode, iocb->ki_pos);
 		error = xfs_setfilesize(ip, pos, ret);
 	}
-out:
 	xfs_iunlock(ip, iolock);
 	if (error)
 		return error;
@@ -590,26 +583,13 @@ xfs_file_dax_write(
 STATIC ssize_t
 xfs_file_buffered_aio_write(
 	struct kiocb		*iocb,
-	struct iov_iter		*from)
+	struct iov_iter		*from,
+	int			iolock,
+	bool			flush_on_enospc)
 {
-	struct file		*file = iocb->ki_filp;
-	struct address_space	*mapping = file->f_mapping;
-	struct inode		*inode = mapping->host;
+	struct inode		*inode = file_inode(iocb->ki_filp);
 	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			ret;
-	int			enospc = 0;
-	int			iolock;
-
-	if (iocb->ki_flags & IOCB_NOWAIT)
-		return -EOPNOTSUPP;
-
-write_retry:
-	iolock = XFS_IOLOCK_EXCL;
-	xfs_ilock(ip, iolock);
-
-	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
-	if (ret)
-		goto out;
 
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = inode_to_bdi(inode);
@@ -617,8 +597,6 @@ xfs_file_buffered_aio_write(
 	trace_xfs_file_buffered_write(ip, iov_iter_count(from), iocb->ki_pos);
 	ret = iomap_file_buffered_write(iocb, from,
 			&xfs_buffered_write_iomap_ops);
-	if (likely(ret >= 0))
-		iocb->ki_pos += ret;
 
 	/*
 	 * If we hit a space limit, try to free up some lingering preallocated
@@ -629,34 +607,35 @@ xfs_file_buffered_aio_write(
 	 * also behaves as a filter to prevent too many eofblocks scans from
 	 * running at the same time.
 	 */
-	if (ret == -EDQUOT && !enospc) {
+	if (ret == -EDQUOT && flush_on_enospc) {
+		int	made_progress;
+
 		xfs_iunlock(ip, iolock);
-		enospc = xfs_inode_free_quota_eofblocks(ip);
-		if (enospc)
-			goto write_retry;
-		enospc = xfs_inode_free_quota_cowblocks(ip);
-		if (enospc)
-			goto write_retry;
-		iolock = 0;
-	} else if (ret == -ENOSPC && !enospc) {
+		made_progress = xfs_inode_free_quota_eofblocks(ip);
+		if (made_progress)
+			return -EAGAIN;
+		made_progress = xfs_inode_free_quota_cowblocks(ip);
+		if (made_progress)
+			return -EAGAIN;
+		return ret;
+	}
+	if (ret == -ENOSPC && flush_on_enospc) {
 		struct xfs_eofblocks eofb = {0};
 
-		enospc = 1;
 		xfs_flush_inodes(ip->i_mount);
 
 		xfs_iunlock(ip, iolock);
 		eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
 		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
 		xfs_icache_free_cowblocks(ip->i_mount, &eofb);
-		goto write_retry;
+		return -EAGAIN;
 	}
 
 	current->backing_dev_info = NULL;
-out:
-	if (iolock)
-		xfs_iunlock(ip, iolock);
-
+	xfs_iunlock(ip, iolock);
 	if (ret > 0) {
+		iocb->ki_pos += ret;
+
 		XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
 		/* Handle various SYNC-type writes */
 		ret = generic_write_sync(iocb, ret);
@@ -675,6 +654,8 @@ xfs_file_write_iter(
 	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			ret;
 	size_t			ocount = iov_iter_count(from);
+	int			iolock;
+	bool			flush_on_enospc = true;
 
 	XFS_STATS_INC(ip->i_mount, xs_write_calls);
 
@@ -684,22 +665,66 @@ xfs_file_write_iter(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
-	if (IS_DAX(inode))
-		return xfs_file_dax_write(iocb, from);
-
-	if (iocb->ki_flags & IOCB_DIRECT) {
+	/*
+	 * Set up the initial locking state - only direct IO uses shared locking
+	 * for writes, and buffered IO does not support IOCB_NOWAIT.
+	 */
+relock:
+	if (IS_DAX(inode)) {
+		iolock = XFS_IOLOCK_EXCL;
+	} else if (iocb->ki_flags & IOCB_DIRECT) {
 		/*
 		 * Allow a directio write to fall back to a buffered
 		 * write *only* in the case that we're doing a reflink
 		 * CoW.  In all other directio scenarios we do not
 		 * allow an operation to fall back to buffered mode.
 		 */
-		ret = xfs_file_dio_aio_write(iocb, from);
-		if (ret != -EREMCHG)
-			return ret;
+		iolock = xfs_file_dio_write_lock_mode(iocb, from);
+		if (iolock == -EREMCHG) {
+			iocb->ki_flags &= ~IOCB_DIRECT;
+			iolock = XFS_IOLOCK_EXCL;
+		}
+		if (iolock < 0)
+			return iolock;
+	} else if (iocb->ki_flags & IOCB_NOWAIT) {
+		/* buffered writes do not support IOCB_NOWAIT */
+		return -EOPNOTSUPP;
+	} else {
+		iolock = XFS_IOLOCK_EXCL;
 	}
 
-	return xfs_file_buffered_aio_write(iocb, from);
+	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
+	if (ret) {
+		xfs_iunlock(ip, iolock);
+		return ret;
+	}
+
+	/*
+	 * If the IO mode switched while we were locking meaning we hold
+	 * the wrong lock type right now, start again.
+	 */
+	if ((IS_DAX(inode) || !(iocb->ki_flags & IOCB_DIRECT)) &&
+	    iolock != XFS_IOLOCK_EXCL) {
+		xfs_iunlock(ip, iolock);
+		goto relock;
+	}
+
+	if (IS_DAX(inode)) {
+		ret = xfs_file_dax_write(iocb, from, iolock);
+	} else if (iocb->ki_flags & IOCB_DIRECT) {
+		ret = xfs_file_dio_aio_write(iocb, from, iolock);
+		ASSERT(ret != -EREMCHG);
+	} else {
+		ret = xfs_file_buffered_aio_write(iocb, from, iolock,
+						flush_on_enospc);
+		if (ret == -EAGAIN && flush_on_enospc) {
+			/* inode already unlocked, but need another attempt */
+			flush_on_enospc = false;
+			goto relock;
+		}
+		ASSERT(ret != -EAGAIN);
+	}
+	return ret;
 }
 
 static void

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
                   ` (11 preceding siblings ...)
  2020-02-27  5:24 ` [PATCH V5 12/12] Documentation/dax: Update Usage section ira.weiny
@ 2020-03-05 15:51 ` Christoph Hellwig
  2020-03-09 17:04   ` Ira Weiny
  12 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2020-03-05 15:51 UTC (permalink / raw)
  To: ira.weiny
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Christoph Hellwig, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel, akpm, torvalds

FYI, I still will fully NAK any series that adds additional locks
and thus atomic instructions to basically every fs call, and grows
the inode by a rw_semaphore plus and atomic64_t.  I also think the
whole idea of switching operation vectors at runtime is fatally flawed
and we should never add such code, nevermind just for a fringe usecase
of a fringe feature.

On Wed, Feb 26, 2020 at 09:24:30PM -0800, ira.weiny@intel.com wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> Changes from V4:
> 	* Open code the aops lock rather than add it to the xfs_ilock()
> 	  subsystem (Darrick's comments were obsoleted by this change)
> 	* Fix lkp build suggestions and bugs
> 
> Changes from V3:
> 	* Remove global locking...  :-D
> 	* put back per inode locking and remove pre-mature optimizations
> 	* Fix issues with Directories having IS_DAX() set
> 	* Fix kernel crash issues reported by Jeff
> 	* Add some clean up patches
> 	* Consolidate diflags to iflags functions
> 	* Update/add documentation
> 	* Reorder/rename patches quite a bit
> 
> Changes from V2:
> 
> 	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
> 		Internal discussions with Dan determined this would be easier,
> 		just as performant, and slightly less overhead that having it
> 		in the SB as suggested by Jan
> 	* Fix locking order in comments and throughout code
> 	* Change "mode" to "state" throughout commits
> 	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
> 		configured
> 	* Add static branch for which is activated by a device which supports
> 		DAX in XFS
> 	* Change "lock/unlock" to up/down read/write as appropriate
> 		Previous names were over simplified
> 	* Update comments/documentation
> 
> 	* Remove the xfs specific lock to the vfs (global) layer.
> 	* Fix i_dax_sem locking order and comments
> 
> 	* Move 'i_mapped' count from struct inode to struct address_space and
> 		rename it to mmap_count
> 	* Add inode_has_mappings() call
> 
> 	* Fix build issues
> 	* Clean up syntax spacing and minor issues
> 	* Update man page text for STATX_ATTR_DAX
> 	* Add reviewed-by's
> 	* Rebase to 5.6
> 
> 	Rename patch:
> 		from: fs/xfs: Add lock/unlock state to xfs
> 		to: fs/xfs: Add write DAX lock to xfs layer
> 	Add patch:
> 		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
> 	Drop patch:
> 		fs/xfs: Fix truncate up
> 
> 
> At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> consumption due to their inability to detect whether the kernel will
> instantiate page cache for a file, and cases where a global dax enable via a
> mount option is too coarse.
> 
> The following patch series enables selecting the use of DAX on individual files
> and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
> scheme the dax mount option can be omitted to allow the per-file property to
> take effect.
> 
> The insight at LSF/MM was to separate the per-mount or per-file "physical"
> capability switch from an "effective" attribute for the file.
> 
> At LSF/MM we discussed the difficulties of switching the DAX state of a file
> with active mappings / page cache.  It was thought the races could be avoided
> by limiting DAX state flips to 0-length files.
> 
> However, this turns out to not be true.[3] This is because address space
> operations (a_ops) may be in use at any time the inode is referenced and users
> have expressed a desire to be able to change the DAX state on a file with data
> in it.  For those reasons this patch set allows changing the DAX state flag on
> a file as long as it is not current mapped.
> 
> Details of when and how DAX state can be changed on a file is included in a
> documentation patch.
> 
> It should be noted that the physical DAX flag inheritance is not shown in this
> patch set as it was maintained from previous work on XFS.  The physical DAX
> flag and it's inheritance will need to be added to other file systems for user
> control. 
> 
> As submitted this works on real hardware testing.
> 
> 
> [1] https://lwn.net/Articles/787973/
> [2] https://lwn.net/Articles/787233/
> [3] https://lkml.org/lkml/2019/10/20/96
> [4] https://patchwork.kernel.org/patch/11310511/
> 
> 
> To: linux-kernel@vger.kernel.org
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
> Cc: Jan Kara <jack@suse.cz>
> Cc: linux-ext4@vger.kernel.org
> Cc: linux-xfs@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> 
> 
> Ira Weiny (12):
>   fs/xfs: Remove unnecessary initialization of i_rwsem
>   fs: Remove unneeded IS_DAX() check
>   fs/stat: Define DAX statx attribute
>   fs/xfs: Isolate the physical DAX flag from enabled
>   fs/xfs: Create function xfs_inode_enable_dax()
>   fs: Add locking for a dynamic address space operations state
>   fs: Prevent DAX state change if file is mmap'ed
>   fs/xfs: Hold off aops users while changing DAX state
>   fs/xfs: Clean up locking in dax invalidate
>   fs/xfs: Allow toggle of effective DAX flag
>   fs/xfs: Remove xfs_diflags_to_linux()
>   Documentation/dax: Update Usage section
> 
>  Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
>  Documentation/filesystems/vfs.rst | 16 +++++
>  fs/attr.c                         |  1 +
>  fs/inode.c                        | 16 ++++-
>  fs/iomap/buffered-io.c            |  1 +
>  fs/open.c                         |  4 ++
>  fs/stat.c                         |  5 ++
>  fs/xfs/xfs_icache.c               |  5 +-
>  fs/xfs/xfs_inode.h                |  2 +
>  fs/xfs/xfs_ioctl.c                | 98 +++++++++++++++----------------
>  fs/xfs/xfs_iops.c                 | 69 +++++++++++++++-------
>  include/linux/fs.h                | 73 ++++++++++++++++++++++-
>  include/uapi/linux/stat.h         |  1 +
>  mm/fadvise.c                      |  7 ++-
>  mm/filemap.c                      |  4 ++
>  mm/huge_memory.c                  |  1 +
>  mm/khugepaged.c                   |  2 +
>  mm/mmap.c                         | 19 +++++-
>  mm/util.c                         |  9 ++-
>  19 files changed, 328 insertions(+), 89 deletions(-)
> 
> -- 
> 2.21.0
---end quoted text---

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-05 15:51 ` [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 Christoph Hellwig
@ 2020-03-09 17:04   ` Ira Weiny
  2020-03-11  3:36     ` Darrick J. Wong
  0 siblings, 1 reply; 45+ messages in thread
From: Ira Weiny @ 2020-03-09 17:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, Alexander Viro, Darrick J. Wong, Dan Williams,
	Dave Chinner, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel, akpm, torvalds

On Thu, Mar 05, 2020 at 04:51:44PM +0100, Christoph Hellwig wrote:
> FYI, I still will fully NAK any series that adds additional locks
> and thus atomic instructions to basically every fs call, and grows
> the inode by a rw_semaphore plus and atomic64_t.  I also think the
> whole idea of switching operation vectors at runtime is fatally flawed
> and we should never add such code, nevermind just for a fringe usecase
> of a fringe feature.

Being new to this area of the kernel I'm not clear on the history...

It was my understanding that the per-file flag support was a requirement to
removing the experimental designation from DAX.  Is this still the case?

Ira

> 
> On Wed, Feb 26, 2020 at 09:24:30PM -0800, ira.weiny@intel.com wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 
> > Changes from V4:
> > 	* Open code the aops lock rather than add it to the xfs_ilock()
> > 	  subsystem (Darrick's comments were obsoleted by this change)
> > 	* Fix lkp build suggestions and bugs
> > 
> > Changes from V3:
> > 	* Remove global locking...  :-D
> > 	* put back per inode locking and remove pre-mature optimizations
> > 	* Fix issues with Directories having IS_DAX() set
> > 	* Fix kernel crash issues reported by Jeff
> > 	* Add some clean up patches
> > 	* Consolidate diflags to iflags functions
> > 	* Update/add documentation
> > 	* Reorder/rename patches quite a bit
> > 
> > Changes from V2:
> > 
> > 	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
> > 		Internal discussions with Dan determined this would be easier,
> > 		just as performant, and slightly less overhead that having it
> > 		in the SB as suggested by Jan
> > 	* Fix locking order in comments and throughout code
> > 	* Change "mode" to "state" throughout commits
> > 	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
> > 		configured
> > 	* Add static branch for which is activated by a device which supports
> > 		DAX in XFS
> > 	* Change "lock/unlock" to up/down read/write as appropriate
> > 		Previous names were over simplified
> > 	* Update comments/documentation
> > 
> > 	* Remove the xfs specific lock to the vfs (global) layer.
> > 	* Fix i_dax_sem locking order and comments
> > 
> > 	* Move 'i_mapped' count from struct inode to struct address_space and
> > 		rename it to mmap_count
> > 	* Add inode_has_mappings() call
> > 
> > 	* Fix build issues
> > 	* Clean up syntax spacing and minor issues
> > 	* Update man page text for STATX_ATTR_DAX
> > 	* Add reviewed-by's
> > 	* Rebase to 5.6
> > 
> > 	Rename patch:
> > 		from: fs/xfs: Add lock/unlock state to xfs
> > 		to: fs/xfs: Add write DAX lock to xfs layer
> > 	Add patch:
> > 		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
> > 	Drop patch:
> > 		fs/xfs: Fix truncate up
> > 
> > 
> > At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> > consumption due to their inability to detect whether the kernel will
> > instantiate page cache for a file, and cases where a global dax enable via a
> > mount option is too coarse.
> > 
> > The following patch series enables selecting the use of DAX on individual files
> > and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
> > scheme the dax mount option can be omitted to allow the per-file property to
> > take effect.
> > 
> > The insight at LSF/MM was to separate the per-mount or per-file "physical"
> > capability switch from an "effective" attribute for the file.
> > 
> > At LSF/MM we discussed the difficulties of switching the DAX state of a file
> > with active mappings / page cache.  It was thought the races could be avoided
> > by limiting DAX state flips to 0-length files.
> > 
> > However, this turns out to not be true.[3] This is because address space
> > operations (a_ops) may be in use at any time the inode is referenced and users
> > have expressed a desire to be able to change the DAX state on a file with data
> > in it.  For those reasons this patch set allows changing the DAX state flag on
> > a file as long as it is not current mapped.
> > 
> > Details of when and how DAX state can be changed on a file is included in a
> > documentation patch.
> > 
> > It should be noted that the physical DAX flag inheritance is not shown in this
> > patch set as it was maintained from previous work on XFS.  The physical DAX
> > flag and it's inheritance will need to be added to other file systems for user
> > control. 
> > 
> > As submitted this works on real hardware testing.
> > 
> > 
> > [1] https://lwn.net/Articles/787973/
> > [2] https://lwn.net/Articles/787233/
> > [3] https://lkml.org/lkml/2019/10/20/96
> > [4] https://patchwork.kernel.org/patch/11310511/
> > 
> > 
> > To: linux-kernel@vger.kernel.org
> > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: linux-ext4@vger.kernel.org
> > Cc: linux-xfs@vger.kernel.org
> > Cc: linux-fsdevel@vger.kernel.org
> > 
> > 
> > Ira Weiny (12):
> >   fs/xfs: Remove unnecessary initialization of i_rwsem
> >   fs: Remove unneeded IS_DAX() check
> >   fs/stat: Define DAX statx attribute
> >   fs/xfs: Isolate the physical DAX flag from enabled
> >   fs/xfs: Create function xfs_inode_enable_dax()
> >   fs: Add locking for a dynamic address space operations state
> >   fs: Prevent DAX state change if file is mmap'ed
> >   fs/xfs: Hold off aops users while changing DAX state
> >   fs/xfs: Clean up locking in dax invalidate
> >   fs/xfs: Allow toggle of effective DAX flag
> >   fs/xfs: Remove xfs_diflags_to_linux()
> >   Documentation/dax: Update Usage section
> > 
> >  Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
> >  Documentation/filesystems/vfs.rst | 16 +++++
> >  fs/attr.c                         |  1 +
> >  fs/inode.c                        | 16 ++++-
> >  fs/iomap/buffered-io.c            |  1 +
> >  fs/open.c                         |  4 ++
> >  fs/stat.c                         |  5 ++
> >  fs/xfs/xfs_icache.c               |  5 +-
> >  fs/xfs/xfs_inode.h                |  2 +
> >  fs/xfs/xfs_ioctl.c                | 98 +++++++++++++++----------------
> >  fs/xfs/xfs_iops.c                 | 69 +++++++++++++++-------
> >  include/linux/fs.h                | 73 ++++++++++++++++++++++-
> >  include/uapi/linux/stat.h         |  1 +
> >  mm/fadvise.c                      |  7 ++-
> >  mm/filemap.c                      |  4 ++
> >  mm/huge_memory.c                  |  1 +
> >  mm/khugepaged.c                   |  2 +
> >  mm/mmap.c                         | 19 +++++-
> >  mm/util.c                         |  9 ++-
> >  19 files changed, 328 insertions(+), 89 deletions(-)
> > 
> > -- 
> > 2.21.0
> ---end quoted text---

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-09 17:04   ` Ira Weiny
@ 2020-03-11  3:36     ` Darrick J. Wong
  2020-03-11  6:29       ` Christoph Hellwig
  2020-03-11  6:39       ` Dave Chinner
  0 siblings, 2 replies; 45+ messages in thread
From: Darrick J. Wong @ 2020-03-11  3:36 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Christoph Hellwig, linux-kernel, Alexander Viro, Dan Williams,
	Dave Chinner, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel, akpm, torvalds

On Mon, Mar 09, 2020 at 10:04:37AM -0700, Ira Weiny wrote:
> On Thu, Mar 05, 2020 at 04:51:44PM +0100, Christoph Hellwig wrote:
> > FYI, I still will fully NAK any series that adds additional locks
> > and thus atomic instructions to basically every fs call, and grows
> > the inode by a rw_semaphore plus and atomic64_t.  I also think the
> > whole idea of switching operation vectors at runtime is fatally flawed
> > and we should never add such code, nevermind just for a fringe usecase
> > of a fringe feature.
> 
> Being new to this area of the kernel I'm not clear on the history...

I /think/ the TLDR version in no particular order is:

- Some people expressed interest in being able to control page cache vs.
  direct access on pmem hardware at a higher granularity than the entire
  fs.

- Dave Chinner(?) added the per-inode flag intending it to be the sign
  that would flip on DAX.

- Someone (I forget who) made it a mount option that would enable it for
  all files regardless of inode flags and whatnot.

- Eric Sandeen(?) complained that the behavior of the dax inode flag was
  weird, particularly the part where you set the iflag and at some point
  if and when the inode gets reclaimed and then reconstituted then the
  change will finally take place.

- Christoph Hellwig objected on various grounds (the kernel is
  responsible for selecting the most appropriate hardware abstraction
  for the usage pattern; a binary flag doesn't capture enough detail for
  potential future pmem hardware; and now additional locking overhead).

- There's been (I hope) a long term understanding that the mount option
  will go away eventually, and not after we remove the EXPERIMENTAL
  tags.

(FWIW I tend to agree with Eric and Christoph, but I also thought it
would be useful at least to see what changeable file operations would be
like; if there were other users who had already implemented it; and how
much of an apetite there was for revoke().)

Hopefully I summarized that more or less accurately...

> It was my understanding that the per-file flag support was a requirement to
> removing the experimental designation from DAX.  Is this still the case?

Nailing down the meaning of the per-file dax flag is/was the requirement,
even if we kill it off entirely in the end.

Given Christoph's veto threat, I suppose that leaves the following
options?

1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX
status via statx.  Document that changes to FS_XFLAG_DAX do not take
effect immediately and that one must check statx to find out the real
mode.  If we choose this, I would also deprecate the dax mount option;
send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on
all files at mkfs time; and we can finally lay this whole thing to rest.
This is the closest to what we have today.

2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on
usage patterns, hardware heuristics, or spiteful arbitrariness.

Can we please pick (1) and just be done with this?  I want to move on.

--D

There are still other things that need to be ironed out WRT pmem:

a) reflink and page/pfn/whatever sharing -- fix the mm or (ab)use the
xfs buffer cache, or something worse?

b) getting our stories straight on how to clear poison, and whether or
not we can come up with a better story for ZERO_FILE_RANGE on pmem.  In
the ideal world I'd love to see Z_F_R actually memset(0) the pmem and
clear poison, at least if the file->pmem mappings were contiguous.

c) wiring up xfs to hwpoison, or wiring up hwpoison to xfs, or otherwise
figuring out how to get storage to tell xfs that it lost something so
that maybe xfs can fix it quickly

> Ira
> 
> > 
> > On Wed, Feb 26, 2020 at 09:24:30PM -0800, ira.weiny@intel.com wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> > > Changes from V4:
> > > 	* Open code the aops lock rather than add it to the xfs_ilock()
> > > 	  subsystem (Darrick's comments were obsoleted by this change)
> > > 	* Fix lkp build suggestions and bugs
> > > 
> > > Changes from V3:
> > > 	* Remove global locking...  :-D
> > > 	* put back per inode locking and remove pre-mature optimizations
> > > 	* Fix issues with Directories having IS_DAX() set
> > > 	* Fix kernel crash issues reported by Jeff
> > > 	* Add some clean up patches
> > > 	* Consolidate diflags to iflags functions
> > > 	* Update/add documentation
> > > 	* Reorder/rename patches quite a bit
> > > 
> > > Changes from V2:
> > > 
> > > 	* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
> > > 		Internal discussions with Dan determined this would be easier,
> > > 		just as performant, and slightly less overhead that having it
> > > 		in the SB as suggested by Jan
> > > 	* Fix locking order in comments and throughout code
> > > 	* Change "mode" to "state" throughout commits
> > > 	* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
> > > 		configured
> > > 	* Add static branch for which is activated by a device which supports
> > > 		DAX in XFS
> > > 	* Change "lock/unlock" to up/down read/write as appropriate
> > > 		Previous names were over simplified
> > > 	* Update comments/documentation
> > > 
> > > 	* Remove the xfs specific lock to the vfs (global) layer.
> > > 	* Fix i_dax_sem locking order and comments
> > > 
> > > 	* Move 'i_mapped' count from struct inode to struct address_space and
> > > 		rename it to mmap_count
> > > 	* Add inode_has_mappings() call
> > > 
> > > 	* Fix build issues
> > > 	* Clean up syntax spacing and minor issues
> > > 	* Update man page text for STATX_ATTR_DAX
> > > 	* Add reviewed-by's
> > > 	* Rebase to 5.6
> > > 
> > > 	Rename patch:
> > > 		from: fs/xfs: Add lock/unlock state to xfs
> > > 		to: fs/xfs: Add write DAX lock to xfs layer
> > > 	Add patch:
> > > 		fs/xfs: Clarify lockdep dependency for xfs_isilocked()
> > > 	Drop patch:
> > > 		fs/xfs: Fix truncate up
> > > 
> > > 
> > > At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> > > consumption due to their inability to detect whether the kernel will
> > > instantiate page cache for a file, and cases where a global dax enable via a
> > > mount option is too coarse.
> > > 
> > > The following patch series enables selecting the use of DAX on individual files
> > > and/or directories on xfs, and lays some groundwork to do so in ext4.  In this
> > > scheme the dax mount option can be omitted to allow the per-file property to
> > > take effect.
> > > 
> > > The insight at LSF/MM was to separate the per-mount or per-file "physical"
> > > capability switch from an "effective" attribute for the file.
> > > 
> > > At LSF/MM we discussed the difficulties of switching the DAX state of a file
> > > with active mappings / page cache.  It was thought the races could be avoided
> > > by limiting DAX state flips to 0-length files.
> > > 
> > > However, this turns out to not be true.[3] This is because address space
> > > operations (a_ops) may be in use at any time the inode is referenced and users
> > > have expressed a desire to be able to change the DAX state on a file with data
> > > in it.  For those reasons this patch set allows changing the DAX state flag on
> > > a file as long as it is not current mapped.
> > > 
> > > Details of when and how DAX state can be changed on a file is included in a
> > > documentation patch.
> > > 
> > > It should be noted that the physical DAX flag inheritance is not shown in this
> > > patch set as it was maintained from previous work on XFS.  The physical DAX
> > > flag and it's inheritance will need to be added to other file systems for user
> > > control. 
> > > 
> > > As submitted this works on real hardware testing.
> > > 
> > > 
> > > [1] https://lwn.net/Articles/787973/
> > > [2] https://lwn.net/Articles/787233/
> > > [3] https://lkml.org/lkml/2019/10/20/96
> > > [4] https://patchwork.kernel.org/patch/11310511/
> > > 
> > > 
> > > To: linux-kernel@vger.kernel.org
> > > Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> > > Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Cc: Christoph Hellwig <hch@lst.de>
> > > Cc: "Theodore Y. Ts'o" <tytso@mit.edu>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: linux-ext4@vger.kernel.org
> > > Cc: linux-xfs@vger.kernel.org
> > > Cc: linux-fsdevel@vger.kernel.org
> > > 
> > > 
> > > Ira Weiny (12):
> > >   fs/xfs: Remove unnecessary initialization of i_rwsem
> > >   fs: Remove unneeded IS_DAX() check
> > >   fs/stat: Define DAX statx attribute
> > >   fs/xfs: Isolate the physical DAX flag from enabled
> > >   fs/xfs: Create function xfs_inode_enable_dax()
> > >   fs: Add locking for a dynamic address space operations state
> > >   fs: Prevent DAX state change if file is mmap'ed
> > >   fs/xfs: Hold off aops users while changing DAX state
> > >   fs/xfs: Clean up locking in dax invalidate
> > >   fs/xfs: Allow toggle of effective DAX flag
> > >   fs/xfs: Remove xfs_diflags_to_linux()
> > >   Documentation/dax: Update Usage section
> > > 
> > >  Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
> > >  Documentation/filesystems/vfs.rst | 16 +++++
> > >  fs/attr.c                         |  1 +
> > >  fs/inode.c                        | 16 ++++-
> > >  fs/iomap/buffered-io.c            |  1 +
> > >  fs/open.c                         |  4 ++
> > >  fs/stat.c                         |  5 ++
> > >  fs/xfs/xfs_icache.c               |  5 +-
> > >  fs/xfs/xfs_inode.h                |  2 +
> > >  fs/xfs/xfs_ioctl.c                | 98 +++++++++++++++----------------
> > >  fs/xfs/xfs_iops.c                 | 69 +++++++++++++++-------
> > >  include/linux/fs.h                | 73 ++++++++++++++++++++++-
> > >  include/uapi/linux/stat.h         |  1 +
> > >  mm/fadvise.c                      |  7 ++-
> > >  mm/filemap.c                      |  4 ++
> > >  mm/huge_memory.c                  |  1 +
> > >  mm/khugepaged.c                   |  2 +
> > >  mm/mmap.c                         | 19 +++++-
> > >  mm/util.c                         |  9 ++-
> > >  19 files changed, 328 insertions(+), 89 deletions(-)
> > > 
> > > -- 
> > > 2.21.0
> > ---end quoted text---

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-11  3:36     ` Darrick J. Wong
@ 2020-03-11  6:29       ` Christoph Hellwig
  2020-03-11 17:07         ` Dan Williams
  2020-03-11  6:39       ` Dave Chinner
  1 sibling, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2020-03-11  6:29 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ira Weiny, Christoph Hellwig, linux-kernel, Alexander Viro,
	Dan Williams, Dave Chinner, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel, akpm, torvalds

On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote:
> 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX
> status via statx.  Document that changes to FS_XFLAG_DAX do not take
> effect immediately and that one must check statx to find out the real
> mode.  If we choose this, I would also deprecate the dax mount option;
> send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on
> all files at mkfs time; and we can finally lay this whole thing to rest.
> This is the closest to what we have today.
> 
> 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on
> usage patterns, hardware heuristics, or spiteful arbitrariness.

3) Only allow changing FS_XFLAG_DAX on directories or files that do
not have blocks allocated to them yet, and side step all the hard
problems.

Which of course still side steps the hard question of what it actually
is supposed to mean..

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-11  3:36     ` Darrick J. Wong
  2020-03-11  6:29       ` Christoph Hellwig
@ 2020-03-11  6:39       ` Dave Chinner
  2020-03-11  6:44         ` Christoph Hellwig
  1 sibling, 1 reply; 45+ messages in thread
From: Dave Chinner @ 2020-03-11  6:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Ira Weiny, Christoph Hellwig, linux-kernel, Alexander Viro,
	Dan Williams, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel, akpm, torvalds

On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote:
> There are still other things that need to be ironed out WRT pmem:
> 
> a) reflink and page/pfn/whatever sharing -- fix the mm or (ab)use the
> xfs buffer cache, or something worse?

I don't think we need either. We just need to remove the DAX page
association for hwpoison that requires the struct page to store the
mapping and index. Get rid of that and we should be able to  safely
map the same page into different inode address spaces at the same
time. When we write a shared page, we COW it immediately and replace
the page in the inode's mapping tree, so we can't actually write to
a shared page...

IOWs, the dax_associate_page() related functionality probably needs
to be a filesystem callout - part of the aops vector, I think, so
that device dax can still use it. That way XFS can go it's own way,
while ext4 and device dax can continue to use the existing mechanism
mechanisn that is currently implemented....

XFS can then make use of rmapbt to find the owners on a bad page
notification, and run the "kill userspace dead dead dead" lookup on
each mapping/index tuple rather than pass it around on a struct
page. i.e. we'll do a kill scan for each mapping/index owner tuple
we find, not just one.

That requires converting all the current vma killer code to pass
mapping/index tuples around rather than the struct page. That kill
code doesn't actually need the struct page, it just needs the
mapping/index tuple to match to the vmas that have it mapped into
userspace.

> b) getting our stories straight on how to clear poison, and whether or
> not we can come up with a better story for ZERO_FILE_RANGE on pmem.  In
> the ideal world I'd love to see Z_F_R actually memset(0) the pmem and
> clear poison, at least if the file->pmem mappings were contiguous.

Are you talking about ZFR from userspace through the filesystem (how
do you clear poison in free space?) or ZFR on the dax device fro
either userspace or the kernel filesystem code?

> c) wiring up xfs to hwpoison, or wiring up hwpoison to xfs, or otherwise
> figuring out how to get storage to tell xfs that it lost something so
> that maybe xfs can fix it quickly

Yup, I think that's a dax device callback into the filesystem. i.e
the hwpoison gets delivered to the dax device, which then calls the
fs function rather than do it's current "dax_lock_page(), kill
userspace dead dead dead, dax_unlock_page()" dance. The filesystem
can do a much more intricate dance and wreak far more havoc on
userspace than what the dax device can do.....

Copious amounts of unused time are things I don't have,
unfortunately. Only having 7 fingers to type with right now doesn't
help, either.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-11  6:39       ` Dave Chinner
@ 2020-03-11  6:44         ` Christoph Hellwig
  2020-03-11 17:07           ` Dan Williams
  2020-03-12  0:49           ` Dave Chinner
  0 siblings, 2 replies; 45+ messages in thread
From: Christoph Hellwig @ 2020-03-11  6:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Ira Weiny, Christoph Hellwig, linux-kernel,
	Alexander Viro, Dan Williams, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel, akpm, torvalds

On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote:
> IOWs, the dax_associate_page() related functionality probably needs
> to be a filesystem callout - part of the aops vector, I think, so
> that device dax can still use it. That way XFS can go it's own way,
> while ext4 and device dax can continue to use the existing mechanism
> mechanisn that is currently implemented....

s/XFS/XFS with rmap/, as most XFS file systems currently don't have
that enabled we'll also need to keep the legacy path around.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-11  6:29       ` Christoph Hellwig
@ 2020-03-11 17:07         ` Dan Williams
  2020-03-16  9:52           ` Jan Kara
  0 siblings, 1 reply; 45+ messages in thread
From: Dan Williams @ 2020-03-11 17:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Ira Weiny, Linux Kernel Mailing List,
	Alexander Viro, Dave Chinner, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel, Andrew Morton,
	Linus Torvalds

On Tue, Mar 10, 2020 at 11:30 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote:
> > 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX
> > status via statx.  Document that changes to FS_XFLAG_DAX do not take
> > effect immediately and that one must check statx to find out the real
> > mode.  If we choose this, I would also deprecate the dax mount option;
> > send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on
> > all files at mkfs time; and we can finally lay this whole thing to rest.
> > This is the closest to what we have today.
> >
> > 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on
> > usage patterns, hardware heuristics, or spiteful arbitrariness.
>
> 3) Only allow changing FS_XFLAG_DAX on directories or files that do
> not have blocks allocated to them yet, and side step all the hard
> problems.

This sounds reasonable to me.

As for deprecating the mount option, I think at a minimum it needs to
continue be accepted as an option even if it is ignored to not break
existing setups. We're currently going through the prolonged flag day
of people discovering that if they update xfsprogs they need to
specify "-m reflink=0" to mkfs.xfs. That pain seems to have only been
a road bump not a showstopper based on the bug reports I've seen. If
anything it has added helpful pressure towards getting reflink support
bumped up in the priority. Hopefully the xfs position that the dax
mount option can be ignored makes it possible to implement the same
policy on ext4, and we can just move on...

> Which of course still side steps the hard question of what it actually
> is supposed to mean..

If we have statx to indicate the effective dax-state that addresses
the pain for applications that want to account for dax in their page
cache pressure estimates, and lets FS_XFLAG_DAX not need to specify
precise semantics.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-11  6:44         ` Christoph Hellwig
@ 2020-03-11 17:07           ` Dan Williams
  2020-03-12  0:49           ` Dave Chinner
  1 sibling, 0 replies; 45+ messages in thread
From: Dan Williams @ 2020-03-11 17:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Darrick J. Wong, Ira Weiny,
	Linux Kernel Mailing List, Alexander Viro, Theodore Y. Ts'o,
	Jan Kara, linux-ext4, linux-xfs, linux-fsdevel, Andrew Morton,
	Linus Torvalds

On Tue, Mar 10, 2020 at 11:44 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote:
> > IOWs, the dax_associate_page() related functionality probably needs
> > to be a filesystem callout - part of the aops vector, I think, so
> > that device dax can still use it. That way XFS can go it's own way,
> > while ext4 and device dax can continue to use the existing mechanism
> > mechanisn that is currently implemented....
>
> s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> that enabled we'll also need to keep the legacy path around.

Agree, it needs to be an opt-in capability as ext4 and xfs w/o rmap
will stay on the legacy path for the foreseeable future.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-11  6:44         ` Christoph Hellwig
  2020-03-11 17:07           ` Dan Williams
@ 2020-03-12  0:49           ` Dave Chinner
  2020-03-12  3:00             ` Darrick J. Wong
  2020-03-12  7:27             ` Christoph Hellwig
  1 sibling, 2 replies; 45+ messages in thread
From: Dave Chinner @ 2020-03-12  0:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Ira Weiny, linux-kernel, Alexander Viro,
	Dan Williams, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel, akpm, torvalds

On Wed, Mar 11, 2020 at 07:44:12AM +0100, Christoph Hellwig wrote:
> On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote:
> > IOWs, the dax_associate_page() related functionality probably needs
> > to be a filesystem callout - part of the aops vector, I think, so
> > that device dax can still use it. That way XFS can go it's own way,
> > while ext4 and device dax can continue to use the existing mechanism
> > mechanisn that is currently implemented....
> 
> s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> that enabled we'll also need to keep the legacy path around.

Sure, that's trivially easy to handle in the XFS code once the
callouts are in place.

But, quite frankly, we can enforce rmap to be enabled 
enabled because nobody is using a reflink enabled FS w/ DAX right
now. Everyone will have to mkfs their filesystems anyway to enable
reflink+dax, so we simply don't allow reflink+dax to be enabled
unless rmap is also enabled. Simple, easy, trivial.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-12  0:49           ` Dave Chinner
@ 2020-03-12  3:00             ` Darrick J. Wong
  2020-03-12  7:27             ` Christoph Hellwig
  1 sibling, 0 replies; 45+ messages in thread
From: Darrick J. Wong @ 2020-03-12  3:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Ira Weiny, linux-kernel, Alexander Viro,
	Dan Williams, Theodore Y. Ts'o, Jan Kara, linux-ext4,
	linux-xfs, linux-fsdevel, akpm, torvalds

On Thu, Mar 12, 2020 at 11:49:32AM +1100, Dave Chinner wrote:
> On Wed, Mar 11, 2020 at 07:44:12AM +0100, Christoph Hellwig wrote:
> > On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote:
> > > IOWs, the dax_associate_page() related functionality probably needs
> > > to be a filesystem callout - part of the aops vector, I think, so
> > > that device dax can still use it. That way XFS can go it's own way,
> > > while ext4 and device dax can continue to use the existing mechanism
> > > mechanisn that is currently implemented....
> > 
> > s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> > that enabled we'll also need to keep the legacy path around.
> 
> Sure, that's trivially easy to handle in the XFS code once the
> callouts are in place.
> 
> But, quite frankly, we can enforce rmap to be enabled 
> enabled because nobody is using a reflink enabled FS w/ DAX right
> now. Everyone will have to mkfs their filesystems anyway to enable
> reflink+dax, so we simply don't allow reflink+dax to be enabled
> unless rmap is also enabled. Simple, easy, trivial.

Heh, this reminds me that I need to get that rmap performance analysis
report out to the list... it does have a fairly substantial performance
impact (in its current not-terribly-optimized state) but otoh enables
self-repair.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-12  0:49           ` Dave Chinner
  2020-03-12  3:00             ` Darrick J. Wong
@ 2020-03-12  7:27             ` Christoph Hellwig
  1 sibling, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2020-03-12  7:27 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Darrick J. Wong, Ira Weiny, linux-kernel,
	Alexander Viro, Dan Williams, Theodore Y. Ts'o, Jan Kara,
	linux-ext4, linux-xfs, linux-fsdevel, akpm, torvalds

On Thu, Mar 12, 2020 at 11:49:32AM +1100, Dave Chinner wrote:
> > > IOWs, the dax_associate_page() related functionality probably needs
> > > to be a filesystem callout - part of the aops vector, I think, so
> > > that device dax can still use it. That way XFS can go it's own way,
> > > while ext4 and device dax can continue to use the existing mechanism
> > > mechanisn that is currently implemented....
> > 
> > s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> > that enabled we'll also need to keep the legacy path around.
> 
> Sure, that's trivially easy to handle in the XFS code once the
> callouts are in place.
> 
> But, quite frankly, we can enforce rmap to be enabled 
> enabled because nobody is using a reflink enabled FS w/ DAX right
> now. Everyone will have to mkfs their filesystems anyway to enable
> reflink+dax, so we simply don't allow reflink+dax to be enabled
> unless rmap is also enabled. Simple, easy, trivial.

True, I think rmap will be required for DAX+reflink.  But we still
need the legacy infrastructure for error reporting.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-11 17:07         ` Dan Williams
@ 2020-03-16  9:52           ` Jan Kara
  2020-03-16  9:55             ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Jan Kara @ 2020-03-16  9:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Darrick J. Wong, Ira Weiny,
	Linux Kernel Mailing List, Alexander Viro, Dave Chinner,
	Theodore Y. Ts'o, Jan Kara, linux-ext4, linux-xfs,
	linux-fsdevel, Andrew Morton, Linus Torvalds

On Wed 11-03-20 10:07:18, Dan Williams wrote:
> On Tue, Mar 10, 2020 at 11:30 PM Christoph Hellwig <hch@lst.de> wrote:
> >
> > On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote:
> > > 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX
> > > status via statx.  Document that changes to FS_XFLAG_DAX do not take
> > > effect immediately and that one must check statx to find out the real
> > > mode.  If we choose this, I would also deprecate the dax mount option;
> > > send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on
> > > all files at mkfs time; and we can finally lay this whole thing to rest.
> > > This is the closest to what we have today.
> > >
> > > 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on
> > > usage patterns, hardware heuristics, or spiteful arbitrariness.
> >
> > 3) Only allow changing FS_XFLAG_DAX on directories or files that do
> > not have blocks allocated to them yet, and side step all the hard
> > problems.
> 
> This sounds reasonable to me.
> 
> As for deprecating the mount option, I think at a minimum it needs to
> continue be accepted as an option even if it is ignored to not break
> existing setups.

Agreed. But that's how we usually deprecate mount options. Also I'd say
that statx() support for reporting DAX state and some education of
programmers using DAX is required before we deprecate the mount option
since currently applications check 'dax' mount option to determine how much
memory they need to set aside for page cache before they consume everything
else on the machine...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-16  9:52           ` Jan Kara
@ 2020-03-16  9:55             ` Christoph Hellwig
  2020-04-01  4:00               ` Darrick J. Wong
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2020-03-16  9:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Christoph Hellwig, Darrick J. Wong, Ira Weiny,
	Linux Kernel Mailing List, Alexander Viro, Dave Chinner,
	Theodore Y. Ts'o, linux-ext4, linux-xfs, linux-fsdevel,
	Andrew Morton, Linus Torvalds

On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > This sounds reasonable to me.
> > 
> > As for deprecating the mount option, I think at a minimum it needs to
> > continue be accepted as an option even if it is ignored to not break
> > existing setups.
> 
> Agreed. But that's how we usually deprecate mount options. Also I'd say
> that statx() support for reporting DAX state and some education of
> programmers using DAX is required before we deprecate the mount option
> since currently applications check 'dax' mount option to determine how much
> memory they need to set aside for page cache before they consume everything
> else on the machine...

I don't even think we should deprecate it.  It isn't painful to maintain
and actually useful for testing.  Instead we should expand it into a
tristate:

  dax=off
  dax=flag
  dax=always

where the existing "dax" option maps to "dax=always" and nodax maps
to "dax=off". and dax=flag becomes the default for DAX capable devices.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-03-16  9:55             ` Christoph Hellwig
@ 2020-04-01  4:00               ` Darrick J. Wong
  2020-04-01 10:25                 ` Jan Kara
  2020-04-03  4:39                 ` Ira Weiny
  0 siblings, 2 replies; 45+ messages in thread
From: Darrick J. Wong @ 2020-04-01  4:00 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner, Theodore Y. Ts'o,
	Dan Williams, Jan Kara, Ira Weiny
  Cc: Linux Kernel Mailing List, Alexander Viro, linux-ext4, linux-xfs,
	linux-fsdevel, Andrew Morton, Linus Torvalds

On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote:
> On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > > This sounds reasonable to me.
> > > 
> > > As for deprecating the mount option, I think at a minimum it needs to
> > > continue be accepted as an option even if it is ignored to not break
> > > existing setups.
> > 
> > Agreed. But that's how we usually deprecate mount options. Also I'd say
> > that statx() support for reporting DAX state and some education of
> > programmers using DAX is required before we deprecate the mount option
> > since currently applications check 'dax' mount option to determine how much
> > memory they need to set aside for page cache before they consume everything
> > else on the machine...
> 
> I don't even think we should deprecate it.  It isn't painful to maintain
> and actually useful for testing.  Instead we should expand it into a
> tristate:
> 
>   dax=off
>   dax=flag
>   dax=always
> 
> where the existing "dax" option maps to "dax=always" and nodax maps
> to "dax=off". and dax=flag becomes the default for DAX capable devices.

That works for me.  In summary:

 - Applications must call statx to discover the current S_DAX state.

 - There exists an advisory file inode flag FS_XFLAG_DAX that can be
   changed on files that have no blocks allocated to them.  Changing
   this flag does not necessarily change the S_DAX state immediately
   but programs can query the S_DAX state via statx.

   If FS_XFLAG_DAX is set and the fs is on pmem then it will always
   enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
   never enable S_DAX.  Unless overridden...

 - There exists a dax= mount option.  dax=off means "never set S_DAX,
   ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on
   pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX"
   and is the default.  "dax" by itself means "dax=always".  "nodax"
   means "dax=off".

 - There exists an advisory directory inode flag FS_XFLAG_DAX that can
   be changed at any time.  The flag state is copied into any files or
   subdirectories created within that directory.  If programs require
   that file access runs in S_DAX mode, they'll have to create those
   files themselves inside a directory with FS_XFLAG_DAX set, or mount
   the fs with dax=always.

Ok?  Let's please get this part finished for 5.8, then we can get back
to arguing about fs-rmap and reflink and dax and whatnot.

--D

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-01  4:00               ` Darrick J. Wong
@ 2020-04-01 10:25                 ` Jan Kara
  2020-04-02  8:53                   ` Christoph Hellwig
  2020-04-03  4:39                 ` Ira Weiny
  1 sibling, 1 reply; 45+ messages in thread
From: Jan Kara @ 2020-04-01 10:25 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Dave Chinner, Theodore Y. Ts'o,
	Dan Williams, Jan Kara, Ira Weiny, Linux Kernel Mailing List,
	Alexander Viro, linux-ext4, linux-xfs, linux-fsdevel,
	Andrew Morton, Linus Torvalds

On Wed 01-04-20 04:00:21, Darrick J. Wong wrote:
> On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote:
> > On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > > > This sounds reasonable to me.
> > > > 
> > > > As for deprecating the mount option, I think at a minimum it needs to
> > > > continue be accepted as an option even if it is ignored to not break
> > > > existing setups.
> > > 
> > > Agreed. But that's how we usually deprecate mount options. Also I'd say
> > > that statx() support for reporting DAX state and some education of
> > > programmers using DAX is required before we deprecate the mount option
> > > since currently applications check 'dax' mount option to determine how much
> > > memory they need to set aside for page cache before they consume everything
> > > else on the machine...
> > 
> > I don't even think we should deprecate it.  It isn't painful to maintain
> > and actually useful for testing.  Instead we should expand it into a
> > tristate:
> > 
> >   dax=off
> >   dax=flag
> >   dax=always
> > 
> > where the existing "dax" option maps to "dax=always" and nodax maps
> > to "dax=off". and dax=flag becomes the default for DAX capable devices.
> 
> That works for me.  In summary:
> 
>  - Applications must call statx to discover the current S_DAX state.
> 
>  - There exists an advisory file inode flag FS_XFLAG_DAX that can be
>    changed on files that have no blocks allocated to them.  Changing
>    this flag does not necessarily change the S_DAX state immediately
>    but programs can query the S_DAX state via statx.

I generally like the proposal but I think the fact that toggling
FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some
confusion and ultimately bug reports. I'm thinking whether we could somehow
improve this... For example an ioctl that would try to make set inode flags
effective by evicting the inode (and returning EBUSY if the eviction is
impossible for some reason)?

								Honza

> 
>    If FS_XFLAG_DAX is set and the fs is on pmem then it will always
>    enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
>    never enable S_DAX.  Unless overridden...
> 
>  - There exists a dax= mount option.  dax=off means "never set S_DAX,
>    ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on
>    pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX"
>    and is the default.  "dax" by itself means "dax=always".  "nodax"
>    means "dax=off".
> 
>  - There exists an advisory directory inode flag FS_XFLAG_DAX that can
>    be changed at any time.  The flag state is copied into any files or
>    subdirectories created within that directory.  If programs require
>    that file access runs in S_DAX mode, they'll have to create those
>    files themselves inside a directory with FS_XFLAG_DAX set, or mount
>    the fs with dax=always.
> 
> Ok?  Let's please get this part finished for 5.8, then we can get back
> to arguing about fs-rmap and reflink and dax and whatnot.
> 
> --D
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-01 10:25                 ` Jan Kara
@ 2020-04-02  8:53                   ` Christoph Hellwig
  2020-04-02 20:55                     ` Ira Weiny
  0 siblings, 1 reply; 45+ messages in thread
From: Christoph Hellwig @ 2020-04-02  8:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: Darrick J. Wong, Christoph Hellwig, Dave Chinner,
	Theodore Y. Ts'o, Dan Williams, Ira Weiny,
	Linux Kernel Mailing List, Alexander Viro, linux-ext4, linux-xfs,
	linux-fsdevel, Andrew Morton, Linus Torvalds

On Wed, Apr 01, 2020 at 12:25:11PM +0200, Jan Kara wrote:
> >  - Applications must call statx to discover the current S_DAX state.
> > 
> >  - There exists an advisory file inode flag FS_XFLAG_DAX that can be
> >    changed on files that have no blocks allocated to them.  Changing
> >    this flag does not necessarily change the S_DAX state immediately
> >    but programs can query the S_DAX state via statx.
> 
> I generally like the proposal but I think the fact that toggling
> FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some
> confusion and ultimately bug reports. I'm thinking whether we could somehow
> improve this... For example an ioctl that would try to make set inode flags
> effective by evicting the inode (and returning EBUSY if the eviction is
> impossible for some reason)?

I'd just return an error for that case, don't play silly games like
evicting the inode.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-02  8:53                   ` Christoph Hellwig
@ 2020-04-02 20:55                     ` Ira Weiny
  2020-04-03  7:27                       ` Christoph Hellwig
  0 siblings, 1 reply; 45+ messages in thread
From: Ira Weiny @ 2020-04-02 20:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Darrick J. Wong, Dave Chinner, Theodore Y. Ts'o,
	Dan Williams, Linux Kernel Mailing List, Alexander Viro,
	linux-ext4, linux-xfs, linux-fsdevel, Andrew Morton,
	Linus Torvalds

On Thu, Apr 02, 2020 at 10:53:27AM +0200, Christoph Hellwig wrote:
> On Wed, Apr 01, 2020 at 12:25:11PM +0200, Jan Kara wrote:
> > >  - Applications must call statx to discover the current S_DAX state.
> > > 
> > >  - There exists an advisory file inode flag FS_XFLAG_DAX that can be
> > >    changed on files that have no blocks allocated to them.  Changing
> > >    this flag does not necessarily change the S_DAX state immediately
> > >    but programs can query the S_DAX state via statx.
> > 
> > I generally like the proposal but I think the fact that toggling
> > FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some
> > confusion and ultimately bug reports. I'm thinking whether we could somehow
> > improve this... For example an ioctl that would try to make set inode flags
> > effective by evicting the inode (and returning EBUSY if the eviction is
> > impossible for some reason)?
> 
> I'd just return an error for that case, don't play silly games like
> evicting the inode.

I think I agree with Christoph here.  But I want to clarify.  I was heading in
a direction of failing the ioctl completely.  But we could have the flag change
with an appropriate error which could let the user know the change has been
delayed.

But I don't immediately see what error code is appropriate for such an
indication.  Candidates I can envision:

EAGAIN
ERESTART
EUSERS
EINPROGRESS

None are perfect but I'm leaning toward EINPROGRESS.

Ira


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-01  4:00               ` Darrick J. Wong
  2020-04-01 10:25                 ` Jan Kara
@ 2020-04-03  4:39                 ` Ira Weiny
  1 sibling, 0 replies; 45+ messages in thread
From: Ira Weiny @ 2020-04-03  4:39 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Dave Chinner, Theodore Y. Ts'o,
	Dan Williams, Jan Kara, Linux Kernel Mailing List,
	Alexander Viro, linux-ext4, linux-xfs, linux-fsdevel,
	Andrew Morton, Linus Torvalds

On Wed, Apr 01, 2020 at 04:00:21AM +0000, Darrick J. Wong wrote:
> On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote:
> > On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > > > This sounds reasonable to me.
> > > > 
> > > > As for deprecating the mount option, I think at a minimum it needs to
> > > > continue be accepted as an option even if it is ignored to not break
> > > > existing setups.
> > > 
> > > Agreed. But that's how we usually deprecate mount options. Also I'd say
> > > that statx() support for reporting DAX state and some education of
> > > programmers using DAX is required before we deprecate the mount option
> > > since currently applications check 'dax' mount option to determine how much
> > > memory they need to set aside for page cache before they consume everything
> > > else on the machine...
> > 
> > I don't even think we should deprecate it.  It isn't painful to maintain
> > and actually useful for testing.  Instead we should expand it into a
> > tristate:
> > 
> >   dax=off
> >   dax=flag
> >   dax=always
> > 
> > where the existing "dax" option maps to "dax=always" and nodax maps
> > to "dax=off". and dax=flag becomes the default for DAX capable devices.
> 
> That works for me.  In summary:
> 
>  - Applications must call statx to discover the current S_DAX state.
> 
>  - There exists an advisory file inode flag FS_XFLAG_DAX that can be
>    changed on files that have no blocks allocated to them.  Changing
>    this flag does not necessarily change the S_DAX state immediately
>    but programs can query the S_DAX state via statx.
> 
>    If FS_XFLAG_DAX is set and the fs is on pmem then it will always
>    enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
>    never enable S_DAX.  Unless overridden...
> 
>  - There exists a dax= mount option.  dax=off means "never set S_DAX,
>    ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on
>    pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX"
>    and is the default.  "dax" by itself means "dax=always".  "nodax"
>    means "dax=off".
> 
>  - There exists an advisory directory inode flag FS_XFLAG_DAX that can
>    be changed at any time.  The flag state is copied into any files or
>    subdirectories created within that directory.  If programs require
>    that file access runs in S_DAX mode, they'll have to create those
>    files themselves inside a directory with FS_XFLAG_DAX set, or mount
>    the fs with dax=always.

One other thing to add here.  They _can_ set the FS_XFLAG_DAX on a file with
data and force an eviction to get S_DAX to change.

I think that is a nice reason to have a different error code returned.

> 
> Ok?  Let's please get this part finished for 5.8, then we can get back
> to arguing about fs-rmap and reflink and dax and whatnot.

I'm happy to see you motivated to get this in.

I'm starting with a new xfstest to make sure we agree on the semantics prior to
more patches.  I hope to have the xfstest patch sent tomorrow sometime.

Ira

> 
> --D

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-02 20:55                     ` Ira Weiny
@ 2020-04-03  7:27                       ` Christoph Hellwig
  2020-04-03 15:48                         ` Ira Weiny
  2020-04-03 16:05                         ` Darrick J. Wong
  0 siblings, 2 replies; 45+ messages in thread
From: Christoph Hellwig @ 2020-04-03  7:27 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Christoph Hellwig, Jan Kara, Darrick J. Wong, Dave Chinner,
	Theodore Y. Ts'o, Dan Williams, Linux Kernel Mailing List,
	Alexander Viro, linux-ext4, linux-xfs, linux-fsdevel,
	Andrew Morton, Linus Torvalds

On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > I'd just return an error for that case, don't play silly games like
> > evicting the inode.
> 
> I think I agree with Christoph here.  But I want to clarify.  I was heading in
> a direction of failing the ioctl completely.  But we could have the flag change
> with an appropriate error which could let the user know the change has been
> delayed.
> 
> But I don't immediately see what error code is appropriate for such an
> indication.  Candidates I can envision:
> 
> EAGAIN
> ERESTART
> EUSERS
> EINPROGRESS
> 
> None are perfect but I'm leaning toward EINPROGRESS.

I really, really dislike that idea.  The whole point of not forcing
evictions is to make it clear - no this inode is "busy" you can't
do that.  A reasonably smart application can try to evict itself.

But returning an error and doing a lazy change anyway is straight from
the playbook for arcane and confusing API designs.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-03  7:27                       ` Christoph Hellwig
@ 2020-04-03 15:48                         ` Ira Weiny
  2020-04-03 17:03                           ` Jan Kara
  2020-04-03 16:05                         ` Darrick J. Wong
  1 sibling, 1 reply; 45+ messages in thread
From: Ira Weiny @ 2020-04-03 15:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Darrick J. Wong, Dave Chinner, Theodore Y. Ts'o,
	Dan Williams, Linux Kernel Mailing List, Alexander Viro,
	linux-ext4, linux-xfs, linux-fsdevel, Andrew Morton,
	Linus Torvalds

On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > I'd just return an error for that case, don't play silly games like
> > > evicting the inode.
> > 
> > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > a direction of failing the ioctl completely.  But we could have the flag change
> > with an appropriate error which could let the user know the change has been
> > delayed.
> > 
> > But I don't immediately see what error code is appropriate for such an
> > indication.  Candidates I can envision:
> > 
> > EAGAIN
> > ERESTART
> > EUSERS
> > EINPROGRESS
> > 
> > None are perfect but I'm leaning toward EINPROGRESS.
> 
> I really, really dislike that idea.  The whole point of not forcing
> evictions is to make it clear - no this inode is "busy" you can't
> do that.  A reasonably smart application can try to evict itself.

I don't understand.  What Darrick proposed would never need any evictions.  If
the file has blocks allocated the FS_XFLAG_DAX flag can not be changed.  So I
don't see what good eviction would do at all.

> 
> But returning an error and doing a lazy change anyway is straight from
> the playbook for arcane and confusing API designs.

Jan countered with a proposal that the FS_XFLAG_DAX does change with blocks
allocated.  But that S_DAX would change on eviction.  Adding that some eviction
ioctl could be added.

You then proposed just returning an error for that case.  (This lead me to
believe that you were ok with an eviction based change of S_DAX.)

So I agreed that changing S_DAX could be delayed until an explicit eviction.
But, to aid the 'smart application', a different error code could be used to
indicate that the FS_XFLAG_DAX had been changed but that until that explicit
eviction occurs S_DAX would remain.

So I don't fully follow what you mean by 'lazy change'?

Do you still really, really dislike an explicit eviction method for changing
the S_DAX flag?

If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
user wants to change the mode of operations on their 'data'; they would have to
create a new file with the proper setting and move the data there.  For example
copy the file into a directory marked FS_XFLAG_DAX==true?

I'm ok with either interface as I think both could be clear if documented.

Jan?

Ira


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-03  7:27                       ` Christoph Hellwig
  2020-04-03 15:48                         ` Ira Weiny
@ 2020-04-03 16:05                         ` Darrick J. Wong
  1 sibling, 0 replies; 45+ messages in thread
From: Darrick J. Wong @ 2020-04-03 16:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ira Weiny, Jan Kara, Dave Chinner, Theodore Y. Ts'o,
	Dan Williams, Linux Kernel Mailing List, Alexander Viro,
	linux-ext4, linux-xfs, linux-fsdevel, Andrew Morton,
	Linus Torvalds

On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > I'd just return an error for that case, don't play silly games like
> > > evicting the inode.
> > 
> > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > a direction of failing the ioctl completely.  But we could have the flag change
> > with an appropriate error which could let the user know the change has been
> > delayed.
> > 
> > But I don't immediately see what error code is appropriate for such an
> > indication.  Candidates I can envision:
> > 
> > EAGAIN
> > ERESTART
> > EUSERS
> > EINPROGRESS
> > 
> > None are perfect but I'm leaning toward EINPROGRESS.
> 
> I really, really dislike that idea.  The whole point of not forcing
> evictions is to make it clear - no this inode is "busy" you can't
> do that.  A reasonably smart application can try to evict itself.
> 
> But returning an error and doing a lazy change anyway is straight from
> the playbook for arcane and confusing API designs.

Agreed.  That's why I wrote that applications can set FS_XFLAG_DAX and
then query statx for STATX_ATTR_DAX to find out if it actually took
effect, and that if applications require it immediately they can either
create a file in a FS_XFLAG_DAX directory, or the admin can mount with
dax=always.  No magic return values required or desired anywhere.

I don't know what "try to evict the inode" magic means, but I'm fairly
sure I don't want to. ;)

--D

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-03 15:48                         ` Ira Weiny
@ 2020-04-03 17:03                           ` Jan Kara
  2020-04-03 18:18                             ` Ira Weiny
  2020-04-03 18:29                             ` Darrick J. Wong
  0 siblings, 2 replies; 45+ messages in thread
From: Jan Kara @ 2020-04-03 17:03 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Christoph Hellwig, Jan Kara, Darrick J. Wong, Dave Chinner,
	Theodore Y. Ts'o, Dan Williams, Linux Kernel Mailing List,
	Alexander Viro, linux-ext4, linux-xfs, linux-fsdevel,
	Andrew Morton, Linus Torvalds

On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > I'd just return an error for that case, don't play silly games like
> > > > evicting the inode.
> > > 
> > > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > > a direction of failing the ioctl completely.  But we could have the flag change
> > > with an appropriate error which could let the user know the change has been
> > > delayed.
> > > 
> > > But I don't immediately see what error code is appropriate for such an
> > > indication.  Candidates I can envision:
> > > 
> > > EAGAIN
> > > ERESTART
> > > EUSERS
> > > EINPROGRESS
> > > 
> > > None are perfect but I'm leaning toward EINPROGRESS.
> > 
> > I really, really dislike that idea.  The whole point of not forcing
> > evictions is to make it clear - no this inode is "busy" you can't
> > do that.  A reasonably smart application can try to evict itself.
> 
> I don't understand.  What Darrick proposed would never need any
> evictions.  If the file has blocks allocated the FS_XFLAG_DAX flag can
> not be changed.  So I don't see what good eviction would do at all.

I guess there's some confusion here (may well be than on my side). Darrick
propose that we can switch FS_XFLAG_DAX only when file has no blocks
allocated - fine by me. But that still does not mean than we can switch
S_DAX immediately, does it? Because that would still mean we need to switch
aops on living inode and that's ... difficult and Christoph didn't want to
clutter the code with it.

So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
S_DAX flag will magically switch when inode gets evicted and the inode gets
reloaded from the disk again. Did I misunderstand anything?

And my thinking was that this is surprising behavior for the user and so it
will likely generate lots of bug reports along the lines of "DAX inode flag
does not work!". So I was pondering how to make the behavior less
confusing... The ioctl I've suggested was just a poor attempt at that.

> > But returning an error and doing a lazy change anyway is straight from
> > the playbook for arcane and confusing API designs.
> 
> Jan countered with a proposal that the FS_XFLAG_DAX does change with
> blocks allocated.  But that S_DAX would change on eviction.  Adding that
> some eviction ioctl could be added.

No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
was still speaking about the case without blocks allocated.

> You then proposed just returning an error for that case.  (This lead me to
> believe that you were ok with an eviction based change of S_DAX.)
> 
> So I agreed that changing S_DAX could be delayed until an explicit eviction.
> But, to aid the 'smart application', a different error code could be used to
> indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> eviction occurs S_DAX would remain.
> 
> So I don't fully follow what you mean by 'lazy change'?
> 
> Do you still really, really dislike an explicit eviction method for changing
> the S_DAX flag?
> 
> If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> user wants to change the mode of operations on their 'data'; they would have to
> create a new file with the proper setting and move the data there.  For example
> copy the file into a directory marked FS_XFLAG_DAX==true?
> 
> I'm ok with either interface as I think both could be clear if documented.

I agree that what Darrick suggested is technically easily doable and can be
documented. But it is not natural behavior (i.e., different than all inode
flags we have) and we know how careful people are when reading
documentation...


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-03 17:03                           ` Jan Kara
@ 2020-04-03 18:18                             ` Ira Weiny
  2020-04-03 18:21                               ` Ira Weiny
                                                 ` (2 more replies)
  2020-04-03 18:29                             ` Darrick J. Wong
  1 sibling, 3 replies; 45+ messages in thread
From: Ira Weiny @ 2020-04-03 18:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner,
	Theodore Y. Ts'o, Dan Williams, Linux Kernel Mailing List,
	Alexander Viro, linux-ext4, linux-xfs, linux-fsdevel,
	Andrew Morton, Linus Torvalds

On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote:
> On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > > I'd just return an error for that case, don't play silly games like
> > > > > evicting the inode.
> > > > 
> > > > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > > > a direction of failing the ioctl completely.  But we could have the flag change
> > > > with an appropriate error which could let the user know the change has been
> > > > delayed.
> > > > 
> > > > But I don't immediately see what error code is appropriate for such an
> > > > indication.  Candidates I can envision:
> > > > 
> > > > EAGAIN
> > > > ERESTART
> > > > EUSERS
> > > > EINPROGRESS
> > > > 
> > > > None are perfect but I'm leaning toward EINPROGRESS.
> > > 
> > > I really, really dislike that idea.  The whole point of not forcing
> > > evictions is to make it clear - no this inode is "busy" you can't
> > > do that.  A reasonably smart application can try to evict itself.
> > 
> > I don't understand.  What Darrick proposed would never need any
> > evictions.  If the file has blocks allocated the FS_XFLAG_DAX flag can
> > not be changed.  So I don't see what good eviction would do at all.
> 
> I guess there's some confusion here (may well be than on my side). Darrick
> propose that we can switch FS_XFLAG_DAX only when file has no blocks
> allocated - fine by me. But that still does not mean than we can switch
> S_DAX immediately, does it? Because that would still mean we need to switch
> aops on living inode and that's ... difficult and Christoph didn't want to
> clutter the code with it.
> 
> So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
> S_DAX flag will magically switch when inode gets evicted and the inode gets
> reloaded from the disk again. Did I misunderstand anything?
> 
> And my thinking was that this is surprising behavior for the user and so it
> will likely generate lots of bug reports along the lines of "DAX inode flag
> does not work!". So I was pondering how to make the behavior less
> confusing... The ioctl I've suggested was just a poor attempt at that.

Ok but then I don't understand Christophs comment to "just return an error for
that case"?  Which case?

> 
> > > But returning an error and doing a lazy change anyway is straight from
> > > the playbook for arcane and confusing API designs.
> > 
> > Jan countered with a proposal that the FS_XFLAG_DAX does change with
> > blocks allocated.  But that S_DAX would change on eviction.  Adding that
> > some eviction ioctl could be added.
> 
> No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
> was still speaking about the case without blocks allocated.

Ah ok good point.  But again what 'error' do we return when FS_XFLAG_DAX
changed but S_DAX did not?

> 
> > You then proposed just returning an error for that case.  (This lead me to
> > believe that you were ok with an eviction based change of S_DAX.)
> > 
> > So I agreed that changing S_DAX could be delayed until an explicit eviction.
> > But, to aid the 'smart application', a different error code could be used to
> > indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> > eviction occurs S_DAX would remain.
> > 
> > So I don't fully follow what you mean by 'lazy change'?
> > 
> > Do you still really, really dislike an explicit eviction method for changing
> > the S_DAX flag?
> > 
> > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> > user wants to change the mode of operations on their 'data'; they would have to
> > create a new file with the proper setting and move the data there.  For example
> > copy the file into a directory marked FS_XFLAG_DAX==true?
> > 
> > I'm ok with either interface as I think both could be clear if documented.
> 
> I agree that what Darrick suggested is technically easily doable and can be
> documented. But it is not natural behavior (i.e., different than all inode
> flags we have) and we know how careful people are when reading
> documentation...
> 

Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_
_all_...

In summary:

 - Applications must call statx to discover the current S_DAX state.

 - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
   the parent directory FS_XFLAG_DAX inode flag.  (There is no way to change
   this flag after file creation.)

   If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
   inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
   Unless overridden...

 - There exists a dax= mount option.

   "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
   	"-o nodax" means "dax=off"
   "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
   	"-o dax" by itself means "dax=always"
   "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default

 - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
   changed at any time.  The flag state is copied into any files or
   subdirectories when they are created within that directory.  If programs
   require file access runs in S_DAX mode, they'll have to create those files
   inside a directory with FS_XFLAG_DAX set, or mount the fs with an
   appropriate dax mount option.


???

Ira


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-03 18:18                             ` Ira Weiny
@ 2020-04-03 18:21                               ` Ira Weiny
  2020-04-03 18:37                               ` Darrick J. Wong
  2020-04-06 10:00                               ` Jan Kara
  2 siblings, 0 replies; 45+ messages in thread
From: Ira Weiny @ 2020-04-03 18:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner,
	Theodore Y. Ts'o, Dan Williams, Linux Kernel Mailing List,
	Alexander Viro, linux-ext4, linux-xfs, linux-fsdevel,
	Andrew Morton, Linus Torvalds

On Fri, Apr 03, 2020 at 11:18:43AM -0700, 'Ira Weiny' wrote:
[snip]

> 
> Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_
> _all_...

To be fair, I believe Christoph advocated for this at one time or another...

Ira

> 
> In summary:
> 
>  - Applications must call statx to discover the current S_DAX state.
> 
>  - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
>    the parent directory FS_XFLAG_DAX inode flag.  (There is no way to change
>    this flag after file creation.)
> 
>    If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
>    inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
>    Unless overridden...
> 
>  - There exists a dax= mount option.
> 
>    "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
>    	"-o nodax" means "dax=off"
>    "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
>    	"-o dax" by itself means "dax=always"
>    "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default
> 
>  - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
>    changed at any time.  The flag state is copied into any files or
>    subdirectories when they are created within that directory.  If programs
>    require file access runs in S_DAX mode, they'll have to create those files
>    inside a directory with FS_XFLAG_DAX set, or mount the fs with an
>    appropriate dax mount option.
> 
> 
> ???
> 
> Ira
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-03 17:03                           ` Jan Kara
  2020-04-03 18:18                             ` Ira Weiny
@ 2020-04-03 18:29                             ` Darrick J. Wong
  1 sibling, 0 replies; 45+ messages in thread
From: Darrick J. Wong @ 2020-04-03 18:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ira Weiny, Christoph Hellwig, Dave Chinner, Theodore Y. Ts'o,
	Dan Williams, Linux Kernel Mailing List, Alexander Viro,
	linux-ext4, linux-xfs, linux-fsdevel, Andrew Morton,
	Linus Torvalds

On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote:
> On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > > I'd just return an error for that case, don't play silly games like
> > > > > evicting the inode.
> > > > 
> > > > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > > > a direction of failing the ioctl completely.  But we could have the flag change
> > > > with an appropriate error which could let the user know the change has been
> > > > delayed.
> > > > 
> > > > But I don't immediately see what error code is appropriate for such an
> > > > indication.  Candidates I can envision:
> > > > 
> > > > EAGAIN
> > > > ERESTART
> > > > EUSERS
> > > > EINPROGRESS
> > > > 
> > > > None are perfect but I'm leaning toward EINPROGRESS.
> > > 
> > > I really, really dislike that idea.  The whole point of not forcing
> > > evictions is to make it clear - no this inode is "busy" you can't
> > > do that.  A reasonably smart application can try to evict itself.
> > 
> > I don't understand.  What Darrick proposed would never need any
> > evictions.  If the file has blocks allocated the FS_XFLAG_DAX flag can
> > not be changed.  So I don't see what good eviction would do at all.
> 
> I guess there's some confusion here (may well be than on my side). Darrick
> propose that we can switch FS_XFLAG_DAX only when file has no blocks
> allocated - fine by me. But that still does not mean than we can switch
> S_DAX immediately, does it? Because that would still mean we need to switch
> aops on living inode and that's ... difficult and Christoph didn't want to
> clutter the code with it.

IIRC, the reason Ira was trying to introduce this file operations lock
is because there isn't any other safe way to change the operations
dynamically, because some of those operations don't start taking any
locks at all until after we've accessed the file operations pointer.

Now that I think about it some more, that's also means we can't change
S_DAX even on files without blocks allocated.  Look at fallocate, it
doesn't even take i_rwsem until we've called into (say)
xfs_file_fallocate.

Ok, so with that in mind, I think we simply have to say that
FS_XFLAG_DAX is 100% advisory, and S_DAX will magically switch some time
in the future or after the next umount/mount cycle.

FSSETXATTR can't evict an inode it has just set FS_XFLAG_DAX on, because
it applies that change to the fd passed into ioctl(), which means that the
caller has a reference to the fd -> file -> inode, which means the inode
is unevictable until after the call completes.

IOWs, 

> So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
> S_DAX flag will magically switch when inode gets evicted and the inode gets
> reloaded from the disk again. Did I misunderstand anything?

That was /my/ understanding. :)

> And my thinking was that this is surprising behavior for the user and so it
> will likely generate lots of bug reports along the lines of "DAX inode flag
> does not work!". So I was pondering how to make the behavior less
> confusing... The ioctl I've suggested was just a poor attempt at that.

At best, userspace uses FSSETXATTR to change FS_XFLAG_DAX, and then
calls some as-yet-undefined ioctl to try to evict the inode from memory.
I'm not sure how you'd actually do that, though, considering that you'd
have to close the fd and as soon as that happens the inode can disappear
permanently.

Ok, that's not sane.  Forget I ever wrote that.

> > > But returning an error and doing a lazy change anyway is straight from
> > > the playbook for arcane and confusing API designs.
> > 
> > Jan countered with a proposal that the FS_XFLAG_DAX does change with
> > blocks allocated.  But that S_DAX would change on eviction.  Adding that
> > some eviction ioctl could be added.
> 
> No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
> was still speaking about the case without blocks allocated.
> 
> > You then proposed just returning an error for that case.  (This lead me to
> > believe that you were ok with an eviction based change of S_DAX.)
> > 
> > So I agreed that changing S_DAX could be delayed until an explicit eviction.
> > But, to aid the 'smart application', a different error code could be used to
> > indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> > eviction occurs S_DAX would remain.

There's no point in returning a magic error code from FSSETXATTR, as I
realized above.  To restate: FSSETXATTR can't evict the inode, so it
would always have to return the magic error code.  The best we can do is
tell userspace that they can set the advisory FS_XFLAG_DAX flag and then
check STATX_ATTR_DAX immediately afterwards.  If some day in the future
we get smarter and can change it immediately, the statx output will
reflect that.

> > So I don't fully follow what you mean by 'lazy change'?
> > 
> > Do you still really, really dislike an explicit eviction method for changing
> > the S_DAX flag?
> > 
> > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> > user wants to change the mode of operations on their 'data'; they would have to
> > create a new file with the proper setting and move the data there.  For example
> > copy the file into a directory marked FS_XFLAG_DAX==true?
> > 
> > I'm ok with either interface as I think both could be clear if documented.
> 
> I agree that what Darrick suggested is technically easily doable and can be
> documented. But it is not natural behavior (i.e., different than all inode
> flags we have) and we know how careful people are when reading
> documentation...

To reflect all that I've rambled in this thread, I withdraw the previous
paragraph and submit this one for consideration:

 - There exists an advisory file inode flag FS_XFLAG_DAX.

   If FS_XFLAG_DAX is set and the fs is on pmem then it will always
   enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
   never enable S_DAX.  The advice can be overridden by mount option.

   Changing this flag does not necessarily change the S_DAX state
   immediately but programs can query the S_DAX state via statx to
   detect when the new advice has gone into effect.

Consider this from the perspective of minimizing changes to userspace
programs between now and the future.  If your program really wants
S_DAX, you can write:

	fd = open(...);

	ioctl(fd, FSGETXATTR, &fsx);

	if (fsx.xflags & FS_XFLAG_DAX)
		return 0;

	fsx.xflags |= FS_XFLAG_DAX;
	ioctl(fd, FSSETXATTR, &fsx);

	do {
		statx(fd, STATX_GIVE_ME_EVERYTHING, &statx);
		if (statx.attrs & STATX_ATTR_DAX)
			return 0;

		/* do stupid magic to evict things */
	} while (we haven't gotten bored and wandered away);

This code snippet will work with the current limitations of the kernel,
and it'll continue working even on Linux v5000 where we finally figure
out how to change S_DAX on the fly.

--D

> 
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-03 18:18                             ` Ira Weiny
  2020-04-03 18:21                               ` Ira Weiny
@ 2020-04-03 18:37                               ` Darrick J. Wong
  2020-04-05  6:19                                 ` Ira Weiny
  2020-04-06 10:00                               ` Jan Kara
  2 siblings, 1 reply; 45+ messages in thread
From: Darrick J. Wong @ 2020-04-03 18:37 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Theodore Y. Ts'o,
	Dan Williams, Linux Kernel Mailing List, Alexander Viro,
	linux-ext4, linux-xfs, linux-fsdevel, Andrew Morton,
	Linus Torvalds

On Fri, Apr 03, 2020 at 11:18:43AM -0700, Ira Weiny wrote:
> On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote:
> > On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> > > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > > > I'd just return an error for that case, don't play silly games like
> > > > > > evicting the inode.
> > > > > 
> > > > > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > > > > a direction of failing the ioctl completely.  But we could have the flag change
> > > > > with an appropriate error which could let the user know the change has been
> > > > > delayed.
> > > > > 
> > > > > But I don't immediately see what error code is appropriate for such an
> > > > > indication.  Candidates I can envision:
> > > > > 
> > > > > EAGAIN
> > > > > ERESTART
> > > > > EUSERS
> > > > > EINPROGRESS
> > > > > 
> > > > > None are perfect but I'm leaning toward EINPROGRESS.
> > > > 
> > > > I really, really dislike that idea.  The whole point of not forcing
> > > > evictions is to make it clear - no this inode is "busy" you can't
> > > > do that.  A reasonably smart application can try to evict itself.
> > > 
> > > I don't understand.  What Darrick proposed would never need any
> > > evictions.  If the file has blocks allocated the FS_XFLAG_DAX flag can
> > > not be changed.  So I don't see what good eviction would do at all.
> > 
> > I guess there's some confusion here (may well be than on my side). Darrick
> > propose that we can switch FS_XFLAG_DAX only when file has no blocks
> > allocated - fine by me. But that still does not mean than we can switch
> > S_DAX immediately, does it? Because that would still mean we need to switch
> > aops on living inode and that's ... difficult and Christoph didn't want to
> > clutter the code with it.
> > 
> > So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
> > S_DAX flag will magically switch when inode gets evicted and the inode gets
> > reloaded from the disk again. Did I misunderstand anything?
> > 
> > And my thinking was that this is surprising behavior for the user and so it
> > will likely generate lots of bug reports along the lines of "DAX inode flag
> > does not work!". So I was pondering how to make the behavior less
> > confusing... The ioctl I've suggested was just a poor attempt at that.
> 
> Ok but then I don't understand Christophs comment to "just return an error for
> that case"?  Which case?
> 
> > 
> > > > But returning an error and doing a lazy change anyway is straight from
> > > > the playbook for arcane and confusing API designs.
> > > 
> > > Jan countered with a proposal that the FS_XFLAG_DAX does change with
> > > blocks allocated.  But that S_DAX would change on eviction.  Adding that
> > > some eviction ioctl could be added.
> > 
> > No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
> > was still speaking about the case without blocks allocated.
> 
> Ah ok good point.  But again what 'error' do we return when FS_XFLAG_DAX
> changed but S_DAX did not?
> 
> > 
> > > You then proposed just returning an error for that case.  (This lead me to
> > > believe that you were ok with an eviction based change of S_DAX.)
> > > 
> > > So I agreed that changing S_DAX could be delayed until an explicit eviction.
> > > But, to aid the 'smart application', a different error code could be used to
> > > indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> > > eviction occurs S_DAX would remain.
> > > 
> > > So I don't fully follow what you mean by 'lazy change'?
> > > 
> > > Do you still really, really dislike an explicit eviction method for changing
> > > the S_DAX flag?
> > > 
> > > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> > > user wants to change the mode of operations on their 'data'; they would have to
> > > create a new file with the proper setting and move the data there.  For example
> > > copy the file into a directory marked FS_XFLAG_DAX==true?
> > > 
> > > I'm ok with either interface as I think both could be clear if documented.
> > 
> > I agree that what Darrick suggested is technically easily doable and can be
> > documented. But it is not natural behavior (i.e., different than all inode
> > flags we have) and we know how careful people are when reading
> > documentation...
> > 
> 
> Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_
> _all_...
> 
> In summary:
> 
>  - Applications must call statx to discover the current S_DAX state.

Ok.

>  - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
>    the parent directory FS_XFLAG_DAX inode flag.  (There is no way to change
>    this flag after file creation.)
> 
>    If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
>    inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
>    Unless overridden...

Ok, fine with me. :)

>  - There exists a dax= mount option.
> 
>    "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
>    	"-o nodax" means "dax=off"

I surveyed the three fses that support dax and found that none of the
filesystems actually have a 'nodax' flag.  Now would be the time not to
add such a thing, and make people specify dax=off instead.  It would
be handy if we could have a single fsparam_enum for figuring out the dax
mount options.

>    "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
>    	"-o dax" by itself means "dax=always"
>    "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default
> 
>  - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
>    changed at any time.  The flag state is copied into any files or
>    subdirectories when they are created within that directory.  If programs
>    require file access runs in S_DAX mode, they'll have to create those files

"...they must create..."

>    inside a directory with FS_XFLAG_DAX set, or mount the fs with an
>    appropriate dax mount option.

Otherwise seems ok to me.

--D

> 
> 
> ???
> 
> Ira
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-03 18:37                               ` Darrick J. Wong
@ 2020-04-05  6:19                                 ` Ira Weiny
  0 siblings, 0 replies; 45+ messages in thread
From: Ira Weiny @ 2020-04-05  6:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, Theodore Y. Ts'o,
	Dan Williams, Linux Kernel Mailing List, Alexander Viro,
	linux-ext4, linux-xfs, linux-fsdevel, Andrew Morton,
	Linus Torvalds

> > 
> > In summary:
> > 
> >  - Applications must call statx to discover the current S_DAX state.
> 
> Ok.
> 
> >  - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
> >    the parent directory FS_XFLAG_DAX inode flag.  (There is no way to change
> >    this flag after file creation.)
> > 
> >    If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
> >    inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
> >    Unless overridden...
> 
> Ok, fine with me. :)

:-D

> 
> >  - There exists a dax= mount option.
> > 
> >    "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
> >    	"-o nodax" means "dax=off"
> 
> I surveyed the three fses that support dax and found that none of the
> filesystems actually have a 'nodax' flag.  Now would be the time not to
> add such a thing, and make people specify dax=off instead.  It would
> be handy if we could have a single fsparam_enum for figuring out the dax
> mount options.

yes good point.

I'm working on updating the documentation patch and I think this might also
be better as:

	-o dax=never

Which is the opposite of 'always'.

> 
> >    "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
> >    	"-o dax" by itself means "dax=always"
> >    "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default
> > 
> >  - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
> >    changed at any time.  The flag state is copied into any files or
> >    subdirectories when they are created within that directory.  If programs
> >    require file access runs in S_DAX mode, they'll have to create those files
> 
> "...they must create..."

yes

> 
> >    inside a directory with FS_XFLAG_DAX set, or mount the fs with an
> >    appropriate dax mount option.
> 
> Otherwise seems ok to me.

Thanks!
Ira


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
  2020-04-03 18:18                             ` Ira Weiny
  2020-04-03 18:21                               ` Ira Weiny
  2020-04-03 18:37                               ` Darrick J. Wong
@ 2020-04-06 10:00                               ` Jan Kara
  2 siblings, 0 replies; 45+ messages in thread
From: Jan Kara @ 2020-04-06 10:00 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Jan Kara, Christoph Hellwig, Darrick J. Wong, Dave Chinner,
	Theodore Y. Ts'o, Dan Williams, Linux Kernel Mailing List,
	Alexander Viro, linux-ext4, linux-xfs, linux-fsdevel,
	Andrew Morton, Linus Torvalds

On Fri 03-04-20 11:18:43, Ira Weiny wrote:
> Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_
> _all_...
> 
> In summary:
> 
>  - Applications must call statx to discover the current S_DAX state.
> 
>  - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
>    the parent directory FS_XFLAG_DAX inode flag.  (There is no way to change
>    this flag after file creation.)
> 
>    If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
>    inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
>    Unless overridden...

OK, after considering all the options we were hashing out here, I think
this is the best API. There isn't the confusing "S_DAX will magically
switch on inode eviction" and although the functionality is limited, I
think 90% of users would end up using the functionality like this anyway.

>  - There exists a dax= mount option.
> 
>    "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
>    	"-o nodax" means "dax=off"
>    "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
>    	"-o dax" by itself means "dax=always"
>    "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default
> 
>  - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
>    changed at any time.  The flag state is copied into any files or
>    subdirectories when they are created within that directory.  If programs
>    require file access runs in S_DAX mode, they'll have to create those files
>    inside a directory with FS_XFLAG_DAX set, or mount the fs with an
>    appropriate dax mount option.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, back to index

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
2020-02-27  5:24 ` [PATCH V5 01/12] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
2020-02-27 17:25   ` Ira Weiny
2020-02-27  5:24 ` [PATCH V5 02/12] fs: Remove unneeded IS_DAX() check ira.weiny
2020-02-27  5:24 ` [PATCH V5 03/12] fs/stat: Define DAX statx attribute ira.weiny
2020-02-27  5:24 ` [PATCH V5 04/12] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
2020-02-27  5:24 ` [PATCH V5 05/12] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
2020-03-01 22:37   ` Dave Chinner
2020-02-27  5:24 ` [PATCH V5 06/12] fs: Add locking for a dynamic address space operations state ira.weiny
2020-03-02  1:26   ` Dave Chinner
2020-03-02  1:36     ` Dave Chinner
2020-02-27  5:24 ` [PATCH V5 07/12] fs: Prevent DAX state change if file is mmap'ed ira.weiny
2020-02-27  5:24 ` [PATCH V5 08/12] fs/xfs: Hold off aops users while changing DAX state ira.weiny
2020-02-27  5:24 ` [PATCH V5 09/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
2020-02-27  5:24 ` [PATCH V5 10/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny
2020-02-27  5:24 ` [PATCH V5 11/12] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
2020-02-27  5:24 ` [PATCH V5 12/12] Documentation/dax: Update Usage section ira.weiny
2020-03-05 15:51 ` [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 Christoph Hellwig
2020-03-09 17:04   ` Ira Weiny
2020-03-11  3:36     ` Darrick J. Wong
2020-03-11  6:29       ` Christoph Hellwig
2020-03-11 17:07         ` Dan Williams
2020-03-16  9:52           ` Jan Kara
2020-03-16  9:55             ` Christoph Hellwig
2020-04-01  4:00               ` Darrick J. Wong
2020-04-01 10:25                 ` Jan Kara
2020-04-02  8:53                   ` Christoph Hellwig
2020-04-02 20:55                     ` Ira Weiny
2020-04-03  7:27                       ` Christoph Hellwig
2020-04-03 15:48                         ` Ira Weiny
2020-04-03 17:03                           ` Jan Kara
2020-04-03 18:18                             ` Ira Weiny
2020-04-03 18:21                               ` Ira Weiny
2020-04-03 18:37                               ` Darrick J. Wong
2020-04-05  6:19                                 ` Ira Weiny
2020-04-06 10:00                               ` Jan Kara
2020-04-03 18:29                             ` Darrick J. Wong
2020-04-03 16:05                         ` Darrick J. Wong
2020-04-03  4:39                 ` Ira Weiny
2020-03-11  6:39       ` Dave Chinner
2020-03-11  6:44         ` Christoph Hellwig
2020-03-11 17:07           ` Dan Williams
2020-03-12  0:49           ` Dave Chinner
2020-03-12  3:00             ` Darrick J. Wong
2020-03-12  7:27             ` Christoph Hellwig

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git