nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] re-enable XFS per-inode DAX
@ 2017-09-25 23:13 Ross Zwisler
  2017-09-25 23:13 ` [PATCH 1/7] xfs: always use DAX if mount option is used Ross Zwisler
                   ` (6 more replies)
  0 siblings, 7 replies; 46+ messages in thread
From: Ross Zwisler @ 2017-09-25 23:13 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs

This series does the work needed to safely re-enable the XFS per-inode DAX
flag.  This includes fixes to make use of the DAX inode flag more safe and
consistent, fixes to the read and write I/O path locking to make S_DAX
transitions safe, and some code that prevents the DAX inode flag from
transitioning when any mappings are set up.

This series has passed my fstests regression testing both with and without
DAX, and it also passes Christoph's regression test for the inode flag:

https://www.spinics.net/lists/linux-xfs/msg10124.html

My goal is to get feedback on this approach and on the XFS implementation,
and then to do a similar implementation for ext4 based on my previous ext4
DAX inode flag patches:

https://patchwork.kernel.org/patch/9939743/

These patches apply cleanly to v4.14-rc2.

Ross Zwisler (7):
  xfs: always use DAX if mount option is used
  xfs: validate bdev support for DAX inode flag
  xfs: protect S_DAX transitions in XFS read path
  xfs: protect S_DAX transitions in XFS write path
  xfs: introduce xfs_is_dax_state_changing
  mm, fs: introduce file_operations->post_mmap()
  xfs: re-enable XFS per-inode DAX

 fs/xfs/xfs_file.c  | 172 ++++++++++++++++++++++-------------------------------
 fs/xfs/xfs_ioctl.c |  47 ++++++++++++---
 include/linux/fs.h |   1 +
 mm/mmap.c          |   2 +
 4 files changed, 114 insertions(+), 108 deletions(-)

-- 
2.9.5

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-25 23:13 [PATCH 0/7] re-enable XFS per-inode DAX Ross Zwisler
@ 2017-09-25 23:13 ` Ross Zwisler
  2017-09-25 23:38   ` Dave Chinner
  2017-09-25 23:13 ` [PATCH 2/7] xfs: validate bdev support for DAX inode flag Ross Zwisler
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2017-09-25 23:13 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs

Before support for the per-inode DAX flag was disabled the XFS the code had
an issue where the user couldn't reliably tell whether or not DAX was being
used to service page faults and I/O when the DAX mount option was used.  In
this case each inode within the mounted filesystem started with S_DAX set
due to the mount option, but it could be cleared if someone touched the
individual inode flag.

For example (v4.13 and before):

  # mount | grep dax
  /dev/pmem0 on /mnt type xfs
  (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)

  # touch /mnt/a /mnt/b   # both files currently use DAX

  # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
  ----------e----- /mnt/a
  ----------e----- /mnt/b

  # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a

  # xfs_io -c "lsattr" /mnt/*
  ----------e----- /mnt/a
  ----------e----- /mnt/b

We end up with both /mnt/a and /mnt/b looking identical from the point of
view of the mount option and from lsattr, but one is using DAX and the
other is not.

Fix this by always doing DAX I/O when either the mount option is set or
when the DAX inode flag is set.  This means that DAX will always be used
for all inodes on a filesystem mounted with -o dax, making the usage
reliable and detectable.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_ioctl.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 5049e8a..26faeb9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1013,7 +1013,7 @@ xfs_diflags_to_linux(
 	else
 		inode->i_flags &= ~S_NOATIME;
 #if 0	/* disabled until the flag switching races are sorted out */
-	if (xflags & FS_XFLAG_DAX)
+	if ((xflags & FS_XFLAG_DAX) || (ip->i_mount->m_flags & XFS_MOUNT_DAX))
 		inode->i_flags |= S_DAX;
 	else
 		inode->i_flags &= ~S_DAX;
@@ -1104,7 +1104,14 @@ xfs_ioctl_setattr_dax_invalidate(
 			return -EINVAL;
 	}
 
-	/* If the DAX state is not changing, we have nothing to do here. */
+	/*
+	 * If the DAX state is not changing, we have nothing to do here.  If
+	 * the DAX mount option was used we will update the DAX inode flag as
+	 * the user requested but we will continue to use DAX for I/O and page
+	 * faults regardless of how the inode flag is set.
+	 */
+	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+		return 0;
 	if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
 		return 0;
 	if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
-- 
2.9.5

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/7] xfs: validate bdev support for DAX inode flag
  2017-09-25 23:13 [PATCH 0/7] re-enable XFS per-inode DAX Ross Zwisler
  2017-09-25 23:13 ` [PATCH 1/7] xfs: always use DAX if mount option is used Ross Zwisler
@ 2017-09-25 23:13 ` Ross Zwisler
  2017-09-26  6:36   ` Christoph Hellwig
  2017-09-25 23:14 ` [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path Ross Zwisler
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2017-09-25 23:13 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs

Currently only the blocksize is checked, but we should really be calling
bdev_dax_supported() which also tests to make sure we can get a
struct dax_device and that the dax_direct_access() path is working.

This is the same check that we do for the "-o dax" mount option in
xfs_fs_fill_super().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 26faeb9..0433aef 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1088,6 +1088,7 @@ xfs_ioctl_setattr_dax_invalidate(
 	int			*join_flags)
 {
 	struct inode		*inode = VFS_I(ip);
+	struct super_block	*sb = inode->i_sb;
 	int			error;
 
 	*join_flags = 0;
@@ -1100,7 +1101,7 @@ xfs_ioctl_setattr_dax_invalidate(
 	if (fa->fsx_xflags & FS_XFLAG_DAX) {
 		if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
 			return -EINVAL;
-		if (ip->i_mount->m_sb.sb_blocksize != PAGE_SIZE)
+		if (bdev_dax_supported(sb, sb->s_blocksize) < 0)
 			return -EINVAL;
 	}
 
-- 
2.9.5

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path
  2017-09-25 23:13 [PATCH 0/7] re-enable XFS per-inode DAX Ross Zwisler
  2017-09-25 23:13 ` [PATCH 1/7] xfs: always use DAX if mount option is used Ross Zwisler
  2017-09-25 23:13 ` [PATCH 2/7] xfs: validate bdev support for DAX inode flag Ross Zwisler
@ 2017-09-25 23:14 ` Ross Zwisler
  2017-09-25 23:27   ` Dave Chinner
  2017-09-26  6:32   ` Christoph Hellwig
  2017-09-25 23:14 ` [PATCH 4/7] xfs: protect S_DAX transitions in XFS write path Ross Zwisler
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2017-09-25 23:14 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs

In the current XFS read I/O path we check IS_DAX() in xfs_file_read_iter()
to decide whether to do DAX I/O, direct I/O or buffered I/O.  This check is
done without holding the XFS_IOLOCK, though, which means that if we allow
S_DAX to be manipulated via the inode flag we can run into this race:

CPU 0				CPU 1
-----				-----
xfs_file_read_iter()
  IS_DAX() << returns false
  				xfs_ioctl_setattr()
				  xfs_ioctl_setattr_dax_invalidate()
				   xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK)
				  sets S_DAX
				  releases XFS_MMAPLOCK and XFS_IOLOCK
  xfs_file_buffered_aio_read()
  does buffered I/O to DAX inode, death

Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK
in the read path.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 42 +++++++++++++-----------------------------
 1 file changed, 13 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd..ca4c8fd 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -207,7 +207,6 @@ xfs_file_dio_aio_read(
 {
 	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
 	size_t			count = iov_iter_count(to);
-	ssize_t			ret;
 
 	trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
 
@@ -215,12 +214,7 @@ xfs_file_dio_aio_read(
 		return 0; /* skip atime */
 
 	file_accessed(iocb->ki_filp);
-
-	xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	ret = iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
-	return ret;
+	return iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
 }
 
 static noinline ssize_t
@@ -230,23 +224,14 @@ xfs_file_dax_read(
 {
 	struct xfs_inode	*ip = XFS_I(iocb->ki_filp->f_mapping->host);
 	size_t			count = iov_iter_count(to);
-	ssize_t			ret = 0;
 
 	trace_xfs_file_dax_read(ip, count, iocb->ki_pos);
 
 	if (!count)
 		return 0; /* skip atime */
 
-	if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			return -EAGAIN;
-		xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	}
-	ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
 	file_accessed(iocb->ki_filp);
-	return ret;
+	return dax_iomap_rw(iocb, to, &xfs_iomap_ops);
 }
 
 STATIC ssize_t
@@ -255,19 +240,9 @@ xfs_file_buffered_aio_read(
 	struct iov_iter		*to)
 {
 	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
-	ssize_t			ret;
 
 	trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);
-
-	if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			return -EAGAIN;
-		xfs_ilock(ip, XFS_IOLOCK_SHARED);
-	}
-	ret = generic_file_read_iter(iocb, to);
-	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
-	return ret;
+	return generic_file_read_iter(iocb, to);
 }
 
 STATIC ssize_t
@@ -276,7 +251,8 @@ xfs_file_read_iter(
 	struct iov_iter		*to)
 {
 	struct inode		*inode = file_inode(iocb->ki_filp);
-	struct xfs_mount	*mp = XFS_I(inode)->i_mount;
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct xfs_mount	*mp = ip->i_mount;
 	ssize_t			ret = 0;
 
 	XFS_STATS_INC(mp, xs_read_calls);
@@ -284,6 +260,12 @@ xfs_file_read_iter(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		xfs_ilock(ip, XFS_IOLOCK_SHARED);
+	}
+
 	if (IS_DAX(inode))
 		ret = xfs_file_dax_read(iocb, to);
 	else if (iocb->ki_flags & IOCB_DIRECT)
@@ -291,6 +273,8 @@ xfs_file_read_iter(
 	else
 		ret = xfs_file_buffered_aio_read(iocb, to);
 
+	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+
 	if (ret > 0)
 		XFS_STATS_ADD(mp, xs_read_bytes, ret);
 	return ret;
-- 
2.9.5

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 4/7] xfs: protect S_DAX transitions in XFS write path
  2017-09-25 23:13 [PATCH 0/7] re-enable XFS per-inode DAX Ross Zwisler
                   ` (2 preceding siblings ...)
  2017-09-25 23:14 ` [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path Ross Zwisler
@ 2017-09-25 23:14 ` Ross Zwisler
  2017-09-25 23:29   ` Dave Chinner
  2017-09-25 23:14 ` [PATCH 5/7] xfs: introduce xfs_is_dax_state_changing Ross Zwisler
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2017-09-25 23:14 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs

In the current XFS write I/O path we check IS_DAX() in
xfs_file_write_iter() to decide whether to do DAX I/O, direct I/O or
buffered I/O.  This check is done without holding the XFS_IOLOCK, though,
which means that if we allow S_DAX to be manipulated via the inode flag we
can run into this race:

CPU 0                           CPU 1
-----                           -----
xfs_file_write_iter()
  IS_DAX() << returns false
			    xfs_ioctl_setattr()
			      xfs_ioctl_setattr_dax_invalidate()
			       xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK)
			      sets S_DAX
			      releases XFS_MMAPLOCK and XFS_IOLOCK
  xfs_file_buffered_aio_write()
  does buffered I/O to DAX inode, death

Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK
in the write path.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 115 +++++++++++++++++++++---------------------------------
 1 file changed, 44 insertions(+), 71 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ca4c8fd..2816858 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -308,9 +308,7 @@ xfs_zero_eof(
 /*
  * Common pre-write limit and setup checks.
  *
- * Called with the iolocked held either shared and exclusive according to
- * @iolock, and returns with it held.  Might upgrade the iolock to exclusive
- * if called for a direct write beyond i_size.
+ * Called with the iolock held in exclusive mode.
  */
 STATIC ssize_t
 xfs_file_aio_write_checks(
@@ -322,7 +320,6 @@ xfs_file_aio_write_checks(
 	struct inode		*inode = file->f_mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			error = 0;
-	size_t			count = iov_iter_count(from);
 	bool			drained_dio = false;
 
 restart:
@@ -335,21 +332,9 @@ xfs_file_aio_write_checks(
 		return error;
 
 	/*
-	 * For changing security info in file_remove_privs() we need i_rwsem
-	 * exclusively.
-	 */
-	if (*iolock == XFS_IOLOCK_SHARED && !IS_NOSEC(inode)) {
-		xfs_iunlock(ip, *iolock);
-		*iolock = XFS_IOLOCK_EXCL;
-		xfs_ilock(ip, *iolock);
-		goto restart;
-	}
-	/*
 	 * If the offset is beyond the size of the file, we need to zero any
 	 * blocks that fall between the existing EOF and the start of this
-	 * write.  If zeroing is needed and we are currently holding the
-	 * iolock shared, we need to update it to exclusive which implies
-	 * having to redo all checks before.
+	 * write.
 	 *
 	 * We need to serialise against EOF updates that occur in IO
 	 * completions here. We want to make sure that nobody is changing the
@@ -365,12 +350,6 @@ xfs_file_aio_write_checks(
 
 		spin_unlock(&ip->i_flags_lock);
 		if (!drained_dio) {
-			if (*iolock == XFS_IOLOCK_SHARED) {
-				xfs_iunlock(ip, *iolock);
-				*iolock = XFS_IOLOCK_EXCL;
-				xfs_ilock(ip, *iolock);
-				iov_iter_reexpand(from, count);
-			}
 			/*
 			 * We now have an IO submission barrier in place, but
 			 * AIO can do EOF updates during IO completion and hence
@@ -491,7 +470,8 @@ xfs_dio_write_end_io(
 STATIC ssize_t
 xfs_file_dio_aio_write(
 	struct kiocb		*iocb,
-	struct iov_iter		*from)
+	struct iov_iter		*from,
+	int			*iolock)
 {
 	struct file		*file = iocb->ki_filp;
 	struct address_space	*mapping = file->f_mapping;
@@ -500,7 +480,6 @@ xfs_file_dio_aio_write(
 	struct xfs_mount	*mp = ip->i_mount;
 	ssize_t			ret = 0;
 	int			unaligned_io = 0;
-	int			iolock;
 	size_t			count = iov_iter_count(from);
 	struct xfs_buftarg      *target = XFS_IS_REALTIME_INODE(ip) ?
 					mp->m_rtdev_targp : mp->m_ddev_targp;
@@ -510,11 +489,12 @@ xfs_file_dio_aio_write(
 		return -EINVAL;
 
 	/*
-	 * Don't take the exclusive iolock here unless the I/O is unaligned to
-	 * the file system block size.  We don't need to consider the EOF
-	 * extension case here because xfs_file_aio_write_checks() will relock
-	 * the inode as necessary for EOF zeroing cases and fill out the new
-	 * inode size as appropriate.
+	 * We hold the exclusive iolock via our caller.  After the common
+	 * write checks we will demote it to a shared iolock unless the I/O is
+	 * unaligned to the file system block size.  We don't need to consider
+	 * the EOF extension case here because xfs_file_aio_write_checks()
+	 * will deal with EOF zeroing cases and fill out the new inode size as
+	 * appropriate.
 	 */
 	if ((iocb->ki_pos & mp->m_blockmask) ||
 	    ((iocb->ki_pos + count) & mp->m_blockmask)) {
@@ -528,26 +508,17 @@ xfs_file_dio_aio_write(
 			trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count);
 			return -EREMCHG;
 		}
-		iolock = XFS_IOLOCK_EXCL;
-	} else {
-		iolock = XFS_IOLOCK_SHARED;
 	}
 
-	if (!xfs_ilock_nowait(ip, iolock)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			return -EAGAIN;
-		xfs_ilock(ip, iolock);
-	}
 
-	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
+	ret = xfs_file_aio_write_checks(iocb, from, iolock);
 	if (ret)
 		goto out;
 	count = iov_iter_count(from);
 
 	/*
 	 * If we are doing unaligned IO, wait for all other IO to drain,
-	 * otherwise demote the lock if we had to take the exclusive lock
-	 * for other reasons in xfs_file_aio_write_checks.
+	 * otherwise demote the lock.
 	 */
 	if (unaligned_io) {
 		/* If we are going to wait for other DIO to finish, bail */
@@ -557,15 +528,14 @@ xfs_file_dio_aio_write(
 		} else {
 			inode_dio_wait(inode);
 		}
-	} else if (iolock == XFS_IOLOCK_EXCL) {
+	} else if (*iolock == XFS_IOLOCK_EXCL) {
 		xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
-		iolock = XFS_IOLOCK_SHARED;
+		*iolock = XFS_IOLOCK_SHARED;
 	}
 
 	trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
 	ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
 out:
-	xfs_iunlock(ip, iolock);
 
 	/*
 	 * No fallback to buffered IO on errors for XFS, direct IO will either
@@ -578,22 +548,16 @@ xfs_file_dio_aio_write(
 static noinline ssize_t
 xfs_file_dax_write(
 	struct kiocb		*iocb,
-	struct iov_iter		*from)
+	struct iov_iter		*from,
+	int			*iolock)
 {
 	struct inode		*inode = iocb->ki_filp->f_mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
-	int			iolock = XFS_IOLOCK_EXCL;
 	ssize_t			ret, error = 0;
 	size_t			count;
 	loff_t			pos;
 
-	if (!xfs_ilock_nowait(ip, iolock)) {
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			return -EAGAIN;
-		xfs_ilock(ip, iolock);
-	}
-
-	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
+	ret = xfs_file_aio_write_checks(iocb, from, iolock);
 	if (ret)
 		goto out;
 
@@ -607,14 +571,14 @@ xfs_file_dax_write(
 		error = xfs_setfilesize(ip, pos, ret);
 	}
 out:
-	xfs_iunlock(ip, iolock);
 	return error ? error : ret;
 }
 
 STATIC ssize_t
 xfs_file_buffered_aio_write(
 	struct kiocb		*iocb,
-	struct iov_iter		*from)
+	struct iov_iter		*from,
+	int			*iolock)
 {
 	struct file		*file = iocb->ki_filp;
 	struct address_space	*mapping = file->f_mapping;
@@ -622,16 +586,12 @@ xfs_file_buffered_aio_write(
 	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			ret;
 	int			enospc = 0;
-	int			iolock;
 
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		return -EOPNOTSUPP;
 
 write_retry:
-	iolock = XFS_IOLOCK_EXCL;
-	xfs_ilock(ip, iolock);
-
-	ret = xfs_file_aio_write_checks(iocb, from, &iolock);
+	ret = xfs_file_aio_write_checks(iocb, from, iolock);
 	if (ret)
 		goto out;
 
@@ -653,32 +613,35 @@ xfs_file_buffered_aio_write(
 	 * running at the same time.
 	 */
 	if (ret == -EDQUOT && !enospc) {
-		xfs_iunlock(ip, iolock);
+		xfs_iunlock(ip, *iolock);
 		enospc = xfs_inode_free_quota_eofblocks(ip);
 		if (enospc)
-			goto write_retry;
+			goto lock_retry;
 		enospc = xfs_inode_free_quota_cowblocks(ip);
 		if (enospc)
-			goto write_retry;
-		iolock = 0;
+			goto lock_retry;
+		*iolock = 0;
 	} else if (ret == -ENOSPC && !enospc) {
 		struct xfs_eofblocks eofb = {0};
 
 		enospc = 1;
 		xfs_flush_inodes(ip->i_mount);
 
-		xfs_iunlock(ip, iolock);
+		xfs_iunlock(ip, *iolock);
 		eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
 		xfs_icache_free_eofblocks(ip->i_mount, &eofb);
 		xfs_icache_free_cowblocks(ip->i_mount, &eofb);
-		goto write_retry;
+		goto lock_retry;
 	}
 
 	current->backing_dev_info = NULL;
 out:
-	if (iolock)
-		xfs_iunlock(ip, iolock);
 	return ret;
+lock_retry:
+	xfs_ilock(ip, *iolock);
+	if (IS_DAX(inode))
+		return -EAGAIN;
+	goto write_retry;
 }
 
 STATIC ssize_t
@@ -692,6 +655,7 @@ xfs_file_write_iter(
 	struct xfs_inode	*ip = XFS_I(inode);
 	ssize_t			ret;
 	size_t			ocount = iov_iter_count(from);
+	int			iolock = XFS_IOLOCK_EXCL;
 
 	XFS_STATS_INC(ip->i_mount, xs_write_calls);
 
@@ -701,8 +665,14 @@ xfs_file_write_iter(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return -EIO;
 
+	if (!xfs_ilock_nowait(ip, iolock)) {
+		if (iocb->ki_flags & IOCB_NOWAIT)
+			return -EAGAIN;
+		xfs_ilock(ip, iolock);
+	}
+
 	if (IS_DAX(inode))
-		ret = xfs_file_dax_write(iocb, from);
+		ret = xfs_file_dax_write(iocb, from, &iolock);
 	else if (iocb->ki_flags & IOCB_DIRECT) {
 		/*
 		 * Allow a directio write to fall back to a buffered
@@ -710,14 +680,17 @@ xfs_file_write_iter(
 		 * CoW.  In all other directio scenarios we do not
 		 * allow an operation to fall back to buffered mode.
 		 */
-		ret = xfs_file_dio_aio_write(iocb, from);
+		ret = xfs_file_dio_aio_write(iocb, from, &iolock);
 		if (ret == -EREMCHG)
 			goto buffered;
 	} else {
 buffered:
-		ret = xfs_file_buffered_aio_write(iocb, from);
+		ret = xfs_file_buffered_aio_write(iocb, from, &iolock);
 	}
 
+	if (iolock)
+		xfs_iunlock(ip, iolock);
+
 	if (ret > 0) {
 		XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
 
-- 
2.9.5

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 5/7] xfs: introduce xfs_is_dax_state_changing
  2017-09-25 23:13 [PATCH 0/7] re-enable XFS per-inode DAX Ross Zwisler
                   ` (3 preceding siblings ...)
  2017-09-25 23:14 ` [PATCH 4/7] xfs: protect S_DAX transitions in XFS write path Ross Zwisler
@ 2017-09-25 23:14 ` Ross Zwisler
  2017-09-26  6:33   ` Christoph Hellwig
  2017-09-25 23:14 ` [PATCH 6/7] mm, fs: introduce file_operations->post_mmap() Ross Zwisler
  2017-09-25 23:14 ` [PATCH 7/7] xfs: re-enable XFS per-inode DAX Ross Zwisler
  6 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2017-09-25 23:14 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs

Pull this code out of xfs_ioctl_setattr_dax_invalidate() as it will be used
in multiple places soon.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_ioctl.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0433aef..386b437 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1020,6 +1020,27 @@ xfs_diflags_to_linux(
 #endif
 }
 
+static bool
+xfs_is_dax_state_changing(
+	unsigned int		xflags,
+	struct xfs_inode	*ip)
+{
+	struct inode		*inode = VFS_I(ip);
+
+	/*
+	 * If the DAX mount option was used we will update the DAX inode flag
+	 * as the user requested but we will continue to use DAX for I/O and
+	 * page faults regardless of how the inode flag is set.
+	 */
+	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+		return false;
+	if ((xflags & FS_XFLAG_DAX) && IS_DAX(inode))
+		return false;
+	if (!(xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
+		return false;
+	return true;
+}
+
 static int
 xfs_ioctl_setattr_xflags(
 	struct xfs_trans	*tp,
@@ -1105,17 +1126,8 @@ xfs_ioctl_setattr_dax_invalidate(
 			return -EINVAL;
 	}
 
-	/*
-	 * If the DAX state is not changing, we have nothing to do here.  If
-	 * the DAX mount option was used we will update the DAX inode flag as
-	 * the user requested but we will continue to use DAX for I/O and page
-	 * faults regardless of how the inode flag is set.
-	 */
-	if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
-		return 0;
-	if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
-		return 0;
-	if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
+	/* If the DAX state is not changing, we have nothing to do here. */
+	if (!xfs_is_dax_state_changing(fa->fsx_xflags, ip))
 		return 0;
 
 	/* lock, flush and invalidate mapping in preparation for flag change */
-- 
2.9.5

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-25 23:13 [PATCH 0/7] re-enable XFS per-inode DAX Ross Zwisler
                   ` (4 preceding siblings ...)
  2017-09-25 23:14 ` [PATCH 5/7] xfs: introduce xfs_is_dax_state_changing Ross Zwisler
@ 2017-09-25 23:14 ` Ross Zwisler
  2017-09-25 23:38   ` Dan Williams
  2017-09-26  6:34   ` Christoph Hellwig
  2017-09-25 23:14 ` [PATCH 7/7] xfs: re-enable XFS per-inode DAX Ross Zwisler
  6 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2017-09-25 23:14 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs

When mappings are created the vma->vm_flags that they use vary based on
whether the inode being mapped is using DAX or not.  This setup happens in
XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().

For us to be able to safely use the DAX per-inode flag we need to prevent
S_DAX transitions when any mappings are present, and we will do that by
looking at the address_space->i_mmap tree and returning -EBUSY if any
mappings are present.

Unfortunately at the time that the filesystem's file_operations->mmap()
entry point is called the mapping has not yet been added to the
address_space->i_mmap tree.  This means that at that point in time we
cannot determine whether or not the mapping will be set up to support DAX.

Fix this by adding a new file_operations entry called post_mmap() which is
called after the mapping has been added to the address_space->i_mmap tree.
This post_mmap() op now happens at a time when we can be sure whether the
mapping will use DAX or not, and we can set up the vma->vm_flags
appropriately.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c  | 15 ++++++++++++++-
 include/linux/fs.h |  1 +
 mm/mmap.c          |  2 ++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2816858..9d66aaa 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1087,9 +1087,21 @@ xfs_file_mmap(
 {
 	file_accessed(filp);
 	vma->vm_ops = &xfs_file_vm_ops;
+	return 0;
+}
+
+/* This call happens during mmap(), after the vma has been inserted into the
+ * inode->i_mapping->i_mmap tree.  At this point the decision on whether or
+ * not to use DAX for this mapping has been set and will not change for the
+ * duration of the mapping.
+ */
+STATIC void
+xfs_file_post_mmap(
+	struct file	*filp,
+	struct vm_area_struct *vma)
+{
 	if (IS_DAX(file_inode(filp)))
 		vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
-	return 0;
 }
 
 const struct file_operations xfs_file_operations = {
@@ -1103,6 +1115,7 @@ const struct file_operations xfs_file_operations = {
 	.compat_ioctl	= xfs_file_compat_ioctl,
 #endif
 	.mmap		= xfs_file_mmap,
+	.post_mmap	= xfs_file_post_mmap,
 	.open		= xfs_file_open,
 	.release	= xfs_file_release,
 	.fsync		= xfs_file_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e737..7c06838 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,7 @@ struct file_operations {
 	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
 	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
 	int (*mmap) (struct file *, struct vm_area_struct *);
+	void (*post_mmap) (struct file *, struct vm_area_struct *);
 	int (*open) (struct inode *, struct file *);
 	int (*flush) (struct file *, fl_owner_t id);
 	int (*release) (struct inode *, struct file *);
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506f..ee7458a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1711,6 +1711,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	/* Once vma denies write, undo our temporary denial count */
 	if (file) {
+		if (file->f_op->post_mmap)
+			file->f_op->post_mmap(file, vma);
 		if (vm_flags & VM_SHARED)
 			mapping_unmap_writable(file->f_mapping);
 		if (vm_flags & VM_DENYWRITE)
-- 
2.9.5

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 7/7] xfs: re-enable XFS per-inode DAX
  2017-09-25 23:13 [PATCH 0/7] re-enable XFS per-inode DAX Ross Zwisler
                   ` (5 preceding siblings ...)
  2017-09-25 23:14 ` [PATCH 6/7] mm, fs: introduce file_operations->post_mmap() Ross Zwisler
@ 2017-09-25 23:14 ` Ross Zwisler
  2017-09-26  0:31   ` Dave Chinner
  2017-09-26  6:36   ` Christoph Hellwig
  6 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2017-09-25 23:14 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, Christoph Hellwig, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs

Re-enable the XFS per-inode DAX flag, preventing S_DAX from changing when
any mappings are present.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_ioctl.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 386b437..7a24dd5 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1012,12 +1012,10 @@ xfs_diflags_to_linux(
 		inode->i_flags |= S_NOATIME;
 	else
 		inode->i_flags &= ~S_NOATIME;
-#if 0	/* disabled until the flag switching races are sorted out */
 	if ((xflags & FS_XFLAG_DAX) || (ip->i_mount->m_flags & XFS_MOUNT_DAX))
 		inode->i_flags |= S_DAX;
 	else
 		inode->i_flags &= ~S_DAX;
-#endif
 }
 
 static bool
@@ -1049,6 +1047,8 @@ xfs_ioctl_setattr_xflags(
 {
 	struct xfs_mount	*mp = ip->i_mount;
 	uint64_t		di_flags2;
+	struct address_space	*mapping = VFS_I(ip)->i_mapping;
+	bool			dax_changing;
 
 	/* Can't change realtime flag if any extents are allocated. */
 	if ((ip->i_d.di_nextents || ip->i_delayed_blks) &&
@@ -1084,10 +1084,23 @@ xfs_ioctl_setattr_xflags(
 	if (di_flags2 && ip->i_d.di_version < 3)
 		return -EINVAL;
 
+	dax_changing = xfs_is_dax_state_changing(fa->fsx_xflags, ip);
+	if (dax_changing) {
+		i_mmap_lock_read(mapping);
+		if (mapping_mapped(mapping)) {
+			i_mmap_unlock_read(mapping);
+			return -EBUSY;
+		}
+	}
+
 	ip->i_d.di_flags = xfs_flags2diflags(ip, fa->fsx_xflags);
 	ip->i_d.di_flags2 = di_flags2;
 
 	xfs_diflags_to_linux(ip);
+
+	if (dax_changing)
+		i_mmap_unlock_read(mapping);
+
 	xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 	XFS_STATS_INC(mp, xs_ig_attrchg);
-- 
2.9.5

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path
  2017-09-25 23:14 ` [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path Ross Zwisler
@ 2017-09-25 23:27   ` Dave Chinner
  2017-09-26  6:32   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2017-09-25 23:27 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	linux-kernel, J. Bruce Fields, linux-mm, linux-fsdevel,
	Andrew Morton, linux-xfs, Christoph Hellwig

On Mon, Sep 25, 2017 at 05:14:00PM -0600, Ross Zwisler wrote:
> In the current XFS read I/O path we check IS_DAX() in xfs_file_read_iter()
> to decide whether to do DAX I/O, direct I/O or buffered I/O.  This check is
> done without holding the XFS_IOLOCK, though, which means that if we allow
> S_DAX to be manipulated via the inode flag we can run into this race:
> 
> CPU 0				CPU 1
> -----				-----
> xfs_file_read_iter()
>   IS_DAX() << returns false
>   				xfs_ioctl_setattr()
> 				  xfs_ioctl_setattr_dax_invalidate()
> 				   xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK)
> 				  sets S_DAX
> 				  releases XFS_MMAPLOCK and XFS_IOLOCK
>   xfs_file_buffered_aio_read()
>   does buffered I/O to DAX inode, death
> 
> Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK
> in the read path.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/xfs/xfs_file.c | 42 +++++++++++++-----------------------------
>  1 file changed, 13 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index ebdd0bd..ca4c8fd 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -207,7 +207,6 @@ xfs_file_dio_aio_read(
>  {
>  	struct xfs_inode	*ip = XFS_I(file_inode(iocb->ki_filp));
>  	size_t			count = iov_iter_count(to);
> -	ssize_t			ret;
>  
>  	trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
>  
> @@ -215,12 +214,7 @@ xfs_file_dio_aio_read(
>  		return 0; /* skip atime */
>  
>  	file_accessed(iocb->ki_filp);
> -
> -	xfs_ilock(ip, XFS_IOLOCK_SHARED);
> -	ret = iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
> -	xfs_iunlock(ip, XFS_IOLOCK_SHARED);
> -
> -	return ret;
> +	return iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);

This puts file_accessed under the XFS_IOLOCK_SHARED now. Is that a
safe/sane thing to do for DIO?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 4/7] xfs: protect S_DAX transitions in XFS write path
  2017-09-25 23:14 ` [PATCH 4/7] xfs: protect S_DAX transitions in XFS write path Ross Zwisler
@ 2017-09-25 23:29   ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2017-09-25 23:29 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	linux-kernel, J. Bruce Fields, linux-mm, linux-fsdevel,
	Andrew Morton, linux-xfs, Christoph Hellwig

On Mon, Sep 25, 2017 at 05:14:01PM -0600, Ross Zwisler wrote:
> In the current XFS write I/O path we check IS_DAX() in
> xfs_file_write_iter() to decide whether to do DAX I/O, direct I/O or
> buffered I/O.  This check is done without holding the XFS_IOLOCK, though,
> which means that if we allow S_DAX to be manipulated via the inode flag we
> can run into this race:
> 
> CPU 0                           CPU 1
> -----                           -----
> xfs_file_write_iter()
>   IS_DAX() << returns false
> 			    xfs_ioctl_setattr()
> 			      xfs_ioctl_setattr_dax_invalidate()
> 			       xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK)
> 			      sets S_DAX
> 			      releases XFS_MMAPLOCK and XFS_IOLOCK
>   xfs_file_buffered_aio_write()
>   does buffered I/O to DAX inode, death
> 
> Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK
> in the write path.

NACK. This breaks concurrent direct IO write semantics. We must not
take XFS_IOLOCK_EXCL on direct IO writes unless it is absolutely
necessary - there are lots of applications out there that rely on
these semantics for performance.

CHeers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-25 23:13 ` [PATCH 1/7] xfs: always use DAX if mount option is used Ross Zwisler
@ 2017-09-25 23:38   ` Dave Chinner
  2017-09-26  9:35     ` Jan Kara
  2017-09-26 18:50     ` Ross Zwisler
  0 siblings, 2 replies; 46+ messages in thread
From: Dave Chinner @ 2017-09-25 23:38 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	linux-kernel, J. Bruce Fields, linux-mm, linux-fsdevel,
	Andrew Morton, linux-xfs, Christoph Hellwig

On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> Before support for the per-inode DAX flag was disabled the XFS the code had
> an issue where the user couldn't reliably tell whether or not DAX was being
> used to service page faults and I/O when the DAX mount option was used.  In
> this case each inode within the mounted filesystem started with S_DAX set
> due to the mount option, but it could be cleared if someone touched the
> individual inode flag.
> 
> For example (v4.13 and before):
> 
>   # mount | grep dax
>   /dev/pmem0 on /mnt type xfs
>   (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> 
>   # touch /mnt/a /mnt/b   # both files currently use DAX
> 
>   # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
>   ----------e----- /mnt/a
>   ----------e----- /mnt/b
> 
>   # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a
> 
>   # xfs_io -c "lsattr" /mnt/*
>   ----------e----- /mnt/a
>   ----------e----- /mnt/b

That's really a bug in the lsattr code, yes? If we've cleared the
S_DAX flag for the inode, then why is it being reported in lsattr?
Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
then isn't that the bug that needs fixing?

Remember, the whole point of the dax inode flag was to be able to
override the mount option setting so that admins could turn off/on
dax for the things that didn't/did work with DAX correctly so they
didn't need multiple filesystems on pmem to segregate the apps that
did/didn't work with DAX...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-25 23:14 ` [PATCH 6/7] mm, fs: introduce file_operations->post_mmap() Ross Zwisler
@ 2017-09-25 23:38   ` Dan Williams
  2017-09-26 18:57     ` Ross Zwisler
  2017-09-26  6:34   ` Christoph Hellwig
  1 sibling, 1 reply; 46+ messages in thread
From: Dan Williams @ 2017-09-25 23:38 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, Darrick J. Wong, J. Bruce Fields,
	Christoph Hellwig, Dave Chinner, Jan Kara, Jeff Layton,
	linux-fsdevel, Linux MM, linux-nvdimm, linux-xfs

On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> When mappings are created the vma->vm_flags that they use vary based on
> whether the inode being mapped is using DAX or not.  This setup happens in
> XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
>
> For us to be able to safely use the DAX per-inode flag we need to prevent
> S_DAX transitions when any mappings are present, and we will do that by
> looking at the address_space->i_mmap tree and returning -EBUSY if any
> mappings are present.
>
> Unfortunately at the time that the filesystem's file_operations->mmap()
> entry point is called the mapping has not yet been added to the
> address_space->i_mmap tree.  This means that at that point in time we
> cannot determine whether or not the mapping will be set up to support DAX.
>
> Fix this by adding a new file_operations entry called post_mmap() which is
> called after the mapping has been added to the address_space->i_mmap tree.
> This post_mmap() op now happens at a time when we can be sure whether the
> mapping will use DAX or not, and we can set up the vma->vm_flags
> appropriately.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/xfs/xfs_file.c  | 15 ++++++++++++++-
>  include/linux/fs.h |  1 +
>  mm/mmap.c          |  2 ++
>  3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 2816858..9d66aaa 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1087,9 +1087,21 @@ xfs_file_mmap(
>  {
>         file_accessed(filp);
>         vma->vm_ops = &xfs_file_vm_ops;
> +       return 0;
> +}
> +
> +/* This call happens during mmap(), after the vma has been inserted into the
> + * inode->i_mapping->i_mmap tree.  At this point the decision on whether or
> + * not to use DAX for this mapping has been set and will not change for the
> + * duration of the mapping.
> + */
> +STATIC void
> +xfs_file_post_mmap(
> +       struct file     *filp,
> +       struct vm_area_struct *vma)
> +{
>         if (IS_DAX(file_inode(filp)))
>                 vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;

It's not clear to me what this is actually protecting? vma_is_dax()
returns true regardless of the vm_flags state , so what is the benefit
to delaying the vm_flags setting to ->post_mmap()?

Also, why is this a file_operation and not a vm_operation?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/7] xfs: re-enable XFS per-inode DAX
  2017-09-25 23:14 ` [PATCH 7/7] xfs: re-enable XFS per-inode DAX Ross Zwisler
@ 2017-09-26  0:31   ` Dave Chinner
  2017-09-26  6:36   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2017-09-26  0:31 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	linux-kernel, J. Bruce Fields, linux-mm, linux-fsdevel,
	Andrew Morton, linux-xfs, Christoph Hellwig

On Mon, Sep 25, 2017 at 05:14:04PM -0600, Ross Zwisler wrote:
> Re-enable the XFS per-inode DAX flag, preventing S_DAX from changing when
> any mappings are present.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/xfs/xfs_ioctl.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 386b437..7a24dd5 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1012,12 +1012,10 @@ xfs_diflags_to_linux(
>  		inode->i_flags |= S_NOATIME;
>  	else
>  		inode->i_flags &= ~S_NOATIME;
> -#if 0	/* disabled until the flag switching races are sorted out */
>  	if ((xflags & FS_XFLAG_DAX) || (ip->i_mount->m_flags & XFS_MOUNT_DAX))
>  		inode->i_flags |= S_DAX;
>  	else
>  		inode->i_flags &= ~S_DAX;
> -#endif
>  }
>  
>  static bool
> @@ -1049,6 +1047,8 @@ xfs_ioctl_setattr_xflags(
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
>  	uint64_t		di_flags2;
> +	struct address_space	*mapping = VFS_I(ip)->i_mapping;
> +	bool			dax_changing;
>  
>  	/* Can't change realtime flag if any extents are allocated. */
>  	if ((ip->i_d.di_nextents || ip->i_delayed_blks) &&
> @@ -1084,10 +1084,23 @@ xfs_ioctl_setattr_xflags(
>  	if (di_flags2 && ip->i_d.di_version < 3)
>  		return -EINVAL;
>  
> +	dax_changing = xfs_is_dax_state_changing(fa->fsx_xflags, ip);
> +	if (dax_changing) {
> +		i_mmap_lock_read(mapping);
> +		if (mapping_mapped(mapping)) {
> +			i_mmap_unlock_read(mapping);
> +			return -EBUSY;
> +		}
> +	}
> +
>  	ip->i_d.di_flags = xfs_flags2diflags(ip, fa->fsx_xflags);
>  	ip->i_d.di_flags2 = di_flags2;
>  
>  	xfs_diflags_to_linux(ip);
> +
> +	if (dax_changing)
> +		i_mmap_unlock_read(mapping);

Is this safe to be taking here under the ILOCK_EXCL? i.e. this is
the lock order here:

IOLOCK_EXCL -> MMAPLOCK_EXCL -> ILOCK_EXCL -> i_mmap_rwsem

The truncate path must run outside the ILOCK
context, and it does this order via unmap_mapping_range:

IOLOCK_EXCL -> MMAPLOCK_EXCL -> i_mmap_rwsem

On a page fault, we do:

	mmap_sem -> MMAPLOCK_EXCL -> page lock -> ILOCK_EXCL

Which gives the order

IOLOCK_EXCL
  -> mmap_sem
     -> MMAPLOCK_EXCL
	-> page lock
	   -> ILOCK_EXCL
	      -> i_mmap_rwsem

What I'm not clear on is what the orders between page locks and
pte locks and i_mapping_tree_lock and i_mmap_rwsem. If there's any
locks that the filesystem can take above the ILOCK that are also
taken under the i_mmap_rwsem, then we have a deadlock vector.

Historically we've avoided any mm/ level interactions under the
ILOCK_EXCL because of it's location in the page fault path locking
order (e.g. lockdep will go nuts if we take a page fault with the
ILOCK held). Hence I'm extremely wary of putting any other mm/ level
locks under the ILOCK like this without a clear explanation of the
locking orders and why it won't deadlock....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path
  2017-09-25 23:14 ` [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path Ross Zwisler
  2017-09-25 23:27   ` Dave Chinner
@ 2017-09-26  6:32   ` Christoph Hellwig
  2017-09-26 13:59     ` Dan Williams
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2017-09-26  6:32 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, Dave Chinner,
	linux-nvdimm, linux-kernel, J. Bruce Fields, linux-mm,
	linux-fsdevel, Andrew Morton, linux-xfs, Christoph Hellwig

We can't just take locking one level up, as we need differnet locking
for different kinds of I/O.

I think you probably want an IOCB_DAX flag to check IS_DAX once and
then stick to it, similar to what we do for direct I/O.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 5/7] xfs: introduce xfs_is_dax_state_changing
  2017-09-25 23:14 ` [PATCH 5/7] xfs: introduce xfs_is_dax_state_changing Ross Zwisler
@ 2017-09-26  6:33   ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2017-09-26  6:33 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, Dave Chinner,
	linux-nvdimm, linux-kernel, J. Bruce Fields, linux-mm,
	linux-fsdevel, Andrew Morton, linux-xfs, Christoph Hellwig

> +static bool
> +xfs_is_dax_state_changing(
> +	unsigned int		xflags,
> +	struct xfs_inode	*ip)

And I have no fricking idea what 'is_dax_state_changing' is supposed
to mean for the caller.  This needs a better name and/or a comment
explaining the function.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-25 23:14 ` [PATCH 6/7] mm, fs: introduce file_operations->post_mmap() Ross Zwisler
  2017-09-25 23:38   ` Dan Williams
@ 2017-09-26  6:34   ` Christoph Hellwig
  1 sibling, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2017-09-26  6:34 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, Dave Chinner,
	linux-nvdimm, linux-kernel, J. Bruce Fields, linux-mm,
	linux-fsdevel, Andrew Morton, linux-xfs, Christoph Hellwig

On Mon, Sep 25, 2017 at 05:14:03PM -0600, Ross Zwisler wrote:
> When mappings are created the vma->vm_flags that they use vary based on
> whether the inode being mapped is using DAX or not.  This setup happens in
> XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
> 
> For us to be able to safely use the DAX per-inode flag we need to prevent
> S_DAX transitions when any mappings are present, and we will do that by
> looking at the address_space->i_mmap tree and returning -EBUSY if any
> mappings are present.
> 
> Unfortunately at the time that the filesystem's file_operations->mmap()
> entry point is called the mapping has not yet been added to the
> address_space->i_mmap tree.  This means that at that point in time we
> cannot determine whether or not the mapping will be set up to support DAX.
> 
> Fix this by adding a new file_operations entry called post_mmap() which is
> called after the mapping has been added to the address_space->i_mmap tree.
> This post_mmap() op now happens at a time when we can be sure whether the
> mapping will use DAX or not, and we can set up the vma->vm_flags
> appropriately.

Just like in the read/write path we'll need a flag that is passed down
from the VM based on checking IS_DAX once and exactly once instead.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/7] xfs: re-enable XFS per-inode DAX
  2017-09-25 23:14 ` [PATCH 7/7] xfs: re-enable XFS per-inode DAX Ross Zwisler
  2017-09-26  0:31   ` Dave Chinner
@ 2017-09-26  6:36   ` Christoph Hellwig
  2017-09-26 19:01     ` Ross Zwisler
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2017-09-26  6:36 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Andrew Morton, linux-kernel, Darrick J. Wong, J. Bruce Fields,
	Christoph Hellwig, Dan Williams, Dave Chinner, Jan Kara,
	Jeff Layton, linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Mon, Sep 25, 2017 at 05:14:04PM -0600, Ross Zwisler wrote:
> Re-enable the XFS per-inode DAX flag, preventing S_DAX from changing when
> any mappings are present.

Before we re-enable it please come up with a coherent description
of the per-inode DAX flag that makes sense to a user.  We'll also need
to find a good place to document it, e.g. a new ioctl_setflags man
page.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/7] xfs: validate bdev support for DAX inode flag
  2017-09-25 23:13 ` [PATCH 2/7] xfs: validate bdev support for DAX inode flag Ross Zwisler
@ 2017-09-26  6:36   ` Christoph Hellwig
  2017-09-26 17:16     ` Ross Zwisler
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2017-09-26  6:36 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, Dave Chinner,
	linux-nvdimm, linux-kernel, J. Bruce Fields, linux-mm,
	linux-fsdevel, Andrew Morton, linux-xfs, Christoph Hellwig

On Mon, Sep 25, 2017 at 05:13:59PM -0600, Ross Zwisler wrote:
> Currently only the blocksize is checked, but we should really be calling
> bdev_dax_supported() which also tests to make sure we can get a
> struct dax_device and that the dax_direct_access() path is working.
> 
> This is the same check that we do for the "-o dax" mount option in
> xfs_fs_fill_super().
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

I think we just want to pick this up ASAP.  And between my vague
memoried and that reviewed-by tag it already was part of a different
series, wasn't it?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-25 23:38   ` Dave Chinner
@ 2017-09-26  9:35     ` Jan Kara
  2017-09-26 11:09       ` Dave Chinner
  2017-09-26 18:50     ` Ross Zwisler
  1 sibling, 1 reply; 46+ messages in thread
From: Jan Kara @ 2017-09-26  9:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, Jan Kara, Andrew Morton, Darrick J. Wong,
	linux-nvdimm, linux-kernel, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs, Christoph Hellwig

On Tue 26-09-17 09:38:12, Dave Chinner wrote:
> On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> > Before support for the per-inode DAX flag was disabled the XFS the code had
> > an issue where the user couldn't reliably tell whether or not DAX was being
> > used to service page faults and I/O when the DAX mount option was used.  In
> > this case each inode within the mounted filesystem started with S_DAX set
> > due to the mount option, but it could be cleared if someone touched the
> > individual inode flag.
> > 
> > For example (v4.13 and before):
> > 
> >   # mount | grep dax
> >   /dev/pmem0 on /mnt type xfs
> >   (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> > 
> >   # touch /mnt/a /mnt/b   # both files currently use DAX
> > 
> >   # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
> >   ----------e----- /mnt/a
> >   ----------e----- /mnt/b
> > 
> >   # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a
> > 
> >   # xfs_io -c "lsattr" /mnt/*
> >   ----------e----- /mnt/a
> >   ----------e----- /mnt/b
> 
> That's really a bug in the lsattr code, yes? If we've cleared the
> S_DAX flag for the inode, then why is it being reported in lsattr?
> Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
> then isn't that the bug that needs fixing?
> 
> Remember, the whole point of the dax inode flag was to be able to
> override the mount option setting so that admins could turn off/on
> dax for the things that didn't/did work with DAX correctly so they
> didn't need multiple filesystems on pmem to segregate the apps that
> did/didn't work with DAX...

So I think there is some confusion that is created by the fact that whether
DAX is used or not is controlled by both a mount option and an inode flag.
We could define that "Inode flag always wins" which is what you seem to
suggest above but then mount option has no practical effect since on-disk
S_DAX flag will always overrule it.

Ross suggests that DAX should be used if "Inode flag or mount option is
set". Which is similar to how e.g. noatime inode flag works but does not
allow to selectively disable DAX.

So if we wanted both mount option to work and selective disabling of DAX,
we would need three states of inode setting - force DAX, disable DAX,
inherit from mount option.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-26  9:35     ` Jan Kara
@ 2017-09-26 11:09       ` Dave Chinner
  2017-09-26 14:37         ` Christoph Hellwig
  2017-09-26 18:02         ` Eric Sandeen
  0 siblings, 2 replies; 46+ messages in thread
From: Dave Chinner @ 2017-09-26 11:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, Andrew Morton, Darrick J. Wong, linux-nvdimm,
	linux-kernel, J. Bruce Fields, linux-mm, linux-fsdevel,
	linux-xfs, Christoph Hellwig

On Tue, Sep 26, 2017 at 11:35:48AM +0200, Jan Kara wrote:
> On Tue 26-09-17 09:38:12, Dave Chinner wrote:
> > On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> > > Before support for the per-inode DAX flag was disabled the XFS the code had
> > > an issue where the user couldn't reliably tell whether or not DAX was being
> > > used to service page faults and I/O when the DAX mount option was used.  In
> > > this case each inode within the mounted filesystem started with S_DAX set
> > > due to the mount option, but it could be cleared if someone touched the
> > > individual inode flag.
> > > 
> > > For example (v4.13 and before):
> > > 
> > >   # mount | grep dax
> > >   /dev/pmem0 on /mnt type xfs
> > >   (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> > > 
> > >   # touch /mnt/a /mnt/b   # both files currently use DAX
> > > 
> > >   # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
> > >   ----------e----- /mnt/a
> > >   ----------e----- /mnt/b
> > > 
> > >   # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a
> > > 
> > >   # xfs_io -c "lsattr" /mnt/*
> > >   ----------e----- /mnt/a
> > >   ----------e----- /mnt/b
> > 
> > That's really a bug in the lsattr code, yes? If we've cleared the
> > S_DAX flag for the inode, then why is it being reported in lsattr?
> > Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
> > then isn't that the bug that needs fixing?
> > 
> > Remember, the whole point of the dax inode flag was to be able to
> > override the mount option setting so that admins could turn off/on
> > dax for the things that didn't/did work with DAX correctly so they
> > didn't need multiple filesystems on pmem to segregate the apps that
> > did/didn't work with DAX...
> 
> So I think there is some confusion that is created by the fact that whether
> DAX is used or not is controlled by both a mount option and an inode flag.
> We could define that "Inode flag always wins" which is what you seem to
> suggest above but then mount option has no practical effect since on-disk
> S_DAX flag will always overrule it.

Well, quite frankly, I never wanted the mount option for XFS. It was
supposed to be for initial testing only, then we'd /always/ use the
the inode flags. For a filesystem to default to using DAX, we
set the DAX flag on the root inode at mkfs time, and then everything
inode flag based just works.

But it seems that we're now stuck with the stupid, blunt, brute
force mount option because that's what the first commit on ext4
used.  Now we're just about stuck with this silly "but we can't turn
it off" problem because of the mount option overriding everything.

If we have to keep the mount option, then lets fix it to mean "mount
option sets inheritable inode flag on directory creation" and
/maybe/ "mount option sets inode flag on file creation".

This then allows the inode flag to control everything else. i.e the
mount option sets the initial flag value rather than the behaviour
of the inode. The behaviour of the inode should be entirely
controlled by the inode flag, hence after initial creation the
chattr +/-x commands do what they advertise regardless of the mount
option value.

Yes, it means that existing users are going to have to run chattr -R
+x on their pmem filesystems to get the inode flags on disk, but
this is all tagged with EXPERIMENTAL and this is the sort of change
that is expected from experimental functionality.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path
  2017-09-26  6:32   ` Christoph Hellwig
@ 2017-09-26 13:59     ` Dan Williams
  2017-09-26 14:33       ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Dan Williams @ 2017-09-26 13:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, linux-kernel, J. Bruce Fields, Linux MM,
	linux-fsdevel, linux-xfs, Andrew Morton

On Mon, Sep 25, 2017 at 11:32 PM, Christoph Hellwig <hch@lst.de> wrote:
> We can't just take locking one level up, as we need differnet locking
> for different kinds of I/O.
>
> I think you probably want an IOCB_DAX flag to check IS_DAX once and
> then stick to it, similar to what we do for direct I/O.

I wonder if this works better with a reference count mechanism
per-file so that we don't need a hold a lock over the whole
transition. Similar to request_queue reference counting, when DAX is
being turned off we block new references and drain the in-flight ones.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path
  2017-09-26 13:59     ` Dan Williams
@ 2017-09-26 14:33       ` Christoph Hellwig
  2017-09-26 18:11         ` Dan Williams
  0 siblings, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2017-09-26 14:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Layton, Jan Kara, Andrew Morton, Darrick J. Wong,
	linux-nvdimm, Dave Chinner, linux-kernel, J. Bruce Fields,
	Linux MM, linux-fsdevel, linux-xfs, Christoph Hellwig

On Tue, Sep 26, 2017 at 06:59:37AM -0700, Dan Williams wrote:
> > I think you probably want an IOCB_DAX flag to check IS_DAX once and
> > then stick to it, similar to what we do for direct I/O.
> 
> I wonder if this works better with a reference count mechanism
> per-file so that we don't need a hold a lock over the whole
> transition. Similar to request_queue reference counting, when DAX is
> being turned off we block new references and drain the in-flight ones.

Maybe.  But that assumes we want to be stuck in a perpetual binary
DAX on/off state on a given file.  Which makes not only for an awkward
interface (inode or mount flag), but also might be fundamentally the
wrong thing to do for some media where you'd happily read directly
from it but rather buffer writes in DRAM.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-26 11:09       ` Dave Chinner
@ 2017-09-26 14:37         ` Christoph Hellwig
  2017-09-26 17:30           ` Ross Zwisler
  2017-09-26 18:02         ` Eric Sandeen
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2017-09-26 14:37 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ross Zwisler, Andrew Morton, linux-kernel,
	Darrick J. Wong, J. Bruce Fields, Christoph Hellwig,
	Dan Williams, Jeff Layton, linux-fsdevel, linux-mm, linux-nvdimm,
	linux-xfs

On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote:
> Well, quite frankly, I never wanted the mount option for XFS. It was
> supposed to be for initial testing only, then we'd /always/ use the
> the inode flags. For a filesystem to default to using DAX, we
> set the DAX flag on the root inode at mkfs time, and then everything
> inode flag based just works.

And I deeply fundamentally disagree.  The mount option is a nice
enough big hammer to try a mode without encoding nitty gritty details
into the application ABI.

The per-inode persistent flag is the biggest nightmare ever, as we see
in all these discussions about it.

What does it even mean?  Right now it forces direct addressing as long
as the underlying media supports that.  But what about media that
you directly access but you really don't want to because it's really slow?
Or media that is so god damn fast that you never want to buffer?  Or
media where you want to buffer for writes (or at least some of them)
but not for reads?

It encodes a very specific mechanism for an early direct access
implementation into the ABI.  What we really need is for applications
to declare an intent, not specify a particular mechanism.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/7] xfs: validate bdev support for DAX inode flag
  2017-09-26  6:36   ` Christoph Hellwig
@ 2017-09-26 17:16     ` Ross Zwisler
  2017-09-26 17:57       ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2017-09-26 17:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Layton, Jan Kara, Andrew Morton, Darrick J. Wong,
	linux-nvdimm, Dave Chinner, linux-kernel, J. Bruce Fields,
	linux-mm, linux-fsdevel, linux-xfs

On Tue, Sep 26, 2017 at 08:36:50AM +0200, Christoph Hellwig wrote:
> On Mon, Sep 25, 2017 at 05:13:59PM -0600, Ross Zwisler wrote:
> > Currently only the blocksize is checked, but we should really be calling
> > bdev_dax_supported() which also tests to make sure we can get a
> > struct dax_device and that the dax_direct_access() path is working.
> > 
> > This is the same check that we do for the "-o dax" mount option in
> > xfs_fs_fill_super().
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> I think we just want to pick this up ASAP.  And between my vague
> memoried and that reviewed-by tag it already was part of a different
> series, wasn't it?

Yep, the first 2 patches were part of this series:

https://lkml.org/lkml/2017/9/7/552

which you reviewed.  I included them in this series because the later patches
needed to build on them.  It looks like they are now in Darrick's
xfs-4.14-fixes branch, but haven't yet made it upstream.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-26 14:37         ` Christoph Hellwig
@ 2017-09-26 17:30           ` Ross Zwisler
  2017-09-26 19:48             ` Darrick J. Wong
  2017-09-27  6:40             ` Christoph Hellwig
  0 siblings, 2 replies; 46+ messages in thread
From: Ross Zwisler @ 2017-09-26 17:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, linux-kernel, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs, Andrew Morton

On Tue, Sep 26, 2017 at 04:37:43PM +0200, Christoph Hellwig wrote:
> On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote:
> > Well, quite frankly, I never wanted the mount option for XFS. It was
> > supposed to be for initial testing only, then we'd /always/ use the
> > the inode flags. For a filesystem to default to using DAX, we
> > set the DAX flag on the root inode at mkfs time, and then everything
> > inode flag based just works.
> 
> And I deeply fundamentally disagree.  The mount option is a nice
> enough big hammer to try a mode without encoding nitty gritty details
> into the application ABI.
> 
> The per-inode persistent flag is the biggest nightmare ever, as we see
> in all these discussions about it.
> 
> What does it even mean?  Right now it forces direct addressing as long
> as the underlying media supports that.  But what about media that
> you directly access but you really don't want to because it's really slow?
> Or media that is so god damn fast that you never want to buffer?  Or
> media where you want to buffer for writes (or at least some of them)
> but not for reads?
> 
> It encodes a very specific mechanism for an early direct access
> implementation into the ABI.  What we really need is for applications
> to declare an intent, not specify a particular mechanism.

I agree that Christoph's idea about having the system intelligently adjust to
use DAX based on performance information it gathers about the underlying
persistent memory (probably via the HMAT on x86_64 systems) is interesting,
but I think we're still a ways away from that.

FWIW, as my patches suggest and Jan observed I think that we should allow
users to turn on DAX by treating the inode flag and the mount flag as an 'or'
operation.  i.e. you get DAX if either the mount option is specified or if the
inode flag is set, and you can continue to manipulate the per-inode flag as
you want regardless of the mount option.  I think this provides maximum
flexibility of the mechanism to select DAX without enforcing policy.

In the end, though, I think what's really important is that we figure out what
the various options mean, have the same story for both XFS and ext4, and
document it as hch suggested in response to my patch 7 in this series.

Does it make sense at this point to just start a "dax" man page that can
contain info about the mount options, inode flags, kernel config options, how
to get PMDs, etc?  Or does this documentation need to be sprinkled around more
in existing man pages?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/7] xfs: validate bdev support for DAX inode flag
  2017-09-26 17:16     ` Ross Zwisler
@ 2017-09-26 17:57       ` Darrick J. Wong
  0 siblings, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2017-09-26 17:57 UTC (permalink / raw)
  To: Ross Zwisler, Christoph Hellwig, Andrew Morton, linux-kernel,
	J. Bruce Fields, Dan Williams, Dave Chinner, Jan Kara,
	Jeff Layton, linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue, Sep 26, 2017 at 11:16:38AM -0600, Ross Zwisler wrote:
> On Tue, Sep 26, 2017 at 08:36:50AM +0200, Christoph Hellwig wrote:
> > On Mon, Sep 25, 2017 at 05:13:59PM -0600, Ross Zwisler wrote:
> > > Currently only the blocksize is checked, but we should really be calling
> > > bdev_dax_supported() which also tests to make sure we can get a
> > > struct dax_device and that the dax_direct_access() path is working.
> > > 
> > > This is the same check that we do for the "-o dax" mount option in
> > > xfs_fs_fill_super().
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > 
> > I think we just want to pick this up ASAP.  And between my vague
> > memoried and that reviewed-by tag it already was part of a different
> > series, wasn't it?
> 
> Yep, the first 2 patches were part of this series:
> 
> https://lkml.org/lkml/2017/9/7/552
> 
> which you reviewed.  I included them in this series because the later patches
> needed to build on them.  It looks like they are now in Darrick's
> xfs-4.14-fixes branch, but haven't yet made it upstream.

I'm pulling that first patch from -fixes because Dave & Christoph
started discussing it again in this new thread after I'd pushed the
patch from September 7th to korg.

The second patch looks fine, it'll stay.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-26 11:09       ` Dave Chinner
  2017-09-26 14:37         ` Christoph Hellwig
@ 2017-09-26 18:02         ` Eric Sandeen
  1 sibling, 0 replies; 46+ messages in thread
From: Eric Sandeen @ 2017-09-26 18:02 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara
  Cc: Jeff Layton, Andrew Morton, Darrick J. Wong, linux-nvdimm,
	linux-kernel, J. Bruce Fields, linux-mm, linux-fsdevel,
	linux-xfs, Christoph Hellwig



On 9/26/17 6:09 AM, Dave Chinner wrote:
> On Tue, Sep 26, 2017 at 11:35:48AM +0200, Jan Kara wrote:
>> On Tue 26-09-17 09:38:12, Dave Chinner wrote:
>>> On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
>>>> Before support for the per-inode DAX flag was disabled the XFS the code had
>>>> an issue where the user couldn't reliably tell whether or not DAX was being
>>>> used to service page faults and I/O when the DAX mount option was used.  In
>>>> this case each inode within the mounted filesystem started with S_DAX set
>>>> due to the mount option, but it could be cleared if someone touched the
>>>> individual inode flag.
>>>>
>>>> For example (v4.13 and before):
>>>>
>>>>   # mount | grep dax
>>>>   /dev/pmem0 on /mnt type xfs
>>>>   (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
>>>>
>>>>   # touch /mnt/a /mnt/b   # both files currently use DAX
>>>>
>>>>   # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
>>>>   ----------e----- /mnt/a
>>>>   ----------e----- /mnt/b
>>>>
>>>>   # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a
>>>>
>>>>   # xfs_io -c "lsattr" /mnt/*
>>>>   ----------e----- /mnt/a
>>>>   ----------e----- /mnt/b
>>>
>>> That's really a bug in the lsattr code, yes? If we've cleared the
>>> S_DAX flag for the inode, then why is it being reported in lsattr?
>>> Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
>>> then isn't that the bug that needs fixing?
>>>
>>> Remember, the whole point of the dax inode flag was to be able to
>>> override the mount option setting so that admins could turn off/on
>>> dax for the things that didn't/did work with DAX correctly so they
>>> didn't need multiple filesystems on pmem to segregate the apps that
>>> did/didn't work with DAX...
>>
>> So I think there is some confusion that is created by the fact that whether
>> DAX is used or not is controlled by both a mount option and an inode flag.
>> We could define that "Inode flag always wins" which is what you seem to
>> suggest above but then mount option has no practical effect since on-disk
>> S_DAX flag will always overrule it.
> 
> Well, quite frankly, I never wanted the mount option for XFS. It was
> supposed to be for initial testing only, then we'd /always/ use the
> the inode flags. For a filesystem to default to using DAX, we
> set the DAX flag on the root inode at mkfs time, and then everything
> inode flag based just works.
> 
> But it seems that we're now stuck with the stupid, blunt, brute
> force mount option because that's what the first commit on ext4
> used.  Now we're just about stuck with this silly "but we can't turn
> it off" problem because of the mount option overriding everything.

I don't think the existence of a mount option in ext4 makes us any
more "stuck" than the mount option in xfs does.

fs/xfs/xfs_super.c:		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
fs/ext4/super.c:		"DAX enabled. Warning: EXPERIMENTAL, use at your own risk");

so when^wif this argument ever gets settled, I think there is plenty
of latitude to do the right thing, potentially breaking the old thing.

> If we have to keep the mount option, then lets fix it to mean "mount
> option sets inheritable inode flag on directory creation" and
> /maybe/ "mount option sets inode flag on file creation".
> 
> This then allows the inode flag to control everything else. i.e the
> mount option sets the initial flag value rather than the behaviour
> of the inode. The behaviour of the inode should be entirely
> controlled by the inode flag, hence after initial creation the
> chattr +/-x commands do what they advertise regardless of the mount
> option value.
> 
> Yes, it means that existing users are going to have to run chattr -R
> +x on their pmem filesystems to get the inode flags on disk, but
> this is all tagged with EXPERIMENTAL and this is the sort of change
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> that is expected from experimental functionality.

Right.

-Eric

> Cheers,
> 
> Dave.
> 
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path
  2017-09-26 14:33       ` Christoph Hellwig
@ 2017-09-26 18:11         ` Dan Williams
  2017-10-01  8:17           ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Dan Williams @ 2017-09-26 18:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Darrick J. Wong,
	J. Bruce Fields, Dave Chinner, Jan Kara, Jeff Layton,
	linux-fsdevel, Linux MM, linux-nvdimm, linux-xfs

On Tue, Sep 26, 2017 at 7:33 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Tue, Sep 26, 2017 at 06:59:37AM -0700, Dan Williams wrote:
>> > I think you probably want an IOCB_DAX flag to check IS_DAX once and
>> > then stick to it, similar to what we do for direct I/O.
>>
>> I wonder if this works better with a reference count mechanism
>> per-file so that we don't need a hold a lock over the whole
>> transition. Similar to request_queue reference counting, when DAX is
>> being turned off we block new references and drain the in-flight ones.
>
> Maybe.  But that assumes we want to be stuck in a perpetual binary
> DAX on/off state on a given file.  Which makes not only for an awkward
> interface (inode or mount flag), but also might be fundamentally the
> wrong thing to do for some media where you'd happily read directly
> from it but rather buffer writes in DRAM.

I think we'll always need an explicit override available, but yes we
need to think about what the override looks like in the context of a
kernel that is able to automatically pick the right I/O policy
relative to the media type. A potential mixed policy for reads vs
writes makes sense. Where would this finer grained I/O policy
selection go other than more inode flags?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-25 23:38   ` Dave Chinner
  2017-09-26  9:35     ` Jan Kara
@ 2017-09-26 18:50     ` Ross Zwisler
  1 sibling, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2017-09-26 18:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, Jan Kara, Andrew Morton, Darrick J. Wong,
	linux-nvdimm, linux-kernel, J. Bruce Fields, linux-mm,
	linux-fsdevel, linux-xfs, Christoph Hellwig

On Tue, Sep 26, 2017 at 09:38:12AM +1000, Dave Chinner wrote:
> On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> > Before support for the per-inode DAX flag was disabled the XFS the code had
> > an issue where the user couldn't reliably tell whether or not DAX was being
> > used to service page faults and I/O when the DAX mount option was used.  In
> > this case each inode within the mounted filesystem started with S_DAX set
> > due to the mount option, but it could be cleared if someone touched the
> > individual inode flag.
> > 
> > For example (v4.13 and before):
> > 
> >   # mount | grep dax
> >   /dev/pmem0 on /mnt type xfs
> >   (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> > 
> >   # touch /mnt/a /mnt/b   # both files currently use DAX
> > 
> >   # xfs_io -c "lsattr" /mnt/*  # neither has the DAX inode option set
> >   ----------e----- /mnt/a
> >   ----------e----- /mnt/b
> > 
> >   # xfs_io -c "chattr -x" /mnt/a  # this clears S_DAX for /mnt/a
> > 
> >   # xfs_io -c "lsattr" /mnt/*
> >   ----------e----- /mnt/a
> >   ----------e----- /mnt/b
> 
> That's really a bug in the lsattr code, yes? If we've cleared the
> S_DAX flag for the inode, then why is it being reported in lsattr?
> Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
> then isn't that the bug that needs fixing?

No, I think lsattr/chattr are working correctly.  In both the examples above
the DAX inode flag (which is represeted by an 'x') is never set.  S_DAX is the
in-memory inode flag (not the on-media inode flag) which is not manipulated
directly by lsattr/chattr, but instead reflects whether the inode is actually
using DAX or not.

Manipulating and displaying the on-media inode flag works as expected with
lsattr/chattr:

  # xfs_io -c "lsattr" ./a 
  ---------------- ./a 
  
  # xfs_io -c "chattr +x" ./a 
  
  # xfs_io -c "lsattr" ./a 
  --------------x- ./a 

- Ross
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-25 23:38   ` Dan Williams
@ 2017-09-26 18:57     ` Ross Zwisler
  2017-09-26 19:19       ` Dan Williams
  0 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2017-09-26 18:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Layton, Jan Kara, Andrew Morton, Darrick J. Wong,
	linux-nvdimm, Dave Chinner, linux-kernel, J. Bruce Fields,
	Linux MM, linux-fsdevel, linux-xfs, Christoph Hellwig

On Mon, Sep 25, 2017 at 04:38:45PM -0700, Dan Williams wrote:
> On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > When mappings are created the vma->vm_flags that they use vary based on
> > whether the inode being mapped is using DAX or not.  This setup happens in
> > XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
> >
> > For us to be able to safely use the DAX per-inode flag we need to prevent
> > S_DAX transitions when any mappings are present, and we will do that by
> > looking at the address_space->i_mmap tree and returning -EBUSY if any
> > mappings are present.
> >
> > Unfortunately at the time that the filesystem's file_operations->mmap()
> > entry point is called the mapping has not yet been added to the
> > address_space->i_mmap tree.  This means that at that point in time we
> > cannot determine whether or not the mapping will be set up to support DAX.
> >
> > Fix this by adding a new file_operations entry called post_mmap() which is
> > called after the mapping has been added to the address_space->i_mmap tree.
> > This post_mmap() op now happens at a time when we can be sure whether the
> > mapping will use DAX or not, and we can set up the vma->vm_flags
> > appropriately.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/xfs/xfs_file.c  | 15 ++++++++++++++-
> >  include/linux/fs.h |  1 +
> >  mm/mmap.c          |  2 ++
> >  3 files changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 2816858..9d66aaa 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1087,9 +1087,21 @@ xfs_file_mmap(
> >  {
> >         file_accessed(filp);
> >         vma->vm_ops = &xfs_file_vm_ops;
> > +       return 0;
> > +}
> > +
> > +/* This call happens during mmap(), after the vma has been inserted into the
> > + * inode->i_mapping->i_mmap tree.  At this point the decision on whether or
> > + * not to use DAX for this mapping has been set and will not change for the
> > + * duration of the mapping.
> > + */
> > +STATIC void
> > +xfs_file_post_mmap(
> > +       struct file     *filp,
> > +       struct vm_area_struct *vma)
> > +{
> >         if (IS_DAX(file_inode(filp)))
> >                 vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
> 
> It's not clear to me what this is actually protecting? vma_is_dax()
> returns true regardless of the vm_flags state , so what is the benefit
> to delaying the vm_flags setting to ->post_mmap()?

Right, but the point is that until the vma has been inserted into the
inode->i_mapping->i_mmap tree, the results of IS_DAX() don't matter because it
can still change.  Until this insertion happens we cannot know whether or not
we should set up the vma->vm_flags to support DAX mappings (i.e. have
VM_MIXEDMAP and VM_HUGEPAGE set).  This decision can only be made (in this
proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
populated, which means we need another call into the filesystem after this
insertion has happened.

We don't want to mess with the existing file_operations->mmap() call because
in many filesystems that does sanity checking and setup that you really want
to have happen *before* the mapping is completed and inserted into the
inode->i_mapping->i_mmap tree.

> Also, why is this a file_operation and not a vm_operation?

Because ->mmap() is also a file_operation, and this is an analogous call from
the mmap code that needs to happen at a different time.  Or are you suggesting
that file_operations->mmap() should be moved to be a vm_operation?  If not,
why would one be in one operations table and one in another?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 7/7] xfs: re-enable XFS per-inode DAX
  2017-09-26  6:36   ` Christoph Hellwig
@ 2017-09-26 19:01     ` Ross Zwisler
  0 siblings, 0 replies; 46+ messages in thread
From: Ross Zwisler @ 2017-09-26 19:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeff Layton, Jan Kara, Andrew Morton, Darrick J. Wong,
	linux-nvdimm, Dave Chinner, linux-kernel, J. Bruce Fields,
	linux-mm, linux-fsdevel, linux-xfs

On Tue, Sep 26, 2017 at 08:36:11AM +0200, Christoph Hellwig wrote:
> On Mon, Sep 25, 2017 at 05:14:04PM -0600, Ross Zwisler wrote:
> > Re-enable the XFS per-inode DAX flag, preventing S_DAX from changing when
> > any mappings are present.
> 
> Before we re-enable it please come up with a coherent description
> of the per-inode DAX flag that makes sense to a user.  We'll also need
> to find a good place to document it, e.g. a new ioctl_setflags man
> page.

I agree that documentation is a great place to start, if we can just agree on
what we want the functionality to be. :)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-26 18:57     ` Ross Zwisler
@ 2017-09-26 19:19       ` Dan Williams
  2017-09-26 21:06         ` Ross Zwisler
  0 siblings, 1 reply; 46+ messages in thread
From: Dan Williams @ 2017-09-26 19:19 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Andrew Morton, linux-kernel,
	Darrick J. Wong, J. Bruce Fields, Christoph Hellwig,
	Dave Chinner, Jan Kara, Jeff Layton, linux-fsdevel, Linux MM,
	linux-nvdimm, linux-xfs

On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Sep 25, 2017 at 04:38:45PM -0700, Dan Williams wrote:
>> On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > When mappings are created the vma->vm_flags that they use vary based on
>> > whether the inode being mapped is using DAX or not.  This setup happens in
>> > XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
>> >
>> > For us to be able to safely use the DAX per-inode flag we need to prevent
>> > S_DAX transitions when any mappings are present, and we will do that by
>> > looking at the address_space->i_mmap tree and returning -EBUSY if any
>> > mappings are present.
>> >
>> > Unfortunately at the time that the filesystem's file_operations->mmap()
>> > entry point is called the mapping has not yet been added to the
>> > address_space->i_mmap tree.  This means that at that point in time we
>> > cannot determine whether or not the mapping will be set up to support DAX.
>> >
>> > Fix this by adding a new file_operations entry called post_mmap() which is
>> > called after the mapping has been added to the address_space->i_mmap tree.
>> > This post_mmap() op now happens at a time when we can be sure whether the
>> > mapping will use DAX or not, and we can set up the vma->vm_flags
>> > appropriately.
>> >
>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> > ---
>> >  fs/xfs/xfs_file.c  | 15 ++++++++++++++-
>> >  include/linux/fs.h |  1 +
>> >  mm/mmap.c          |  2 ++
>> >  3 files changed, 17 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> > index 2816858..9d66aaa 100644
>> > --- a/fs/xfs/xfs_file.c
>> > +++ b/fs/xfs/xfs_file.c
>> > @@ -1087,9 +1087,21 @@ xfs_file_mmap(
>> >  {
>> >         file_accessed(filp);
>> >         vma->vm_ops = &xfs_file_vm_ops;
>> > +       return 0;
>> > +}
>> > +
>> > +/* This call happens during mmap(), after the vma has been inserted into the
>> > + * inode->i_mapping->i_mmap tree.  At this point the decision on whether or
>> > + * not to use DAX for this mapping has been set and will not change for the
>> > + * duration of the mapping.
>> > + */
>> > +STATIC void
>> > +xfs_file_post_mmap(
>> > +       struct file     *filp,
>> > +       struct vm_area_struct *vma)
>> > +{
>> >         if (IS_DAX(file_inode(filp)))
>> >                 vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
>>
>> It's not clear to me what this is actually protecting? vma_is_dax()
>> returns true regardless of the vm_flags state , so what is the benefit
>> to delaying the vm_flags setting to ->post_mmap()?
>
> Right, but the point is that until the vma has been inserted into the
> inode->i_mapping->i_mmap tree, the results of IS_DAX() don't matter because it
> can still change.  Until this insertion happens we cannot know whether or not
> we should set up the vma->vm_flags to support DAX mappings (i.e. have
> VM_MIXEDMAP and VM_HUGEPAGE set).

Those flags are not DAX flags. The side effect of these being set on
non-DAX mappings is that we effectively auto madvise(MADV_HUGEPAGE)
and enable some page-less insertion paths. Both of those are
effectively no-ops for normal mappings since normal mappings always
have an associated struct page and the THP policy these days is
usually "always".

> This decision can only be made (in this
> proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
> populated, which means we need another call into the filesystem after this
> insertion has happened.

I get that, but it seems over-engineered and something that can also
be safely cleaned up after the fact by the code path that is disabling
DAX.

> We don't want to mess with the existing file_operations->mmap() call because
> in many filesystems that does sanity checking and setup that you really want
> to have happen *before* the mapping is completed and inserted into the
> inode->i_mapping->i_mmap tree.
>
>> Also, why is this a file_operation and not a vm_operation?
>
> Because ->mmap() is also a file_operation, and this is an analogous call from
> the mmap code that needs to happen at a different time.  Or are you suggesting
> that file_operations->mmap() should be moved to be a vm_operation?  If not,
> why would one be in one operations table and one in another?

Growing something as widely used as file_operations for this one-off
fixup feels like overkill. vm_operations is not much better, but it at
least constrains the data structure growth to something closer to the
problem space.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-26 17:30           ` Ross Zwisler
@ 2017-09-26 19:48             ` Darrick J. Wong
  2017-09-26 22:00               ` Dave Chinner
  2017-09-27  6:40             ` Christoph Hellwig
  1 sibling, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2017-09-26 19:48 UTC (permalink / raw)
  To: Ross Zwisler, Christoph Hellwig, Dave Chinner, Jan Kara,
	Andrew Morton, linux-kernel, J. Bruce Fields, Dan Williams,
	Jeff Layton, linux-fsdevel, linux-mm, linux-nvdimm, linux-xfs

On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote:
> On Tue, Sep 26, 2017 at 04:37:43PM +0200, Christoph Hellwig wrote:
> > On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote:
> > > Well, quite frankly, I never wanted the mount option for XFS. It was
> > > supposed to be for initial testing only, then we'd /always/ use the
> > > the inode flags. For a filesystem to default to using DAX, we
> > > set the DAX flag on the root inode at mkfs time, and then everything
> > > inode flag based just works.
> > 
> > And I deeply fundamentally disagree.  The mount option is a nice
> > enough big hammer to try a mode without encoding nitty gritty details
> > into the application ABI.
> > 
> > The per-inode persistent flag is the biggest nightmare ever, as we see
> > in all these discussions about it.
> > 
> > What does it even mean?  Right now it forces direct addressing as long
> > as the underlying media supports that.  But what about media that
> > you directly access but you really don't want to because it's really slow?
> > Or media that is so god damn fast that you never want to buffer?  Or
> > media where you want to buffer for writes (or at least some of them)
> > but not for reads?
> > 
> > It encodes a very specific mechanism for an early direct access
> > implementation into the ABI.  What we really need is for applications
> > to declare an intent, not specify a particular mechanism.
> 
> I agree that Christoph's idea about having the system intelligently adjust to
> use DAX based on performance information it gathers about the underlying
> persistent memory (probably via the HMAT on x86_64 systems) is interesting,
> but I think we're still a ways away from that.
> 
> FWIW, as my patches suggest and Jan observed I think that we should allow
> users to turn on DAX by treating the inode flag and the mount flag as an 'or'
> operation.  i.e. you get DAX if either the mount option is specified or if the
> inode flag is set, and you can continue to manipulate the per-inode flag as
> you want regardless of the mount option.  I think this provides maximum
> flexibility of the mechanism to select DAX without enforcing policy.
> 
> In the end, though, I think what's really important is that we figure out what
> the various options mean, have the same story for both XFS and ext4, and
> document it as hch suggested in response to my patch 7 in this series.

Agreed.  We have a fundamental conflict between letting the sysadmin or
user decide how they want an inode to behave vs. letting the kernel make
all the decisions based on whatever information it gathers.

I'm pulled this patch out of -fixes and for-next because I feel strongly
discouraged about taking any more patches that change the user-control
parts of the DAX implementation until we reach a consensus on what to
do.

Given that DAX and pmem support in filesystems is still experimental,
I'm open to changing the interface as needed.  Where do we think we'll
be in a few years once ACPI or whatever reaches the point of being able
to tell the kernel about the general performance characteristics of the
pmem?  What choices about the interface do we need to make now so that
we can get there while minimizing the number of insufficient interfaces
to deprecate?

My personal guess is that most programs will not care enough to want to
make a syscall so we might as well give them the most performant option
available.

Roughly speaking, here are the use cases I can think of:

 * Regular buffered read/write: we can let the kernel decide if it wants
   to push the IO through the page cache, directly access the pmem, or
   some future combination of the two.

 * O_DIRECT read/write: Straight to pmem.

 * Regular mmap: This seems fairly agnostic to how we actually make the
   memory mapping work, right?

 * MAP_DIRECT/MAP_SYNC mmap: If userspace actually goes to the trouble
   of making sure the whole range is allocated and pre-written and uses
   these flags then they get direct access.

I've wondered off and on if an acceptable solution is to define a number
of things surrounding an inode for which XFS /could/ optimize, and let
the user tell us which one thing matters most to them: total manual
control over everything like we do now, sequential io, random io,
fastest mmap access possible, most direct access to storage, etc.
If you set a hint other than full manual control then XFS reserves the
right to change inode flags at any time to satisfy the hint.

For the most part I'm in favor of Christoph's suggestion to let the
kernel decide on its own, and I don't see the point in encoding details
of the storage medium access strategy on the disk, particularly since
filesystems are supposed to be fairly independent of storage.  But
frankly, so many people have asked me over the years if there's some way
to influence the decision-making that I won't quite let go of file hints
as a way to influence the decisions XFS makes around storage media.

> Does it make sense at this point to just start a "dax" man page that can
> contain info about the mount options, inode flags, kernel config options, how
> to get PMDs, etc?  Or does this documentation need to be sprinkled around more
> in existing man pages?

Personally it'd be a lot easier to tell internal groups to go look at a
single documentation page that discusses everything you'd want to know
about enabling dax -- how to control it, how to make large page table
entries work, etc.  Some of those things will get into fs internals,
however, which probably belong in the ext4/xfs manpages.  I suggest
laying out the general details in a single dax manpage and pointing
people at each fs's documentation for specific details.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-26 19:19       ` Dan Williams
@ 2017-09-26 21:06         ` Ross Zwisler
  2017-09-26 21:41           ` Dan Williams
  0 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2017-09-26 21:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Layton, Jan Kara, Andrew Morton, Darrick J. Wong,
	linux-nvdimm, Dave Chinner, linux-kernel, J. Bruce Fields,
	Linux MM, linux-fsdevel, linux-xfs, Christoph Hellwig

On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
<>
> > This decision can only be made (in this
> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
> > populated, which means we need another call into the filesystem after this
> > insertion has happened.
> 
> I get that, but it seems over-engineered and something that can also
> be safely cleaned up after the fact by the code path that is disabling
> DAX.

I don't think you can safely clean it up after the fact because some thread
might have already called ->mmap() to set up the vma->vm_flags for their new
mapping, but they haven't added it to inode->i_mapping->i_mmap.

The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
that the filesystem has any idea about about the mapping.  This is the method
by which we would try and clean up mapping flags, if we were to do so, and
it's the only way that the filesystem can know whether or not mappings exist.

The only way that I could think of to make this safely work is to have the
insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
that the filesystem and the mapping code can communicate on the state of DAX,
but before that I think it's basically indeterminate.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-26 21:06         ` Ross Zwisler
@ 2017-09-26 21:41           ` Dan Williams
  2017-09-27 11:35             ` Jan Kara
  0 siblings, 1 reply; 46+ messages in thread
From: Dan Williams @ 2017-09-26 21:41 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Andrew Morton, linux-kernel,
	Darrick J. Wong, J. Bruce Fields, Christoph Hellwig,
	Dave Chinner, Jan Kara, Jeff Layton, linux-fsdevel, Linux MM,
	linux-nvdimm, linux-xfs

On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
>> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> <>
>> > This decision can only be made (in this
>> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
>> > populated, which means we need another call into the filesystem after this
>> > insertion has happened.
>>
>> I get that, but it seems over-engineered and something that can also
>> be safely cleaned up after the fact by the code path that is disabling
>> DAX.
>
> I don't think you can safely clean it up after the fact because some thread
> might have already called ->mmap() to set up the vma->vm_flags for their new
> mapping, but they haven't added it to inode->i_mapping->i_mmap.

If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
memory mappings.

> The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> that the filesystem has any idea about about the mapping.  This is the method
> by which we would try and clean up mapping flags, if we were to do so, and
> it's the only way that the filesystem can know whether or not mappings exist.
>
> The only way that I could think of to make this safely work is to have the
> insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
> that the filesystem and the mapping code can communicate on the state of DAX,
> but before that I think it's basically indeterminate.

If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
the ->mmap() handler and let the default THP policy take over. In
fact, see transparent_hugepage_enabled() we already auto-enable huge
page support for dax mappings regardless of VM_HUGEPAGE.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-26 19:48             ` Darrick J. Wong
@ 2017-09-26 22:00               ` Dave Chinner
  0 siblings, 0 replies; 46+ messages in thread
From: Dave Chinner @ 2017-09-26 22:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jeff Layton, Jan Kara, Andrew Morton, linux-nvdimm, linux-kernel,
	J. Bruce Fields, linux-mm, linux-fsdevel, linux-xfs,
	Christoph Hellwig

On Tue, Sep 26, 2017 at 12:48:30PM -0700, Darrick J. Wong wrote:
> For the most part I'm in favor of Christoph's suggestion to let the
> kernel decide on its own, and I don't see the point in encoding details
> of the storage medium access strategy on the disk, particularly since
> filesystems are supposed to be fairly independent of storage.  But
> frankly, so many people have asked me over the years if there's some way
> to influence the decision-making that I won't quite let go of file hints
> as a way to influence the decisions XFS makes around storage media.

And that's pretty much it. The discussion here is not about whether
there should be a flag, but what semantics it should have when the
flag is not set. If "flag not set" means "kernel selects
automatically", then that's fine by me.

But history tells us that users and admins want a way to be able to
override the kernel's automatic behaviours because they are /never
100% correct/ for everyone. There are always exceptions, otherwise
we wouldn't have the great plethora of mkfs, mount, proc and sysfs
options for our filesystems or storage. Anyone who says "the kernel
will always do the right thing for everyone automatically" is living
in a dream world.

Note: I agree that the kernel should do the right thing w.r.t. DAX
automatically. We don't need a mount option for that - we can probe
for dax support automatically and use it automatically already.
However, in a world where the kernel automatically uses that
functionality when it is present, admins and users need a way to
solve the "default behaviour is bad for me, let me control this
manually" problem. That's where the inode flags come in....

i.e. What I'm advocating is a model DAX gets enabled automatically
if the underlying device supports is using whatever the kernel
thinks is optimal at the time the access is made, but the user can
override/direct behvaiour on a case by case basis via persistent
inode flags/xattrs/whatever.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-26 17:30           ` Ross Zwisler
  2017-09-26 19:48             ` Darrick J. Wong
@ 2017-09-27  6:40             ` Christoph Hellwig
  2017-09-27 16:15               ` Ross Zwisler
  1 sibling, 1 reply; 46+ messages in thread
From: Christoph Hellwig @ 2017-09-27  6:40 UTC (permalink / raw)
  To: Ross Zwisler, Christoph Hellwig, Dave Chinner, Jan Kara,
	Andrew Morton, linux-kernel, Darrick J. Wong, J. Bruce Fields,
	Dan Williams, Jeff Layton, linux-fsdevel, linux-mm, linux-nvdimm,
	linux-xfs

On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote:
> I agree that Christoph's idea about having the system intelligently adjust to
> use DAX based on performance information it gathers about the underlying
> persistent memory (probably via the HMAT on x86_64 systems) is interesting,
> but I think we're still a ways away from that.

So what are the missing blockers for a getting started?

> FWIW, as my patches suggest and Jan observed I think that we should allow
> users to turn on DAX by treating the inode flag and the mount flag as an 'or'
> operation.  i.e. you get DAX if either the mount option is specified or if the
> inode flag is set, and you can continue to manipulate the per-inode flag as
> you want regardless of the mount option.  I think this provides maximum
> flexibility of the mechanism to select DAX without enforcing policy.

IFF we stick to the dax flag that's the only workable way.  The only
major issue I still see with that is that this allows unprivilegued
users to enable DAX on a any file they own / have write access to.
So there isn't really any way to effectively disable the DAX path
by the sysadmin.

> Does it make sense at this point to just start a "dax" man page that can
> contain info about the mount options, inode flags, kernel config options, how
> to get PMDs, etc?  Or does this documentation need to be sprinkled around more
> in existing man pages?

A dax manpage would be good.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-26 21:41           ` Dan Williams
@ 2017-09-27 11:35             ` Jan Kara
  2017-09-27 14:00               ` Dan Williams
  2017-09-27 15:39               ` Ross Zwisler
  0 siblings, 2 replies; 46+ messages in thread
From: Jan Kara @ 2017-09-27 11:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Layton, Jan Kara, Andrew Morton, Darrick J. Wong,
	linux-nvdimm, Dave Chinner, linux-kernel, J. Bruce Fields,
	Linux MM, linux-fsdevel, linux-xfs, Christoph Hellwig

On Tue 26-09-17 14:41:53, Dan Williams wrote:
> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> > <>
> >> > This decision can only be made (in this
> >> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
> >> > populated, which means we need another call into the filesystem after this
> >> > insertion has happened.
> >>
> >> I get that, but it seems over-engineered and something that can also
> >> be safely cleaned up after the fact by the code path that is disabling
> >> DAX.
> >
> > I don't think you can safely clean it up after the fact because some thread
> > might have already called ->mmap() to set up the vma->vm_flags for their new
> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
> 
> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
> memory mappings.
> 
> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> > that the filesystem has any idea about about the mapping.  This is the method
> > by which we would try and clean up mapping flags, if we were to do so, and
> > it's the only way that the filesystem can know whether or not mappings exist.
> >
> > The only way that I could think of to make this safely work is to have the
> > insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
> > that the filesystem and the mapping code can communicate on the state of DAX,
> > but before that I think it's basically indeterminate.
> 
> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
> the ->mmap() handler and let the default THP policy take over. In
> fact, see transparent_hugepage_enabled() we already auto-enable huge
> page support for dax mappings regardless of VM_HUGEPAGE.

Hum, this is an interesting option. So do you suggest that filesystems
supporting DAX would always setup mappings with VM_MIXEDMAP and without
VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
That could actually work. The only possible issue I can see is that
VM_MIXEDMAP is still slightly different from normal page mappings and it
could have some performance implications - e.g. copy_page_range() does more
work on VM_MIXEDMAP mappings but not on normal page mappings.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-27 11:35             ` Jan Kara
@ 2017-09-27 14:00               ` Dan Williams
  2017-09-27 15:07                 ` Jan Kara
  2017-09-27 15:39               ` Ross Zwisler
  1 sibling, 1 reply; 46+ messages in thread
From: Dan Williams @ 2017-09-27 14:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, Andrew Morton, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, linux-kernel, J. Bruce Fields, Linux MM,
	linux-fsdevel, linux-xfs, Christoph Hellwig

On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 26-09-17 14:41:53, Dan Williams wrote:
>> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
>> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
>> > <>
>> >> > This decision can only be made (in this
>> >> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
>> >> > populated, which means we need another call into the filesystem after this
>> >> > insertion has happened.
>> >>
>> >> I get that, but it seems over-engineered and something that can also
>> >> be safely cleaned up after the fact by the code path that is disabling
>> >> DAX.
>> >
>> > I don't think you can safely clean it up after the fact because some thread
>> > might have already called ->mmap() to set up the vma->vm_flags for their new
>> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
>>
>> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
>> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
>> memory mappings.
>>
>> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
>> > that the filesystem has any idea about about the mapping.  This is the method
>> > by which we would try and clean up mapping flags, if we were to do so, and
>> > it's the only way that the filesystem can know whether or not mappings exist.
>> >
>> > The only way that I could think of to make this safely work is to have the
>> > insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
>> > that the filesystem and the mapping code can communicate on the state of DAX,
>> > but before that I think it's basically indeterminate.
>>
>> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
>> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
>> the ->mmap() handler and let the default THP policy take over. In
>> fact, see transparent_hugepage_enabled() we already auto-enable huge
>> page support for dax mappings regardless of VM_HUGEPAGE.
>
> Hum, this is an interesting option. So do you suggest that filesystems
> supporting DAX would always setup mappings with VM_MIXEDMAP and without
> VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
> That could actually work. The only possible issue I can see is that
> VM_MIXEDMAP is still slightly different from normal page mappings and it
> could have some performance implications - e.g. copy_page_range() does more
> work on VM_MIXEDMAP mappings but not on normal page mappings.

We can also get rid of VM_MIXEDMAP if we disable DAX in the
!pfn_t_has_page() case.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-27 14:00               ` Dan Williams
@ 2017-09-27 15:07                 ` Jan Kara
  2017-09-27 15:36                   ` Dan Williams
  0 siblings, 1 reply; 46+ messages in thread
From: Jan Kara @ 2017-09-27 15:07 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Layton, Jan Kara, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, linux-kernel, Christoph Hellwig, J. Bruce Fields,
	Linux MM, linux-fsdevel, linux-xfs, Andrew Morton

On Wed 27-09-17 07:00:53, Dan Williams wrote:
> On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 26-09-17 14:41:53, Dan Williams wrote:
> >> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> >> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> >> > <>
> >> >> > This decision can only be made (in this
> >> >> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
> >> >> > populated, which means we need another call into the filesystem after this
> >> >> > insertion has happened.
> >> >>
> >> >> I get that, but it seems over-engineered and something that can also
> >> >> be safely cleaned up after the fact by the code path that is disabling
> >> >> DAX.
> >> >
> >> > I don't think you can safely clean it up after the fact because some thread
> >> > might have already called ->mmap() to set up the vma->vm_flags for their new
> >> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
> >>
> >> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
> >> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
> >> memory mappings.
> >>
> >> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> >> > that the filesystem has any idea about about the mapping.  This is the method
> >> > by which we would try and clean up mapping flags, if we were to do so, and
> >> > it's the only way that the filesystem can know whether or not mappings exist.
> >> >
> >> > The only way that I could think of to make this safely work is to have the
> >> > insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
> >> > that the filesystem and the mapping code can communicate on the state of DAX,
> >> > but before that I think it's basically indeterminate.
> >>
> >> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
> >> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
> >> the ->mmap() handler and let the default THP policy take over. In
> >> fact, see transparent_hugepage_enabled() we already auto-enable huge
> >> page support for dax mappings regardless of VM_HUGEPAGE.
> >
> > Hum, this is an interesting option. So do you suggest that filesystems
> > supporting DAX would always setup mappings with VM_MIXEDMAP and without
> > VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
> > That could actually work. The only possible issue I can see is that
> > VM_MIXEDMAP is still slightly different from normal page mappings and it
> > could have some performance implications - e.g. copy_page_range() does more
> > work on VM_MIXEDMAP mappings but not on normal page mappings.
> 
> We can also get rid of VM_MIXEDMAP if we disable DAX in the
> !pfn_t_has_page() case.

Yeah, although it would be a pity to require struct page just to avoid
having to set VM_MIXEDMAP flag...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-27 15:07                 ` Jan Kara
@ 2017-09-27 15:36                   ` Dan Williams
  0 siblings, 0 replies; 46+ messages in thread
From: Dan Williams @ 2017-09-27 15:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Andrew Morton, linux-kernel, Darrick J. Wong,
	J. Bruce Fields, Christoph Hellwig, Dave Chinner, Jeff Layton,
	linux-fsdevel, Linux MM, linux-nvdimm, linux-xfs

On Wed, Sep 27, 2017 at 8:07 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 27-09-17 07:00:53, Dan Williams wrote:
>> On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 26-09-17 14:41:53, Dan Williams wrote:
>> >> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
>> >> <ross.zwisler@linux.intel.com> wrote:
>> >> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
>> >> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
>> >> > <>
>> >> >> > This decision can only be made (in this
>> >> >> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
>> >> >> > populated, which means we need another call into the filesystem after this
>> >> >> > insertion has happened.
>> >> >>
>> >> >> I get that, but it seems over-engineered and something that can also
>> >> >> be safely cleaned up after the fact by the code path that is disabling
>> >> >> DAX.
>> >> >
>> >> > I don't think you can safely clean it up after the fact because some thread
>> >> > might have already called ->mmap() to set up the vma->vm_flags for their new
>> >> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
>> >>
>> >> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
>> >> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
>> >> memory mappings.
>> >>
>> >> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
>> >> > that the filesystem has any idea about about the mapping.  This is the method
>> >> > by which we would try and clean up mapping flags, if we were to do so, and
>> >> > it's the only way that the filesystem can know whether or not mappings exist.
>> >> >
>> >> > The only way that I could think of to make this safely work is to have the
>> >> > insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
>> >> > that the filesystem and the mapping code can communicate on the state of DAX,
>> >> > but before that I think it's basically indeterminate.
>> >>
>> >> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
>> >> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
>> >> the ->mmap() handler and let the default THP policy take over. In
>> >> fact, see transparent_hugepage_enabled() we already auto-enable huge
>> >> page support for dax mappings regardless of VM_HUGEPAGE.
>> >
>> > Hum, this is an interesting option. So do you suggest that filesystems
>> > supporting DAX would always setup mappings with VM_MIXEDMAP and without
>> > VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
>> > That could actually work. The only possible issue I can see is that
>> > VM_MIXEDMAP is still slightly different from normal page mappings and it
>> > could have some performance implications - e.g. copy_page_range() does more
>> > work on VM_MIXEDMAP mappings but not on normal page mappings.
>>
>> We can also get rid of VM_MIXEDMAP if we disable DAX in the
>> !pfn_t_has_page() case.
>
> Yeah, although it would be a pity to require struct page just to avoid
> having to set VM_MIXEDMAP flag...

Yes, but the real motivation is fixing all the basic things that break
without struct page, like ptrace and direct-i/o. The removal of
needing to set VM_MIXEDMAP is just a nice side effect. I'll send a
patch because DAX without pages has too many surprise failure cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-27 11:35             ` Jan Kara
  2017-09-27 14:00               ` Dan Williams
@ 2017-09-27 15:39               ` Ross Zwisler
  2017-09-27 15:54                 ` Dan Williams
  1 sibling, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2017-09-27 15:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, Andrew Morton, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, linux-kernel, Christoph Hellwig, J. Bruce Fields,
	Linux MM, linux-fsdevel, linux-xfs

On Wed, Sep 27, 2017 at 01:35:27PM +0200, Jan Kara wrote:
> On Tue 26-09-17 14:41:53, Dan Williams wrote:
> > On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> > >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> > > <>
> > >> > This decision can only be made (in this
> > >> > proposed scheme) *after* the inode->i_mapping->i_mmap  tree has been
> > >> > populated, which means we need another call into the filesystem after this
> > >> > insertion has happened.
> > >>
> > >> I get that, but it seems over-engineered and something that can also
> > >> be safely cleaned up after the fact by the code path that is disabling
> > >> DAX.
> > >
> > > I don't think you can safely clean it up after the fact because some thread
> > > might have already called ->mmap() to set up the vma->vm_flags for their new
> > > mapping, but they haven't added it to inode->i_mapping->i_mmap.
> > 
> > If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
> > DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
> > memory mappings.
> > 
> > > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> > > that the filesystem has any idea about about the mapping.  This is the method
> > > by which we would try and clean up mapping flags, if we were to do so, and
> > > it's the only way that the filesystem can know whether or not mappings exist.
> > >
> > > The only way that I could think of to make this safely work is to have the
> > > insertion into the inode->i_mapping->i_mmap tree be our sync point.  After
> > > that the filesystem and the mapping code can communicate on the state of DAX,
> > > but before that I think it's basically indeterminate.
> > 
> > If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
> > breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
> > the ->mmap() handler and let the default THP policy take over. In
> > fact, see transparent_hugepage_enabled() we already auto-enable huge
> > page support for dax mappings regardless of VM_HUGEPAGE.
> 
> Hum, this is an interesting option. So do you suggest that filesystems
> supporting DAX would always setup mappings with VM_MIXEDMAP and without
> VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
> That could actually work. The only possible issue I can see is that
> VM_MIXEDMAP is still slightly different from normal page mappings and it
> could have some performance implications - e.g. copy_page_range() does more
> work on VM_MIXEDMAP mappings but not on normal page mappings.

It looks like having VM_MIXEDMAP always set for filesystems that support DAX
might affect their memory's NUMA migration in the non-DAX case? 

8e76d4e sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 6/7] mm, fs: introduce file_operations->post_mmap()
  2017-09-27 15:39               ` Ross Zwisler
@ 2017-09-27 15:54                 ` Dan Williams
  0 siblings, 0 replies; 46+ messages in thread
From: Dan Williams @ 2017-09-27 15:54 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, Dan Williams, Andrew Morton,
	linux-kernel, Darrick J. Wong, J. Bruce Fields,
	Christoph Hellwig, Dave Chinner, Jeff Layton, linux-fsdevel,
	Linux MM, linux-nvdimm, linux-xfs

On Wed, Sep 27, 2017 at 8:39 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Wed, Sep 27, 2017 at 01:35:27PM +0200, Jan Kara wrote:
[..]
>> Hum, this is an interesting option. So do you suggest that filesystems
>> supporting DAX would always setup mappings with VM_MIXEDMAP and without
>> VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
>> That could actually work. The only possible issue I can see is that
>> VM_MIXEDMAP is still slightly different from normal page mappings and it
>> could have some performance implications - e.g. copy_page_range() does more
>> work on VM_MIXEDMAP mappings but not on normal page mappings.
>
> It looks like having VM_MIXEDMAP always set for filesystems that support DAX
> might affect their memory's NUMA migration in the non-DAX case?
>
> 8e76d4e sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings

Addressed separately here:

c1ef8e2c0235 mm: disable numa migration faults for dax vmas
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-27  6:40             ` Christoph Hellwig
@ 2017-09-27 16:15               ` Ross Zwisler
  2017-10-01  8:17                 ` Christoph Hellwig
  0 siblings, 1 reply; 46+ messages in thread
From: Ross Zwisler @ 2017-09-27 16:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ross Zwisler, Christoph Hellwig, Dave Chinner, Jan Kara,
	Andrew Morton, linux-kernel, Darrick J. Wong, J. Bruce Fields,
	Dan Williams, Jeff Layton, linux-fsdevel, linux-mm, linux-nvdimm,
	linux-xfs

On Tue, Sep 26, 2017 at 11:40:01PM -0700, Christoph Hellwig wrote:
> On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote:
> > I agree that Christoph's idea about having the system intelligently adjust to
> > use DAX based on performance information it gathers about the underlying
> > persistent memory (probably via the HMAT on x86_64 systems) is interesting,
> > but I think we're still a ways away from that.
> 
> So what are the missing blockers for a getting started?

Well, I don't know if platforms that support HMAT + PMEM are widely available,
but we have all the details in the ACPI spec, so we could begin to code it up
and things will "just work" when platforms arrive.

> > FWIW, as my patches suggest and Jan observed I think that we should allow
> > users to turn on DAX by treating the inode flag and the mount flag as an 'or'
> > operation.  i.e. you get DAX if either the mount option is specified or if the
> > inode flag is set, and you can continue to manipulate the per-inode flag as
> > you want regardless of the mount option.  I think this provides maximum
> > flexibility of the mechanism to select DAX without enforcing policy.
> 
> IFF we stick to the dax flag that's the only workable way.  The only
> major issue I still see with that is that this allows unprivilegued
> users to enable DAX on a any file they own / have write access to.
> So there isn't really any way to effectively disable the DAX path
> by the sysadmin.

Hum, I wonder if maybe we need/want three different mount modes?  What about:

autodax (the default): the filesystem is free to use DAX or not, as it sees
fit and thinks is optimal.  For the time being we can make this mean "don't
use DAX", and phase in DAX usage as we add support for the HMAT, etc.

Users can manually turn on DAX for a given inode by setting the DAX inode
flag, but there is no way for the user to *prevent* DAX for an inode - the
kernel can always choose to turn it on.

MAP_DIRECT and MAP_SYNC work.

nodax: Don't use DAX.  The kernel won't choose to use DAX, and any DAX inode
flags will be ignored.  This gives the sysadmin the override that I think
you're looking for.  The user can still manipulate the inode flags as they see
fit.

MAP_DIRECT and MAP_SYNC both fail.

dax: Use DAX for all inodes in the filesystem.  Again the inode flags are
essentially ignored, but the user can manipulate the inode flags as they see
fit.  This is basically unchanged from how it works today, modulo the bug
where DAX can get turned off if you unset the inode flag where it wasn't even
set (patch 1 in my series).

MAP_DIRECT and MAP_SYNC work.

> > Does it make sense at this point to just start a "dax" man page that can
> > contain info about the mount options, inode flags, kernel config options, how
> > to get PMDs, etc?  Or does this documentation need to be sprinkled around more
> > in existing man pages?
> 
> A dax manpage would be good.

Okay, I'll start with a manpage, and once we agree on whats in there we can
start working on code again. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/7] xfs: always use DAX if mount option is used
  2017-09-27 16:15               ` Ross Zwisler
@ 2017-10-01  8:17                 ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2017-10-01  8:17 UTC (permalink / raw)
  To: Ross Zwisler, Christoph Hellwig, Christoph Hellwig, Dave Chinner,
	Jan Kara, Andrew Morton, linux-kernel, Darrick J. Wong,
	J. Bruce Fields, Dan Williams, Jeff Layton, linux-fsdevel,
	linux-mm, linux-nvdimm, linux-xfs

On Wed, Sep 27, 2017 at 10:15:10AM -0600, Ross Zwisler wrote:
> Well, I don't know if platforms that support HMAT + PMEM are widely available,
> but we have all the details in the ACPI spec, so we could begin to code it up
> and things will "just work" when platforms arrive.

Then again currently all actually shipping NVDIMMs are battery backed
dram and DAX mode should work just fine for them.  Things will get
interesting once companies start shipping actually persistent technologies
that will be significantly slower than DRAM.  And we sould make sure
we have the infrastruture for that in place.

> Hum, I wonder if maybe we need/want three different mount modes?  What about:
> 
> autodax (the default): the filesystem is free to use DAX or not, as it sees
> fit and thinks is optimal.  For the time being we can make this mean "don't
> use DAX", and phase in DAX usage as we add support for the HMAT, etc.

What does "use DAX" really mean anyway?

I think we are conflating a few things:

 a) use a block device or use a dax_device for accessing the device
 b) use the pagecache for caching data in DRAM or not.

Now we actually have a really nice way to control a) already, it's
called O_DIRECT.  Currently O_DIRECT only works with read/write I/O,
but with a byte addressable scheme we now can implement it for mmap
as well, which is what the DAX mmap path does.

b) right now is implied by a), but it's really an implementation
detail.

So the modes would be more like two options to:

 a) disallow any byte-level access.  The right way to do that would
    be to mount the /dev/dax* device instead of the block device
    to allow byte access, and disallow any DAXish operation if you
    mount the block device in the long run.
 b) have a mode to always force an O_DIRECT-like mode for devices
    that are fast enough.  We should always do that with the right
    HMAT entries if mounting the /dev/dax devices, and maybe have
    a mount option to force it.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path
  2017-09-26 18:11         ` Dan Williams
@ 2017-10-01  8:17           ` Christoph Hellwig
  0 siblings, 0 replies; 46+ messages in thread
From: Christoph Hellwig @ 2017-10-01  8:17 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jeff Layton, Jan Kara, Andrew Morton, Darrick J. Wong,
	linux-nvdimm, Dave Chinner, linux-kernel, J. Bruce Fields,
	Linux MM, linux-fsdevel, linux-xfs, Christoph Hellwig

On Tue, Sep 26, 2017 at 11:11:55AM -0700, Dan Williams wrote:
> I think we'll always need an explicit override available, but yes we
> need to think about what the override looks like in the context of a
> kernel that is able to automatically pick the right I/O policy
> relative to the media type. A potential mixed policy for reads vs
> writes makes sense. Where would this finer grained I/O policy
> selection go other than more inode flags?

fadvise?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2017-10-01  8:14 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-25 23:13 [PATCH 0/7] re-enable XFS per-inode DAX Ross Zwisler
2017-09-25 23:13 ` [PATCH 1/7] xfs: always use DAX if mount option is used Ross Zwisler
2017-09-25 23:38   ` Dave Chinner
2017-09-26  9:35     ` Jan Kara
2017-09-26 11:09       ` Dave Chinner
2017-09-26 14:37         ` Christoph Hellwig
2017-09-26 17:30           ` Ross Zwisler
2017-09-26 19:48             ` Darrick J. Wong
2017-09-26 22:00               ` Dave Chinner
2017-09-27  6:40             ` Christoph Hellwig
2017-09-27 16:15               ` Ross Zwisler
2017-10-01  8:17                 ` Christoph Hellwig
2017-09-26 18:02         ` Eric Sandeen
2017-09-26 18:50     ` Ross Zwisler
2017-09-25 23:13 ` [PATCH 2/7] xfs: validate bdev support for DAX inode flag Ross Zwisler
2017-09-26  6:36   ` Christoph Hellwig
2017-09-26 17:16     ` Ross Zwisler
2017-09-26 17:57       ` Darrick J. Wong
2017-09-25 23:14 ` [PATCH 3/7] xfs: protect S_DAX transitions in XFS read path Ross Zwisler
2017-09-25 23:27   ` Dave Chinner
2017-09-26  6:32   ` Christoph Hellwig
2017-09-26 13:59     ` Dan Williams
2017-09-26 14:33       ` Christoph Hellwig
2017-09-26 18:11         ` Dan Williams
2017-10-01  8:17           ` Christoph Hellwig
2017-09-25 23:14 ` [PATCH 4/7] xfs: protect S_DAX transitions in XFS write path Ross Zwisler
2017-09-25 23:29   ` Dave Chinner
2017-09-25 23:14 ` [PATCH 5/7] xfs: introduce xfs_is_dax_state_changing Ross Zwisler
2017-09-26  6:33   ` Christoph Hellwig
2017-09-25 23:14 ` [PATCH 6/7] mm, fs: introduce file_operations->post_mmap() Ross Zwisler
2017-09-25 23:38   ` Dan Williams
2017-09-26 18:57     ` Ross Zwisler
2017-09-26 19:19       ` Dan Williams
2017-09-26 21:06         ` Ross Zwisler
2017-09-26 21:41           ` Dan Williams
2017-09-27 11:35             ` Jan Kara
2017-09-27 14:00               ` Dan Williams
2017-09-27 15:07                 ` Jan Kara
2017-09-27 15:36                   ` Dan Williams
2017-09-27 15:39               ` Ross Zwisler
2017-09-27 15:54                 ` Dan Williams
2017-09-26  6:34   ` Christoph Hellwig
2017-09-25 23:14 ` [PATCH 7/7] xfs: re-enable XFS per-inode DAX Ross Zwisler
2017-09-26  0:31   ` Dave Chinner
2017-09-26  6:36   ` Christoph Hellwig
2017-09-26 19:01     ` Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).