All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-04  2:28 ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, luto, linux-fsdevel, Christoph Hellwig

Changes since v1 [1]:
* Add IS_IOMAP_IMMUTABLE() checks to xfs ioctl paths that perform block
  map changes (xfs_alloc_file_space and xfs_free_file_space) (Darrick)

* Rather than complete a partial write, fail all writes that would
  attempt to extend the file size (Darrick)

* Introduce FALLOC_FL_UNSEAL_BLOCK_MAP as an explicit operation type for
  clearing S_IOMAP_IMMUTABLE (Dave)

* Rework xfs_seal_file_space() to first complete hole-fill and unshare
  operations and then check the file for suitability under
  XFS_ILOCK_EXCL. (Darrick)

* Add an FS_XFLAG_IOMAP_IMMUTABLE flag so the immutable state can be
  seen by xfs_io. (Dave)

* Move the setting of S_IOMAP_IMMUTABLE to be atomic with respect to the
  successful transaction that records XFS_DIFLAG2_IOMAP_IMMUTABLE.
  (Darrick, Dave)

* Switch to a 'goto out_unlock' style in xfs_seal_file_space() to
  cleanup 'if / else' tree, and use the mapping_mapped() helper. (Dave)

* Rely on XFS_MMAPLOCK_EXCL for reading a stable state of
  mapping->i_mmap. (Dave)

[1]: http://marc.info/?l=linux-fsdevel&m=150135785712967&w=2

---

The daxfile proposal a few weeks back [2] sought to piggy back on the
swapfile implementation to approximate a block map immutable file. This
is an idea Dave originated last year to solve the dax "flush from
userspace" problem [3].

The discussion yielded several results. First, Christoph pointed out
that swapfiles are subtly broken [4].  Second, Darrick [5] and Dave [6]
proposed how to properly implement a block map immutable file.  Finally,
Dave identified some improvements to swapfiles that can be built on the
block-map-immutable mechanism. These patches seek to implement the first
part of the proposal and save the swapfile work to build on top once the
base mechanism is complete.

While the initial motivation for this feature is support for
byte-addressable updates of persistent memory and managing cache
maintenance from userspace, the applications of the feature are broader.
In addition to being the start of a better swapfile mechanism it can
also support a DMA-to-storage use case.  This use case enables
data-acquisition hardware to DMA directly to a storage device address
while being safe in the knowledge that storage mappings will not change.

[2]: https://lkml.org/lkml/2017/6/16/790
[3]: https://lkml.org/lkml/2016/9/11/159
[4]: https://lkml.org/lkml/2017/6/18/31
[5]: https://lkml.org/lkml/2017/6/20/49
[6]: https://www.spinics.net/lists/linux-xfs/msg07871.html

---

Dan Williams (5):
      fs, xfs: introduce S_IOMAP_IMMUTABLE
      fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
      fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
      xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
      xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate


 fs/attr.c                   |   10 ++
 fs/open.c                   |   22 +++++
 fs/read_write.c             |    3 +
 fs/xfs/libxfs/xfs_format.h  |    5 +
 fs/xfs/xfs_bmap_util.c      |  181 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h      |    5 +
 fs/xfs/xfs_file.c           |   16 +++-
 fs/xfs/xfs_inode.c          |    2 
 fs/xfs/xfs_ioctl.c          |    7 ++
 fs/xfs/xfs_iops.c           |    8 +-
 include/linux/falloc.h      |    4 +
 include/linux/fs.h          |    2 
 include/uapi/linux/falloc.h |   20 +++++
 include/uapi/linux/fs.h     |    1 
 mm/filemap.c                |    5 +
 15 files changed, 282 insertions(+), 9 deletions(-)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-04  2:28 ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

Changes since v1 [1]:
* Add IS_IOMAP_IMMUTABLE() checks to xfs ioctl paths that perform block
  map changes (xfs_alloc_file_space and xfs_free_file_space) (Darrick)

* Rather than complete a partial write, fail all writes that would
  attempt to extend the file size (Darrick)

* Introduce FALLOC_FL_UNSEAL_BLOCK_MAP as an explicit operation type for
  clearing S_IOMAP_IMMUTABLE (Dave)

* Rework xfs_seal_file_space() to first complete hole-fill and unshare
  operations and then check the file for suitability under
  XFS_ILOCK_EXCL. (Darrick)

* Add an FS_XFLAG_IOMAP_IMMUTABLE flag so the immutable state can be
  seen by xfs_io. (Dave)

* Move the setting of S_IOMAP_IMMUTABLE to be atomic with respect to the
  successful transaction that records XFS_DIFLAG2_IOMAP_IMMUTABLE.
  (Darrick, Dave)

* Switch to a 'goto out_unlock' style in xfs_seal_file_space() to
  cleanup 'if / else' tree, and use the mapping_mapped() helper. (Dave)

* Rely on XFS_MMAPLOCK_EXCL for reading a stable state of
  mapping->i_mmap. (Dave)

[1]: http://marc.info/?l=linux-fsdevel&m=150135785712967&w=2

---

The daxfile proposal a few weeks back [2] sought to piggy back on the
swapfile implementation to approximate a block map immutable file. This
is an idea Dave originated last year to solve the dax "flush from
userspace" problem [3].

The discussion yielded several results. First, Christoph pointed out
that swapfiles are subtly broken [4].  Second, Darrick [5] and Dave [6]
proposed how to properly implement a block map immutable file.  Finally,
Dave identified some improvements to swapfiles that can be built on the
block-map-immutable mechanism. These patches seek to implement the first
part of the proposal and save the swapfile work to build on top once the
base mechanism is complete.

While the initial motivation for this feature is support for
byte-addressable updates of persistent memory and managing cache
maintenance from userspace, the applications of the feature are broader.
In addition to being the start of a better swapfile mechanism it can
also support a DMA-to-storage use case.  This use case enables
data-acquisition hardware to DMA directly to a storage device address
while being safe in the knowledge that storage mappings will not change.

[2]: https://lkml.org/lkml/2017/6/16/790
[3]: https://lkml.org/lkml/2016/9/11/159
[4]: https://lkml.org/lkml/2017/6/18/31
[5]: https://lkml.org/lkml/2017/6/20/49
[6]: https://www.spinics.net/lists/linux-xfs/msg07871.html

---

Dan Williams (5):
      fs, xfs: introduce S_IOMAP_IMMUTABLE
      fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
      fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
      xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
      xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate


 fs/attr.c                   |   10 ++
 fs/open.c                   |   22 +++++
 fs/read_write.c             |    3 +
 fs/xfs/libxfs/xfs_format.h  |    5 +
 fs/xfs/xfs_bmap_util.c      |  181 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h      |    5 +
 fs/xfs/xfs_file.c           |   16 +++-
 fs/xfs/xfs_inode.c          |    2 
 fs/xfs/xfs_ioctl.c          |    7 ++
 fs/xfs/xfs_iops.c           |    8 +-
 include/linux/falloc.h      |    4 +
 include/linux/fs.h          |    2 
 include/uapi/linux/falloc.h |   20 +++++
 include/uapi/linux/fs.h     |    1 
 mm/filemap.c                |    5 +
 15 files changed, 282 insertions(+), 9 deletions(-)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
  2017-08-04  2:28 ` Dan Williams
@ 2017-08-04  2:28   ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, luto, linux-fsdevel, Christoph Hellwig

An inode with this flag set indicates that the file's block map cannot
be changed from the currently allocated set.

The implementation of toggling the flag and sealing the state of the
extent map is saved for a later patch. The functionality provided by
S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
that provided by S_SWAPFILE, and it is targeted to replace it.

For now, only xfs and the core vfs are updated to consider the new flag.

The additional checks that are added for this flag, beyond what we are
already doing for swapfiles, are:
* fail writes that try to extend the file size
* fail attempts to directly change the allocation map via fallocate or
  xfs ioctls. This can be done centrally by blocking
  xfs_alloc_file_space and xfs_free_file_space when the flag is set.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/attr.c              |   10 ++++++++++
 fs/open.c              |    6 ++++++
 fs/read_write.c        |    3 +++
 fs/xfs/xfs_bmap_util.c |    6 ++++++
 fs/xfs/xfs_ioctl.c     |    6 ++++++
 include/linux/fs.h     |    2 ++
 mm/filemap.c           |    5 +++++
 7 files changed, 38 insertions(+)

diff --git a/fs/attr.c b/fs/attr.c
index 135304146120..8573e364bd06 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
  */
 int inode_newsize_ok(const struct inode *inode, loff_t offset)
 {
+	if (IS_IOMAP_IMMUTABLE(inode)) {
+		/*
+		 * Any size change is disallowed. Size increases may
+		 * dirty metadata that an application is not prepared to
+		 * sync, and a size decrease may expose free blocks to
+		 * in-flight DMA.
+		 */
+		return -ETXTBSY;
+	}
+
 	if (inode->i_size < offset) {
 		unsigned long limit;
 
diff --git a/fs/open.c b/fs/open.c
index 35bb784763a4..7395860d7164 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -ETXTBSY;
 
 	/*
+	 * We cannot allow any allocation changes on an iomap immutable file
+	 */
+	if (IS_IOMAP_IMMUTABLE(inode))
+		return -ETXTBSY;
+
+	/*
 	 * Revalidate the write permissions, in case security policy has
 	 * changed since the files were opened.
 	 */
diff --git a/fs/read_write.c b/fs/read_write.c
index 0cc7033aa413..dc673be7c7cb 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
 		return -ETXTBSY;
 
+	if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
+		return -ETXTBSY;
+
 	/* Don't reflink dirs, pipes, sockets... */
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
 		return -EISDIR;
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 93e955262d07..fe0f8f7f4bb7 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1044,6 +1044,9 @@ xfs_alloc_file_space(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
+		return -ETXTBSY;
+
 	error = xfs_qm_dqattach(ip, 0);
 	if (error)
 		return error;
@@ -1294,6 +1297,9 @@ xfs_free_file_space(
 
 	trace_xfs_free_file_space(ip);
 
+	if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
+		return -ETXTBSY;
+
 	error = xfs_qm_dqattach(ip, 0);
 	if (error)
 		return error;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index e75c40a47b7d..2e64488bc4de 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
 		goto out_put_tmp_file;
 	}
 
+	if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
+	    IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
+		error = -EINVAL;
+		goto out_put_tmp_file;
+	}
+
 	/*
 	 * We need to ensure that the fds passed in point to XFS inodes
 	 * before we cast and access them as XFS structures as we have no
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e1fd5d21248..0a254b768855 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1829,6 +1829,7 @@ struct super_operations {
 #else
 #define S_DAX		0	/* Make all the DAX code disappear */
 #endif
+#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1867,6 +1868,7 @@ struct super_operations {
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
 #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
+#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
 
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
diff --git a/mm/filemap.c b/mm/filemap.c
index a49702445ce0..a4105a4c1d69 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2806,6 +2806,11 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if (unlikely(pos >= inode->i_sb->s_maxbytes))
 		return -EFBIG;
 
+	/* Are we about to mutate the block map on an immutable file? */
+	if (IS_IOMAP_IMMUTABLE(inode)
+			&& (pos + iov_iter_count(from) > i_size_read(inode)))
+		return -ETXTBSY;
+
 	iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
 	return iov_iter_count(from);
 }

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
@ 2017-08-04  2:28   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

An inode with this flag set indicates that the file's block map cannot
be changed from the currently allocated set.

The implementation of toggling the flag and sealing the state of the
extent map is saved for a later patch. The functionality provided by
S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
that provided by S_SWAPFILE, and it is targeted to replace it.

For now, only xfs and the core vfs are updated to consider the new flag.

The additional checks that are added for this flag, beyond what we are
already doing for swapfiles, are:
* fail writes that try to extend the file size
* fail attempts to directly change the allocation map via fallocate or
  xfs ioctls. This can be done centrally by blocking
  xfs_alloc_file_space and xfs_free_file_space when the flag is set.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/attr.c              |   10 ++++++++++
 fs/open.c              |    6 ++++++
 fs/read_write.c        |    3 +++
 fs/xfs/xfs_bmap_util.c |    6 ++++++
 fs/xfs/xfs_ioctl.c     |    6 ++++++
 include/linux/fs.h     |    2 ++
 mm/filemap.c           |    5 +++++
 7 files changed, 38 insertions(+)

diff --git a/fs/attr.c b/fs/attr.c
index 135304146120..8573e364bd06 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
  */
 int inode_newsize_ok(const struct inode *inode, loff_t offset)
 {
+	if (IS_IOMAP_IMMUTABLE(inode)) {
+		/*
+		 * Any size change is disallowed. Size increases may
+		 * dirty metadata that an application is not prepared to
+		 * sync, and a size decrease may expose free blocks to
+		 * in-flight DMA.
+		 */
+		return -ETXTBSY;
+	}
+
 	if (inode->i_size < offset) {
 		unsigned long limit;
 
diff --git a/fs/open.c b/fs/open.c
index 35bb784763a4..7395860d7164 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -ETXTBSY;
 
 	/*
+	 * We cannot allow any allocation changes on an iomap immutable file
+	 */
+	if (IS_IOMAP_IMMUTABLE(inode))
+		return -ETXTBSY;
+
+	/*
 	 * Revalidate the write permissions, in case security policy has
 	 * changed since the files were opened.
 	 */
diff --git a/fs/read_write.c b/fs/read_write.c
index 0cc7033aa413..dc673be7c7cb 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
 	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
 		return -ETXTBSY;
 
+	if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
+		return -ETXTBSY;
+
 	/* Don't reflink dirs, pipes, sockets... */
 	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
 		return -EISDIR;
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 93e955262d07..fe0f8f7f4bb7 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1044,6 +1044,9 @@ xfs_alloc_file_space(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
+	if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
+		return -ETXTBSY;
+
 	error = xfs_qm_dqattach(ip, 0);
 	if (error)
 		return error;
@@ -1294,6 +1297,9 @@ xfs_free_file_space(
 
 	trace_xfs_free_file_space(ip);
 
+	if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
+		return -ETXTBSY;
+
 	error = xfs_qm_dqattach(ip, 0);
 	if (error)
 		return error;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index e75c40a47b7d..2e64488bc4de 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
 		goto out_put_tmp_file;
 	}
 
+	if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
+	    IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
+		error = -EINVAL;
+		goto out_put_tmp_file;
+	}
+
 	/*
 	 * We need to ensure that the fds passed in point to XFS inodes
 	 * before we cast and access them as XFS structures as we have no
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6e1fd5d21248..0a254b768855 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1829,6 +1829,7 @@ struct super_operations {
 #else
 #define S_DAX		0	/* Make all the DAX code disappear */
 #endif
+#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1867,6 +1868,7 @@ struct super_operations {
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
 #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
+#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
 
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
diff --git a/mm/filemap.c b/mm/filemap.c
index a49702445ce0..a4105a4c1d69 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2806,6 +2806,11 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
 	if (unlikely(pos >= inode->i_sb->s_maxbytes))
 		return -EFBIG;
 
+	/* Are we about to mutate the block map on an immutable file? */
+	if (IS_IOMAP_IMMUTABLE(inode)
+			&& (pos + iov_iter_count(from) > i_size_read(inode)))
+		return -ETXTBSY;
+
 	iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
 	return iov_iter_count(from);
 }

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
  2017-08-04  2:28 ` Dan Williams
  (?)
@ 2017-08-04  2:28   ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, luto, linux-fsdevel, Christoph Hellwig

>>From falloc.h:

    FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
    file logical-to-physical extent offset mappings in the file. The
    purpose is to allow an application to assume that there are no holes
    or shared extents in the file and that the metadata needed to find
    all the physical extents of the file is stable and can never be
    dirtied.

For now this patch only permits setting the in-memory state of
S_IOMAP_IMMMUTABLE. Support for clearing and persisting the state is
saved for later patches.

The implementation is careful to not allow the immutable state to change
while any process might have any established mappings. It reuses the
existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare
extents and fill all holes in the file. It then holds XFS_ILOCK_EXCL
while it validates the file is in the proper state and sets
S_IOMAP_IMMUTABLE.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/open.c                   |   11 +++++
 fs/xfs/xfs_bmap_util.c      |  101 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h      |    2 +
 fs/xfs/xfs_file.c           |   14 ++++--
 include/linux/falloc.h      |    3 +
 include/uapi/linux/falloc.h |   19 ++++++++
 6 files changed, 145 insertions(+), 5 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 7395860d7164..e3aae59785ae 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -273,6 +273,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
 		return -EINVAL;
 
+	/*
+	 * Seal block map operation should only be used exclusively, and
+	 * with the IMMUTABLE capability.
+	 */
+	if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
+		if (!capable(CAP_LINUX_IMMUTABLE))
+			return -EPERM;
+		if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
+			return -EINVAL;
+	}
+
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index fe0f8f7f4bb7..46d8eb9e19fc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1393,6 +1393,107 @@ xfs_zero_file_space(
 
 }
 
+/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
+STATIC int
+xfs_file_has_holes(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	*map;
+	const int		map_size = 10;	/* constrain memory overhead */
+	int			i, nmaps;
+	int			error = 0;
+	xfs_fileoff_t		lblkno = 0;
+	xfs_filblks_t		maxlblkcnt;
+
+	map = kmem_alloc(map_size * sizeof(*map), KM_SLEEP);
+
+	maxlblkcnt = XFS_B_TO_FSB(mp, i_size_read(VFS_I(ip)));
+	do {
+		nmaps = map_size;
+		error = xfs_bmapi_read(ip, lblkno, maxlblkcnt - lblkno,
+				       map, &nmaps, 0);
+		if (error)
+			break;
+
+		ASSERT(nmaps <= map_size);
+		for (i = 0; i < nmaps; i++) {
+			lblkno += map[i].br_blockcount;
+			if (map[i].br_startblock == HOLESTARTBLOCK) {
+				error = 1;
+				break;
+			}
+		}
+	} while (nmaps > 0 && error == 0);
+
+	kmem_free(map);
+	return error;
+}
+
+int
+xfs_seal_file_space(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		len)
+{
+	struct inode		*inode = VFS_I(ip);
+	struct address_space	*mapping = inode->i_mapping;
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
+
+	if (offset)
+		return -EINVAL;
+
+	error = xfs_reflink_unshare(ip, offset, len);
+	if (error)
+		return error;
+
+	error = xfs_alloc_file_space(ip, offset, len,
+			XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	/*
+	 * Either the size changed after we performed allocation /
+	 * unsharing, or the request was too small to begin with.
+	 */
+	error = -EINVAL;
+	if (len < i_size_read(inode))
+		goto out_unlock;
+
+	/*
+	 * Allow DAX path to assume that the state of S_IOMAP_IMMUTABLE
+	 * will never change while any mapping is established.
+	 */
+	error = -EBUSY;
+	if (mapping_mapped(mapping))
+		goto out_unlock;
+
+	/* Did we race someone attempting to share extents? */
+	if (xfs_is_reflink_inode(ip))
+		goto out_unlock;
+
+	/* Did we race a hole punch? */
+	error = xfs_file_has_holes(ip);
+	if (error == 1) {
+		error = -EBUSY;
+		goto out_unlock;
+	}
+
+	/* Abort on an error reading the block map */
+	if (error < 0)
+		goto out_unlock;
+
+	inode->i_flags |= S_IOMAP_IMMUTABLE;
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	return error;
+}
+
 /*
  * @next_fsb will keep track of the extent currently undergoing shift.
  * @stop_fsb will keep track of the extent at which we have to stop.
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 0cede1043571..5115a32a2483 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -60,6 +60,8 @@ int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
 int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
+int	xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
+				xfs_off_t len);
 
 /* EOF block manipulation functions */
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c4893e226fd8..e21121530a90 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -739,7 +739,8 @@ xfs_file_write_iter(
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
-		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
+		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |	\
+		 FALLOC_FL_SEAL_BLOCK_MAP)
 
 STATIC long
 xfs_file_fallocate(
@@ -834,9 +835,14 @@ xfs_file_fallocate(
 				error = xfs_reflink_unshare(ip, offset, len);
 				if (error)
 					goto out_unlock;
-			}
-			error = xfs_alloc_file_space(ip, offset, len,
-						     XFS_BMAPI_PREALLOC);
+
+				error = xfs_alloc_file_space(ip, offset, len,
+						XFS_BMAPI_PREALLOC);
+			} else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
+				error = xfs_seal_file_space(ip, offset, len);
+			} else
+				error = xfs_alloc_file_space(ip, offset, len,
+						XFS_BMAPI_PREALLOC);
 		}
 		if (error)
 			goto out_unlock;
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 7494dc67c66f..48546c6fbec7 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -26,6 +26,7 @@ struct space_resv {
 					 FALLOC_FL_COLLAPSE_RANGE |	\
 					 FALLOC_FL_ZERO_RANGE |		\
 					 FALLOC_FL_INSERT_RANGE |	\
-					 FALLOC_FL_UNSHARE_RANGE)
+					 FALLOC_FL_UNSHARE_RANGE |	\
+					 FALLOC_FL_SEAL_BLOCK_MAP)
 
 #endif /* _FALLOC_H_ */
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index b075f601919b..39076975bf6f 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -76,4 +76,23 @@
  */
 #define FALLOC_FL_UNSHARE_RANGE		0x40
 
+/*
+ * FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
+ * file logical-to-physical extent offset mappings in the file. The
+ * purpose is to allow an application to assume that there are no holes
+ * or shared extents in the file and that the metadata needed to find
+ * all the physical extents of the file is stable and can never be
+ * dirtied.
+ *
+ * The immutable property is in effect for the entire inode, so the
+ * range for this operation must start at offset 0 and len must be
+ * greater than or equal to the current size of the file. If greater,
+ * this operation allocates, unshares, hole fills, and seals in one
+ * atomic step. If len is zero then the immutable state is cleared for
+ * the inode.
+ *
+ * This flag implies FALLOC_FL_UNSHARE_RANGE and as such cannot be used
+ * with the punch, zero, collapse, or insert range modes.
+ */
+#define FALLOC_FL_SEAL_BLOCK_MAP	0x080
 #endif /* _UAPI_FALLOC_H_ */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
@ 2017-08-04  2:28   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

>From falloc.h:

    FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
    file logical-to-physical extent offset mappings in the file. The
    purpose is to allow an application to assume that there are no holes
    or shared extents in the file and that the metadata needed to find
    all the physical extents of the file is stable and can never be
    dirtied.

For now this patch only permits setting the in-memory state of
S_IOMAP_IMMMUTABLE. Support for clearing and persisting the state is
saved for later patches.

The implementation is careful to not allow the immutable state to change
while any process might have any established mappings. It reuses the
existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare
extents and fill all holes in the file. It then holds XFS_ILOCK_EXCL
while it validates the file is in the proper state and sets
S_IOMAP_IMMUTABLE.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/open.c                   |   11 +++++
 fs/xfs/xfs_bmap_util.c      |  101 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h      |    2 +
 fs/xfs/xfs_file.c           |   14 ++++--
 include/linux/falloc.h      |    3 +
 include/uapi/linux/falloc.h |   19 ++++++++
 6 files changed, 145 insertions(+), 5 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 7395860d7164..e3aae59785ae 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -273,6 +273,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
 		return -EINVAL;
 
+	/*
+	 * Seal block map operation should only be used exclusively, and
+	 * with the IMMUTABLE capability.
+	 */
+	if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
+		if (!capable(CAP_LINUX_IMMUTABLE))
+			return -EPERM;
+		if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
+			return -EINVAL;
+	}
+
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index fe0f8f7f4bb7..46d8eb9e19fc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1393,6 +1393,107 @@ xfs_zero_file_space(
 
 }
 
+/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
+STATIC int
+xfs_file_has_holes(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	*map;
+	const int		map_size = 10;	/* constrain memory overhead */
+	int			i, nmaps;
+	int			error = 0;
+	xfs_fileoff_t		lblkno = 0;
+	xfs_filblks_t		maxlblkcnt;
+
+	map = kmem_alloc(map_size * sizeof(*map), KM_SLEEP);
+
+	maxlblkcnt = XFS_B_TO_FSB(mp, i_size_read(VFS_I(ip)));
+	do {
+		nmaps = map_size;
+		error = xfs_bmapi_read(ip, lblkno, maxlblkcnt - lblkno,
+				       map, &nmaps, 0);
+		if (error)
+			break;
+
+		ASSERT(nmaps <= map_size);
+		for (i = 0; i < nmaps; i++) {
+			lblkno += map[i].br_blockcount;
+			if (map[i].br_startblock == HOLESTARTBLOCK) {
+				error = 1;
+				break;
+			}
+		}
+	} while (nmaps > 0 && error == 0);
+
+	kmem_free(map);
+	return error;
+}
+
+int
+xfs_seal_file_space(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		len)
+{
+	struct inode		*inode = VFS_I(ip);
+	struct address_space	*mapping = inode->i_mapping;
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
+
+	if (offset)
+		return -EINVAL;
+
+	error = xfs_reflink_unshare(ip, offset, len);
+	if (error)
+		return error;
+
+	error = xfs_alloc_file_space(ip, offset, len,
+			XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	/*
+	 * Either the size changed after we performed allocation /
+	 * unsharing, or the request was too small to begin with.
+	 */
+	error = -EINVAL;
+	if (len < i_size_read(inode))
+		goto out_unlock;
+
+	/*
+	 * Allow DAX path to assume that the state of S_IOMAP_IMMUTABLE
+	 * will never change while any mapping is established.
+	 */
+	error = -EBUSY;
+	if (mapping_mapped(mapping))
+		goto out_unlock;
+
+	/* Did we race someone attempting to share extents? */
+	if (xfs_is_reflink_inode(ip))
+		goto out_unlock;
+
+	/* Did we race a hole punch? */
+	error = xfs_file_has_holes(ip);
+	if (error == 1) {
+		error = -EBUSY;
+		goto out_unlock;
+	}
+
+	/* Abort on an error reading the block map */
+	if (error < 0)
+		goto out_unlock;
+
+	inode->i_flags |= S_IOMAP_IMMUTABLE;
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	return error;
+}
+
 /*
  * @next_fsb will keep track of the extent currently undergoing shift.
  * @stop_fsb will keep track of the extent at which we have to stop.
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 0cede1043571..5115a32a2483 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -60,6 +60,8 @@ int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
 int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
+int	xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
+				xfs_off_t len);
 
 /* EOF block manipulation functions */
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c4893e226fd8..e21121530a90 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -739,7 +739,8 @@ xfs_file_write_iter(
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
-		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
+		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |	\
+		 FALLOC_FL_SEAL_BLOCK_MAP)
 
 STATIC long
 xfs_file_fallocate(
@@ -834,9 +835,14 @@ xfs_file_fallocate(
 				error = xfs_reflink_unshare(ip, offset, len);
 				if (error)
 					goto out_unlock;
-			}
-			error = xfs_alloc_file_space(ip, offset, len,
-						     XFS_BMAPI_PREALLOC);
+
+				error = xfs_alloc_file_space(ip, offset, len,
+						XFS_BMAPI_PREALLOC);
+			} else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
+				error = xfs_seal_file_space(ip, offset, len);
+			} else
+				error = xfs_alloc_file_space(ip, offset, len,
+						XFS_BMAPI_PREALLOC);
 		}
 		if (error)
 			goto out_unlock;
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 7494dc67c66f..48546c6fbec7 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -26,6 +26,7 @@ struct space_resv {
 					 FALLOC_FL_COLLAPSE_RANGE |	\
 					 FALLOC_FL_ZERO_RANGE |		\
 					 FALLOC_FL_INSERT_RANGE |	\
-					 FALLOC_FL_UNSHARE_RANGE)
+					 FALLOC_FL_UNSHARE_RANGE |	\
+					 FALLOC_FL_SEAL_BLOCK_MAP)
 
 #endif /* _FALLOC_H_ */
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index b075f601919b..39076975bf6f 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -76,4 +76,23 @@
  */
 #define FALLOC_FL_UNSHARE_RANGE		0x40
 
+/*
+ * FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
+ * file logical-to-physical extent offset mappings in the file. The
+ * purpose is to allow an application to assume that there are no holes
+ * or shared extents in the file and that the metadata needed to find
+ * all the physical extents of the file is stable and can never be
+ * dirtied.
+ *
+ * The immutable property is in effect for the entire inode, so the
+ * range for this operation must start at offset 0 and len must be
+ * greater than or equal to the current size of the file. If greater,
+ * this operation allocates, unshares, hole fills, and seals in one
+ * atomic step. If len is zero then the immutable state is cleared for
+ * the inode.
+ *
+ * This flag implies FALLOC_FL_UNSHARE_RANGE and as such cannot be used
+ * with the punch, zero, collapse, or insert range modes.
+ */
+#define FALLOC_FL_SEAL_BLOCK_MAP	0x080
 #endif /* _UAPI_FALLOC_H_ */

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
@ 2017-08-04  2:28   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

>>From falloc.h:

    FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
    file logical-to-physical extent offset mappings in the file. The
    purpose is to allow an application to assume that there are no holes
    or shared extents in the file and that the metadata needed to find
    all the physical extents of the file is stable and can never be
    dirtied.

For now this patch only permits setting the in-memory state of
S_IOMAP_IMMMUTABLE. Support for clearing and persisting the state is
saved for later patches.

The implementation is careful to not allow the immutable state to change
while any process might have any established mappings. It reuses the
existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare
extents and fill all holes in the file. It then holds XFS_ILOCK_EXCL
while it validates the file is in the proper state and sets
S_IOMAP_IMMUTABLE.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/open.c                   |   11 +++++
 fs/xfs/xfs_bmap_util.c      |  101 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h      |    2 +
 fs/xfs/xfs_file.c           |   14 ++++--
 include/linux/falloc.h      |    3 +
 include/uapi/linux/falloc.h |   19 ++++++++
 6 files changed, 145 insertions(+), 5 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 7395860d7164..e3aae59785ae 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -273,6 +273,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
 		return -EINVAL;
 
+	/*
+	 * Seal block map operation should only be used exclusively, and
+	 * with the IMMUTABLE capability.
+	 */
+	if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
+		if (!capable(CAP_LINUX_IMMUTABLE))
+			return -EPERM;
+		if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
+			return -EINVAL;
+	}
+
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index fe0f8f7f4bb7..46d8eb9e19fc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1393,6 +1393,107 @@ xfs_zero_file_space(
 
 }
 
+/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
+STATIC int
+xfs_file_has_holes(
+	struct xfs_inode	*ip)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_bmbt_irec	*map;
+	const int		map_size = 10;	/* constrain memory overhead */
+	int			i, nmaps;
+	int			error = 0;
+	xfs_fileoff_t		lblkno = 0;
+	xfs_filblks_t		maxlblkcnt;
+
+	map = kmem_alloc(map_size * sizeof(*map), KM_SLEEP);
+
+	maxlblkcnt = XFS_B_TO_FSB(mp, i_size_read(VFS_I(ip)));
+	do {
+		nmaps = map_size;
+		error = xfs_bmapi_read(ip, lblkno, maxlblkcnt - lblkno,
+				       map, &nmaps, 0);
+		if (error)
+			break;
+
+		ASSERT(nmaps <= map_size);
+		for (i = 0; i < nmaps; i++) {
+			lblkno += map[i].br_blockcount;
+			if (map[i].br_startblock == HOLESTARTBLOCK) {
+				error = 1;
+				break;
+			}
+		}
+	} while (nmaps > 0 && error == 0);
+
+	kmem_free(map);
+	return error;
+}
+
+int
+xfs_seal_file_space(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		len)
+{
+	struct inode		*inode = VFS_I(ip);
+	struct address_space	*mapping = inode->i_mapping;
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
+
+	if (offset)
+		return -EINVAL;
+
+	error = xfs_reflink_unshare(ip, offset, len);
+	if (error)
+		return error;
+
+	error = xfs_alloc_file_space(ip, offset, len,
+			XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
+	if (error)
+		return error;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	/*
+	 * Either the size changed after we performed allocation /
+	 * unsharing, or the request was too small to begin with.
+	 */
+	error = -EINVAL;
+	if (len < i_size_read(inode))
+		goto out_unlock;
+
+	/*
+	 * Allow DAX path to assume that the state of S_IOMAP_IMMUTABLE
+	 * will never change while any mapping is established.
+	 */
+	error = -EBUSY;
+	if (mapping_mapped(mapping))
+		goto out_unlock;
+
+	/* Did we race someone attempting to share extents? */
+	if (xfs_is_reflink_inode(ip))
+		goto out_unlock;
+
+	/* Did we race a hole punch? */
+	error = xfs_file_has_holes(ip);
+	if (error == 1) {
+		error = -EBUSY;
+		goto out_unlock;
+	}
+
+	/* Abort on an error reading the block map */
+	if (error < 0)
+		goto out_unlock;
+
+	inode->i_flags |= S_IOMAP_IMMUTABLE;
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	return error;
+}
+
 /*
  * @next_fsb will keep track of the extent currently undergoing shift.
  * @stop_fsb will keep track of the extent at which we have to stop.
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 0cede1043571..5115a32a2483 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -60,6 +60,8 @@ int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
 int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
+int	xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
+				xfs_off_t len);
 
 /* EOF block manipulation functions */
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c4893e226fd8..e21121530a90 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -739,7 +739,8 @@ xfs_file_write_iter(
 #define	XFS_FALLOC_FL_SUPPORTED						\
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
-		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
+		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |	\
+		 FALLOC_FL_SEAL_BLOCK_MAP)
 
 STATIC long
 xfs_file_fallocate(
@@ -834,9 +835,14 @@ xfs_file_fallocate(
 				error = xfs_reflink_unshare(ip, offset, len);
 				if (error)
 					goto out_unlock;
-			}
-			error = xfs_alloc_file_space(ip, offset, len,
-						     XFS_BMAPI_PREALLOC);
+
+				error = xfs_alloc_file_space(ip, offset, len,
+						XFS_BMAPI_PREALLOC);
+			} else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
+				error = xfs_seal_file_space(ip, offset, len);
+			} else
+				error = xfs_alloc_file_space(ip, offset, len,
+						XFS_BMAPI_PREALLOC);
 		}
 		if (error)
 			goto out_unlock;
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 7494dc67c66f..48546c6fbec7 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -26,6 +26,7 @@ struct space_resv {
 					 FALLOC_FL_COLLAPSE_RANGE |	\
 					 FALLOC_FL_ZERO_RANGE |		\
 					 FALLOC_FL_INSERT_RANGE |	\
-					 FALLOC_FL_UNSHARE_RANGE)
+					 FALLOC_FL_UNSHARE_RANGE |	\
+					 FALLOC_FL_SEAL_BLOCK_MAP)
 
 #endif /* _FALLOC_H_ */
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index b075f601919b..39076975bf6f 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -76,4 +76,23 @@
  */
 #define FALLOC_FL_UNSHARE_RANGE		0x40
 
+/*
+ * FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
+ * file logical-to-physical extent offset mappings in the file. The
+ * purpose is to allow an application to assume that there are no holes
+ * or shared extents in the file and that the metadata needed to find
+ * all the physical extents of the file is stable and can never be
+ * dirtied.
+ *
+ * The immutable property is in effect for the entire inode, so the
+ * range for this operation must start at offset 0 and len must be
+ * greater than or equal to the current size of the file. If greater,
+ * this operation allocates, unshares, hole fills, and seals in one
+ * atomic step. If len is zero then the immutable state is cleared for
+ * the inode.
+ *
+ * This flag implies FALLOC_FL_UNSHARE_RANGE and as such cannot be used
+ * with the punch, zero, collapse, or insert range modes.
+ */
+#define FALLOC_FL_SEAL_BLOCK_MAP	0x080
 #endif /* _UAPI_FALLOC_H_ */

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 3/5] fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
  2017-08-04  2:28 ` Dan Williams
@ 2017-08-04  2:28   ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, luto, linux-fsdevel, Christoph Hellwig

Provide an explicit fallocate operation type for clearing the
S_IOMAP_IMMUTABLE flag. Like the enable case it requires CAP_IMMUTABLE
and it can only be performed while no process has the file mapped.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/open.c                   |   17 +++++++++++------
 fs/xfs/xfs_bmap_util.c      |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h      |    3 +++
 fs/xfs/xfs_file.c           |    4 +++-
 include/linux/falloc.h      |    3 ++-
 include/uapi/linux/falloc.h |    1 +
 6 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index e3aae59785ae..ccfd8d3becc8 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -274,13 +274,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EINVAL;
 
 	/*
-	 * Seal block map operation should only be used exclusively, and
-	 * with the IMMUTABLE capability.
+	 * Seal/unseal block map operations should only be used
+	 * exclusively, and with the IMMUTABLE capability.
 	 */
-	if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
+	if (mode & (FALLOC_FL_SEAL_BLOCK_MAP | FALLOC_FL_UNSEAL_BLOCK_MAP)) {
 		if (!capable(CAP_LINUX_IMMUTABLE))
 			return -EPERM;
-		if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
+		if (mode == (FALLOC_FL_SEAL_BLOCK_MAP
+					| FALLOC_FL_UNSEAL_BLOCK_MAP))
+			return -EINVAL;
+		if (mode & ~(FALLOC_FL_SEAL_BLOCK_MAP
+					| FALLOC_FL_UNSEAL_BLOCK_MAP))
 			return -EINVAL;
 	}
 
@@ -303,9 +307,10 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -ETXTBSY;
 
 	/*
-	 * We cannot allow any allocation changes on an iomap immutable file
+	 * We cannot allow any allocation changes on an iomap immutable
+	 * file, but we can allow clearing the immutable state.
 	 */
-	if (IS_IOMAP_IMMUTABLE(inode))
+	if (IS_IOMAP_IMMUTABLE(inode) && !(mode & FALLOC_FL_UNSEAL_BLOCK_MAP))
 		return -ETXTBSY;
 
 	/*
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 46d8eb9e19fc..70ac2d33ab27 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1494,6 +1494,48 @@ xfs_seal_file_space(
 	return error;
 }
 
+int
+xfs_unseal_file_space(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		len)
+{
+	struct inode		*inode = VFS_I(ip);
+	struct address_space	*mapping = inode->i_mapping;
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
+
+	if (offset)
+		return -EINVAL;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	/*
+	 * It does not make sense to unseal less than the full range of
+	 * the file.
+	 */
+	error = -EINVAL;
+	if (len < i_size_read(inode))
+		goto out_unlock;
+
+	/*
+	 * Provide safety against one thread changing the policy of not
+	 * requiring fsync/msync (for block allocations) behind another
+	 * thread's back.
+	 */
+	error = -EBUSY;
+	if (mapping_mapped(mapping))
+		goto out_unlock;
+
+	inode->i_flags &= ~S_IOMAP_IMMUTABLE;
+	error = 0;
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	return error;
+}
+
 /*
  * @next_fsb will keep track of the extent currently undergoing shift.
  * @stop_fsb will keep track of the extent at which we have to stop.
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 5115a32a2483..b64653a75942 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -62,6 +62,9 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
 int	xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
+int	xfs_unseal_file_space(struct xfs_inode *, xfs_off_t offset,
+				xfs_off_t len);
+
 
 /* EOF block manipulation functions */
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e21121530a90..833f77700be2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -740,7 +740,7 @@ xfs_file_write_iter(
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
 		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |	\
-		 FALLOC_FL_SEAL_BLOCK_MAP)
+		 FALLOC_FL_SEAL_BLOCK_MAP | FALLOC_FL_UNSEAL_BLOCK_MAP)
 
 STATIC long
 xfs_file_fallocate(
@@ -840,6 +840,8 @@ xfs_file_fallocate(
 						XFS_BMAPI_PREALLOC);
 			} else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
 				error = xfs_seal_file_space(ip, offset, len);
+			} else if (mode & FALLOC_FL_UNSEAL_BLOCK_MAP) {
+				error = xfs_unseal_file_space(ip, offset, len);
 			} else
 				error = xfs_alloc_file_space(ip, offset, len,
 						XFS_BMAPI_PREALLOC);
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 48546c6fbec7..b22c1368ed1e 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -27,6 +27,7 @@ struct space_resv {
 					 FALLOC_FL_ZERO_RANGE |		\
 					 FALLOC_FL_INSERT_RANGE |	\
 					 FALLOC_FL_UNSHARE_RANGE |	\
-					 FALLOC_FL_SEAL_BLOCK_MAP)
+					 FALLOC_FL_SEAL_BLOCK_MAP |	\
+					 FALLOC_FL_UNSEAL_BLOCK_MAP)
 
 #endif /* _FALLOC_H_ */
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 39076975bf6f..a4949e1a2dae 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -95,4 +95,5 @@
  * with the punch, zero, collapse, or insert range modes.
  */
 #define FALLOC_FL_SEAL_BLOCK_MAP	0x080
+#define FALLOC_FL_UNSEAL_BLOCK_MAP	0x100
 #endif /* _UAPI_FALLOC_H_ */

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 3/5] fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
@ 2017-08-04  2:28   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

Provide an explicit fallocate operation type for clearing the
S_IOMAP_IMMUTABLE flag. Like the enable case it requires CAP_IMMUTABLE
and it can only be performed while no process has the file mapped.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/open.c                   |   17 +++++++++++------
 fs/xfs/xfs_bmap_util.c      |   42 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h      |    3 +++
 fs/xfs/xfs_file.c           |    4 +++-
 include/linux/falloc.h      |    3 ++-
 include/uapi/linux/falloc.h |    1 +
 6 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index e3aae59785ae..ccfd8d3becc8 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -274,13 +274,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -EINVAL;
 
 	/*
-	 * Seal block map operation should only be used exclusively, and
-	 * with the IMMUTABLE capability.
+	 * Seal/unseal block map operations should only be used
+	 * exclusively, and with the IMMUTABLE capability.
 	 */
-	if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
+	if (mode & (FALLOC_FL_SEAL_BLOCK_MAP | FALLOC_FL_UNSEAL_BLOCK_MAP)) {
 		if (!capable(CAP_LINUX_IMMUTABLE))
 			return -EPERM;
-		if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
+		if (mode == (FALLOC_FL_SEAL_BLOCK_MAP
+					| FALLOC_FL_UNSEAL_BLOCK_MAP))
+			return -EINVAL;
+		if (mode & ~(FALLOC_FL_SEAL_BLOCK_MAP
+					| FALLOC_FL_UNSEAL_BLOCK_MAP))
 			return -EINVAL;
 	}
 
@@ -303,9 +307,10 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		return -ETXTBSY;
 
 	/*
-	 * We cannot allow any allocation changes on an iomap immutable file
+	 * We cannot allow any allocation changes on an iomap immutable
+	 * file, but we can allow clearing the immutable state.
 	 */
-	if (IS_IOMAP_IMMUTABLE(inode))
+	if (IS_IOMAP_IMMUTABLE(inode) && !(mode & FALLOC_FL_UNSEAL_BLOCK_MAP))
 		return -ETXTBSY;
 
 	/*
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 46d8eb9e19fc..70ac2d33ab27 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1494,6 +1494,48 @@ xfs_seal_file_space(
 	return error;
 }
 
+int
+xfs_unseal_file_space(
+	struct xfs_inode	*ip,
+	xfs_off_t		offset,
+	xfs_off_t		len)
+{
+	struct inode		*inode = VFS_I(ip);
+	struct address_space	*mapping = inode->i_mapping;
+	int			error;
+
+	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
+
+	if (offset)
+		return -EINVAL;
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	/*
+	 * It does not make sense to unseal less than the full range of
+	 * the file.
+	 */
+	error = -EINVAL;
+	if (len < i_size_read(inode))
+		goto out_unlock;
+
+	/*
+	 * Provide safety against one thread changing the policy of not
+	 * requiring fsync/msync (for block allocations) behind another
+	 * thread's back.
+	 */
+	error = -EBUSY;
+	if (mapping_mapped(mapping))
+		goto out_unlock;
+
+	inode->i_flags &= ~S_IOMAP_IMMUTABLE;
+	error = 0;
+
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+	return error;
+}
+
 /*
  * @next_fsb will keep track of the extent currently undergoing shift.
  * @stop_fsb will keep track of the extent at which we have to stop.
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 5115a32a2483..b64653a75942 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -62,6 +62,9 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
 int	xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
 				xfs_off_t len);
+int	xfs_unseal_file_space(struct xfs_inode *, xfs_off_t offset,
+				xfs_off_t len);
+
 
 /* EOF block manipulation functions */
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e21121530a90..833f77700be2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -740,7 +740,7 @@ xfs_file_write_iter(
 		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
 		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
 		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |	\
-		 FALLOC_FL_SEAL_BLOCK_MAP)
+		 FALLOC_FL_SEAL_BLOCK_MAP | FALLOC_FL_UNSEAL_BLOCK_MAP)
 
 STATIC long
 xfs_file_fallocate(
@@ -840,6 +840,8 @@ xfs_file_fallocate(
 						XFS_BMAPI_PREALLOC);
 			} else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
 				error = xfs_seal_file_space(ip, offset, len);
+			} else if (mode & FALLOC_FL_UNSEAL_BLOCK_MAP) {
+				error = xfs_unseal_file_space(ip, offset, len);
 			} else
 				error = xfs_alloc_file_space(ip, offset, len,
 						XFS_BMAPI_PREALLOC);
diff --git a/include/linux/falloc.h b/include/linux/falloc.h
index 48546c6fbec7..b22c1368ed1e 100644
--- a/include/linux/falloc.h
+++ b/include/linux/falloc.h
@@ -27,6 +27,7 @@ struct space_resv {
 					 FALLOC_FL_ZERO_RANGE |		\
 					 FALLOC_FL_INSERT_RANGE |	\
 					 FALLOC_FL_UNSHARE_RANGE |	\
-					 FALLOC_FL_SEAL_BLOCK_MAP)
+					 FALLOC_FL_SEAL_BLOCK_MAP |	\
+					 FALLOC_FL_UNSEAL_BLOCK_MAP)
 
 #endif /* _FALLOC_H_ */
diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
index 39076975bf6f..a4949e1a2dae 100644
--- a/include/uapi/linux/falloc.h
+++ b/include/uapi/linux/falloc.h
@@ -95,4 +95,5 @@
  * with the punch, zero, collapse, or insert range modes.
  */
 #define FALLOC_FL_SEAL_BLOCK_MAP	0x080
+#define FALLOC_FL_UNSEAL_BLOCK_MAP	0x100
 #endif /* _UAPI_FALLOC_H_ */

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
  2017-08-04  2:28 ` Dan Williams
@ 2017-08-04  2:28   ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	luto, linux-fsdevel, Christoph Hellwig

Add an on-disk inode flag to record the state of the S_IOMAP_IMMUTABLE
in-memory vfs inode flags. This allows the protections against reflink
and hole punch to be automatically restored on a sub-sequent boot when
the in-memory inode is established.

The FS_XFLAG_IOMAP_IMMUTABLE is introduced to allow xfs_io to read the
state of the flag, but toggling the flag requires going through
fallocate(FALLOC_FL_[UN]SEAL_BLOCK_MAP). Support for toggling this
on-disk state is saved for a later patch.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/libxfs/xfs_format.h |    5 ++++-
 fs/xfs/xfs_inode.c         |    2 ++
 fs/xfs/xfs_ioctl.c         |    1 +
 fs/xfs/xfs_iops.c          |    8 +++++---
 include/uapi/linux/fs.h    |    1 +
 5 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index d4d9bef20c3a..9e720e55776b 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
 #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
+#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
+#define XFS_DIFLAG2_IOMAP_IMMUTABLE (1 << XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT)
 
 #define XFS_DIFLAG2_ANY \
-	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
+	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
+	 XFS_DIFLAG2_IOMAP_IMMUTABLE)
 
 /*
  * Inode number format:
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ceef77c0416a..4ca22e272ce6 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -674,6 +674,8 @@ _xfs_dic2xflags(
 			flags |= FS_XFLAG_DAX;
 		if (di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
 			flags |= FS_XFLAG_COWEXTSIZE;
+		if (di_flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE)
+			flags |= FS_XFLAG_IOMAP_IMMUTABLE;
 	}
 
 	if (has_attr)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 2e64488bc4de..df2eef0f9d45 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -978,6 +978,7 @@ xfs_set_diflags(
 		return;
 
 	di_flags2 = (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK);
+	di_flags2 |= (ip->i_d.di_flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE);
 	if (xflags & FS_XFLAG_DAX)
 		di_flags2 |= XFS_DIFLAG2_DAX;
 	if (xflags & FS_XFLAG_COWEXTSIZE)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 469c9fa4c178..174ef95453f5 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1186,9 +1186,10 @@ xfs_diflags_to_iflags(
 	struct xfs_inode	*ip)
 {
 	uint16_t		flags = ip->i_d.di_flags;
+	uint64_t		flags2 = ip->i_d.di_flags2;
 
 	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
-			    S_NOATIME | S_DAX);
+			    S_NOATIME | S_DAX | S_IOMAP_IMMUTABLE);
 
 	if (flags & XFS_DIFLAG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
@@ -1201,9 +1202,10 @@ xfs_diflags_to_iflags(
 	if (S_ISREG(inode->i_mode) &&
 	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
 	    !xfs_is_reflink_inode(ip) &&
-	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
-	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
+	    (ip->i_mount->m_flags & XFS_MOUNT_DAX || flags2 & XFS_DIFLAG2_DAX))
 		inode->i_flags |= S_DAX;
+	if (flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE)
+		inode->i_flags |= S_IOMAP_IMMUTABLE;
 }
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7495d05e8de..4765e024ad74 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -182,6 +182,7 @@ struct fsxattr {
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+#define FS_XFLAG_IOMAP_IMMUTABLE 0x00020000	/* block map immutable */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
@ 2017-08-04  2:28   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, luto, linux-fsdevel, Ross Zwisler, Christoph Hellwig

Add an on-disk inode flag to record the state of the S_IOMAP_IMMUTABLE
in-memory vfs inode flags. This allows the protections against reflink
and hole punch to be automatically restored on a sub-sequent boot when
the in-memory inode is established.

The FS_XFLAG_IOMAP_IMMUTABLE is introduced to allow xfs_io to read the
state of the flag, but toggling the flag requires going through
fallocate(FALLOC_FL_[UN]SEAL_BLOCK_MAP). Support for toggling this
on-disk state is saved for a later patch.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/libxfs/xfs_format.h |    5 ++++-
 fs/xfs/xfs_inode.c         |    2 ++
 fs/xfs/xfs_ioctl.c         |    1 +
 fs/xfs/xfs_iops.c          |    8 +++++---
 include/uapi/linux/fs.h    |    1 +
 5 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index d4d9bef20c3a..9e720e55776b 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
 #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
+#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
+#define XFS_DIFLAG2_IOMAP_IMMUTABLE (1 << XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT)
 
 #define XFS_DIFLAG2_ANY \
-	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
+	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
+	 XFS_DIFLAG2_IOMAP_IMMUTABLE)
 
 /*
  * Inode number format:
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ceef77c0416a..4ca22e272ce6 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -674,6 +674,8 @@ _xfs_dic2xflags(
 			flags |= FS_XFLAG_DAX;
 		if (di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
 			flags |= FS_XFLAG_COWEXTSIZE;
+		if (di_flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE)
+			flags |= FS_XFLAG_IOMAP_IMMUTABLE;
 	}
 
 	if (has_attr)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 2e64488bc4de..df2eef0f9d45 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -978,6 +978,7 @@ xfs_set_diflags(
 		return;
 
 	di_flags2 = (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK);
+	di_flags2 |= (ip->i_d.di_flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE);
 	if (xflags & FS_XFLAG_DAX)
 		di_flags2 |= XFS_DIFLAG2_DAX;
 	if (xflags & FS_XFLAG_COWEXTSIZE)
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 469c9fa4c178..174ef95453f5 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1186,9 +1186,10 @@ xfs_diflags_to_iflags(
 	struct xfs_inode	*ip)
 {
 	uint16_t		flags = ip->i_d.di_flags;
+	uint64_t		flags2 = ip->i_d.di_flags2;
 
 	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
-			    S_NOATIME | S_DAX);
+			    S_NOATIME | S_DAX | S_IOMAP_IMMUTABLE);
 
 	if (flags & XFS_DIFLAG_IMMUTABLE)
 		inode->i_flags |= S_IMMUTABLE;
@@ -1201,9 +1202,10 @@ xfs_diflags_to_iflags(
 	if (S_ISREG(inode->i_mode) &&
 	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
 	    !xfs_is_reflink_inode(ip) &&
-	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
-	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
+	    (ip->i_mount->m_flags & XFS_MOUNT_DAX || flags2 & XFS_DIFLAG2_DAX))
 		inode->i_flags |= S_DAX;
+	if (flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE)
+		inode->i_flags |= S_IOMAP_IMMUTABLE;
 }
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7495d05e8de..4765e024ad74 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -182,6 +182,7 @@ struct fsxattr {
 #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
 #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
 #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
+#define FS_XFLAG_IOMAP_IMMUTABLE 0x00020000	/* block map immutable */
 #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
 
 /* the read-only stuff doesn't really belong here, but any other place is

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
  2017-08-04  2:28 ` Dan Williams
@ 2017-08-04  2:28   ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	luto, linux-fsdevel, Christoph Hellwig

After validating the state of the file as not having holes, shared
extents, or active mappings try to commit the
XFS_DIFLAG2_IOMAP_IMMUTABLE flag to the on-disk inode metadata. If that
succeeds then allow the S_IOMAP_IMMUTABLE to be set on the vfs inode.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 70ac2d33ab27..8464c25a2403 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1436,9 +1436,11 @@ xfs_seal_file_space(
 	xfs_off_t		offset,
 	xfs_off_t		len)
 {
+	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
 	struct address_space	*mapping = inode->i_mapping;
 	int			error;
+	struct xfs_trans	*tp;
 
 	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
 
@@ -1454,6 +1456,10 @@ xfs_seal_file_space(
 	if (error)
 		return error;
 
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
+	if (error)
+		return error;
+
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	/*
 	 * Either the size changed after we performed allocation /
@@ -1486,10 +1492,20 @@ xfs_seal_file_space(
 	if (error < 0)
 		goto out_unlock;
 
+	xfs_trans_ijoin(tp, ip, 0);
+	ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	error = xfs_trans_commit(tp);
+	tp = NULL; /* nothing to cancel */
+	if (error)
+		goto out_unlock;
+
 	inode->i_flags |= S_IOMAP_IMMUTABLE;
 
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (tp)
+		xfs_trans_cancel(tp);
 
 	return error;
 }
@@ -1500,15 +1516,21 @@ xfs_unseal_file_space(
 	xfs_off_t		offset,
 	xfs_off_t		len)
 {
+	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
 	struct address_space	*mapping = inode->i_mapping;
 	int			error;
+	struct xfs_trans	*tp;
 
 	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
 
 	if (offset)
 		return -EINVAL;
 
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
+	if (error)
+		return error;
+
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	/*
 	 * It does not make sense to unseal less than the full range of
@@ -1527,11 +1549,21 @@ xfs_unseal_file_space(
 	if (mapping_mapped(mapping))
 		goto out_unlock;
 
+	xfs_trans_ijoin(tp, ip, 0);
+	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_IOMAP_IMMUTABLE;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	error = xfs_trans_commit(tp);
+	tp = NULL; /* nothing to cancel */
+	if (error)
+		goto out_unlock;
+
 	inode->i_flags &= ~S_IOMAP_IMMUTABLE;
 	error = 0;
 
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (tp)
+		xfs_trans_cancel(tp);
 
 	return error;
 }

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
@ 2017-08-04  2:28   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:28 UTC (permalink / raw)
  To: darrick.wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, luto, linux-fsdevel, Ross Zwisler, Christoph Hellwig

After validating the state of the file as not having holes, shared
extents, or active mappings try to commit the
XFS_DIFLAG2_IOMAP_IMMUTABLE flag to the on-disk inode metadata. If that
succeeds then allow the S_IOMAP_IMMUTABLE to be set on the vfs inode.

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 70ac2d33ab27..8464c25a2403 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1436,9 +1436,11 @@ xfs_seal_file_space(
 	xfs_off_t		offset,
 	xfs_off_t		len)
 {
+	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
 	struct address_space	*mapping = inode->i_mapping;
 	int			error;
+	struct xfs_trans	*tp;
 
 	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
 
@@ -1454,6 +1456,10 @@ xfs_seal_file_space(
 	if (error)
 		return error;
 
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
+	if (error)
+		return error;
+
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	/*
 	 * Either the size changed after we performed allocation /
@@ -1486,10 +1492,20 @@ xfs_seal_file_space(
 	if (error < 0)
 		goto out_unlock;
 
+	xfs_trans_ijoin(tp, ip, 0);
+	ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	error = xfs_trans_commit(tp);
+	tp = NULL; /* nothing to cancel */
+	if (error)
+		goto out_unlock;
+
 	inode->i_flags |= S_IOMAP_IMMUTABLE;
 
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (tp)
+		xfs_trans_cancel(tp);
 
 	return error;
 }
@@ -1500,15 +1516,21 @@ xfs_unseal_file_space(
 	xfs_off_t		offset,
 	xfs_off_t		len)
 {
+	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
 	struct address_space	*mapping = inode->i_mapping;
 	int			error;
+	struct xfs_trans	*tp;
 
 	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
 
 	if (offset)
 		return -EINVAL;
 
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
+	if (error)
+		return error;
+
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	/*
 	 * It does not make sense to unseal less than the full range of
@@ -1527,11 +1549,21 @@ xfs_unseal_file_space(
 	if (mapping_mapped(mapping))
 		goto out_unlock;
 
+	xfs_trans_ijoin(tp, ip, 0);
+	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_IOMAP_IMMUTABLE;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	error = xfs_trans_commit(tp);
+	tp = NULL; /* nothing to cancel */
+	if (error)
+		goto out_unlock;
+
 	inode->i_flags &= ~S_IOMAP_IMMUTABLE;
 	error = 0;
 
 out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	if (tp)
+		xfs_trans_cancel(tp);
 
 	return error;
 }

^ permalink raw reply related	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-04  2:28 ` Dan Williams
  (?)
@ 2017-08-04  2:38   ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Linux API, Dave Chinner, linux-kernel,
	linux-xfs, Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Christoph Hellwig

[ adding linux-api to the cover letter for notification, will send the
full set to linux-api for v3 ]

On Thu, Aug 3, 2017 at 7:28 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> Changes since v1 [1]:
> * Add IS_IOMAP_IMMUTABLE() checks to xfs ioctl paths that perform block
>   map changes (xfs_alloc_file_space and xfs_free_file_space) (Darrick)
>
> * Rather than complete a partial write, fail all writes that would
>   attempt to extend the file size (Darrick)
>
> * Introduce FALLOC_FL_UNSEAL_BLOCK_MAP as an explicit operation type for
>   clearing S_IOMAP_IMMUTABLE (Dave)
>
> * Rework xfs_seal_file_space() to first complete hole-fill and unshare
>   operations and then check the file for suitability under
>   XFS_ILOCK_EXCL. (Darrick)
>
> * Add an FS_XFLAG_IOMAP_IMMUTABLE flag so the immutable state can be
>   seen by xfs_io. (Dave)
>
> * Move the setting of S_IOMAP_IMMUTABLE to be atomic with respect to the
>   successful transaction that records XFS_DIFLAG2_IOMAP_IMMUTABLE.
>   (Darrick, Dave)
>
> * Switch to a 'goto out_unlock' style in xfs_seal_file_space() to
>   cleanup 'if / else' tree, and use the mapping_mapped() helper. (Dave)
>
> * Rely on XFS_MMAPLOCK_EXCL for reading a stable state of
>   mapping->i_mmap. (Dave)
>
> [1]: http://marc.info/?l=linux-fsdevel&m=150135785712967&w=2
>
> ---
>
> The daxfile proposal a few weeks back [2] sought to piggy back on the
> swapfile implementation to approximate a block map immutable file. This
> is an idea Dave originated last year to solve the dax "flush from
> userspace" problem [3].
>
> The discussion yielded several results. First, Christoph pointed out
> that swapfiles are subtly broken [4].  Second, Darrick [5] and Dave [6]
> proposed how to properly implement a block map immutable file.  Finally,
> Dave identified some improvements to swapfiles that can be built on the
> block-map-immutable mechanism. These patches seek to implement the first
> part of the proposal and save the swapfile work to build on top once the
> base mechanism is complete.
>
> While the initial motivation for this feature is support for
> byte-addressable updates of persistent memory and managing cache
> maintenance from userspace, the applications of the feature are broader.
> In addition to being the start of a better swapfile mechanism it can
> also support a DMA-to-storage use case.  This use case enables
> data-acquisition hardware to DMA directly to a storage device address
> while being safe in the knowledge that storage mappings will not change.
>
> [2]: https://lkml.org/lkml/2017/6/16/790
> [3]: https://lkml.org/lkml/2016/9/11/159
> [4]: https://lkml.org/lkml/2017/6/18/31
> [5]: https://lkml.org/lkml/2017/6/20/49
> [6]: https://www.spinics.net/lists/linux-xfs/msg07871.html
>
> ---
>
> Dan Williams (5):
>       fs, xfs: introduce S_IOMAP_IMMUTABLE
>       fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
>       fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
>       xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
>       xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
>
>
>  fs/attr.c                   |   10 ++
>  fs/open.c                   |   22 +++++
>  fs/read_write.c             |    3 +
>  fs/xfs/libxfs/xfs_format.h  |    5 +
>  fs/xfs/xfs_bmap_util.c      |  181 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_bmap_util.h      |    5 +
>  fs/xfs/xfs_file.c           |   16 +++-
>  fs/xfs/xfs_inode.c          |    2
>  fs/xfs/xfs_ioctl.c          |    7 ++
>  fs/xfs/xfs_iops.c           |    8 +-
>  include/linux/falloc.h      |    4 +
>  include/linux/fs.h          |    2
>  include/uapi/linux/falloc.h |   20 +++++
>  include/uapi/linux/fs.h     |    1
>  mm/filemap.c                |    5 +
>  15 files changed, 282 insertions(+), 9 deletions(-)
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-04  2:38   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig, Linux API

[ adding linux-api to the cover letter for notification, will send the
full set to linux-api for v3 ]

On Thu, Aug 3, 2017 at 7:28 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> Changes since v1 [1]:
> * Add IS_IOMAP_IMMUTABLE() checks to xfs ioctl paths that perform block
>   map changes (xfs_alloc_file_space and xfs_free_file_space) (Darrick)
>
> * Rather than complete a partial write, fail all writes that would
>   attempt to extend the file size (Darrick)
>
> * Introduce FALLOC_FL_UNSEAL_BLOCK_MAP as an explicit operation type for
>   clearing S_IOMAP_IMMUTABLE (Dave)
>
> * Rework xfs_seal_file_space() to first complete hole-fill and unshare
>   operations and then check the file for suitability under
>   XFS_ILOCK_EXCL. (Darrick)
>
> * Add an FS_XFLAG_IOMAP_IMMUTABLE flag so the immutable state can be
>   seen by xfs_io. (Dave)
>
> * Move the setting of S_IOMAP_IMMUTABLE to be atomic with respect to the
>   successful transaction that records XFS_DIFLAG2_IOMAP_IMMUTABLE.
>   (Darrick, Dave)
>
> * Switch to a 'goto out_unlock' style in xfs_seal_file_space() to
>   cleanup 'if / else' tree, and use the mapping_mapped() helper. (Dave)
>
> * Rely on XFS_MMAPLOCK_EXCL for reading a stable state of
>   mapping->i_mmap. (Dave)
>
> [1]: http://marc.info/?l=linux-fsdevel&m=150135785712967&w=2
>
> ---
>
> The daxfile proposal a few weeks back [2] sought to piggy back on the
> swapfile implementation to approximate a block map immutable file. This
> is an idea Dave originated last year to solve the dax "flush from
> userspace" problem [3].
>
> The discussion yielded several results. First, Christoph pointed out
> that swapfiles are subtly broken [4].  Second, Darrick [5] and Dave [6]
> proposed how to properly implement a block map immutable file.  Finally,
> Dave identified some improvements to swapfiles that can be built on the
> block-map-immutable mechanism. These patches seek to implement the first
> part of the proposal and save the swapfile work to build on top once the
> base mechanism is complete.
>
> While the initial motivation for this feature is support for
> byte-addressable updates of persistent memory and managing cache
> maintenance from userspace, the applications of the feature are broader.
> In addition to being the start of a better swapfile mechanism it can
> also support a DMA-to-storage use case.  This use case enables
> data-acquisition hardware to DMA directly to a storage device address
> while being safe in the knowledge that storage mappings will not change.
>
> [2]: https://lkml.org/lkml/2017/6/16/790
> [3]: https://lkml.org/lkml/2016/9/11/159
> [4]: https://lkml.org/lkml/2017/6/18/31
> [5]: https://lkml.org/lkml/2017/6/20/49
> [6]: https://www.spinics.net/lists/linux-xfs/msg07871.html
>
> ---
>
> Dan Williams (5):
>       fs, xfs: introduce S_IOMAP_IMMUTABLE
>       fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
>       fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
>       xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
>       xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
>
>
>  fs/attr.c                   |   10 ++
>  fs/open.c                   |   22 +++++
>  fs/read_write.c             |    3 +
>  fs/xfs/libxfs/xfs_format.h  |    5 +
>  fs/xfs/xfs_bmap_util.c      |  181 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_bmap_util.h      |    5 +
>  fs/xfs/xfs_file.c           |   16 +++-
>  fs/xfs/xfs_inode.c          |    2
>  fs/xfs/xfs_ioctl.c          |    7 ++
>  fs/xfs/xfs_iops.c           |    8 +-
>  include/linux/falloc.h      |    4 +
>  include/linux/fs.h          |    2
>  include/uapi/linux/falloc.h |   20 +++++
>  include/uapi/linux/fs.h     |    1
>  mm/filemap.c                |    5 +
>  15 files changed, 282 insertions(+), 9 deletions(-)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-04  2:38   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04  2:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

[ adding linux-api to the cover letter for notification, will send the
full set to linux-api for v3 ]

On Thu, Aug 3, 2017 at 7:28 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> Changes since v1 [1]:
> * Add IS_IOMAP_IMMUTABLE() checks to xfs ioctl paths that perform block
>   map changes (xfs_alloc_file_space and xfs_free_file_space) (Darrick)
>
> * Rather than complete a partial write, fail all writes that would
>   attempt to extend the file size (Darrick)
>
> * Introduce FALLOC_FL_UNSEAL_BLOCK_MAP as an explicit operation type for
>   clearing S_IOMAP_IMMUTABLE (Dave)
>
> * Rework xfs_seal_file_space() to first complete hole-fill and unshare
>   operations and then check the file for suitability under
>   XFS_ILOCK_EXCL. (Darrick)
>
> * Add an FS_XFLAG_IOMAP_IMMUTABLE flag so the immutable state can be
>   seen by xfs_io. (Dave)
>
> * Move the setting of S_IOMAP_IMMUTABLE to be atomic with respect to the
>   successful transaction that records XFS_DIFLAG2_IOMAP_IMMUTABLE.
>   (Darrick, Dave)
>
> * Switch to a 'goto out_unlock' style in xfs_seal_file_space() to
>   cleanup 'if / else' tree, and use the mapping_mapped() helper. (Dave)
>
> * Rely on XFS_MMAPLOCK_EXCL for reading a stable state of
>   mapping->i_mmap. (Dave)
>
> [1]: http://marc.info/?l=linux-fsdevel&m=150135785712967&w=2
>
> ---
>
> The daxfile proposal a few weeks back [2] sought to piggy back on the
> swapfile implementation to approximate a block map immutable file. This
> is an idea Dave originated last year to solve the dax "flush from
> userspace" problem [3].
>
> The discussion yielded several results. First, Christoph pointed out
> that swapfiles are subtly broken [4].  Second, Darrick [5] and Dave [6]
> proposed how to properly implement a block map immutable file.  Finally,
> Dave identified some improvements to swapfiles that can be built on the
> block-map-immutable mechanism. These patches seek to implement the first
> part of the proposal and save the swapfile work to build on top once the
> base mechanism is complete.
>
> While the initial motivation for this feature is support for
> byte-addressable updates of persistent memory and managing cache
> maintenance from userspace, the applications of the feature are broader.
> In addition to being the start of a better swapfile mechanism it can
> also support a DMA-to-storage use case.  This use case enables
> data-acquisition hardware to DMA directly to a storage device address
> while being safe in the knowledge that storage mappings will not change.
>
> [2]: https://lkml.org/lkml/2017/6/16/790
> [3]: https://lkml.org/lkml/2016/9/11/159
> [4]: https://lkml.org/lkml/2017/6/18/31
> [5]: https://lkml.org/lkml/2017/6/20/49
> [6]: https://www.spinics.net/lists/linux-xfs/msg07871.html
>
> ---
>
> Dan Williams (5):
>       fs, xfs: introduce S_IOMAP_IMMUTABLE
>       fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
>       fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
>       xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
>       xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
>
>
>  fs/attr.c                   |   10 ++
>  fs/open.c                   |   22 +++++
>  fs/read_write.c             |    3 +
>  fs/xfs/libxfs/xfs_format.h  |    5 +
>  fs/xfs/xfs_bmap_util.c      |  181 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_bmap_util.h      |    5 +
>  fs/xfs/xfs_file.c           |   16 +++-
>  fs/xfs/xfs_inode.c          |    2
>  fs/xfs/xfs_ioctl.c          |    7 ++
>  fs/xfs/xfs_iops.c           |    8 +-
>  include/linux/falloc.h      |    4 +
>  include/linux/fs.h          |    2
>  include/uapi/linux/falloc.h |   20 +++++
>  include/uapi/linux/fs.h     |    1
>  mm/filemap.c                |    5 +
>  15 files changed, 282 insertions(+), 9 deletions(-)

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
  2017-08-04  2:28   ` Dan Williams
@ 2017-08-04 19:46     ` Darrick J. Wong
  -1 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 19:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, luto, linux-fsdevel, Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote:
> >From falloc.h:
> 
>     FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
>     file logical-to-physical extent offset mappings in the file. The
>     purpose is to allow an application to assume that there are no holes
>     or shared extents in the file and that the metadata needed to find
>     all the physical extents of the file is stable and can never be
>     dirtied.
> 
> For now this patch only permits setting the in-memory state of
> S_IOMAP_IMMMUTABLE. Support for clearing and persisting the state is
> saved for later patches.
> 
> The implementation is careful to not allow the immutable state to change
> while any process might have any established mappings. It reuses the
> existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare
> extents and fill all holes in the file. It then holds XFS_ILOCK_EXCL
> while it validates the file is in the proper state and sets
> S_IOMAP_IMMUTABLE.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/open.c                   |   11 +++++
>  fs/xfs/xfs_bmap_util.c      |  101 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_bmap_util.h      |    2 +
>  fs/xfs/xfs_file.c           |   14 ++++--
>  include/linux/falloc.h      |    3 +
>  include/uapi/linux/falloc.h |   19 ++++++++
>  6 files changed, 145 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 7395860d7164..e3aae59785ae 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -273,6 +273,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
>  		return -EINVAL;
>  
> +	/*
> +	 * Seal block map operation should only be used exclusively, and
> +	 * with the IMMUTABLE capability.
> +	 */
> +	if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
> +		if (!capable(CAP_LINUX_IMMUTABLE))
> +			return -EPERM;
> +		if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
> +			return -EINVAL;
> +	}
> +
>  	if (!(file->f_mode & FMODE_WRITE))
>  		return -EBADF;
>  
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index fe0f8f7f4bb7..46d8eb9e19fc 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1393,6 +1393,107 @@ xfs_zero_file_space(
>  
>  }
>  
> +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
> +STATIC int
> +xfs_file_has_holes(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_bmbt_irec	*map;
> +	const int		map_size = 10;	/* constrain memory overhead */
> +	int			i, nmaps;
> +	int			error = 0;
> +	xfs_fileoff_t		lblkno = 0;
> +	xfs_filblks_t		maxlblkcnt;
> +
> +	map = kmem_alloc(map_size * sizeof(*map), KM_SLEEP);

Sleeping with an inode fully locked and (eventually) a running
transaction?  Yikes.

Just allocate one xfs_bmbt_irec on the stack and pass in nmaps=1.

This method might fit better in libxfs/xfs_bmap.c where it'll be
able to scan the extent list more quickly with the iext helpers.

> +
> +	maxlblkcnt = XFS_B_TO_FSB(mp, i_size_read(VFS_I(ip)));
> +	do {
> +		nmaps = map_size;
> +		error = xfs_bmapi_read(ip, lblkno, maxlblkcnt - lblkno,
> +				       map, &nmaps, 0);
> +		if (error)
> +			break;
> +
> +		ASSERT(nmaps <= map_size);
> +		for (i = 0; i < nmaps; i++) {
> +			lblkno += map[i].br_blockcount;
> +			if (map[i].br_startblock == HOLESTARTBLOCK) {

I think we also need to check for unwritten extents here, because a
write to an unwritten block requires some zeroing and a mapping metdata
update.

> +				error = 1;
> +				break;
> +			}
> +		}
> +	} while (nmaps > 0 && error == 0);
> +
> +	kmem_free(map);
> +	return error;
> +}
> +
> +int
> +xfs_seal_file_space(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		offset,
> +	xfs_off_t		len)
> +{
> +	struct inode		*inode = VFS_I(ip);
> +	struct address_space	*mapping = inode->i_mapping;
> +	int			error;
> +
> +	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));

The IOLOCK must be held here too.

> +
> +	if (offset)
> +		return -EINVAL;
> +
> +	error = xfs_reflink_unshare(ip, offset, len);
> +	if (error)
> +		return error;
> +
> +	error = xfs_alloc_file_space(ip, offset, len,
> +			XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
> +	if (error)
> +		return error;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	/*
> +	 * Either the size changed after we performed allocation /
> +	 * unsharing, or the request was too small to begin with.
> +	 */
> +	error = -EINVAL;
> +	if (len < i_size_read(inode))
> +		goto out_unlock;
> +
> +	/*
> +	 * Allow DAX path to assume that the state of S_IOMAP_IMMUTABLE
> +	 * will never change while any mapping is established.
> +	 */
> +	error = -EBUSY;
> +	if (mapping_mapped(mapping))
> +		goto out_unlock;
> +
> +	/* Did we race someone attempting to share extents? */
> +	if (xfs_is_reflink_inode(ip))
> +		goto out_unlock;
> +
> +	/* Did we race a hole punch? */
> +	error = xfs_file_has_holes(ip);
> +	if (error == 1) {
> +		error = -EBUSY;
> +		goto out_unlock;
> +	}
> +
> +	/* Abort on an error reading the block map */
> +	if (error < 0)
> +		goto out_unlock;
> +
> +	inode->i_flags |= S_IOMAP_IMMUTABLE;
> +
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +	return error;
> +}
> +
>  /*
>   * @next_fsb will keep track of the extent currently undergoing shift.
>   * @stop_fsb will keep track of the extent at which we have to stop.
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 0cede1043571..5115a32a2483 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -60,6 +60,8 @@ int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
>  				xfs_off_t len);
>  int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
>  				xfs_off_t len);
> +int	xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
> +				xfs_off_t len);
>  
>  /* EOF block manipulation functions */
>  bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c4893e226fd8..e21121530a90 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -739,7 +739,8 @@ xfs_file_write_iter(
>  #define	XFS_FALLOC_FL_SUPPORTED						\
>  		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
>  		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
> -		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
> +		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |	\
> +		 FALLOC_FL_SEAL_BLOCK_MAP)
>  
>  STATIC long
>  xfs_file_fallocate(
> @@ -834,9 +835,14 @@ xfs_file_fallocate(
>  				error = xfs_reflink_unshare(ip, offset, len);
>  				if (error)
>  					goto out_unlock;
> -			}
> -			error = xfs_alloc_file_space(ip, offset, len,
> -						     XFS_BMAPI_PREALLOC);
> +
> +				error = xfs_alloc_file_space(ip, offset, len,
> +						XFS_BMAPI_PREALLOC);
> +			} else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
> +				error = xfs_seal_file_space(ip, offset, len);
> +			} else
> +				error = xfs_alloc_file_space(ip, offset, len,
> +						XFS_BMAPI_PREALLOC);
>  		}
>  		if (error)
>  			goto out_unlock;
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 7494dc67c66f..48546c6fbec7 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -26,6 +26,7 @@ struct space_resv {
>  					 FALLOC_FL_COLLAPSE_RANGE |	\
>  					 FALLOC_FL_ZERO_RANGE |		\
>  					 FALLOC_FL_INSERT_RANGE |	\
> -					 FALLOC_FL_UNSHARE_RANGE)
> +					 FALLOC_FL_UNSHARE_RANGE |	\
> +					 FALLOC_FL_SEAL_BLOCK_MAP)
>  
>  #endif /* _FALLOC_H_ */
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index b075f601919b..39076975bf6f 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -76,4 +76,23 @@
>   */
>  #define FALLOC_FL_UNSHARE_RANGE		0x40
>  
> +/*
> + * FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
> + * file logical-to-physical extent offset mappings in the file. The
> + * purpose is to allow an application to assume that there are no holes
> + * or shared extents in the file and that the metadata needed to find
> + * all the physical extents of the file is stable and can never be
> + * dirtied.
> + *
> + * The immutable property is in effect for the entire inode, so the
> + * range for this operation must start at offset 0 and len must be
> + * greater than or equal to the current size of the file. If greater,
> + * this operation allocates, unshares, hole fills, and seals in one

'allocates' is the same as 'hole fills', I think.

This converts unwritten extents to zeroed written extents too, correct?

> + * atomic step. If len is zero then the immutable state is cleared for
> + * the inode.

It's cleared if len == 0?  I thought that was what FL_UNSEAL is for?

> + * This flag implies FALLOC_FL_UNSHARE_RANGE and as such cannot be used
> + * with the punch, zero, collapse, or insert range modes.
> + */
> +#define FALLOC_FL_SEAL_BLOCK_MAP	0x080
>  #endif /* _UAPI_FALLOC_H_ */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
@ 2017-08-04 19:46     ` Darrick J. Wong
  0 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 19:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote:
> >From falloc.h:
> 
>     FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
>     file logical-to-physical extent offset mappings in the file. The
>     purpose is to allow an application to assume that there are no holes
>     or shared extents in the file and that the metadata needed to find
>     all the physical extents of the file is stable and can never be
>     dirtied.
> 
> For now this patch only permits setting the in-memory state of
> S_IOMAP_IMMMUTABLE. Support for clearing and persisting the state is
> saved for later patches.
> 
> The implementation is careful to not allow the immutable state to change
> while any process might have any established mappings. It reuses the
> existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare
> extents and fill all holes in the file. It then holds XFS_ILOCK_EXCL
> while it validates the file is in the proper state and sets
> S_IOMAP_IMMUTABLE.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/open.c                   |   11 +++++
>  fs/xfs/xfs_bmap_util.c      |  101 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_bmap_util.h      |    2 +
>  fs/xfs/xfs_file.c           |   14 ++++--
>  include/linux/falloc.h      |    3 +
>  include/uapi/linux/falloc.h |   19 ++++++++
>  6 files changed, 145 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 7395860d7164..e3aae59785ae 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -273,6 +273,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  	    (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
>  		return -EINVAL;
>  
> +	/*
> +	 * Seal block map operation should only be used exclusively, and
> +	 * with the IMMUTABLE capability.
> +	 */
> +	if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
> +		if (!capable(CAP_LINUX_IMMUTABLE))
> +			return -EPERM;
> +		if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
> +			return -EINVAL;
> +	}
> +
>  	if (!(file->f_mode & FMODE_WRITE))
>  		return -EBADF;
>  
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index fe0f8f7f4bb7..46d8eb9e19fc 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1393,6 +1393,107 @@ xfs_zero_file_space(
>  
>  }
>  
> +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
> +STATIC int
> +xfs_file_has_holes(
> +	struct xfs_inode	*ip)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_bmbt_irec	*map;
> +	const int		map_size = 10;	/* constrain memory overhead */
> +	int			i, nmaps;
> +	int			error = 0;
> +	xfs_fileoff_t		lblkno = 0;
> +	xfs_filblks_t		maxlblkcnt;
> +
> +	map = kmem_alloc(map_size * sizeof(*map), KM_SLEEP);

Sleeping with an inode fully locked and (eventually) a running
transaction?  Yikes.

Just allocate one xfs_bmbt_irec on the stack and pass in nmaps=1.

This method might fit better in libxfs/xfs_bmap.c where it'll be
able to scan the extent list more quickly with the iext helpers.

> +
> +	maxlblkcnt = XFS_B_TO_FSB(mp, i_size_read(VFS_I(ip)));
> +	do {
> +		nmaps = map_size;
> +		error = xfs_bmapi_read(ip, lblkno, maxlblkcnt - lblkno,
> +				       map, &nmaps, 0);
> +		if (error)
> +			break;
> +
> +		ASSERT(nmaps <= map_size);
> +		for (i = 0; i < nmaps; i++) {
> +			lblkno += map[i].br_blockcount;
> +			if (map[i].br_startblock == HOLESTARTBLOCK) {

I think we also need to check for unwritten extents here, because a
write to an unwritten block requires some zeroing and a mapping metdata
update.

> +				error = 1;
> +				break;
> +			}
> +		}
> +	} while (nmaps > 0 && error == 0);
> +
> +	kmem_free(map);
> +	return error;
> +}
> +
> +int
> +xfs_seal_file_space(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		offset,
> +	xfs_off_t		len)
> +{
> +	struct inode		*inode = VFS_I(ip);
> +	struct address_space	*mapping = inode->i_mapping;
> +	int			error;
> +
> +	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));

The IOLOCK must be held here too.

> +
> +	if (offset)
> +		return -EINVAL;
> +
> +	error = xfs_reflink_unshare(ip, offset, len);
> +	if (error)
> +		return error;
> +
> +	error = xfs_alloc_file_space(ip, offset, len,
> +			XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
> +	if (error)
> +		return error;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	/*
> +	 * Either the size changed after we performed allocation /
> +	 * unsharing, or the request was too small to begin with.
> +	 */
> +	error = -EINVAL;
> +	if (len < i_size_read(inode))
> +		goto out_unlock;
> +
> +	/*
> +	 * Allow DAX path to assume that the state of S_IOMAP_IMMUTABLE
> +	 * will never change while any mapping is established.
> +	 */
> +	error = -EBUSY;
> +	if (mapping_mapped(mapping))
> +		goto out_unlock;
> +
> +	/* Did we race someone attempting to share extents? */
> +	if (xfs_is_reflink_inode(ip))
> +		goto out_unlock;
> +
> +	/* Did we race a hole punch? */
> +	error = xfs_file_has_holes(ip);
> +	if (error == 1) {
> +		error = -EBUSY;
> +		goto out_unlock;
> +	}
> +
> +	/* Abort on an error reading the block map */
> +	if (error < 0)
> +		goto out_unlock;
> +
> +	inode->i_flags |= S_IOMAP_IMMUTABLE;
> +
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +	return error;
> +}
> +
>  /*
>   * @next_fsb will keep track of the extent currently undergoing shift.
>   * @stop_fsb will keep track of the extent at which we have to stop.
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 0cede1043571..5115a32a2483 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -60,6 +60,8 @@ int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
>  				xfs_off_t len);
>  int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
>  				xfs_off_t len);
> +int	xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
> +				xfs_off_t len);
>  
>  /* EOF block manipulation functions */
>  bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c4893e226fd8..e21121530a90 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -739,7 +739,8 @@ xfs_file_write_iter(
>  #define	XFS_FALLOC_FL_SUPPORTED						\
>  		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
>  		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
> -		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
> +		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |	\
> +		 FALLOC_FL_SEAL_BLOCK_MAP)
>  
>  STATIC long
>  xfs_file_fallocate(
> @@ -834,9 +835,14 @@ xfs_file_fallocate(
>  				error = xfs_reflink_unshare(ip, offset, len);
>  				if (error)
>  					goto out_unlock;
> -			}
> -			error = xfs_alloc_file_space(ip, offset, len,
> -						     XFS_BMAPI_PREALLOC);
> +
> +				error = xfs_alloc_file_space(ip, offset, len,
> +						XFS_BMAPI_PREALLOC);
> +			} else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
> +				error = xfs_seal_file_space(ip, offset, len);
> +			} else
> +				error = xfs_alloc_file_space(ip, offset, len,
> +						XFS_BMAPI_PREALLOC);
>  		}
>  		if (error)
>  			goto out_unlock;
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 7494dc67c66f..48546c6fbec7 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -26,6 +26,7 @@ struct space_resv {
>  					 FALLOC_FL_COLLAPSE_RANGE |	\
>  					 FALLOC_FL_ZERO_RANGE |		\
>  					 FALLOC_FL_INSERT_RANGE |	\
> -					 FALLOC_FL_UNSHARE_RANGE)
> +					 FALLOC_FL_UNSHARE_RANGE |	\
> +					 FALLOC_FL_SEAL_BLOCK_MAP)
>  
>  #endif /* _FALLOC_H_ */
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index b075f601919b..39076975bf6f 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -76,4 +76,23 @@
>   */
>  #define FALLOC_FL_UNSHARE_RANGE		0x40
>  
> +/*
> + * FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
> + * file logical-to-physical extent offset mappings in the file. The
> + * purpose is to allow an application to assume that there are no holes
> + * or shared extents in the file and that the metadata needed to find
> + * all the physical extents of the file is stable and can never be
> + * dirtied.
> + *
> + * The immutable property is in effect for the entire inode, so the
> + * range for this operation must start at offset 0 and len must be
> + * greater than or equal to the current size of the file. If greater,
> + * this operation allocates, unshares, hole fills, and seals in one

'allocates' is the same as 'hole fills', I think.

This converts unwritten extents to zeroed written extents too, correct?

> + * atomic step. If len is zero then the immutable state is cleared for
> + * the inode.

It's cleared if len == 0?  I thought that was what FL_UNSEAL is for?

> + * This flag implies FALLOC_FL_UNSHARE_RANGE and as such cannot be used
> + * with the punch, zero, collapse, or insert range modes.
> + */
> +#define FALLOC_FL_SEAL_BLOCK_MAP	0x080
>  #endif /* _UAPI_FALLOC_H_ */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
  2017-08-04 19:46     ` Darrick J. Wong
@ 2017-08-04 19:52       ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 19:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Christoph Hellwig

On Fri, Aug 4, 2017 at 12:46 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote:
>> >From falloc.h:
>>
>>     FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
>>     file logical-to-physical extent offset mappings in the file. The
>>     purpose is to allow an application to assume that there are no holes
>>     or shared extents in the file and that the metadata needed to find
>>     all the physical extents of the file is stable and can never be
>>     dirtied.
>>
>> For now this patch only permits setting the in-memory state of
>> S_IOMAP_IMMMUTABLE. Support for clearing and persisting the state is
>> saved for later patches.
>>
>> The implementation is careful to not allow the immutable state to change
>> while any process might have any established mappings. It reuses the
>> existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare
>> extents and fill all holes in the file. It then holds XFS_ILOCK_EXCL
>> while it validates the file is in the proper state and sets
>> S_IOMAP_IMMUTABLE.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/open.c                   |   11 +++++
>>  fs/xfs/xfs_bmap_util.c      |  101 +++++++++++++++++++++++++++++++++++++++++++
>>  fs/xfs/xfs_bmap_util.h      |    2 +
>>  fs/xfs/xfs_file.c           |   14 ++++--
>>  include/linux/falloc.h      |    3 +
>>  include/uapi/linux/falloc.h |   19 ++++++++
>>  6 files changed, 145 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index 7395860d7164..e3aae59785ae 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -273,6 +273,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>           (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
>>               return -EINVAL;
>>
>> +     /*
>> +      * Seal block map operation should only be used exclusively, and
>> +      * with the IMMUTABLE capability.
>> +      */
>> +     if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
>> +             if (!capable(CAP_LINUX_IMMUTABLE))
>> +                     return -EPERM;
>> +             if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
>> +                     return -EINVAL;
>> +     }
>> +
>>       if (!(file->f_mode & FMODE_WRITE))
>>               return -EBADF;
>>
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index fe0f8f7f4bb7..46d8eb9e19fc 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1393,6 +1393,107 @@ xfs_zero_file_space(
>>
>>  }
>>
>> +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
>> +STATIC int
>> +xfs_file_has_holes(
>> +     struct xfs_inode        *ip)
>> +{
>> +     struct xfs_mount        *mp = ip->i_mount;
>> +     struct xfs_bmbt_irec    *map;
>> +     const int               map_size = 10;  /* constrain memory overhead */
>> +     int                     i, nmaps;
>> +     int                     error = 0;
>> +     xfs_fileoff_t           lblkno = 0;
>> +     xfs_filblks_t           maxlblkcnt;
>> +
>> +     map = kmem_alloc(map_size * sizeof(*map), KM_SLEEP);
>
> Sleeping with an inode fully locked and (eventually) a running
> transaction?  Yikes.
>
> Just allocate one xfs_bmbt_irec on the stack and pass in nmaps=1.
>
> This method might fit better in libxfs/xfs_bmap.c where it'll be
> able to scan the extent list more quickly with the iext helpers.

Ok, I'll take a look.

>
>> +
>> +     maxlblkcnt = XFS_B_TO_FSB(mp, i_size_read(VFS_I(ip)));
>> +     do {
>> +             nmaps = map_size;
>> +             error = xfs_bmapi_read(ip, lblkno, maxlblkcnt - lblkno,
>> +                                    map, &nmaps, 0);
>> +             if (error)
>> +                     break;
>> +
>> +             ASSERT(nmaps <= map_size);
>> +             for (i = 0; i < nmaps; i++) {
>> +                     lblkno += map[i].br_blockcount;
>> +                     if (map[i].br_startblock == HOLESTARTBLOCK) {
>
> I think we also need to check for unwritten extents here, because a
> write to an unwritten block requires some zeroing and a mapping metdata
> update.

Will do.

>
>> +                             error = 1;
>> +                             break;
>> +                     }
>> +             }
>> +     } while (nmaps > 0 && error == 0);
>> +
>> +     kmem_free(map);
>> +     return error;
>> +}
>> +
>> +int
>> +xfs_seal_file_space(
>> +     struct xfs_inode        *ip,
>> +     xfs_off_t               offset,
>> +     xfs_off_t               len)
>> +{
>> +     struct inode            *inode = VFS_I(ip);
>> +     struct address_space    *mapping = inode->i_mapping;
>> +     int                     error;
>> +
>> +     ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>
> The IOLOCK must be held here too.

Ok, I can add that. I had this here for the mapping_mapped() check.

>
>> +
>> +     if (offset)
>> +             return -EINVAL;
>> +
>> +     error = xfs_reflink_unshare(ip, offset, len);
>> +     if (error)
>> +             return error;
>> +
>> +     error = xfs_alloc_file_space(ip, offset, len,
>> +                     XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
>> +     if (error)
>> +             return error;
>> +
>> +     xfs_ilock(ip, XFS_ILOCK_EXCL);
>> +     /*
>> +      * Either the size changed after we performed allocation /
>> +      * unsharing, or the request was too small to begin with.
>> +      */
>> +     error = -EINVAL;
>> +     if (len < i_size_read(inode))
>> +             goto out_unlock;
>> +
>> +     /*
>> +      * Allow DAX path to assume that the state of S_IOMAP_IMMUTABLE
>> +      * will never change while any mapping is established.
>> +      */
>> +     error = -EBUSY;
>> +     if (mapping_mapped(mapping))
>> +             goto out_unlock;
>> +
>> +     /* Did we race someone attempting to share extents? */
>> +     if (xfs_is_reflink_inode(ip))
>> +             goto out_unlock;
>> +
>> +     /* Did we race a hole punch? */
>> +     error = xfs_file_has_holes(ip);
>> +     if (error == 1) {
>> +             error = -EBUSY;
>> +             goto out_unlock;
>> +     }
>> +
>> +     /* Abort on an error reading the block map */
>> +     if (error < 0)
>> +             goto out_unlock;
>> +
>> +     inode->i_flags |= S_IOMAP_IMMUTABLE;
>> +
>> +out_unlock:
>> +     xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +
>> +     return error;
>> +}
>> +
>>  /*
>>   * @next_fsb will keep track of the extent currently undergoing shift.
>>   * @stop_fsb will keep track of the extent at which we have to stop.
>> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
>> index 0cede1043571..5115a32a2483 100644
>> --- a/fs/xfs/xfs_bmap_util.h
>> +++ b/fs/xfs/xfs_bmap_util.h
>> @@ -60,6 +60,8 @@ int xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
>>                               xfs_off_t len);
>>  int  xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
>>                               xfs_off_t len);
>> +int  xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
>> +                             xfs_off_t len);
>>
>>  /* EOF block manipulation functions */
>>  bool xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index c4893e226fd8..e21121530a90 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -739,7 +739,8 @@ xfs_file_write_iter(
>>  #define      XFS_FALLOC_FL_SUPPORTED                                         \
>>               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
>>                FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |      \
>> -              FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
>> +              FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |     \
>> +              FALLOC_FL_SEAL_BLOCK_MAP)
>>
>>  STATIC long
>>  xfs_file_fallocate(
>> @@ -834,9 +835,14 @@ xfs_file_fallocate(
>>                               error = xfs_reflink_unshare(ip, offset, len);
>>                               if (error)
>>                                       goto out_unlock;
>> -                     }
>> -                     error = xfs_alloc_file_space(ip, offset, len,
>> -                                                  XFS_BMAPI_PREALLOC);
>> +
>> +                             error = xfs_alloc_file_space(ip, offset, len,
>> +                                             XFS_BMAPI_PREALLOC);
>> +                     } else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
>> +                             error = xfs_seal_file_space(ip, offset, len);
>> +                     } else
>> +                             error = xfs_alloc_file_space(ip, offset, len,
>> +                                             XFS_BMAPI_PREALLOC);
>>               }
>>               if (error)
>>                       goto out_unlock;
>> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
>> index 7494dc67c66f..48546c6fbec7 100644
>> --- a/include/linux/falloc.h
>> +++ b/include/linux/falloc.h
>> @@ -26,6 +26,7 @@ struct space_resv {
>>                                        FALLOC_FL_COLLAPSE_RANGE |     \
>>                                        FALLOC_FL_ZERO_RANGE |         \
>>                                        FALLOC_FL_INSERT_RANGE |       \
>> -                                      FALLOC_FL_UNSHARE_RANGE)
>> +                                      FALLOC_FL_UNSHARE_RANGE |      \
>> +                                      FALLOC_FL_SEAL_BLOCK_MAP)
>>
>>  #endif /* _FALLOC_H_ */
>> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
>> index b075f601919b..39076975bf6f 100644
>> --- a/include/uapi/linux/falloc.h
>> +++ b/include/uapi/linux/falloc.h
>> @@ -76,4 +76,23 @@
>>   */
>>  #define FALLOC_FL_UNSHARE_RANGE              0x40
>>
>> +/*
>> + * FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
>> + * file logical-to-physical extent offset mappings in the file. The
>> + * purpose is to allow an application to assume that there are no holes
>> + * or shared extents in the file and that the metadata needed to find
>> + * all the physical extents of the file is stable and can never be
>> + * dirtied.
>> + *
>> + * The immutable property is in effect for the entire inode, so the
>> + * range for this operation must start at offset 0 and len must be
>> + * greater than or equal to the current size of the file. If greater,
>> + * this operation allocates, unshares, hole fills, and seals in one
>
> 'allocates' is the same as 'hole fills', I think.
>
> This converts unwritten extents to zeroed written extents too, correct?

Yes, I'll add that and also include that in the man page patch that
I'm working on...

>
>> + * atomic step. If len is zero then the immutable state is cleared for
>> + * the inode.
>
> It's cleared if len == 0?  I thought that was what FL_UNSEAL is for?

Whoops, stale holdover from v1.

>> + * This flag implies FALLOC_FL_UNSHARE_RANGE and as such cannot be used
>> + * with the punch, zero, collapse, or insert range modes.
>> + */
>> +#define FALLOC_FL_SEAL_BLOCK_MAP     0x080
>>  #endif /* _UAPI_FALLOC_H_ */

Thanks Darrick!
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
@ 2017-08-04 19:52       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 19:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

On Fri, Aug 4, 2017 at 12:46 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote:
>> >From falloc.h:
>>
>>     FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
>>     file logical-to-physical extent offset mappings in the file. The
>>     purpose is to allow an application to assume that there are no holes
>>     or shared extents in the file and that the metadata needed to find
>>     all the physical extents of the file is stable and can never be
>>     dirtied.
>>
>> For now this patch only permits setting the in-memory state of
>> S_IOMAP_IMMMUTABLE. Support for clearing and persisting the state is
>> saved for later patches.
>>
>> The implementation is careful to not allow the immutable state to change
>> while any process might have any established mappings. It reuses the
>> existing xfs_reflink_unshare() and xfs_alloc_file_space() to unshare
>> extents and fill all holes in the file. It then holds XFS_ILOCK_EXCL
>> while it validates the file is in the proper state and sets
>> S_IOMAP_IMMUTABLE.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/open.c                   |   11 +++++
>>  fs/xfs/xfs_bmap_util.c      |  101 +++++++++++++++++++++++++++++++++++++++++++
>>  fs/xfs/xfs_bmap_util.h      |    2 +
>>  fs/xfs/xfs_file.c           |   14 ++++--
>>  include/linux/falloc.h      |    3 +
>>  include/uapi/linux/falloc.h |   19 ++++++++
>>  6 files changed, 145 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index 7395860d7164..e3aae59785ae 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -273,6 +273,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>           (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE)))
>>               return -EINVAL;
>>
>> +     /*
>> +      * Seal block map operation should only be used exclusively, and
>> +      * with the IMMUTABLE capability.
>> +      */
>> +     if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
>> +             if (!capable(CAP_LINUX_IMMUTABLE))
>> +                     return -EPERM;
>> +             if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
>> +                     return -EINVAL;
>> +     }
>> +
>>       if (!(file->f_mode & FMODE_WRITE))
>>               return -EBADF;
>>
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index fe0f8f7f4bb7..46d8eb9e19fc 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1393,6 +1393,107 @@ xfs_zero_file_space(
>>
>>  }
>>
>> +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
>> +STATIC int
>> +xfs_file_has_holes(
>> +     struct xfs_inode        *ip)
>> +{
>> +     struct xfs_mount        *mp = ip->i_mount;
>> +     struct xfs_bmbt_irec    *map;
>> +     const int               map_size = 10;  /* constrain memory overhead */
>> +     int                     i, nmaps;
>> +     int                     error = 0;
>> +     xfs_fileoff_t           lblkno = 0;
>> +     xfs_filblks_t           maxlblkcnt;
>> +
>> +     map = kmem_alloc(map_size * sizeof(*map), KM_SLEEP);
>
> Sleeping with an inode fully locked and (eventually) a running
> transaction?  Yikes.
>
> Just allocate one xfs_bmbt_irec on the stack and pass in nmaps=1.
>
> This method might fit better in libxfs/xfs_bmap.c where it'll be
> able to scan the extent list more quickly with the iext helpers.

Ok, I'll take a look.

>
>> +
>> +     maxlblkcnt = XFS_B_TO_FSB(mp, i_size_read(VFS_I(ip)));
>> +     do {
>> +             nmaps = map_size;
>> +             error = xfs_bmapi_read(ip, lblkno, maxlblkcnt - lblkno,
>> +                                    map, &nmaps, 0);
>> +             if (error)
>> +                     break;
>> +
>> +             ASSERT(nmaps <= map_size);
>> +             for (i = 0; i < nmaps; i++) {
>> +                     lblkno += map[i].br_blockcount;
>> +                     if (map[i].br_startblock == HOLESTARTBLOCK) {
>
> I think we also need to check for unwritten extents here, because a
> write to an unwritten block requires some zeroing and a mapping metdata
> update.

Will do.

>
>> +                             error = 1;
>> +                             break;
>> +                     }
>> +             }
>> +     } while (nmaps > 0 && error == 0);
>> +
>> +     kmem_free(map);
>> +     return error;
>> +}
>> +
>> +int
>> +xfs_seal_file_space(
>> +     struct xfs_inode        *ip,
>> +     xfs_off_t               offset,
>> +     xfs_off_t               len)
>> +{
>> +     struct inode            *inode = VFS_I(ip);
>> +     struct address_space    *mapping = inode->i_mapping;
>> +     int                     error;
>> +
>> +     ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>
> The IOLOCK must be held here too.

Ok, I can add that. I had this here for the mapping_mapped() check.

>
>> +
>> +     if (offset)
>> +             return -EINVAL;
>> +
>> +     error = xfs_reflink_unshare(ip, offset, len);
>> +     if (error)
>> +             return error;
>> +
>> +     error = xfs_alloc_file_space(ip, offset, len,
>> +                     XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO);
>> +     if (error)
>> +             return error;
>> +
>> +     xfs_ilock(ip, XFS_ILOCK_EXCL);
>> +     /*
>> +      * Either the size changed after we performed allocation /
>> +      * unsharing, or the request was too small to begin with.
>> +      */
>> +     error = -EINVAL;
>> +     if (len < i_size_read(inode))
>> +             goto out_unlock;
>> +
>> +     /*
>> +      * Allow DAX path to assume that the state of S_IOMAP_IMMUTABLE
>> +      * will never change while any mapping is established.
>> +      */
>> +     error = -EBUSY;
>> +     if (mapping_mapped(mapping))
>> +             goto out_unlock;
>> +
>> +     /* Did we race someone attempting to share extents? */
>> +     if (xfs_is_reflink_inode(ip))
>> +             goto out_unlock;
>> +
>> +     /* Did we race a hole punch? */
>> +     error = xfs_file_has_holes(ip);
>> +     if (error == 1) {
>> +             error = -EBUSY;
>> +             goto out_unlock;
>> +     }
>> +
>> +     /* Abort on an error reading the block map */
>> +     if (error < 0)
>> +             goto out_unlock;
>> +
>> +     inode->i_flags |= S_IOMAP_IMMUTABLE;
>> +
>> +out_unlock:
>> +     xfs_iunlock(ip, XFS_ILOCK_EXCL);
>> +
>> +     return error;
>> +}
>> +
>>  /*
>>   * @next_fsb will keep track of the extent currently undergoing shift.
>>   * @stop_fsb will keep track of the extent at which we have to stop.
>> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
>> index 0cede1043571..5115a32a2483 100644
>> --- a/fs/xfs/xfs_bmap_util.h
>> +++ b/fs/xfs/xfs_bmap_util.h
>> @@ -60,6 +60,8 @@ int xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
>>                               xfs_off_t len);
>>  int  xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
>>                               xfs_off_t len);
>> +int  xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
>> +                             xfs_off_t len);
>>
>>  /* EOF block manipulation functions */
>>  bool xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index c4893e226fd8..e21121530a90 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -739,7 +739,8 @@ xfs_file_write_iter(
>>  #define      XFS_FALLOC_FL_SUPPORTED                                         \
>>               (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |           \
>>                FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |      \
>> -              FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE)
>> +              FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |     \
>> +              FALLOC_FL_SEAL_BLOCK_MAP)
>>
>>  STATIC long
>>  xfs_file_fallocate(
>> @@ -834,9 +835,14 @@ xfs_file_fallocate(
>>                               error = xfs_reflink_unshare(ip, offset, len);
>>                               if (error)
>>                                       goto out_unlock;
>> -                     }
>> -                     error = xfs_alloc_file_space(ip, offset, len,
>> -                                                  XFS_BMAPI_PREALLOC);
>> +
>> +                             error = xfs_alloc_file_space(ip, offset, len,
>> +                                             XFS_BMAPI_PREALLOC);
>> +                     } else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
>> +                             error = xfs_seal_file_space(ip, offset, len);
>> +                     } else
>> +                             error = xfs_alloc_file_space(ip, offset, len,
>> +                                             XFS_BMAPI_PREALLOC);
>>               }
>>               if (error)
>>                       goto out_unlock;
>> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
>> index 7494dc67c66f..48546c6fbec7 100644
>> --- a/include/linux/falloc.h
>> +++ b/include/linux/falloc.h
>> @@ -26,6 +26,7 @@ struct space_resv {
>>                                        FALLOC_FL_COLLAPSE_RANGE |     \
>>                                        FALLOC_FL_ZERO_RANGE |         \
>>                                        FALLOC_FL_INSERT_RANGE |       \
>> -                                      FALLOC_FL_UNSHARE_RANGE)
>> +                                      FALLOC_FL_UNSHARE_RANGE |      \
>> +                                      FALLOC_FL_SEAL_BLOCK_MAP)
>>
>>  #endif /* _FALLOC_H_ */
>> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
>> index b075f601919b..39076975bf6f 100644
>> --- a/include/uapi/linux/falloc.h
>> +++ b/include/uapi/linux/falloc.h
>> @@ -76,4 +76,23 @@
>>   */
>>  #define FALLOC_FL_UNSHARE_RANGE              0x40
>>
>> +/*
>> + * FALLOC_FL_SEAL_BLOCK_MAP is used to seal (make immutable) all of the
>> + * file logical-to-physical extent offset mappings in the file. The
>> + * purpose is to allow an application to assume that there are no holes
>> + * or shared extents in the file and that the metadata needed to find
>> + * all the physical extents of the file is stable and can never be
>> + * dirtied.
>> + *
>> + * The immutable property is in effect for the entire inode, so the
>> + * range for this operation must start at offset 0 and len must be
>> + * greater than or equal to the current size of the file. If greater,
>> + * this operation allocates, unshares, hole fills, and seals in one
>
> 'allocates' is the same as 'hole fills', I think.
>
> This converts unwritten extents to zeroed written extents too, correct?

Yes, I'll add that and also include that in the man page patch that
I'm working on...

>
>> + * atomic step. If len is zero then the immutable state is cleared for
>> + * the inode.
>
> It's cleared if len == 0?  I thought that was what FL_UNSEAL is for?

Whoops, stale holdover from v1.

>> + * This flag implies FALLOC_FL_UNSHARE_RANGE and as such cannot be used
>> + * with the punch, zero, collapse, or insert range modes.
>> + */
>> +#define FALLOC_FL_SEAL_BLOCK_MAP     0x080
>>  #endif /* _UAPI_FALLOC_H_ */

Thanks Darrick!

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
  2017-08-04  2:28   ` Dan Williams
@ 2017-08-04 20:00     ` Darrick J. Wong
  -1 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 20:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, luto, linux-fsdevel, Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:10PM -0700, Dan Williams wrote:
> An inode with this flag set indicates that the file's block map cannot
> be changed from the currently allocated set.
> 
> The implementation of toggling the flag and sealing the state of the
> extent map is saved for a later patch. The functionality provided by
> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
> that provided by S_SWAPFILE, and it is targeted to replace it.
> 
> For now, only xfs and the core vfs are updated to consider the new flag.
> 
> The additional checks that are added for this flag, beyond what we are
> already doing for swapfiles, are:
> * fail writes that try to extend the file size
> * fail attempts to directly change the allocation map via fallocate or
>   xfs ioctls. This can be done centrally by blocking
>   xfs_alloc_file_space and xfs_free_file_space when the flag is set.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/attr.c              |   10 ++++++++++
>  fs/open.c              |    6 ++++++
>  fs/read_write.c        |    3 +++
>  fs/xfs/xfs_bmap_util.c |    6 ++++++
>  fs/xfs/xfs_ioctl.c     |    6 ++++++
>  include/linux/fs.h     |    2 ++
>  mm/filemap.c           |    5 +++++
>  7 files changed, 38 insertions(+)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 135304146120..8573e364bd06 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
>   */
>  int inode_newsize_ok(const struct inode *inode, loff_t offset)
>  {
> +	if (IS_IOMAP_IMMUTABLE(inode)) {
> +		/*
> +		 * Any size change is disallowed. Size increases may
> +		 * dirty metadata that an application is not prepared to
> +		 * sync, and a size decrease may expose free blocks to
> +		 * in-flight DMA.
> +		 */
> +		return -ETXTBSY;
> +	}
> +
>  	if (inode->i_size < offset) {
>  		unsigned long limit;
>  
> diff --git a/fs/open.c b/fs/open.c
> index 35bb784763a4..7395860d7164 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -ETXTBSY;
>  
>  	/*
> +	 * We cannot allow any allocation changes on an iomap immutable file
> +	 */
> +	if (IS_IOMAP_IMMUTABLE(inode))
> +		return -ETXTBSY;
> +
> +	/*
>  	 * Revalidate the write permissions, in case security policy has
>  	 * changed since the files were opened.
>  	 */
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 0cc7033aa413..dc673be7c7cb 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>  	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
>  		return -ETXTBSY;
>  
> +	if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
> +		return -ETXTBSY;
> +
>  	/* Don't reflink dirs, pipes, sockets... */
>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>  		return -EISDIR;
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 93e955262d07..fe0f8f7f4bb7 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1044,6 +1044,9 @@ xfs_alloc_file_space(
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
>  
> +	if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
> +		return -ETXTBSY;
> +

Hm.  The 'seal this up' caller in the next patch doesn't check for
ETXTBSY (or if it does I missed that), so if you try to seal an already
sealed file you'll get an error code even though you actually got the
state you wanted.

Second question: How might we handle the situation where a filesystem
/has/ to alter a block mapping?  Hypothetically, if the block layer
tells the fs that some range of storage has gone bad and the fs decides
to punch out that part of the file (or mark it unwritten or whatever) to
avoid a machine check, can we lock out file IO, forcibly remove the
mapping from memory, make whatever block map updates we want, and then
unlock?

(Conceptually, the bmbt rebuilder in the online fsck patchset operates
in a similar manner...)

--D

>  	error = xfs_qm_dqattach(ip, 0);
>  	if (error)
>  		return error;
> @@ -1294,6 +1297,9 @@ xfs_free_file_space(
>  
>  	trace_xfs_free_file_space(ip);
>  
> +	if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
> +		return -ETXTBSY;
> +
>  	error = xfs_qm_dqattach(ip, 0);
>  	if (error)
>  		return error;
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index e75c40a47b7d..2e64488bc4de 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
>  		goto out_put_tmp_file;
>  	}
>  
> +	if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
> +	    IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
> +		error = -EINVAL;
> +		goto out_put_tmp_file;
> +	}
> +
>  	/*
>  	 * We need to ensure that the fds passed in point to XFS inodes
>  	 * before we cast and access them as XFS structures as we have no
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6e1fd5d21248..0a254b768855 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1829,6 +1829,7 @@ struct super_operations {
>  #else
>  #define S_DAX		0	/* Make all the DAX code disappear */
>  #endif
> +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
>  
>  /*
>   * Note that nosuid etc flags are inode-specific: setting some file-system
> @@ -1867,6 +1868,7 @@ struct super_operations {
>  #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
>  #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
>  #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
> +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
>  
>  #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
>  				 (inode)->i_rdev == WHITEOUT_DEV)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..a4105a4c1d69 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2806,6 +2806,11 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(pos >= inode->i_sb->s_maxbytes))
>  		return -EFBIG;
>  
> +	/* Are we about to mutate the block map on an immutable file? */
> +	if (IS_IOMAP_IMMUTABLE(inode)
> +			&& (pos + iov_iter_count(from) > i_size_read(inode)))
> +		return -ETXTBSY;
> +
>  	iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
>  	return iov_iter_count(from);
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
@ 2017-08-04 20:00     ` Darrick J. Wong
  0 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 20:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:10PM -0700, Dan Williams wrote:
> An inode with this flag set indicates that the file's block map cannot
> be changed from the currently allocated set.
> 
> The implementation of toggling the flag and sealing the state of the
> extent map is saved for a later patch. The functionality provided by
> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
> that provided by S_SWAPFILE, and it is targeted to replace it.
> 
> For now, only xfs and the core vfs are updated to consider the new flag.
> 
> The additional checks that are added for this flag, beyond what we are
> already doing for swapfiles, are:
> * fail writes that try to extend the file size
> * fail attempts to directly change the allocation map via fallocate or
>   xfs ioctls. This can be done centrally by blocking
>   xfs_alloc_file_space and xfs_free_file_space when the flag is set.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/attr.c              |   10 ++++++++++
>  fs/open.c              |    6 ++++++
>  fs/read_write.c        |    3 +++
>  fs/xfs/xfs_bmap_util.c |    6 ++++++
>  fs/xfs/xfs_ioctl.c     |    6 ++++++
>  include/linux/fs.h     |    2 ++
>  mm/filemap.c           |    5 +++++
>  7 files changed, 38 insertions(+)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 135304146120..8573e364bd06 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
>   */
>  int inode_newsize_ok(const struct inode *inode, loff_t offset)
>  {
> +	if (IS_IOMAP_IMMUTABLE(inode)) {
> +		/*
> +		 * Any size change is disallowed. Size increases may
> +		 * dirty metadata that an application is not prepared to
> +		 * sync, and a size decrease may expose free blocks to
> +		 * in-flight DMA.
> +		 */
> +		return -ETXTBSY;
> +	}
> +
>  	if (inode->i_size < offset) {
>  		unsigned long limit;
>  
> diff --git a/fs/open.c b/fs/open.c
> index 35bb784763a4..7395860d7164 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -ETXTBSY;
>  
>  	/*
> +	 * We cannot allow any allocation changes on an iomap immutable file
> +	 */
> +	if (IS_IOMAP_IMMUTABLE(inode))
> +		return -ETXTBSY;
> +
> +	/*
>  	 * Revalidate the write permissions, in case security policy has
>  	 * changed since the files were opened.
>  	 */
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 0cc7033aa413..dc673be7c7cb 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>  	if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
>  		return -ETXTBSY;
>  
> +	if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
> +		return -ETXTBSY;
> +
>  	/* Don't reflink dirs, pipes, sockets... */
>  	if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>  		return -EISDIR;
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 93e955262d07..fe0f8f7f4bb7 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1044,6 +1044,9 @@ xfs_alloc_file_space(
>  	if (XFS_FORCED_SHUTDOWN(mp))
>  		return -EIO;
>  
> +	if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
> +		return -ETXTBSY;
> +

Hm.  The 'seal this up' caller in the next patch doesn't check for
ETXTBSY (or if it does I missed that), so if you try to seal an already
sealed file you'll get an error code even though you actually got the
state you wanted.

Second question: How might we handle the situation where a filesystem
/has/ to alter a block mapping?  Hypothetically, if the block layer
tells the fs that some range of storage has gone bad and the fs decides
to punch out that part of the file (or mark it unwritten or whatever) to
avoid a machine check, can we lock out file IO, forcibly remove the
mapping from memory, make whatever block map updates we want, and then
unlock?

(Conceptually, the bmbt rebuilder in the online fsck patchset operates
in a similar manner...)

--D

>  	error = xfs_qm_dqattach(ip, 0);
>  	if (error)
>  		return error;
> @@ -1294,6 +1297,9 @@ xfs_free_file_space(
>  
>  	trace_xfs_free_file_space(ip);
>  
> +	if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
> +		return -ETXTBSY;
> +
>  	error = xfs_qm_dqattach(ip, 0);
>  	if (error)
>  		return error;
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index e75c40a47b7d..2e64488bc4de 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1755,6 +1755,12 @@ xfs_ioc_swapext(
>  		goto out_put_tmp_file;
>  	}
>  
> +	if (IS_IOMAP_IMMUTABLE(file_inode(f.file)) ||
> +	    IS_IOMAP_IMMUTABLE(file_inode(tmp.file))) {
> +		error = -EINVAL;
> +		goto out_put_tmp_file;
> +	}
> +
>  	/*
>  	 * We need to ensure that the fds passed in point to XFS inodes
>  	 * before we cast and access them as XFS structures as we have no
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6e1fd5d21248..0a254b768855 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1829,6 +1829,7 @@ struct super_operations {
>  #else
>  #define S_DAX		0	/* Make all the DAX code disappear */
>  #endif
> +#define S_IOMAP_IMMUTABLE 16384 /* logical-to-physical extent map is fixed */
>  
>  /*
>   * Note that nosuid etc flags are inode-specific: setting some file-system
> @@ -1867,6 +1868,7 @@ struct super_operations {
>  #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
>  #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
>  #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
> +#define IS_IOMAP_IMMUTABLE(inode) ((inode)->i_flags & S_IOMAP_IMMUTABLE)
>  
>  #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
>  				 (inode)->i_rdev == WHITEOUT_DEV)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..a4105a4c1d69 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2806,6 +2806,11 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, struct iov_iter *from)
>  	if (unlikely(pos >= inode->i_sb->s_maxbytes))
>  		return -EFBIG;
>  
> +	/* Are we about to mutate the block map on an immutable file? */
> +	if (IS_IOMAP_IMMUTABLE(inode)
> +			&& (pos + iov_iter_count(from) > i_size_read(inode)))
> +		return -ETXTBSY;
> +
>  	iov_iter_truncate(from, inode->i_sb->s_maxbytes - pos);
>  	return iov_iter_count(from);
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 3/5] fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
  2017-08-04  2:28   ` Dan Williams
@ 2017-08-04 20:04     ` Darrick J. Wong
  -1 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 20:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, luto, linux-fsdevel, Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:23PM -0700, Dan Williams wrote:
> Provide an explicit fallocate operation type for clearing the
> S_IOMAP_IMMUTABLE flag. Like the enable case it requires CAP_IMMUTABLE
> and it can only be performed while no process has the file mapped.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/open.c                   |   17 +++++++++++------
>  fs/xfs/xfs_bmap_util.c      |   42 ++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_bmap_util.h      |    3 +++
>  fs/xfs/xfs_file.c           |    4 +++-
>  include/linux/falloc.h      |    3 ++-
>  include/uapi/linux/falloc.h |    1 +
>  6 files changed, 62 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index e3aae59785ae..ccfd8d3becc8 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -274,13 +274,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -EINVAL;
>  
>  	/*
> -	 * Seal block map operation should only be used exclusively, and
> -	 * with the IMMUTABLE capability.
> +	 * Seal/unseal block map operations should only be used
> +	 * exclusively, and with the IMMUTABLE capability.
>  	 */
> -	if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
> +	if (mode & (FALLOC_FL_SEAL_BLOCK_MAP | FALLOC_FL_UNSEAL_BLOCK_MAP)) {
>  		if (!capable(CAP_LINUX_IMMUTABLE))
>  			return -EPERM;
> -		if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
> +		if (mode == (FALLOC_FL_SEAL_BLOCK_MAP
> +					| FALLOC_FL_UNSEAL_BLOCK_MAP))
> +			return -EINVAL;
> +		if (mode & ~(FALLOC_FL_SEAL_BLOCK_MAP
> +					| FALLOC_FL_UNSEAL_BLOCK_MAP))
>  			return -EINVAL;
>  	}
>  
> @@ -303,9 +307,10 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -ETXTBSY;
>  
>  	/*
> -	 * We cannot allow any allocation changes on an iomap immutable file
> +	 * We cannot allow any allocation changes on an iomap immutable
> +	 * file, but we can allow clearing the immutable state.
>  	 */
> -	if (IS_IOMAP_IMMUTABLE(inode))
> +	if (IS_IOMAP_IMMUTABLE(inode) && !(mode & FALLOC_FL_UNSEAL_BLOCK_MAP))
>  		return -ETXTBSY;
>  
>  	/*
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 46d8eb9e19fc..70ac2d33ab27 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1494,6 +1494,48 @@ xfs_seal_file_space(
>  	return error;
>  }
>  
> +int
> +xfs_unseal_file_space(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		offset,
> +	xfs_off_t		len)
> +{
> +	struct inode		*inode = VFS_I(ip);
> +	struct address_space	*mapping = inode->i_mapping;
> +	int			error;
> +
> +	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));

Same assert-on-the-iolock comment as the previous patch.

> +
> +	if (offset)
> +		return -EINVAL;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	/*
> +	 * It does not make sense to unseal less than the full range of
> +	 * the file.
> +	 */
> +	error = -EINVAL;
> +	if (len < i_size_read(inode))
> +		goto out_unlock;

Hmm, should we be picky and require len == i_size_read() here?

> +	/*
> +	 * Provide safety against one thread changing the policy of not
> +	 * requiring fsync/msync (for block allocations) behind another
> +	 * thread's back.
> +	 */
> +	error = -EBUSY;
> +	if (mapping_mapped(mapping))
> +		goto out_unlock;
> +
> +	inode->i_flags &= ~S_IOMAP_IMMUTABLE;

It occurred to me, should we jump out early from the seal/unseal
operations if the flag state matches whatever the user is asking for?
This is perhaps not necessary for unseal since we don't do a lot of
work.

--D

> +	error = 0;
> +
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +	return error;
> +}
> +
>  /*
>   * @next_fsb will keep track of the extent currently undergoing shift.
>   * @stop_fsb will keep track of the extent at which we have to stop.
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 5115a32a2483..b64653a75942 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -62,6 +62,9 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
>  				xfs_off_t len);
>  int	xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
>  				xfs_off_t len);
> +int	xfs_unseal_file_space(struct xfs_inode *, xfs_off_t offset,
> +				xfs_off_t len);
> +
>  
>  /* EOF block manipulation functions */
>  bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e21121530a90..833f77700be2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -740,7 +740,7 @@ xfs_file_write_iter(
>  		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
>  		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
>  		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |	\
> -		 FALLOC_FL_SEAL_BLOCK_MAP)
> +		 FALLOC_FL_SEAL_BLOCK_MAP | FALLOC_FL_UNSEAL_BLOCK_MAP)
>  
>  STATIC long
>  xfs_file_fallocate(
> @@ -840,6 +840,8 @@ xfs_file_fallocate(
>  						XFS_BMAPI_PREALLOC);
>  			} else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
>  				error = xfs_seal_file_space(ip, offset, len);
> +			} else if (mode & FALLOC_FL_UNSEAL_BLOCK_MAP) {
> +				error = xfs_unseal_file_space(ip, offset, len);
>  			} else
>  				error = xfs_alloc_file_space(ip, offset, len,
>  						XFS_BMAPI_PREALLOC);
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 48546c6fbec7..b22c1368ed1e 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -27,6 +27,7 @@ struct space_resv {
>  					 FALLOC_FL_ZERO_RANGE |		\
>  					 FALLOC_FL_INSERT_RANGE |	\
>  					 FALLOC_FL_UNSHARE_RANGE |	\
> -					 FALLOC_FL_SEAL_BLOCK_MAP)
> +					 FALLOC_FL_SEAL_BLOCK_MAP |	\
> +					 FALLOC_FL_UNSEAL_BLOCK_MAP)
>  
>  #endif /* _FALLOC_H_ */
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 39076975bf6f..a4949e1a2dae 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -95,4 +95,5 @@
>   * with the punch, zero, collapse, or insert range modes.
>   */
>  #define FALLOC_FL_SEAL_BLOCK_MAP	0x080
> +#define FALLOC_FL_UNSEAL_BLOCK_MAP	0x100
>  #endif /* _UAPI_FALLOC_H_ */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 3/5] fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
@ 2017-08-04 20:04     ` Darrick J. Wong
  0 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 20:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:23PM -0700, Dan Williams wrote:
> Provide an explicit fallocate operation type for clearing the
> S_IOMAP_IMMUTABLE flag. Like the enable case it requires CAP_IMMUTABLE
> and it can only be performed while no process has the file mapped.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/open.c                   |   17 +++++++++++------
>  fs/xfs/xfs_bmap_util.c      |   42 ++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_bmap_util.h      |    3 +++
>  fs/xfs/xfs_file.c           |    4 +++-
>  include/linux/falloc.h      |    3 ++-
>  include/uapi/linux/falloc.h |    1 +
>  6 files changed, 62 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index e3aae59785ae..ccfd8d3becc8 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -274,13 +274,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -EINVAL;
>  
>  	/*
> -	 * Seal block map operation should only be used exclusively, and
> -	 * with the IMMUTABLE capability.
> +	 * Seal/unseal block map operations should only be used
> +	 * exclusively, and with the IMMUTABLE capability.
>  	 */
> -	if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
> +	if (mode & (FALLOC_FL_SEAL_BLOCK_MAP | FALLOC_FL_UNSEAL_BLOCK_MAP)) {
>  		if (!capable(CAP_LINUX_IMMUTABLE))
>  			return -EPERM;
> -		if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
> +		if (mode == (FALLOC_FL_SEAL_BLOCK_MAP
> +					| FALLOC_FL_UNSEAL_BLOCK_MAP))
> +			return -EINVAL;
> +		if (mode & ~(FALLOC_FL_SEAL_BLOCK_MAP
> +					| FALLOC_FL_UNSEAL_BLOCK_MAP))
>  			return -EINVAL;
>  	}
>  
> @@ -303,9 +307,10 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		return -ETXTBSY;
>  
>  	/*
> -	 * We cannot allow any allocation changes on an iomap immutable file
> +	 * We cannot allow any allocation changes on an iomap immutable
> +	 * file, but we can allow clearing the immutable state.
>  	 */
> -	if (IS_IOMAP_IMMUTABLE(inode))
> +	if (IS_IOMAP_IMMUTABLE(inode) && !(mode & FALLOC_FL_UNSEAL_BLOCK_MAP))
>  		return -ETXTBSY;
>  
>  	/*
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 46d8eb9e19fc..70ac2d33ab27 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1494,6 +1494,48 @@ xfs_seal_file_space(
>  	return error;
>  }
>  
> +int
> +xfs_unseal_file_space(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		offset,
> +	xfs_off_t		len)
> +{
> +	struct inode		*inode = VFS_I(ip);
> +	struct address_space	*mapping = inode->i_mapping;
> +	int			error;
> +
> +	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));

Same assert-on-the-iolock comment as the previous patch.

> +
> +	if (offset)
> +		return -EINVAL;
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	/*
> +	 * It does not make sense to unseal less than the full range of
> +	 * the file.
> +	 */
> +	error = -EINVAL;
> +	if (len < i_size_read(inode))
> +		goto out_unlock;

Hmm, should we be picky and require len == i_size_read() here?

> +	/*
> +	 * Provide safety against one thread changing the policy of not
> +	 * requiring fsync/msync (for block allocations) behind another
> +	 * thread's back.
> +	 */
> +	error = -EBUSY;
> +	if (mapping_mapped(mapping))
> +		goto out_unlock;
> +
> +	inode->i_flags &= ~S_IOMAP_IMMUTABLE;

It occurred to me, should we jump out early from the seal/unseal
operations if the flag state matches whatever the user is asking for?
This is perhaps not necessary for unseal since we don't do a lot of
work.

--D

> +	error = 0;
> +
> +out_unlock:
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +	return error;
> +}
> +
>  /*
>   * @next_fsb will keep track of the extent currently undergoing shift.
>   * @stop_fsb will keep track of the extent at which we have to stop.
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 5115a32a2483..b64653a75942 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -62,6 +62,9 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
>  				xfs_off_t len);
>  int	xfs_seal_file_space(struct xfs_inode *, xfs_off_t offset,
>  				xfs_off_t len);
> +int	xfs_unseal_file_space(struct xfs_inode *, xfs_off_t offset,
> +				xfs_off_t len);
> +
>  
>  /* EOF block manipulation functions */
>  bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e21121530a90..833f77700be2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -740,7 +740,7 @@ xfs_file_write_iter(
>  		(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |		\
>  		 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |	\
>  		 FALLOC_FL_INSERT_RANGE | FALLOC_FL_UNSHARE_RANGE |	\
> -		 FALLOC_FL_SEAL_BLOCK_MAP)
> +		 FALLOC_FL_SEAL_BLOCK_MAP | FALLOC_FL_UNSEAL_BLOCK_MAP)
>  
>  STATIC long
>  xfs_file_fallocate(
> @@ -840,6 +840,8 @@ xfs_file_fallocate(
>  						XFS_BMAPI_PREALLOC);
>  			} else if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
>  				error = xfs_seal_file_space(ip, offset, len);
> +			} else if (mode & FALLOC_FL_UNSEAL_BLOCK_MAP) {
> +				error = xfs_unseal_file_space(ip, offset, len);
>  			} else
>  				error = xfs_alloc_file_space(ip, offset, len,
>  						XFS_BMAPI_PREALLOC);
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 48546c6fbec7..b22c1368ed1e 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -27,6 +27,7 @@ struct space_resv {
>  					 FALLOC_FL_ZERO_RANGE |		\
>  					 FALLOC_FL_INSERT_RANGE |	\
>  					 FALLOC_FL_UNSHARE_RANGE |	\
> -					 FALLOC_FL_SEAL_BLOCK_MAP)
> +					 FALLOC_FL_SEAL_BLOCK_MAP |	\
> +					 FALLOC_FL_UNSEAL_BLOCK_MAP)
>  
>  #endif /* _FALLOC_H_ */
> diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h
> index 39076975bf6f..a4949e1a2dae 100644
> --- a/include/uapi/linux/falloc.h
> +++ b/include/uapi/linux/falloc.h
> @@ -95,4 +95,5 @@
>   * with the punch, zero, collapse, or insert range modes.
>   */
>  #define FALLOC_FL_SEAL_BLOCK_MAP	0x080
> +#define FALLOC_FL_UNSEAL_BLOCK_MAP	0x100
>  #endif /* _UAPI_FALLOC_H_ */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
  2017-08-04  2:28   ` Dan Williams
@ 2017-08-04 20:14     ` Darrick J. Wong
  -1 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 20:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	luto, linux-fsdevel, Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:35PM -0700, Dan Williams wrote:
> After validating the state of the file as not having holes, shared
> extents, or active mappings try to commit the
> XFS_DIFLAG2_IOMAP_IMMUTABLE flag to the on-disk inode metadata. If that
> succeeds then allow the S_IOMAP_IMMUTABLE to be set on the vfs inode.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 70ac2d33ab27..8464c25a2403 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1436,9 +1436,11 @@ xfs_seal_file_space(
>  	xfs_off_t		offset,
>  	xfs_off_t		len)
>  {
> +	struct xfs_mount	*mp = ip->i_mount;
>  	struct inode		*inode = VFS_I(ip);
>  	struct address_space	*mapping = inode->i_mapping;
>  	int			error;
> +	struct xfs_trans	*tp;
>
>  	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>  
> @@ -1454,6 +1456,10 @@ xfs_seal_file_space(
>  	if (error)
>  		return error;
>  
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
> +	if (error)
> +		return error;
> +
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	/*
>  	 * Either the size changed after we performed allocation /
> @@ -1486,10 +1492,20 @@ xfs_seal_file_space(
>  	if (error < 0)
>  		goto out_unlock;
>  
> +	xfs_trans_ijoin(tp, ip, 0);

FWIW if you change that third parameter to XFS_ILOCK_EXCL then
xfs_trans_commit will do the xfs_iunlock(ip, XFS_ILOCK_EXCL) for you if
the commit succeeds...

> +	ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
> +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +	error = xfs_trans_commit(tp);
> +	tp = NULL; /* nothing to cancel */
> +	if (error)
> +		goto out_unlock;
> +
>  	inode->i_flags |= S_IOMAP_IMMUTABLE;

...and then you can just return out here.

--D

>  out_unlock:
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	if (tp)
> +		xfs_trans_cancel(tp);
>  
>  	return error;
>  }
> @@ -1500,15 +1516,21 @@ xfs_unseal_file_space(
>  	xfs_off_t		offset,
>  	xfs_off_t		len)
>  {
> +	struct xfs_mount	*mp = ip->i_mount;
>  	struct inode		*inode = VFS_I(ip);
>  	struct address_space	*mapping = inode->i_mapping;
>  	int			error;
> +	struct xfs_trans	*tp;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>  
>  	if (offset)
>  		return -EINVAL;
>  
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
> +	if (error)
> +		return error;
> +
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	/*
>  	 * It does not make sense to unseal less than the full range of
> @@ -1527,11 +1549,21 @@ xfs_unseal_file_space(
>  	if (mapping_mapped(mapping))
>  		goto out_unlock;
>  
> +	xfs_trans_ijoin(tp, ip, 0);
> +	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_IOMAP_IMMUTABLE;
> +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +	error = xfs_trans_commit(tp);
> +	tp = NULL; /* nothing to cancel */
> +	if (error)
> +		goto out_unlock;
> +
>  	inode->i_flags &= ~S_IOMAP_IMMUTABLE;
>  	error = 0;
>  
>  out_unlock:
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	if (tp)
> +		xfs_trans_cancel(tp);
>  
>  	return error;
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
@ 2017-08-04 20:14     ` Darrick J. Wong
  0 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 20:14 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, luto, linux-fsdevel, Ross Zwisler, Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:35PM -0700, Dan Williams wrote:
> After validating the state of the file as not having holes, shared
> extents, or active mappings try to commit the
> XFS_DIFLAG2_IOMAP_IMMUTABLE flag to the on-disk inode metadata. If that
> succeeds then allow the S_IOMAP_IMMUTABLE to be set on the vfs inode.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 70ac2d33ab27..8464c25a2403 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1436,9 +1436,11 @@ xfs_seal_file_space(
>  	xfs_off_t		offset,
>  	xfs_off_t		len)
>  {
> +	struct xfs_mount	*mp = ip->i_mount;
>  	struct inode		*inode = VFS_I(ip);
>  	struct address_space	*mapping = inode->i_mapping;
>  	int			error;
> +	struct xfs_trans	*tp;
>
>  	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>  
> @@ -1454,6 +1456,10 @@ xfs_seal_file_space(
>  	if (error)
>  		return error;
>  
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
> +	if (error)
> +		return error;
> +
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	/*
>  	 * Either the size changed after we performed allocation /
> @@ -1486,10 +1492,20 @@ xfs_seal_file_space(
>  	if (error < 0)
>  		goto out_unlock;
>  
> +	xfs_trans_ijoin(tp, ip, 0);

FWIW if you change that third parameter to XFS_ILOCK_EXCL then
xfs_trans_commit will do the xfs_iunlock(ip, XFS_ILOCK_EXCL) for you if
the commit succeeds...

> +	ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
> +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +	error = xfs_trans_commit(tp);
> +	tp = NULL; /* nothing to cancel */
> +	if (error)
> +		goto out_unlock;
> +
>  	inode->i_flags |= S_IOMAP_IMMUTABLE;

...and then you can just return out here.

--D

>  out_unlock:
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	if (tp)
> +		xfs_trans_cancel(tp);
>  
>  	return error;
>  }
> @@ -1500,15 +1516,21 @@ xfs_unseal_file_space(
>  	xfs_off_t		offset,
>  	xfs_off_t		len)
>  {
> +	struct xfs_mount	*mp = ip->i_mount;
>  	struct inode		*inode = VFS_I(ip);
>  	struct address_space	*mapping = inode->i_mapping;
>  	int			error;
> +	struct xfs_trans	*tp;
>  
>  	ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>  
>  	if (offset)
>  		return -EINVAL;
>  
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
> +	if (error)
> +		return error;
> +
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	/*
>  	 * It does not make sense to unseal less than the full range of
> @@ -1527,11 +1549,21 @@ xfs_unseal_file_space(
>  	if (mapping_mapped(mapping))
>  		goto out_unlock;
>  
> +	xfs_trans_ijoin(tp, ip, 0);
> +	ip->i_d.di_flags2 &= ~XFS_DIFLAG2_IOMAP_IMMUTABLE;
> +	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> +	error = xfs_trans_commit(tp);
> +	tp = NULL; /* nothing to cancel */
> +	if (error)
> +		goto out_unlock;
> +
>  	inode->i_flags &= ~S_IOMAP_IMMUTABLE;
>  	error = 0;
>  
>  out_unlock:
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	if (tp)
> +		xfs_trans_cancel(tp);
>  
>  	return error;
>  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
  2017-08-04 20:00     ` Darrick J. Wong
@ 2017-08-04 20:31       ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 20:31 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Christoph Hellwig

On Fri, Aug 4, 2017 at 1:00 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:10PM -0700, Dan Williams wrote:
>> An inode with this flag set indicates that the file's block map cannot
>> be changed from the currently allocated set.
>>
>> The implementation of toggling the flag and sealing the state of the
>> extent map is saved for a later patch. The functionality provided by
>> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
>> that provided by S_SWAPFILE, and it is targeted to replace it.
>>
>> For now, only xfs and the core vfs are updated to consider the new flag.
>>
>> The additional checks that are added for this flag, beyond what we are
>> already doing for swapfiles, are:
>> * fail writes that try to extend the file size
>> * fail attempts to directly change the allocation map via fallocate or
>>   xfs ioctls. This can be done centrally by blocking
>>   xfs_alloc_file_space and xfs_free_file_space when the flag is set.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/attr.c              |   10 ++++++++++
>>  fs/open.c              |    6 ++++++
>>  fs/read_write.c        |    3 +++
>>  fs/xfs/xfs_bmap_util.c |    6 ++++++
>>  fs/xfs/xfs_ioctl.c     |    6 ++++++
>>  include/linux/fs.h     |    2 ++
>>  mm/filemap.c           |    5 +++++
>>  7 files changed, 38 insertions(+)
>>
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 135304146120..8573e364bd06 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
>>   */
>>  int inode_newsize_ok(const struct inode *inode, loff_t offset)
>>  {
>> +     if (IS_IOMAP_IMMUTABLE(inode)) {
>> +             /*
>> +              * Any size change is disallowed. Size increases may
>> +              * dirty metadata that an application is not prepared to
>> +              * sync, and a size decrease may expose free blocks to
>> +              * in-flight DMA.
>> +              */
>> +             return -ETXTBSY;
>> +     }
>> +
>>       if (inode->i_size < offset) {
>>               unsigned long limit;
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index 35bb784763a4..7395860d7164 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>               return -ETXTBSY;
>>
>>       /*
>> +      * We cannot allow any allocation changes on an iomap immutable file
>> +      */
>> +     if (IS_IOMAP_IMMUTABLE(inode))
>> +             return -ETXTBSY;
>> +
>> +     /*
>>        * Revalidate the write permissions, in case security policy has
>>        * changed since the files were opened.
>>        */
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 0cc7033aa413..dc673be7c7cb 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>>       if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
>>               return -ETXTBSY;
>>
>> +     if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
>> +             return -ETXTBSY;
>> +
>>       /* Don't reflink dirs, pipes, sockets... */
>>       if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>>               return -EISDIR;
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 93e955262d07..fe0f8f7f4bb7 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1044,6 +1044,9 @@ xfs_alloc_file_space(
>>       if (XFS_FORCED_SHUTDOWN(mp))
>>               return -EIO;
>>
>> +     if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
>> +             return -ETXTBSY;
>> +
>
> Hm.  The 'seal this up' caller in the next patch doesn't check for
> ETXTBSY (or if it does I missed that), so if you try to seal an already
> sealed file you'll get an error code even though you actually got the
> state you wanted.

That's a good point, I'll fix that up.

>
> Second question: How might we handle the situation where a filesystem
> /has/ to alter a block mapping?  Hypothetically, if the block layer
> tells the fs that some range of storage has gone bad and the fs decides
> to punch out that part of the file (or mark it unwritten or whatever) to
> avoid a machine check, can we lock out file IO, forcibly remove the
> mapping from memory, make whatever block map updates we want, and then
> unlock?

It's not clear that the filesystem /has/ to change the block mappings
when the backing media supports error clearing. Unlike bad DRAM ranges
where the address is permanently mapped out, we can clear pmem and
disk errors by writing new data. The bad block can be repaired or
remapped internal to the hardware device.

As far as I can see no amount of fs locking will keep in-flight DMA
from assuming it can continue to write to the storage address it
thought was immutable. So, I think that means that the only error
management that can be expected while the file is immutable is
blkdev_issue_zeroout() to clear the error, or otherwise hope that the
DMA operation can generate the properly sized/aligned write request
that can clear the error.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
@ 2017-08-04 20:31       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 20:31 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

On Fri, Aug 4, 2017 at 1:00 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:10PM -0700, Dan Williams wrote:
>> An inode with this flag set indicates that the file's block map cannot
>> be changed from the currently allocated set.
>>
>> The implementation of toggling the flag and sealing the state of the
>> extent map is saved for a later patch. The functionality provided by
>> S_IOMAP_IMMUTABLE, once toggle support is added, will be a superset of
>> that provided by S_SWAPFILE, and it is targeted to replace it.
>>
>> For now, only xfs and the core vfs are updated to consider the new flag.
>>
>> The additional checks that are added for this flag, beyond what we are
>> already doing for swapfiles, are:
>> * fail writes that try to extend the file size
>> * fail attempts to directly change the allocation map via fallocate or
>>   xfs ioctls. This can be done centrally by blocking
>>   xfs_alloc_file_space and xfs_free_file_space when the flag is set.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/attr.c              |   10 ++++++++++
>>  fs/open.c              |    6 ++++++
>>  fs/read_write.c        |    3 +++
>>  fs/xfs/xfs_bmap_util.c |    6 ++++++
>>  fs/xfs/xfs_ioctl.c     |    6 ++++++
>>  include/linux/fs.h     |    2 ++
>>  mm/filemap.c           |    5 +++++
>>  7 files changed, 38 insertions(+)
>>
>> diff --git a/fs/attr.c b/fs/attr.c
>> index 135304146120..8573e364bd06 100644
>> --- a/fs/attr.c
>> +++ b/fs/attr.c
>> @@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
>>   */
>>  int inode_newsize_ok(const struct inode *inode, loff_t offset)
>>  {
>> +     if (IS_IOMAP_IMMUTABLE(inode)) {
>> +             /*
>> +              * Any size change is disallowed. Size increases may
>> +              * dirty metadata that an application is not prepared to
>> +              * sync, and a size decrease may expose free blocks to
>> +              * in-flight DMA.
>> +              */
>> +             return -ETXTBSY;
>> +     }
>> +
>>       if (inode->i_size < offset) {
>>               unsigned long limit;
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index 35bb784763a4..7395860d7164 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>               return -ETXTBSY;
>>
>>       /*
>> +      * We cannot allow any allocation changes on an iomap immutable file
>> +      */
>> +     if (IS_IOMAP_IMMUTABLE(inode))
>> +             return -ETXTBSY;
>> +
>> +     /*
>>        * Revalidate the write permissions, in case security policy has
>>        * changed since the files were opened.
>>        */
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 0cc7033aa413..dc673be7c7cb 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, loff_t pos_in,
>>       if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
>>               return -ETXTBSY;
>>
>> +     if (IS_IOMAP_IMMUTABLE(inode_in) || IS_IOMAP_IMMUTABLE(inode_out))
>> +             return -ETXTBSY;
>> +
>>       /* Don't reflink dirs, pipes, sockets... */
>>       if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
>>               return -EISDIR;
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 93e955262d07..fe0f8f7f4bb7 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1044,6 +1044,9 @@ xfs_alloc_file_space(
>>       if (XFS_FORCED_SHUTDOWN(mp))
>>               return -EIO;
>>
>> +     if (IS_IOMAP_IMMUTABLE(VFS_I(ip)))
>> +             return -ETXTBSY;
>> +
>
> Hm.  The 'seal this up' caller in the next patch doesn't check for
> ETXTBSY (or if it does I missed that), so if you try to seal an already
> sealed file you'll get an error code even though you actually got the
> state you wanted.

That's a good point, I'll fix that up.

>
> Second question: How might we handle the situation where a filesystem
> /has/ to alter a block mapping?  Hypothetically, if the block layer
> tells the fs that some range of storage has gone bad and the fs decides
> to punch out that part of the file (or mark it unwritten or whatever) to
> avoid a machine check, can we lock out file IO, forcibly remove the
> mapping from memory, make whatever block map updates we want, and then
> unlock?

It's not clear that the filesystem /has/ to change the block mappings
when the backing media supports error clearing. Unlike bad DRAM ranges
where the address is permanently mapped out, we can clear pmem and
disk errors by writing new data. The bad block can be repaired or
remapped internal to the hardware device.

As far as I can see no amount of fs locking will keep in-flight DMA
from assuming it can continue to write to the storage address it
thought was immutable. So, I think that means that the only error
management that can be expected while the file is immutable is
blkdev_issue_zeroout() to clear the error, or otherwise hope that the
DMA operation can generate the properly sized/aligned write request
that can clear the error.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
  2017-08-04  2:28   ` Dan Williams
@ 2017-08-04 20:33     ` Darrick J. Wong
  -1 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 20:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	luto, linux-fsdevel, Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:30PM -0700, Dan Williams wrote:
> Add an on-disk inode flag to record the state of the S_IOMAP_IMMUTABLE
> in-memory vfs inode flags. This allows the protections against reflink
> and hole punch to be automatically restored on a sub-sequent boot when
> the in-memory inode is established.
> 
> The FS_XFLAG_IOMAP_IMMUTABLE is introduced to allow xfs_io to read the
> state of the flag, but toggling the flag requires going through
> fallocate(FALLOC_FL_[UN]SEAL_BLOCK_MAP). Support for toggling this
> on-disk state is saved for a later patch.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/libxfs/xfs_format.h |    5 ++++-
>  fs/xfs/xfs_inode.c         |    2 ++
>  fs/xfs/xfs_ioctl.c         |    1 +
>  fs/xfs/xfs_iops.c          |    8 +++++---
>  include/uapi/linux/fs.h    |    1 +
>  5 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index d4d9bef20c3a..9e720e55776b 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>  #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
>  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
>  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> +#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */

So... the greedy part of my brain that doesn't want to give out flags2
bits has been wondering, what if we just didn't have an on-disk
IOMAP_IMMUTABLE bit, and set FS_XFLAG based only on the in-core
S_IOMAP_IMMUTABLE bit?  If a program wants the immutable iomap
semantics, they will have to code some variant on the following:

fd = open(...);
ret = fallocate(fd, FALLOC_FL_SEAL_BLOCK_MAP, 0, len...)
if (ret) {
	printf("couldn't seal block map");
	close(fd);
	return;
}

mmap(fd...);
/* do sensitive io operations here */
munmap(fd...);

close(fd);

Therefore the cost of not having the on-disk flag is that we'll have to
do more unshare/alloc/test/set cycles than we would if we could remember
the iomap-immutable state across unmounts and inode reclaiming.
However, if the data map is already ready to go, this shouldn't have a
lot of overhead since we only have to iterate the in-core extents.

Just trying to make sure we /need/ the inode flag bit. :)

--D

>  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
>  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
>  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
> +#define XFS_DIFLAG2_IOMAP_IMMUTABLE (1 << XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT)
>  
>  #define XFS_DIFLAG2_ANY \
> -	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
> +	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> +	 XFS_DIFLAG2_IOMAP_IMMUTABLE)
>  
>  /*
>   * Inode number format:
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ceef77c0416a..4ca22e272ce6 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -674,6 +674,8 @@ _xfs_dic2xflags(
>  			flags |= FS_XFLAG_DAX;
>  		if (di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
>  			flags |= FS_XFLAG_COWEXTSIZE;
> +		if (di_flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE)
> +			flags |= FS_XFLAG_IOMAP_IMMUTABLE;
>  	}
>  
>  	if (has_attr)
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 2e64488bc4de..df2eef0f9d45 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -978,6 +978,7 @@ xfs_set_diflags(
>  		return;
>  
>  	di_flags2 = (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK);
> +	di_flags2 |= (ip->i_d.di_flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE);
>  	if (xflags & FS_XFLAG_DAX)
>  		di_flags2 |= XFS_DIFLAG2_DAX;
>  	if (xflags & FS_XFLAG_COWEXTSIZE)
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 469c9fa4c178..174ef95453f5 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1186,9 +1186,10 @@ xfs_diflags_to_iflags(
>  	struct xfs_inode	*ip)
>  {
>  	uint16_t		flags = ip->i_d.di_flags;
> +	uint64_t		flags2 = ip->i_d.di_flags2;
>  
>  	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> -			    S_NOATIME | S_DAX);
> +			    S_NOATIME | S_DAX | S_IOMAP_IMMUTABLE);
>  
>  	if (flags & XFS_DIFLAG_IMMUTABLE)
>  		inode->i_flags |= S_IMMUTABLE;
> @@ -1201,9 +1202,10 @@ xfs_diflags_to_iflags(
>  	if (S_ISREG(inode->i_mode) &&
>  	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
>  	    !xfs_is_reflink_inode(ip) &&
> -	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
> -	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> +	    (ip->i_mount->m_flags & XFS_MOUNT_DAX || flags2 & XFS_DIFLAG2_DAX))
>  		inode->i_flags |= S_DAX;
> +	if (flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE)
> +		inode->i_flags |= S_IOMAP_IMMUTABLE;
>  }
>  
>  /*
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7495d05e8de..4765e024ad74 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -182,6 +182,7 @@ struct fsxattr {
>  #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
>  #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
>  #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
> +#define FS_XFLAG_IOMAP_IMMUTABLE 0x00020000	/* block map immutable */
>  #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
>  
>  /* the read-only stuff doesn't really belong here, but any other place is
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
@ 2017-08-04 20:33     ` Darrick J. Wong
  0 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 20:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, luto, linux-fsdevel, Ross Zwisler, Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:30PM -0700, Dan Williams wrote:
> Add an on-disk inode flag to record the state of the S_IOMAP_IMMUTABLE
> in-memory vfs inode flags. This allows the protections against reflink
> and hole punch to be automatically restored on a sub-sequent boot when
> the in-memory inode is established.
> 
> The FS_XFLAG_IOMAP_IMMUTABLE is introduced to allow xfs_io to read the
> state of the flag, but toggling the flag requires going through
> fallocate(FALLOC_FL_[UN]SEAL_BLOCK_MAP). Support for toggling this
> on-disk state is saved for a later patch.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/xfs/libxfs/xfs_format.h |    5 ++++-
>  fs/xfs/xfs_inode.c         |    2 ++
>  fs/xfs/xfs_ioctl.c         |    1 +
>  fs/xfs/xfs_iops.c          |    8 +++++---
>  include/uapi/linux/fs.h    |    1 +
>  5 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index d4d9bef20c3a..9e720e55776b 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>  #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
>  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
>  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> +#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */

So... the greedy part of my brain that doesn't want to give out flags2
bits has been wondering, what if we just didn't have an on-disk
IOMAP_IMMUTABLE bit, and set FS_XFLAG based only on the in-core
S_IOMAP_IMMUTABLE bit?  If a program wants the immutable iomap
semantics, they will have to code some variant on the following:

fd = open(...);
ret = fallocate(fd, FALLOC_FL_SEAL_BLOCK_MAP, 0, len...)
if (ret) {
	printf("couldn't seal block map");
	close(fd);
	return;
}

mmap(fd...);
/* do sensitive io operations here */
munmap(fd...);

close(fd);

Therefore the cost of not having the on-disk flag is that we'll have to
do more unshare/alloc/test/set cycles than we would if we could remember
the iomap-immutable state across unmounts and inode reclaiming.
However, if the data map is already ready to go, this shouldn't have a
lot of overhead since we only have to iterate the in-core extents.

Just trying to make sure we /need/ the inode flag bit. :)

--D

>  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
>  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
>  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
> +#define XFS_DIFLAG2_IOMAP_IMMUTABLE (1 << XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT)
>  
>  #define XFS_DIFLAG2_ANY \
> -	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
> +	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> +	 XFS_DIFLAG2_IOMAP_IMMUTABLE)
>  
>  /*
>   * Inode number format:
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ceef77c0416a..4ca22e272ce6 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -674,6 +674,8 @@ _xfs_dic2xflags(
>  			flags |= FS_XFLAG_DAX;
>  		if (di_flags2 & XFS_DIFLAG2_COWEXTSIZE)
>  			flags |= FS_XFLAG_COWEXTSIZE;
> +		if (di_flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE)
> +			flags |= FS_XFLAG_IOMAP_IMMUTABLE;
>  	}
>  
>  	if (has_attr)
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 2e64488bc4de..df2eef0f9d45 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -978,6 +978,7 @@ xfs_set_diflags(
>  		return;
>  
>  	di_flags2 = (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK);
> +	di_flags2 |= (ip->i_d.di_flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE);
>  	if (xflags & FS_XFLAG_DAX)
>  		di_flags2 |= XFS_DIFLAG2_DAX;
>  	if (xflags & FS_XFLAG_COWEXTSIZE)
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 469c9fa4c178..174ef95453f5 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1186,9 +1186,10 @@ xfs_diflags_to_iflags(
>  	struct xfs_inode	*ip)
>  {
>  	uint16_t		flags = ip->i_d.di_flags;
> +	uint64_t		flags2 = ip->i_d.di_flags2;
>  
>  	inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> -			    S_NOATIME | S_DAX);
> +			    S_NOATIME | S_DAX | S_IOMAP_IMMUTABLE);
>  
>  	if (flags & XFS_DIFLAG_IMMUTABLE)
>  		inode->i_flags |= S_IMMUTABLE;
> @@ -1201,9 +1202,10 @@ xfs_diflags_to_iflags(
>  	if (S_ISREG(inode->i_mode) &&
>  	    ip->i_mount->m_sb.sb_blocksize == PAGE_SIZE &&
>  	    !xfs_is_reflink_inode(ip) &&
> -	    (ip->i_mount->m_flags & XFS_MOUNT_DAX ||
> -	     ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> +	    (ip->i_mount->m_flags & XFS_MOUNT_DAX || flags2 & XFS_DIFLAG2_DAX))
>  		inode->i_flags |= S_DAX;
> +	if (flags2 & XFS_DIFLAG2_IOMAP_IMMUTABLE)
> +		inode->i_flags |= S_IOMAP_IMMUTABLE;
>  }
>  
>  /*
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7495d05e8de..4765e024ad74 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -182,6 +182,7 @@ struct fsxattr {
>  #define FS_XFLAG_FILESTREAM	0x00004000	/* use filestream allocator */
>  #define FS_XFLAG_DAX		0x00008000	/* use DAX for IO */
>  #define FS_XFLAG_COWEXTSIZE	0x00010000	/* CoW extent size allocator hint */
> +#define FS_XFLAG_IOMAP_IMMUTABLE 0x00020000	/* block map immutable */
>  #define FS_XFLAG_HASATTR	0x80000000	/* no DIFLAG for this	*/
>  
>  /* the read-only stuff doesn't really belong here, but any other place is
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 3/5] fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
  2017-08-04 20:04     ` Darrick J. Wong
@ 2017-08-04 20:36       ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 20:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Christoph Hellwig

On Fri, Aug 4, 2017 at 1:04 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:23PM -0700, Dan Williams wrote:
>> Provide an explicit fallocate operation type for clearing the
>> S_IOMAP_IMMUTABLE flag. Like the enable case it requires CAP_IMMUTABLE
>> and it can only be performed while no process has the file mapped.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/open.c                   |   17 +++++++++++------
>>  fs/xfs/xfs_bmap_util.c      |   42 ++++++++++++++++++++++++++++++++++++++++++
>>  fs/xfs/xfs_bmap_util.h      |    3 +++
>>  fs/xfs/xfs_file.c           |    4 +++-
>>  include/linux/falloc.h      |    3 ++-
>>  include/uapi/linux/falloc.h |    1 +
>>  6 files changed, 62 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index e3aae59785ae..ccfd8d3becc8 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -274,13 +274,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>               return -EINVAL;
>>
>>       /*
>> -      * Seal block map operation should only be used exclusively, and
>> -      * with the IMMUTABLE capability.
>> +      * Seal/unseal block map operations should only be used
>> +      * exclusively, and with the IMMUTABLE capability.
>>        */
>> -     if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
>> +     if (mode & (FALLOC_FL_SEAL_BLOCK_MAP | FALLOC_FL_UNSEAL_BLOCK_MAP)) {
>>               if (!capable(CAP_LINUX_IMMUTABLE))
>>                       return -EPERM;
>> -             if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
>> +             if (mode == (FALLOC_FL_SEAL_BLOCK_MAP
>> +                                     | FALLOC_FL_UNSEAL_BLOCK_MAP))
>> +                     return -EINVAL;
>> +             if (mode & ~(FALLOC_FL_SEAL_BLOCK_MAP
>> +                                     | FALLOC_FL_UNSEAL_BLOCK_MAP))
>>                       return -EINVAL;
>>       }
>>
>> @@ -303,9 +307,10 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>               return -ETXTBSY;
>>
>>       /*
>> -      * We cannot allow any allocation changes on an iomap immutable file
>> +      * We cannot allow any allocation changes on an iomap immutable
>> +      * file, but we can allow clearing the immutable state.
>>        */
>> -     if (IS_IOMAP_IMMUTABLE(inode))
>> +     if (IS_IOMAP_IMMUTABLE(inode) && !(mode & FALLOC_FL_UNSEAL_BLOCK_MAP))
>>               return -ETXTBSY;
>>
>>       /*
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 46d8eb9e19fc..70ac2d33ab27 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1494,6 +1494,48 @@ xfs_seal_file_space(
>>       return error;
>>  }
>>
>> +int
>> +xfs_unseal_file_space(
>> +     struct xfs_inode        *ip,
>> +     xfs_off_t               offset,
>> +     xfs_off_t               len)
>> +{
>> +     struct inode            *inode = VFS_I(ip);
>> +     struct address_space    *mapping = inode->i_mapping;
>> +     int                     error;
>> +
>> +     ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>
> Same assert-on-the-iolock comment as the previous patch.

Ok.

>
>> +
>> +     if (offset)
>> +             return -EINVAL;
>> +
>> +     xfs_ilock(ip, XFS_ILOCK_EXCL);
>> +     /*
>> +      * It does not make sense to unseal less than the full range of
>> +      * the file.
>> +      */
>> +     error = -EINVAL;
>> +     if (len < i_size_read(inode))
>> +             goto out_unlock;
>
> Hmm, should we be picky and require len == i_size_read() here?

Yes, I think so, otherwise we may have raced someone who increased the
file size with unwritten extents.

>
>> +     /*
>> +      * Provide safety against one thread changing the policy of not
>> +      * requiring fsync/msync (for block allocations) behind another
>> +      * thread's back.
>> +      */
>> +     error = -EBUSY;
>> +     if (mapping_mapped(mapping))
>> +             goto out_unlock;
>> +
>> +     inode->i_flags &= ~S_IOMAP_IMMUTABLE;
>
> It occurred to me, should we jump out early from the seal/unseal
> operations if the flag state matches whatever the user is asking for?
> This is perhaps not necessary for unseal since we don't do a lot of
> work.
>

Yes, I think I had that semantic in v1, but lost in the cleanups. Will
bring it back.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 3/5] fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP
@ 2017-08-04 20:36       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 20:36 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

On Fri, Aug 4, 2017 at 1:04 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:23PM -0700, Dan Williams wrote:
>> Provide an explicit fallocate operation type for clearing the
>> S_IOMAP_IMMUTABLE flag. Like the enable case it requires CAP_IMMUTABLE
>> and it can only be performed while no process has the file mapped.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/open.c                   |   17 +++++++++++------
>>  fs/xfs/xfs_bmap_util.c      |   42 ++++++++++++++++++++++++++++++++++++++++++
>>  fs/xfs/xfs_bmap_util.h      |    3 +++
>>  fs/xfs/xfs_file.c           |    4 +++-
>>  include/linux/falloc.h      |    3 ++-
>>  include/uapi/linux/falloc.h |    1 +
>>  6 files changed, 62 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/open.c b/fs/open.c
>> index e3aae59785ae..ccfd8d3becc8 100644
>> --- a/fs/open.c
>> +++ b/fs/open.c
>> @@ -274,13 +274,17 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>               return -EINVAL;
>>
>>       /*
>> -      * Seal block map operation should only be used exclusively, and
>> -      * with the IMMUTABLE capability.
>> +      * Seal/unseal block map operations should only be used
>> +      * exclusively, and with the IMMUTABLE capability.
>>        */
>> -     if (mode & FALLOC_FL_SEAL_BLOCK_MAP) {
>> +     if (mode & (FALLOC_FL_SEAL_BLOCK_MAP | FALLOC_FL_UNSEAL_BLOCK_MAP)) {
>>               if (!capable(CAP_LINUX_IMMUTABLE))
>>                       return -EPERM;
>> -             if (mode & ~FALLOC_FL_SEAL_BLOCK_MAP)
>> +             if (mode == (FALLOC_FL_SEAL_BLOCK_MAP
>> +                                     | FALLOC_FL_UNSEAL_BLOCK_MAP))
>> +                     return -EINVAL;
>> +             if (mode & ~(FALLOC_FL_SEAL_BLOCK_MAP
>> +                                     | FALLOC_FL_UNSEAL_BLOCK_MAP))
>>                       return -EINVAL;
>>       }
>>
>> @@ -303,9 +307,10 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>               return -ETXTBSY;
>>
>>       /*
>> -      * We cannot allow any allocation changes on an iomap immutable file
>> +      * We cannot allow any allocation changes on an iomap immutable
>> +      * file, but we can allow clearing the immutable state.
>>        */
>> -     if (IS_IOMAP_IMMUTABLE(inode))
>> +     if (IS_IOMAP_IMMUTABLE(inode) && !(mode & FALLOC_FL_UNSEAL_BLOCK_MAP))
>>               return -ETXTBSY;
>>
>>       /*
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 46d8eb9e19fc..70ac2d33ab27 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1494,6 +1494,48 @@ xfs_seal_file_space(
>>       return error;
>>  }
>>
>> +int
>> +xfs_unseal_file_space(
>> +     struct xfs_inode        *ip,
>> +     xfs_off_t               offset,
>> +     xfs_off_t               len)
>> +{
>> +     struct inode            *inode = VFS_I(ip);
>> +     struct address_space    *mapping = inode->i_mapping;
>> +     int                     error;
>> +
>> +     ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>
> Same assert-on-the-iolock comment as the previous patch.

Ok.

>
>> +
>> +     if (offset)
>> +             return -EINVAL;
>> +
>> +     xfs_ilock(ip, XFS_ILOCK_EXCL);
>> +     /*
>> +      * It does not make sense to unseal less than the full range of
>> +      * the file.
>> +      */
>> +     error = -EINVAL;
>> +     if (len < i_size_read(inode))
>> +             goto out_unlock;
>
> Hmm, should we be picky and require len == i_size_read() here?

Yes, I think so, otherwise we may have raced someone who increased the
file size with unwritten extents.

>
>> +     /*
>> +      * Provide safety against one thread changing the policy of not
>> +      * requiring fsync/msync (for block allocations) behind another
>> +      * thread's back.
>> +      */
>> +     error = -EBUSY;
>> +     if (mapping_mapped(mapping))
>> +             goto out_unlock;
>> +
>> +     inode->i_flags &= ~S_IOMAP_IMMUTABLE;
>
> It occurred to me, should we jump out early from the seal/unseal
> operations if the flag state matches whatever the user is asking for?
> This is perhaps not necessary for unseal since we don't do a lot of
> work.
>

Yes, I think I had that semantic in v1, but lost in the cleanups. Will
bring it back.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
  2017-08-04 20:33     ` Darrick J. Wong
@ 2017-08-04 20:45       ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 20:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Fri, Aug 4, 2017 at 1:33 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:30PM -0700, Dan Williams wrote:
>> Add an on-disk inode flag to record the state of the S_IOMAP_IMMUTABLE
>> in-memory vfs inode flags. This allows the protections against reflink
>> and hole punch to be automatically restored on a sub-sequent boot when
>> the in-memory inode is established.
>>
>> The FS_XFLAG_IOMAP_IMMUTABLE is introduced to allow xfs_io to read the
>> state of the flag, but toggling the flag requires going through
>> fallocate(FALLOC_FL_[UN]SEAL_BLOCK_MAP). Support for toggling this
>> on-disk state is saved for a later patch.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/libxfs/xfs_format.h |    5 ++++-
>>  fs/xfs/xfs_inode.c         |    2 ++
>>  fs/xfs/xfs_ioctl.c         |    1 +
>>  fs/xfs/xfs_iops.c          |    8 +++++---
>>  include/uapi/linux/fs.h    |    1 +
>>  5 files changed, 13 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> index d4d9bef20c3a..9e720e55776b 100644
>> --- a/fs/xfs/libxfs/xfs_format.h
>> +++ b/fs/xfs/libxfs/xfs_format.h
>> @@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>>  #define XFS_DIFLAG2_DAX_BIT  0       /* use DAX for this inode */
>>  #define XFS_DIFLAG2_REFLINK_BIT      1       /* file's blocks may be shared */
>>  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
>> +#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */
>
> So... the greedy part of my brain that doesn't want to give out flags2
> bits has been wondering, what if we just didn't have an on-disk
> IOMAP_IMMUTABLE bit, and set FS_XFLAG based only on the in-core
> S_IOMAP_IMMUTABLE bit?  If a program wants the immutable iomap
> semantics, they will have to code some variant on the following:
>
> fd = open(...);
> ret = fallocate(fd, FALLOC_FL_SEAL_BLOCK_MAP, 0, len...)
> if (ret) {
>         printf("couldn't seal block map");
>         close(fd);
>         return;
> }
>
> mmap(fd...);
> /* do sensitive io operations here */
> munmap(fd...);
>
> close(fd);
>
> Therefore the cost of not having the on-disk flag is that we'll have to
> do more unshare/alloc/test/set cycles than we would if we could remember
> the iomap-immutable state across unmounts and inode reclaiming.
> However, if the data map is already ready to go, this shouldn't have a
> lot of overhead since we only have to iterate the in-core extents.
>
> Just trying to make sure we /need/ the inode flag bit. :)

A fair point.

The use case I imagine is a privileged (CAP_IMMUTABLE) process setting
up the data file space at server provisioning time, and then
unprivileged processes using the immutable semantic thereafter.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
@ 2017-08-04 20:45       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 20:45 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

On Fri, Aug 4, 2017 at 1:33 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:30PM -0700, Dan Williams wrote:
>> Add an on-disk inode flag to record the state of the S_IOMAP_IMMUTABLE
>> in-memory vfs inode flags. This allows the protections against reflink
>> and hole punch to be automatically restored on a sub-sequent boot when
>> the in-memory inode is established.
>>
>> The FS_XFLAG_IOMAP_IMMUTABLE is introduced to allow xfs_io to read the
>> state of the flag, but toggling the flag requires going through
>> fallocate(FALLOC_FL_[UN]SEAL_BLOCK_MAP). Support for toggling this
>> on-disk state is saved for a later patch.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/libxfs/xfs_format.h |    5 ++++-
>>  fs/xfs/xfs_inode.c         |    2 ++
>>  fs/xfs/xfs_ioctl.c         |    1 +
>>  fs/xfs/xfs_iops.c          |    8 +++++---
>>  include/uapi/linux/fs.h    |    1 +
>>  5 files changed, 13 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> index d4d9bef20c3a..9e720e55776b 100644
>> --- a/fs/xfs/libxfs/xfs_format.h
>> +++ b/fs/xfs/libxfs/xfs_format.h
>> @@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>>  #define XFS_DIFLAG2_DAX_BIT  0       /* use DAX for this inode */
>>  #define XFS_DIFLAG2_REFLINK_BIT      1       /* file's blocks may be shared */
>>  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
>> +#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */
>
> So... the greedy part of my brain that doesn't want to give out flags2
> bits has been wondering, what if we just didn't have an on-disk
> IOMAP_IMMUTABLE bit, and set FS_XFLAG based only on the in-core
> S_IOMAP_IMMUTABLE bit?  If a program wants the immutable iomap
> semantics, they will have to code some variant on the following:
>
> fd = open(...);
> ret = fallocate(fd, FALLOC_FL_SEAL_BLOCK_MAP, 0, len...)
> if (ret) {
>         printf("couldn't seal block map");
>         close(fd);
>         return;
> }
>
> mmap(fd...);
> /* do sensitive io operations here */
> munmap(fd...);
>
> close(fd);
>
> Therefore the cost of not having the on-disk flag is that we'll have to
> do more unshare/alloc/test/set cycles than we would if we could remember
> the iomap-immutable state across unmounts and inode reclaiming.
> However, if the data map is already ready to go, this shouldn't have a
> lot of overhead since we only have to iterate the in-core extents.
>
> Just trying to make sure we /need/ the inode flag bit. :)

A fair point.

The use case I imagine is a privileged (CAP_IMMUTABLE) process setting
up the data file space at server provisioning time, and then
unprivileged processes using the immutable semantic thereafter.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
  2017-08-04 20:14     ` Darrick J. Wong
@ 2017-08-04 20:47       ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 20:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Fri, Aug 4, 2017 at 1:14 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:35PM -0700, Dan Williams wrote:
>> After validating the state of the file as not having holes, shared
>> extents, or active mappings try to commit the
>> XFS_DIFLAG2_IOMAP_IMMUTABLE flag to the on-disk inode metadata. If that
>> succeeds then allow the S_IOMAP_IMMUTABLE to be set on the vfs inode.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++++++++++++++++++
>>  1 file changed, 32 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 70ac2d33ab27..8464c25a2403 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1436,9 +1436,11 @@ xfs_seal_file_space(
>>       xfs_off_t               offset,
>>       xfs_off_t               len)
>>  {
>> +     struct xfs_mount        *mp = ip->i_mount;
>>       struct inode            *inode = VFS_I(ip);
>>       struct address_space    *mapping = inode->i_mapping;
>>       int                     error;
>> +     struct xfs_trans        *tp;
>>
>>       ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>>
>> @@ -1454,6 +1456,10 @@ xfs_seal_file_space(
>>       if (error)
>>               return error;
>>
>> +     error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
>> +     if (error)
>> +             return error;
>> +
>>       xfs_ilock(ip, XFS_ILOCK_EXCL);
>>       /*
>>        * Either the size changed after we performed allocation /
>> @@ -1486,10 +1492,20 @@ xfs_seal_file_space(
>>       if (error < 0)
>>               goto out_unlock;
>>
>> +     xfs_trans_ijoin(tp, ip, 0);
>
> FWIW if you change that third parameter to XFS_ILOCK_EXCL then
> xfs_trans_commit will do the xfs_iunlock(ip, XFS_ILOCK_EXCL) for you if
> the commit succeeds...
>
>> +     ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
>> +     xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>> +     error = xfs_trans_commit(tp);
>> +     tp = NULL; /* nothing to cancel */
>> +     if (error)
>> +             goto out_unlock;
>> +
>>       inode->i_flags |= S_IOMAP_IMMUTABLE;
>
> ...and then you can just return out here.

Do we not need to hold XFS_ILOCK_EXCL over ->i_flags changes, or is
XFS_IOLOCK_EXCL enough?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
@ 2017-08-04 20:47       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 20:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

On Fri, Aug 4, 2017 at 1:14 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:35PM -0700, Dan Williams wrote:
>> After validating the state of the file as not having holes, shared
>> extents, or active mappings try to commit the
>> XFS_DIFLAG2_IOMAP_IMMUTABLE flag to the on-disk inode metadata. If that
>> succeeds then allow the S_IOMAP_IMMUTABLE to be set on the vfs inode.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Jeff Moyer <jmoyer@redhat.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++++++++++++++++++
>>  1 file changed, 32 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index 70ac2d33ab27..8464c25a2403 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1436,9 +1436,11 @@ xfs_seal_file_space(
>>       xfs_off_t               offset,
>>       xfs_off_t               len)
>>  {
>> +     struct xfs_mount        *mp = ip->i_mount;
>>       struct inode            *inode = VFS_I(ip);
>>       struct address_space    *mapping = inode->i_mapping;
>>       int                     error;
>> +     struct xfs_trans        *tp;
>>
>>       ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>>
>> @@ -1454,6 +1456,10 @@ xfs_seal_file_space(
>>       if (error)
>>               return error;
>>
>> +     error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
>> +     if (error)
>> +             return error;
>> +
>>       xfs_ilock(ip, XFS_ILOCK_EXCL);
>>       /*
>>        * Either the size changed after we performed allocation /
>> @@ -1486,10 +1492,20 @@ xfs_seal_file_space(
>>       if (error < 0)
>>               goto out_unlock;
>>
>> +     xfs_trans_ijoin(tp, ip, 0);
>
> FWIW if you change that third parameter to XFS_ILOCK_EXCL then
> xfs_trans_commit will do the xfs_iunlock(ip, XFS_ILOCK_EXCL) for you if
> the commit succeeds...
>
>> +     ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
>> +     xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>> +     error = xfs_trans_commit(tp);
>> +     tp = NULL; /* nothing to cancel */
>> +     if (error)
>> +             goto out_unlock;
>> +
>>       inode->i_flags |= S_IOMAP_IMMUTABLE;
>
> ...and then you can just return out here.

Do we not need to hold XFS_ILOCK_EXCL over ->i_flags changes, or is
XFS_IOLOCK_EXCL enough?

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
  2017-08-04 20:47       ` Dan Williams
@ 2017-08-04 20:53         ` Darrick J. Wong
  -1 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 20:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Fri, Aug 04, 2017 at 01:47:32PM -0700, Dan Williams wrote:
> On Fri, Aug 4, 2017 at 1:14 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Thu, Aug 03, 2017 at 07:28:35PM -0700, Dan Williams wrote:
> >> After validating the state of the file as not having holes, shared
> >> extents, or active mappings try to commit the
> >> XFS_DIFLAG2_IOMAP_IMMUTABLE flag to the on-disk inode metadata. If that
> >> succeeds then allow the S_IOMAP_IMMUTABLE to be set on the vfs inode.
> >>
> >> Cc: Jan Kara <jack@suse.cz>
> >> Cc: Jeff Moyer <jmoyer@redhat.com>
> >> Cc: Christoph Hellwig <hch@lst.de>
> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> Suggested-by: Dave Chinner <david@fromorbit.com>
> >> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> ---
> >>  fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++++++++++++++++++
> >>  1 file changed, 32 insertions(+)
> >>
> >> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> >> index 70ac2d33ab27..8464c25a2403 100644
> >> --- a/fs/xfs/xfs_bmap_util.c
> >> +++ b/fs/xfs/xfs_bmap_util.c
> >> @@ -1436,9 +1436,11 @@ xfs_seal_file_space(
> >>       xfs_off_t               offset,
> >>       xfs_off_t               len)
> >>  {
> >> +     struct xfs_mount        *mp = ip->i_mount;
> >>       struct inode            *inode = VFS_I(ip);
> >>       struct address_space    *mapping = inode->i_mapping;
> >>       int                     error;
> >> +     struct xfs_trans        *tp;
> >>
> >>       ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
> >>
> >> @@ -1454,6 +1456,10 @@ xfs_seal_file_space(
> >>       if (error)
> >>               return error;
> >>
> >> +     error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
> >> +     if (error)
> >> +             return error;
> >> +
> >>       xfs_ilock(ip, XFS_ILOCK_EXCL);
> >>       /*
> >>        * Either the size changed after we performed allocation /
> >> @@ -1486,10 +1492,20 @@ xfs_seal_file_space(
> >>       if (error < 0)
> >>               goto out_unlock;
> >>
> >> +     xfs_trans_ijoin(tp, ip, 0);
> >
> > FWIW if you change that third parameter to XFS_ILOCK_EXCL then
> > xfs_trans_commit will do the xfs_iunlock(ip, XFS_ILOCK_EXCL) for you if
> > the commit succeeds...
> >
> >> +     ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
> >> +     xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> >> +     error = xfs_trans_commit(tp);
> >> +     tp = NULL; /* nothing to cancel */
> >> +     if (error)
> >> +             goto out_unlock;
> >> +
> >>       inode->i_flags |= S_IOMAP_IMMUTABLE;
> >
> > ...and then you can just return out here.
> 
> Do we not need to hold XFS_ILOCK_EXCL over ->i_flags changes, or is
> XFS_IOLOCK_EXCL enough?

Oh, heh, I missed a piece.  Set the flag before the transaction commit,
because if the commit fails the fs is shut down anyway. :)

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
@ 2017-08-04 20:53         ` Darrick J. Wong
  0 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 20:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

On Fri, Aug 04, 2017 at 01:47:32PM -0700, Dan Williams wrote:
> On Fri, Aug 4, 2017 at 1:14 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > On Thu, Aug 03, 2017 at 07:28:35PM -0700, Dan Williams wrote:
> >> After validating the state of the file as not having holes, shared
> >> extents, or active mappings try to commit the
> >> XFS_DIFLAG2_IOMAP_IMMUTABLE flag to the on-disk inode metadata. If that
> >> succeeds then allow the S_IOMAP_IMMUTABLE to be set on the vfs inode.
> >>
> >> Cc: Jan Kara <jack@suse.cz>
> >> Cc: Jeff Moyer <jmoyer@redhat.com>
> >> Cc: Christoph Hellwig <hch@lst.de>
> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> Suggested-by: Dave Chinner <david@fromorbit.com>
> >> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> ---
> >>  fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++++++++++++++++++
> >>  1 file changed, 32 insertions(+)
> >>
> >> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> >> index 70ac2d33ab27..8464c25a2403 100644
> >> --- a/fs/xfs/xfs_bmap_util.c
> >> +++ b/fs/xfs/xfs_bmap_util.c
> >> @@ -1436,9 +1436,11 @@ xfs_seal_file_space(
> >>       xfs_off_t               offset,
> >>       xfs_off_t               len)
> >>  {
> >> +     struct xfs_mount        *mp = ip->i_mount;
> >>       struct inode            *inode = VFS_I(ip);
> >>       struct address_space    *mapping = inode->i_mapping;
> >>       int                     error;
> >> +     struct xfs_trans        *tp;
> >>
> >>       ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
> >>
> >> @@ -1454,6 +1456,10 @@ xfs_seal_file_space(
> >>       if (error)
> >>               return error;
> >>
> >> +     error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
> >> +     if (error)
> >> +             return error;
> >> +
> >>       xfs_ilock(ip, XFS_ILOCK_EXCL);
> >>       /*
> >>        * Either the size changed after we performed allocation /
> >> @@ -1486,10 +1492,20 @@ xfs_seal_file_space(
> >>       if (error < 0)
> >>               goto out_unlock;
> >>
> >> +     xfs_trans_ijoin(tp, ip, 0);
> >
> > FWIW if you change that third parameter to XFS_ILOCK_EXCL then
> > xfs_trans_commit will do the xfs_iunlock(ip, XFS_ILOCK_EXCL) for you if
> > the commit succeeds...
> >
> >> +     ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
> >> +     xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
> >> +     error = xfs_trans_commit(tp);
> >> +     tp = NULL; /* nothing to cancel */
> >> +     if (error)
> >> +             goto out_unlock;
> >> +
> >>       inode->i_flags |= S_IOMAP_IMMUTABLE;
> >
> > ...and then you can just return out here.
> 
> Do we not need to hold XFS_ILOCK_EXCL over ->i_flags changes, or is
> XFS_IOLOCK_EXCL enough?

Oh, heh, I missed a piece.  Set the flag before the transaction commit,
because if the commit fails the fs is shut down anyway. :)

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
  2017-08-04 20:53         ` Darrick J. Wong
@ 2017-08-04 20:55           ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 20:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Fri, Aug 4, 2017 at 1:53 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Fri, Aug 04, 2017 at 01:47:32PM -0700, Dan Williams wrote:
>> On Fri, Aug 4, 2017 at 1:14 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>> > On Thu, Aug 03, 2017 at 07:28:35PM -0700, Dan Williams wrote:
>> >> After validating the state of the file as not having holes, shared
>> >> extents, or active mappings try to commit the
>> >> XFS_DIFLAG2_IOMAP_IMMUTABLE flag to the on-disk inode metadata. If that
>> >> succeeds then allow the S_IOMAP_IMMUTABLE to be set on the vfs inode.
>> >>
>> >> Cc: Jan Kara <jack@suse.cz>
>> >> Cc: Jeff Moyer <jmoyer@redhat.com>
>> >> Cc: Christoph Hellwig <hch@lst.de>
>> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> >> Suggested-by: Dave Chinner <david@fromorbit.com>
>> >> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> >> ---
>> >>  fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++++++++++++++++++
>> >>  1 file changed, 32 insertions(+)
>> >>
>> >> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> >> index 70ac2d33ab27..8464c25a2403 100644
>> >> --- a/fs/xfs/xfs_bmap_util.c
>> >> +++ b/fs/xfs/xfs_bmap_util.c
>> >> @@ -1436,9 +1436,11 @@ xfs_seal_file_space(
>> >>       xfs_off_t               offset,
>> >>       xfs_off_t               len)
>> >>  {
>> >> +     struct xfs_mount        *mp = ip->i_mount;
>> >>       struct inode            *inode = VFS_I(ip);
>> >>       struct address_space    *mapping = inode->i_mapping;
>> >>       int                     error;
>> >> +     struct xfs_trans        *tp;
>> >>
>> >>       ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>> >>
>> >> @@ -1454,6 +1456,10 @@ xfs_seal_file_space(
>> >>       if (error)
>> >>               return error;
>> >>
>> >> +     error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
>> >> +     if (error)
>> >> +             return error;
>> >> +
>> >>       xfs_ilock(ip, XFS_ILOCK_EXCL);
>> >>       /*
>> >>        * Either the size changed after we performed allocation /
>> >> @@ -1486,10 +1492,20 @@ xfs_seal_file_space(
>> >>       if (error < 0)
>> >>               goto out_unlock;
>> >>
>> >> +     xfs_trans_ijoin(tp, ip, 0);
>> >
>> > FWIW if you change that third parameter to XFS_ILOCK_EXCL then
>> > xfs_trans_commit will do the xfs_iunlock(ip, XFS_ILOCK_EXCL) for you if
>> > the commit succeeds...
>> >
>> >> +     ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
>> >> +     xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>> >> +     error = xfs_trans_commit(tp);
>> >> +     tp = NULL; /* nothing to cancel */
>> >> +     if (error)
>> >> +             goto out_unlock;
>> >> +
>> >>       inode->i_flags |= S_IOMAP_IMMUTABLE;
>> >
>> > ...and then you can just return out here.
>>
>> Do we not need to hold XFS_ILOCK_EXCL over ->i_flags changes, or is
>> XFS_IOLOCK_EXCL enough?
>
> Oh, heh, I missed a piece.  Set the flag before the transaction commit,
> because if the commit fails the fs is shut down anyway. :)

Ah, got it, thanks.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate
@ 2017-08-04 20:55           ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 20:55 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs,
	Jeff Moyer, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

On Fri, Aug 4, 2017 at 1:53 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> On Fri, Aug 04, 2017 at 01:47:32PM -0700, Dan Williams wrote:
>> On Fri, Aug 4, 2017 at 1:14 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote:
>> > On Thu, Aug 03, 2017 at 07:28:35PM -0700, Dan Williams wrote:
>> >> After validating the state of the file as not having holes, shared
>> >> extents, or active mappings try to commit the
>> >> XFS_DIFLAG2_IOMAP_IMMUTABLE flag to the on-disk inode metadata. If that
>> >> succeeds then allow the S_IOMAP_IMMUTABLE to be set on the vfs inode.
>> >>
>> >> Cc: Jan Kara <jack@suse.cz>
>> >> Cc: Jeff Moyer <jmoyer@redhat.com>
>> >> Cc: Christoph Hellwig <hch@lst.de>
>> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> >> Suggested-by: Dave Chinner <david@fromorbit.com>
>> >> Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
>> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> >> ---
>> >>  fs/xfs/xfs_bmap_util.c |   32 ++++++++++++++++++++++++++++++++
>> >>  1 file changed, 32 insertions(+)
>> >>
>> >> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> >> index 70ac2d33ab27..8464c25a2403 100644
>> >> --- a/fs/xfs/xfs_bmap_util.c
>> >> +++ b/fs/xfs/xfs_bmap_util.c
>> >> @@ -1436,9 +1436,11 @@ xfs_seal_file_space(
>> >>       xfs_off_t               offset,
>> >>       xfs_off_t               len)
>> >>  {
>> >> +     struct xfs_mount        *mp = ip->i_mount;
>> >>       struct inode            *inode = VFS_I(ip);
>> >>       struct address_space    *mapping = inode->i_mapping;
>> >>       int                     error;
>> >> +     struct xfs_trans        *tp;
>> >>
>> >>       ASSERT(xfs_isilocked(ip, XFS_MMAPLOCK_EXCL));
>> >>
>> >> @@ -1454,6 +1456,10 @@ xfs_seal_file_space(
>> >>       if (error)
>> >>               return error;
>> >>
>> >> +     error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ichange, 0, 0, 0, &tp);
>> >> +     if (error)
>> >> +             return error;
>> >> +
>> >>       xfs_ilock(ip, XFS_ILOCK_EXCL);
>> >>       /*
>> >>        * Either the size changed after we performed allocation /
>> >> @@ -1486,10 +1492,20 @@ xfs_seal_file_space(
>> >>       if (error < 0)
>> >>               goto out_unlock;
>> >>
>> >> +     xfs_trans_ijoin(tp, ip, 0);
>> >
>> > FWIW if you change that third parameter to XFS_ILOCK_EXCL then
>> > xfs_trans_commit will do the xfs_iunlock(ip, XFS_ILOCK_EXCL) for you if
>> > the commit succeeds...
>> >
>> >> +     ip->i_d.di_flags2 |= XFS_DIFLAG2_IOMAP_IMMUTABLE;
>> >> +     xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>> >> +     error = xfs_trans_commit(tp);
>> >> +     tp = NULL; /* nothing to cancel */
>> >> +     if (error)
>> >> +             goto out_unlock;
>> >> +
>> >>       inode->i_flags |= S_IOMAP_IMMUTABLE;
>> >
>> > ...and then you can just return out here.
>>
>> Do we not need to hold XFS_ILOCK_EXCL over ->i_flags changes, or is
>> XFS_IOLOCK_EXCL enough?
>
> Oh, heh, I missed a piece.  Set the flag before the transaction commit,
> because if the commit fails the fs is shut down anyway. :)

Ah, got it, thanks.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
  2017-08-04  2:28   ` Dan Williams
@ 2017-08-04 23:31     ` Dave Chinner
  -1 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2017-08-04 23:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, darrick.wong, linux-kernel, linux-xfs,
	Alexander Viro, luto, linux-fsdevel, Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote:
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index fe0f8f7f4bb7..46d8eb9e19fc 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1393,6 +1393,107 @@ xfs_zero_file_space(
>  
>  }
>  
> +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
> +STATIC int
> +xfs_file_has_holes(
> +	struct xfs_inode	*ip)
> +{

Why do we need this function?

We've just run xfs_alloc_file_space() across the entire range we
are sealing, so we've already guaranteed that it won't have holes
in it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
@ 2017-08-04 23:31     ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2017-08-04 23:31 UTC (permalink / raw)
  To: Dan Williams
  Cc: darrick.wong, Jan Kara, linux-nvdimm, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, luto, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote:
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index fe0f8f7f4bb7..46d8eb9e19fc 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1393,6 +1393,107 @@ xfs_zero_file_space(
>  
>  }
>  
> +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
> +STATIC int
> +xfs_file_has_holes(
> +	struct xfs_inode	*ip)
> +{

Why do we need this function?

We've just run xfs_alloc_file_space() across the entire range we
are sealing, so we've already guaranteed that it won't have holes
in it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
  2017-08-04 23:31     ` Dave Chinner
@ 2017-08-04 23:43       ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 23:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, linux-kernel, linux-xfs,
	Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Christoph Hellwig

On Fri, Aug 4, 2017 at 4:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote:
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index fe0f8f7f4bb7..46d8eb9e19fc 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1393,6 +1393,107 @@ xfs_zero_file_space(
>>
>>  }
>>
>> +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
>> +STATIC int
>> +xfs_file_has_holes(
>> +     struct xfs_inode        *ip)
>> +{
>
> Why do we need this function?
>
> We've just run xfs_alloc_file_space() across the entire range we
> are sealing, so we've already guaranteed that it won't have holes
> in it.

I'm sure this is due to my ignorance of the scope of XFS_IOLOCK_EXCL
vs XFS_ILOCK_EXCL. I had assumed that since we drop and retake
XFS_ILOCK_EXCL that we need to re-validate the block map before
setting S_IOMAP_IMMUTABLE.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
@ 2017-08-04 23:43       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-04 23:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Jan Kara, linux-nvdimm, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

On Fri, Aug 4, 2017 at 4:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote:
>> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
>> index fe0f8f7f4bb7..46d8eb9e19fc 100644
>> --- a/fs/xfs/xfs_bmap_util.c
>> +++ b/fs/xfs/xfs_bmap_util.c
>> @@ -1393,6 +1393,107 @@ xfs_zero_file_space(
>>
>>  }
>>
>> +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
>> +STATIC int
>> +xfs_file_has_holes(
>> +     struct xfs_inode        *ip)
>> +{
>
> Why do we need this function?
>
> We've just run xfs_alloc_file_space() across the entire range we
> are sealing, so we've already guaranteed that it won't have holes
> in it.

I'm sure this is due to my ignorance of the scope of XFS_IOLOCK_EXCL
vs XFS_ILOCK_EXCL. I had assumed that since we drop and retake
XFS_ILOCK_EXCL that we need to re-validate the block map before
setting S_IOMAP_IMMUTABLE.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
  2017-08-04 20:33     ` Darrick J. Wong
@ 2017-08-04 23:46       ` Dave Chinner
  -1 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2017-08-04 23:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, linux-nvdimm, linux-kernel, Christoph Hellwig,
	linux-xfs, luto, linux-fsdevel

On Fri, Aug 04, 2017 at 01:33:12PM -0700, Darrick J. Wong wrote:
> On Thu, Aug 03, 2017 at 07:28:30PM -0700, Dan Williams wrote:
> > Add an on-disk inode flag to record the state of the S_IOMAP_IMMUTABLE
> > in-memory vfs inode flags. This allows the protections against reflink
> > and hole punch to be automatically restored on a sub-sequent boot when
> > the in-memory inode is established.
> > 
> > The FS_XFLAG_IOMAP_IMMUTABLE is introduced to allow xfs_io to read the
> > state of the flag, but toggling the flag requires going through
> > fallocate(FALLOC_FL_[UN]SEAL_BLOCK_MAP). Support for toggling this
> > on-disk state is saved for a later patch.
> > 
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Jeff Moyer <jmoyer@redhat.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h |    5 ++++-
> >  fs/xfs/xfs_inode.c         |    2 ++
> >  fs/xfs/xfs_ioctl.c         |    1 +
> >  fs/xfs/xfs_iops.c          |    8 +++++---
> >  include/uapi/linux/fs.h    |    1 +
> >  5 files changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index d4d9bef20c3a..9e720e55776b 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> >  #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
> >  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
> >  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> > +#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */
> 
> So... the greedy part of my brain that doesn't want to give out flags2
> bits has been wondering,

FWIW, I made di_flags2 a 64 bit value in the first place precisely
so we didn't have a scarcity problem and can just give out flag
bits for enabling new functionality like this...

> what if we just didn't have an on-disk
> IOMAP_IMMUTABLE bit, and set FS_XFLAG based only on the in-core
> S_IOMAP_IMMUTABLE bit?  If a program wants the immutable iomap
> semantics, they will have to code some variant on the following:
> 
> fd = open(...);
> ret = fallocate(fd, FALLOC_FL_SEAL_BLOCK_MAP, 0, len...)
> if (ret) {
> 	printf("couldn't seal block map");
> 	close(fd);
> 	return;
> }
> 
> mmap(fd...);
> /* do sensitive io operations here */
> munmap(fd...);
> 
> close(fd);
> 
> Therefore the cost of not having the on-disk flag is that we'll have to
> do more unshare/alloc/test/set cycles than we would if we could remember
> the iomap-immutable state across unmounts and inode reclaiming.
> However, if the data map is already ready to go, this shouldn't have a
> lot of overhead since we only have to iterate the in-core extents.
> 
> Just trying to make sure we /need/ the inode flag bit. :)

IMO, fallocate() is for making permanent changes to file extents. If
this is not going to be a permanent state change but only a
runtime-while-the-inode-is-in-cache flag, then it's probably not the
right interface to use.

This also seems problematic for applications other than DAX where
the block map may be sealed, the fd closed and access handed off to
another entity for remote storage access. If the inode gets
reclaimed due to memory pressure, the system loses the fact that
that the inode has been sealed. Hence another process can come
along, re-read the inode and modify the block map because it hasn't
been sealed in this new cache life cycle.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
@ 2017-08-04 23:46       ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2017-08-04 23:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Jan Kara, linux-nvdimm, linux-kernel, linux-xfs,
	Jeff Moyer, luto, linux-fsdevel, Ross Zwisler, Christoph Hellwig

On Fri, Aug 04, 2017 at 01:33:12PM -0700, Darrick J. Wong wrote:
> On Thu, Aug 03, 2017 at 07:28:30PM -0700, Dan Williams wrote:
> > Add an on-disk inode flag to record the state of the S_IOMAP_IMMUTABLE
> > in-memory vfs inode flags. This allows the protections against reflink
> > and hole punch to be automatically restored on a sub-sequent boot when
> > the in-memory inode is established.
> > 
> > The FS_XFLAG_IOMAP_IMMUTABLE is introduced to allow xfs_io to read the
> > state of the flag, but toggling the flag requires going through
> > fallocate(FALLOC_FL_[UN]SEAL_BLOCK_MAP). Support for toggling this
> > on-disk state is saved for a later patch.
> > 
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Jeff Moyer <jmoyer@redhat.com>
> > Cc: Christoph Hellwig <hch@lst.de>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h |    5 ++++-
> >  fs/xfs/xfs_inode.c         |    2 ++
> >  fs/xfs/xfs_ioctl.c         |    1 +
> >  fs/xfs/xfs_iops.c          |    8 +++++---
> >  include/uapi/linux/fs.h    |    1 +
> >  5 files changed, 13 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index d4d9bef20c3a..9e720e55776b 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> >  #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
> >  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
> >  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> > +#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */
> 
> So... the greedy part of my brain that doesn't want to give out flags2
> bits has been wondering,

FWIW, I made di_flags2 a 64 bit value in the first place precisely
so we didn't have a scarcity problem and can just give out flag
bits for enabling new functionality like this...

> what if we just didn't have an on-disk
> IOMAP_IMMUTABLE bit, and set FS_XFLAG based only on the in-core
> S_IOMAP_IMMUTABLE bit?  If a program wants the immutable iomap
> semantics, they will have to code some variant on the following:
> 
> fd = open(...);
> ret = fallocate(fd, FALLOC_FL_SEAL_BLOCK_MAP, 0, len...)
> if (ret) {
> 	printf("couldn't seal block map");
> 	close(fd);
> 	return;
> }
> 
> mmap(fd...);
> /* do sensitive io operations here */
> munmap(fd...);
> 
> close(fd);
> 
> Therefore the cost of not having the on-disk flag is that we'll have to
> do more unshare/alloc/test/set cycles than we would if we could remember
> the iomap-immutable state across unmounts and inode reclaiming.
> However, if the data map is already ready to go, this shouldn't have a
> lot of overhead since we only have to iterate the in-core extents.
> 
> Just trying to make sure we /need/ the inode flag bit. :)

IMO, fallocate() is for making permanent changes to file extents. If
this is not going to be a permanent state change but only a
runtime-while-the-inode-is-in-cache flag, then it's probably not the
right interface to use.

This also seems problematic for applications other than DAX where
the block map may be sealed, the fd closed and access handed off to
another entity for remote storage access. If the inode gets
reclaimed due to memory pressure, the system loses the fact that
that the inode has been sealed. Hence another process can come
along, re-read the inode and modify the block map because it hasn't
been sealed in this new cache life cycle.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
  2017-08-04 23:46       ` Dave Chinner
@ 2017-08-04 23:57         ` Darrick J. Wong
  -1 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 23:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, linux-kernel, Christoph Hellwig,
	linux-xfs, luto, linux-fsdevel

On Sat, Aug 05, 2017 at 09:46:15AM +1000, Dave Chinner wrote:
> On Fri, Aug 04, 2017 at 01:33:12PM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 03, 2017 at 07:28:30PM -0700, Dan Williams wrote:
> > > Add an on-disk inode flag to record the state of the S_IOMAP_IMMUTABLE
> > > in-memory vfs inode flags. This allows the protections against reflink
> > > and hole punch to be automatically restored on a sub-sequent boot when
> > > the in-memory inode is established.
> > > 
> > > The FS_XFLAG_IOMAP_IMMUTABLE is introduced to allow xfs_io to read the
> > > state of the flag, but toggling the flag requires going through
> > > fallocate(FALLOC_FL_[UN]SEAL_BLOCK_MAP). Support for toggling this
> > > on-disk state is saved for a later patch.
> > > 
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Jeff Moyer <jmoyer@redhat.com>
> > > Cc: Christoph Hellwig <hch@lst.de>
> > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Suggested-by: Dave Chinner <david@fromorbit.com>
> > > Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h |    5 ++++-
> > >  fs/xfs/xfs_inode.c         |    2 ++
> > >  fs/xfs/xfs_ioctl.c         |    1 +
> > >  fs/xfs/xfs_iops.c          |    8 +++++---
> > >  include/uapi/linux/fs.h    |    1 +
> > >  5 files changed, 13 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index d4d9bef20c3a..9e720e55776b 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> > >  #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
> > >  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
> > >  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> > > +#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */
> > 
> > So... the greedy part of my brain that doesn't want to give out flags2
> > bits has been wondering,
> 
> FWIW, I made di_flags2 a 64 bit value in the first place precisely
> so we didn't have a scarcity problem and can just give out flag
> bits for enabling new functionality like this...

Ok.  That's what I thought.

> > what if we just didn't have an on-disk
> > IOMAP_IMMUTABLE bit, and set FS_XFLAG based only on the in-core
> > S_IOMAP_IMMUTABLE bit?  If a program wants the immutable iomap
> > semantics, they will have to code some variant on the following:
> > 
> > fd = open(...);
> > ret = fallocate(fd, FALLOC_FL_SEAL_BLOCK_MAP, 0, len...)
> > if (ret) {
> > 	printf("couldn't seal block map");
> > 	close(fd);
> > 	return;
> > }
> > 
> > mmap(fd...);
> > /* do sensitive io operations here */
> > munmap(fd...);
> > 
> > close(fd);
> > 
> > Therefore the cost of not having the on-disk flag is that we'll have to
> > do more unshare/alloc/test/set cycles than we would if we could remember
> > the iomap-immutable state across unmounts and inode reclaiming.
> > However, if the data map is already ready to go, this shouldn't have a
> > lot of overhead since we only have to iterate the in-core extents.
> > 
> > Just trying to make sure we /need/ the inode flag bit. :)
> 
> IMO, fallocate() is for making permanent changes to file extents. If
> this is not going to be a permanent state change but only a
> runtime-while-the-inode-is-in-cache flag, then it's probably not the
> right interface to use.
> 
> This also seems problematic for applications other than DAX where
> the block map may be sealed, the fd closed and access handed off to
> another entity for remote storage access. If the inode gets
> reclaimed due to memory pressure, the system loses the fact that
> that the inode has been sealed. Hence another process can come
> along, re-read the inode and modify the block map because it hasn't
> been sealed in this new cache life cycle.....

<nod> Ok, I'm convinced. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE
@ 2017-08-04 23:57         ` Darrick J. Wong
  0 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-04 23:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Jan Kara, linux-nvdimm, linux-kernel, linux-xfs,
	Jeff Moyer, luto, linux-fsdevel, Ross Zwisler, Christoph Hellwig

On Sat, Aug 05, 2017 at 09:46:15AM +1000, Dave Chinner wrote:
> On Fri, Aug 04, 2017 at 01:33:12PM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 03, 2017 at 07:28:30PM -0700, Dan Williams wrote:
> > > Add an on-disk inode flag to record the state of the S_IOMAP_IMMUTABLE
> > > in-memory vfs inode flags. This allows the protections against reflink
> > > and hole punch to be automatically restored on a sub-sequent boot when
> > > the in-memory inode is established.
> > > 
> > > The FS_XFLAG_IOMAP_IMMUTABLE is introduced to allow xfs_io to read the
> > > state of the flag, but toggling the flag requires going through
> > > fallocate(FALLOC_FL_[UN]SEAL_BLOCK_MAP). Support for toggling this
> > > on-disk state is saved for a later patch.
> > > 
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Jeff Moyer <jmoyer@redhat.com>
> > > Cc: Christoph Hellwig <hch@lst.de>
> > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Suggested-by: Dave Chinner <david@fromorbit.com>
> > > Suggested-by: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h |    5 ++++-
> > >  fs/xfs/xfs_inode.c         |    2 ++
> > >  fs/xfs/xfs_ioctl.c         |    1 +
> > >  fs/xfs/xfs_iops.c          |    8 +++++---
> > >  include/uapi/linux/fs.h    |    1 +
> > >  5 files changed, 13 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index d4d9bef20c3a..9e720e55776b 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -1063,12 +1063,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> > >  #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
> > >  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
> > >  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> > > +#define XFS_DIFLAG2_IOMAP_IMMUTABLE_BIT 3 /* set S_IOMAP_IMMUTABLE for this inode */
> > 
> > So... the greedy part of my brain that doesn't want to give out flags2
> > bits has been wondering,
> 
> FWIW, I made di_flags2 a 64 bit value in the first place precisely
> so we didn't have a scarcity problem and can just give out flag
> bits for enabling new functionality like this...

Ok.  That's what I thought.

> > what if we just didn't have an on-disk
> > IOMAP_IMMUTABLE bit, and set FS_XFLAG based only on the in-core
> > S_IOMAP_IMMUTABLE bit?  If a program wants the immutable iomap
> > semantics, they will have to code some variant on the following:
> > 
> > fd = open(...);
> > ret = fallocate(fd, FALLOC_FL_SEAL_BLOCK_MAP, 0, len...)
> > if (ret) {
> > 	printf("couldn't seal block map");
> > 	close(fd);
> > 	return;
> > }
> > 
> > mmap(fd...);
> > /* do sensitive io operations here */
> > munmap(fd...);
> > 
> > close(fd);
> > 
> > Therefore the cost of not having the on-disk flag is that we'll have to
> > do more unshare/alloc/test/set cycles than we would if we could remember
> > the iomap-immutable state across unmounts and inode reclaiming.
> > However, if the data map is already ready to go, this shouldn't have a
> > lot of overhead since we only have to iterate the in-core extents.
> > 
> > Just trying to make sure we /need/ the inode flag bit. :)
> 
> IMO, fallocate() is for making permanent changes to file extents. If
> this is not going to be a permanent state change but only a
> runtime-while-the-inode-is-in-cache flag, then it's probably not the
> right interface to use.
> 
> This also seems problematic for applications other than DAX where
> the block map may be sealed, the fd closed and access handed off to
> another entity for remote storage access. If the inode gets
> reclaimed due to memory pressure, the system loses the fact that
> that the inode has been sealed. Hence another process can come
> along, re-read the inode and modify the block map because it hasn't
> been sealed in this new cache life cycle.....

<nod> Ok, I'm convinced. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
  2017-08-04 23:43       ` Dan Williams
@ 2017-08-05  0:04         ` Dave Chinner
  -1 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2017-08-05  0:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, linux-kernel, linux-xfs,
	Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Christoph Hellwig

On Fri, Aug 04, 2017 at 04:43:50PM -0700, Dan Williams wrote:
> On Fri, Aug 4, 2017 at 4:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote:
> >> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> >> index fe0f8f7f4bb7..46d8eb9e19fc 100644
> >> --- a/fs/xfs/xfs_bmap_util.c
> >> +++ b/fs/xfs/xfs_bmap_util.c
> >> @@ -1393,6 +1393,107 @@ xfs_zero_file_space(
> >>
> >>  }
> >>
> >> +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
> >> +STATIC int
> >> +xfs_file_has_holes(
> >> +     struct xfs_inode        *ip)
> >> +{
> >
> > Why do we need this function?
> >
> > We've just run xfs_alloc_file_space() across the entire range we
> > are sealing, so we've already guaranteed that it won't have holes
> > in it.
> 
> I'm sure this is due to my ignorance of the scope of XFS_IOLOCK_EXCL
> vs XFS_ILOCK_EXCL. I had assumed that since we drop and retake
> XFS_ILOCK_EXCL that we need to re-validate the block map before
> setting S_IOMAP_IMMUTABLE.

THe ILOCK is there to protect the inode metadata when there is
concurrent access through the IO/MMAP lock paths.  However, if we
hold the IOLOCK_EXCL and the MMAPLOCK_EXCL, then nothing can get
through the IO interfaces to modify the data in the file.  This is
required because APIs that directly modify the extent map (e.g.
fallocate, truncate, etc) have to lock out the IO path to ensure
there are no IOs in flight across the range we are manipulating.

Holding these locks also locks out other APIs that modify the extent
map and so effectively nothing else can be accessing or modifying
the extent map while a fallocate or truncate operation is in
progress.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP
@ 2017-08-05  0:04         ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2017-08-05  0:04 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Jan Kara, linux-nvdimm, linux-kernel, linux-xfs,
	Jeff Moyer, Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

On Fri, Aug 04, 2017 at 04:43:50PM -0700, Dan Williams wrote:
> On Fri, Aug 4, 2017 at 4:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Aug 03, 2017 at 07:28:17PM -0700, Dan Williams wrote:
> >> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> >> index fe0f8f7f4bb7..46d8eb9e19fc 100644
> >> --- a/fs/xfs/xfs_bmap_util.c
> >> +++ b/fs/xfs/xfs_bmap_util.c
> >> @@ -1393,6 +1393,107 @@ xfs_zero_file_space(
> >>
> >>  }
> >>
> >> +/* Return 1 if hole detected, 0 if not, and < 0 if fail to determine */
> >> +STATIC int
> >> +xfs_file_has_holes(
> >> +     struct xfs_inode        *ip)
> >> +{
> >
> > Why do we need this function?
> >
> > We've just run xfs_alloc_file_space() across the entire range we
> > are sealing, so we've already guaranteed that it won't have holes
> > in it.
> 
> I'm sure this is due to my ignorance of the scope of XFS_IOLOCK_EXCL
> vs XFS_ILOCK_EXCL. I had assumed that since we drop and retake
> XFS_ILOCK_EXCL that we need to re-validate the block map before
> setting S_IOMAP_IMMUTABLE.

THe ILOCK is there to protect the inode metadata when there is
concurrent access through the IO/MMAP lock paths.  However, if we
hold the IOLOCK_EXCL and the MMAPLOCK_EXCL, then nothing can get
through the IO interfaces to modify the data in the file.  This is
required because APIs that directly modify the extent map (e.g.
fallocate, truncate, etc) have to lock out the IO path to ensure
there are no IOs in flight across the range we are manipulating.

Holding these locks also locks out other APIs that modify the extent
map and so effectively nothing else can be accessing or modifying
the extent map while a fallocate or truncate operation is in
progress.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
  2017-08-04  2:28   ` Dan Williams
@ 2017-08-05  9:47     ` Christoph Hellwig
  -1 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-05  9:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, darrick.wong, Dave Chinner, linux-kernel,
	linux-xfs, Alexander Viro, luto, linux-fsdevel,
	Christoph Hellwig

NAK^4.

We should not allow users to create immutable files.  We have
proper ways to synchronize I/O, and this is just an invitation
for horrible abuses that should not be allowed, and which we've
always people told not to do.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
@ 2017-08-05  9:47     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-05  9:47 UTC (permalink / raw)
  To: Dan Williams
  Cc: darrick.wong, Jan Kara, linux-nvdimm, Dave Chinner, linux-kernel,
	linux-xfs, Jeff Moyer, Alexander Viro, luto, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig

NAK^4.

We should not allow users to create immutable files.  We have
proper ways to synchronize I/O, and this is just an invitation
for horrible abuses that should not be allowed, and which we've
always people told not to do.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-04  2:38   ` Dan Williams
  (?)
@ 2017-08-05  9:50     ` Christoph Hellwig
  -1 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-05  9:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, Andy Lutomirski,
	linux-fsdevel, Christoph Hellwig

On Thu, Aug 03, 2017 at 07:38:11PM -0700, Dan Williams wrote:
> [ adding linux-api to the cover letter for notification, will send the
> full set to linux-api for v3 ]

Just don't send this crap ever again.  All the so called use cases in the
earlier thread were incorrect and highly dangerous.

Promising that the block map is stable is not a useful userspace API,
as it the block map is a complete internal implementation detail.

We've been through this a few times but let me repeat it:  The only
sensible API gurantee is one that is observable and usable.

so Jan's synchronous page fault flag in one form or another makes
perfect sense as it is a clear receipe for the user:  you don't
have to call msync to persist your mmap writes.  This API is not,
it guarantees that the block map does not change, but the application
has absolutely no point of even knowing about the block map.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-05  9:50     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-05  9:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Darrick J. Wong, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Christoph Hellwig,
	Linux API

On Thu, Aug 03, 2017 at 07:38:11PM -0700, Dan Williams wrote:
> [ adding linux-api to the cover letter for notification, will send the
> full set to linux-api for v3 ]

Just don't send this crap ever again.  All the so called use cases in the
earlier thread were incorrect and highly dangerous.

Promising that the block map is stable is not a useful userspace API,
as it the block map is a complete internal implementation detail.

We've been through this a few times but let me repeat it:  The only
sensible API gurantee is one that is observable and usable.

so Jan's synchronous page fault flag in one form or another makes
perfect sense as it is a clear receipe for the user:  you don't
have to call msync to persist your mmap writes.  This API is not,
it guarantees that the block map does not change, but the application
has absolutely no point of even knowing about the block map.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-05  9:50     ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-05  9:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API,
	Darrick J. Wong, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Thu, Aug 03, 2017 at 07:38:11PM -0700, Dan Williams wrote:
> [ adding linux-api to the cover letter for notification, will send the
> full set to linux-api for v3 ]

Just don't send this crap ever again.  All the so called use cases in the
earlier thread were incorrect and highly dangerous.

Promising that the block map is stable is not a useful userspace API,
as it the block map is a complete internal implementation detail.

We've been through this a few times but let me repeat it:  The only
sensible API gurantee is one that is observable and usable.

so Jan's synchronous page fault flag in one form or another makes
perfect sense as it is a clear receipe for the user:  you don't
have to call msync to persist your mmap writes.  This API is not,
it guarantees that the block map does not change, but the application
has absolutely no point of even knowing about the block map.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-05  9:50     ` Christoph Hellwig
  (?)
@ 2017-08-06 18:51       ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-06 18:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, Andy Lutomirski,
	linux-fsdevel

On Sat, Aug 5, 2017 at 2:50 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Aug 03, 2017 at 07:38:11PM -0700, Dan Williams wrote:
>> [ adding linux-api to the cover letter for notification, will send the
>> full set to linux-api for v3 ]
>
> Just don't send this crap ever again.  All the so called use cases in the
> earlier thread were incorrect and highly dangerous.

I usually end up coming around to your position on these types of
debates because you almost always put forward unassailable technical
arguments. So far, you have not in this case.

> Promising that the block map is stable is not a useful userspace API,
> as it the block map is a complete internal implementation detail.

Of course it's a useful API. An application already needs to worry
about the block map, that's why we have fallocate, msync, fiemap
and...

> We've been through this a few times but let me repeat it:  The only
> sensible API gurantee is one that is observable and usable.

I'm missing how block-map immutable files violate this observable and
usable constraint?

> so Jan's synchronous page fault flag in one form or another makes
> perfect sense as it is a clear receipe for the user:  you don't
> have to call msync to persist your mmap writes.  This API is not,
> it guarantees that the block map does not change, but the application
> has absolutely no point of even knowing about the block map.

Jan's approach is great, it should go in, it solves a long standing
problem with dax with the only drawback being potentially
unpredictable latency spikes.

This immutable approach should also go in, it solves the same problem
without the the latency drawback, but yes, with the administrative
overhead of CAP_LINUX_IMMUTABLE. Beyond flush from userspace it also
can be used to solve the swapfile problems you highlighted and it
allows safe ongoing dma to a filesystem-dax mapping beyond what we can
already do with direct-I/O. There is demand for these capabilities
that cannot be satisfied by just hand waving them away as invalid.

The magnitude of opposition to this approach is out of step with the
actual risk.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-06 18:51       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-06 18:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API

On Sat, Aug 5, 2017 at 2:50 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Thu, Aug 03, 2017 at 07:38:11PM -0700, Dan Williams wrote:
>> [ adding linux-api to the cover letter for notification, will send the
>> full set to linux-api for v3 ]
>
> Just don't send this crap ever again.  All the so called use cases in the
> earlier thread were incorrect and highly dangerous.

I usually end up coming around to your position on these types of
debates because you almost always put forward unassailable technical
arguments. So far, you have not in this case.

> Promising that the block map is stable is not a useful userspace API,
> as it the block map is a complete internal implementation detail.

Of course it's a useful API. An application already needs to worry
about the block map, that's why we have fallocate, msync, fiemap
and...

> We've been through this a few times but let me repeat it:  The only
> sensible API gurantee is one that is observable and usable.

I'm missing how block-map immutable files violate this observable and
usable constraint?

> so Jan's synchronous page fault flag in one form or another makes
> perfect sense as it is a clear receipe for the user:  you don't
> have to call msync to persist your mmap writes.  This API is not,
> it guarantees that the block map does not change, but the application
> has absolutely no point of even knowing about the block map.

Jan's approach is great, it should go in, it solves a long standing
problem with dax with the only drawback being potentially
unpredictable latency spikes.

This immutable approach should also go in, it solves the same problem
without the the latency drawback, but yes, with the administrative
overhead of CAP_LINUX_IMMUTABLE. Beyond flush from userspace it also
can be used to solve the swapfile problems you highlighted and it
allows safe ongoing dma to a filesystem-dax mapping beyond what we can
already do with direct-I/O. There is demand for these capabilities
that cannot be satisfied by just hand waving them away as invalid.

The magnitude of opposition to this approach is out of step with the
actual risk.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-06 18:51       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-06 18:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API

On Sat, Aug 5, 2017 at 2:50 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> On Thu, Aug 03, 2017 at 07:38:11PM -0700, Dan Williams wrote:
>> [ adding linux-api to the cover letter for notification, will send the
>> full set to linux-api for v3 ]
>
> Just don't send this crap ever again.  All the so called use cases in the
> earlier thread were incorrect and highly dangerous.

I usually end up coming around to your position on these types of
debates because you almost always put forward unassailable technical
arguments. So far, you have not in this case.

> Promising that the block map is stable is not a useful userspace API,
> as it the block map is a complete internal implementation detail.

Of course it's a useful API. An application already needs to worry
about the block map, that's why we have fallocate, msync, fiemap
and...

> We've been through this a few times but let me repeat it:  The only
> sensible API gurantee is one that is observable and usable.

I'm missing how block-map immutable files violate this observable and
usable constraint?

> so Jan's synchronous page fault flag in one form or another makes
> perfect sense as it is a clear receipe for the user:  you don't
> have to call msync to persist your mmap writes.  This API is not,
> it guarantees that the block map does not change, but the application
> has absolutely no point of even knowing about the block map.

Jan's approach is great, it should go in, it solves a long standing
problem with dax with the only drawback being potentially
unpredictable latency spikes.

This immutable approach should also go in, it solves the same problem
without the the latency drawback, but yes, with the administrative
overhead of CAP_LINUX_IMMUTABLE. Beyond flush from userspace it also
can be used to solve the swapfile problems you highlighted and it
allows safe ongoing dma to a filesystem-dax mapping beyond what we can
already do with direct-I/O. There is demand for these capabilities
that cannot be satisfied by just hand waving them away as invalid.

The magnitude of opposition to this approach is out of step with the
actual risk.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
  2017-08-05  9:47     ` Christoph Hellwig
@ 2017-08-07  0:25       ` Dave Chinner
  -1 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2017-08-07  0:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, darrick.wong, linux-kernel, linux-xfs,
	Alexander Viro, luto, linux-fsdevel

On Sat, Aug 05, 2017 at 11:47:08AM +0200, Christoph Hellwig wrote:
> NAK^4.
> 
> We should not allow users to create immutable files.  We have
> proper ways to synchronize I/O, and this is just an invitation
> for horrible abuses that should not be allowed, and which we've
> always people told not to do.

We've always told people not to do those "horrible abuses" because
of the TOCTOU race conditions inherent in getting accurate
BMAP/FIEMAP information to userspace. However, immutable extent maps
solve the TOCTOU problem and so removes the only *technical* barrier
in the way of using extent maps to implement functionality such as
userspace pNFS servers.

The core requirement for a userspace pNFS block server to be able to
safely export the block map of a file to remote clients is that the
extent map is allocated and will not change while the client has
been granted access to it. Immutable extent maps provide that
functionality to userspace.  However, for this to work, us
filesystem developers have to give up the idea that only the
filesystem can access the storage underlying the filesystem.

I'm not writing this for your benefit, Christoph, but for everyone
else who doesn't know about existing direct remote storage access
protocols and implementations. That is, I'm letting everyone know
we've already had to give up the exclusive storage device access
model...

.... when you implemented the kernel pNFS server code that provides
unknown third parties with the *remote direct access* to the storage
underlying the XFS filesystem.

Yup, we already allow third parties to arbitrate and directly access
to the XFS block device map. That "horrible abuse" was allowed
because it could be done safely via NFSv4 delegations and a new API
that provided a "blocks will always be allocated before a write and
won't change while the remote client has access" guarantee from XFS
to the kernel pNFS server (i.e. ->map_blocks()/->commit_blocks()
export ops and the break_layouts() API).

Immutable extent maps provide userspace with this same guarantee, so
what used to be considered a "horrible abuse" can now be done safely
and without risking data and/or filesystem corruption.  So, really,
calling this an "invitation to horrible abuses that should not be
allowed" ignores the reality that you were the architect that
introduced this "safe remote direct access" model to convert a
"horrible abuse" into a set of safe, supportable operations.

In the end, all I care about is that everyone understands the
technical merits of the proposals being considered rather than
discussion and review being shut down because "Christoph shouted
nasty words at me but I still don't understand why?".....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
@ 2017-08-07  0:25       ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2017-08-07  0:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, darrick.wong, Jan Kara, linux-nvdimm, linux-kernel,
	linux-xfs, Jeff Moyer, Alexander Viro, luto, linux-fsdevel,
	Ross Zwisler

On Sat, Aug 05, 2017 at 11:47:08AM +0200, Christoph Hellwig wrote:
> NAK^4.
> 
> We should not allow users to create immutable files.  We have
> proper ways to synchronize I/O, and this is just an invitation
> for horrible abuses that should not be allowed, and which we've
> always people told not to do.

We've always told people not to do those "horrible abuses" because
of the TOCTOU race conditions inherent in getting accurate
BMAP/FIEMAP information to userspace. However, immutable extent maps
solve the TOCTOU problem and so removes the only *technical* barrier
in the way of using extent maps to implement functionality such as
userspace pNFS servers.

The core requirement for a userspace pNFS block server to be able to
safely export the block map of a file to remote clients is that the
extent map is allocated and will not change while the client has
been granted access to it. Immutable extent maps provide that
functionality to userspace.  However, for this to work, us
filesystem developers have to give up the idea that only the
filesystem can access the storage underlying the filesystem.

I'm not writing this for your benefit, Christoph, but for everyone
else who doesn't know about existing direct remote storage access
protocols and implementations. That is, I'm letting everyone know
we've already had to give up the exclusive storage device access
model...

.... when you implemented the kernel pNFS server code that provides
unknown third parties with the *remote direct access* to the storage
underlying the XFS filesystem.

Yup, we already allow third parties to arbitrate and directly access
to the XFS block device map. That "horrible abuse" was allowed
because it could be done safely via NFSv4 delegations and a new API
that provided a "blocks will always be allocated before a write and
won't change while the remote client has access" guarantee from XFS
to the kernel pNFS server (i.e. ->map_blocks()/->commit_blocks()
export ops and the break_layouts() API).

Immutable extent maps provide userspace with this same guarantee, so
what used to be considered a "horrible abuse" can now be done safely
and without risking data and/or filesystem corruption.  So, really,
calling this an "invitation to horrible abuses that should not be
allowed" ignores the reality that you were the architect that
introduced this "safe remote direct access" model to convert a
"horrible abuse" into a set of safe, supportable operations.

In the end, all I care about is that everyone understands the
technical merits of the proposals being considered rather than
discussion and review being shut down because "Christoph shouted
nasty words at me but I still don't understand why?".....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
  2017-08-07  0:25       ` Dave Chinner
@ 2017-08-11 10:34         ` Christoph Hellwig
  -1 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-11 10:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, darrick.wong, linux-kernel, linux-xfs,
	Alexander Viro, luto, linux-fsdevel, Christoph Hellwig

On Mon, Aug 07, 2017 at 10:25:02AM +1000, Dave Chinner wrote:
> We've always told people not to do those "horrible abuses" because
> of the TOCTOU race conditions inherent in getting accurate
> BMAP/FIEMAP information to userspace. However, immutable extent maps
> solve the TOCTOU problem and so removes the only *technical* barrier
> in the way of using extent maps to implement functionality such as
> userspace pNFS servers.

For pNFS block/scsi and my upcoming RDMA persistent memory layout?
Hell no - we'll need concepts we can't expose to userspace for them,
and to expose the advanced functionality people are asking for
(reflinks, atomic updates, no stale data exposure) immutable extents
maps won't work at all.

> The core requirement for a userspace pNFS block server to be able to
> safely export the block map of a file to remote clients is that the
> extent map is allocated and will not change while the client has
> been granted access to it.

No.  The core feature for the block layout is to create an unwrittent
extent that we can expose to the client for writing to it and only
marking it as written after commit by converting the extent list.

Now I know you're going to argue that this could work with pre-zeroing
the extents, but for and actual SCSI or NVMe device that will suck
badly.  And for RDMA-like layouts we don't even need the zeroing as
we can control client behavior a lot better because memory registrations
allow much more fine grained control.

Either way we a good notification from the file system to the server
when the extent map changes.

But for either blocks or rdma layout and implementation with the filesystem
in kernel space and the server in user is stupid as they need to interact
closely.  There is a good reason why all successful NFS products have
the server very tightly coupled to the file system, and a userspace <->
kernel barrier does not help with that.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE
@ 2017-08-11 10:34         ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-11 10:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, Dan Williams, darrick.wong, Jan Kara,
	linux-nvdimm, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, luto, linux-fsdevel, Ross Zwisler

On Mon, Aug 07, 2017 at 10:25:02AM +1000, Dave Chinner wrote:
> We've always told people not to do those "horrible abuses" because
> of the TOCTOU race conditions inherent in getting accurate
> BMAP/FIEMAP information to userspace. However, immutable extent maps
> solve the TOCTOU problem and so removes the only *technical* barrier
> in the way of using extent maps to implement functionality such as
> userspace pNFS servers.

For pNFS block/scsi and my upcoming RDMA persistent memory layout?
Hell no - we'll need concepts we can't expose to userspace for them,
and to expose the advanced functionality people are asking for
(reflinks, atomic updates, no stale data exposure) immutable extents
maps won't work at all.

> The core requirement for a userspace pNFS block server to be able to
> safely export the block map of a file to remote clients is that the
> extent map is allocated and will not change while the client has
> been granted access to it.

No.  The core feature for the block layout is to create an unwrittent
extent that we can expose to the client for writing to it and only
marking it as written after commit by converting the extent list.

Now I know you're going to argue that this could work with pre-zeroing
the extents, but for and actual SCSI or NVMe device that will suck
badly.  And for RDMA-like layouts we don't even need the zeroing as
we can control client behavior a lot better because memory registrations
allow much more fine grained control.

Either way we a good notification from the file system to the server
when the extent map changes.

But for either blocks or rdma layout and implementation with the filesystem
in kernel space and the server in user is stupid as they need to interact
closely.  There is a good reason why all successful NFS products have
the server very tightly coupled to the file system, and a userspace <->
kernel barrier does not help with that.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-06 18:51       ` Dan Williams
  (?)
@ 2017-08-11 10:44         ` Christoph Hellwig
  -1 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-11 10:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, Andy Lutomirski,
	linux-fsdevel, Christoph Hellwig

On Sun, Aug 06, 2017 at 11:51:50AM -0700, Dan Williams wrote:
> Of course it's a useful API. An application already needs to worry
> about the block map, that's why we have fallocate, msync, fiemap
> and...

Fallocate and msync do not expose the block map in any way.  Proof:
they work just fine over say nfs.

fiemap does indeed expose the block map, which is the whole point.
But it's a debug tool that we don't event have a man page for.  And
it's not usable for anything else, if only for the fact that it doesn't
tell you what device your returned extents are relative to.

> > We've been through this a few times but let me repeat it:  The only
> > sensible API gurantee is one that is observable and usable.
> 
> I'm missing how block-map immutable files violate this observable and
> usable constraint?

What is the observable behavior of an extent map change?  How can you
describe your immutable extent map behavior so that when I violate
them by e.g. moving one extent to a different place on disk you can
observe that in userspace?

> This immutable approach should also go in, it solves the same problem
> without the the latency drawback,

How is your latency going to be any different from MAP_SYNC on
a fully allocated and pre-zeroed file?

> Beyond flush from userspace it also
> can be used to solve the swapfile problems you highlighted

Which swapfile problem?

> and it
> allows safe ongoing dma to a filesystem-dax mapping beyond what we can
> already do with direct-I/O.

Please explain how this interface allows for any sort of safe userspace
DMA.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-11 10:44         ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-11 10:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Darrick J. Wong, Jan Kara, linux-nvdimm,
	Dave Chinner, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Linux API

On Sun, Aug 06, 2017 at 11:51:50AM -0700, Dan Williams wrote:
> Of course it's a useful API. An application already needs to worry
> about the block map, that's why we have fallocate, msync, fiemap
> and...

Fallocate and msync do not expose the block map in any way.  Proof:
they work just fine over say nfs.

fiemap does indeed expose the block map, which is the whole point.
But it's a debug tool that we don't event have a man page for.  And
it's not usable for anything else, if only for the fact that it doesn't
tell you what device your returned extents are relative to.

> > We've been through this a few times but let me repeat it:  The only
> > sensible API gurantee is one that is observable and usable.
> 
> I'm missing how block-map immutable files violate this observable and
> usable constraint?

What is the observable behavior of an extent map change?  How can you
describe your immutable extent map behavior so that when I violate
them by e.g. moving one extent to a different place on disk you can
observe that in userspace?

> This immutable approach should also go in, it solves the same problem
> without the the latency drawback,

How is your latency going to be any different from MAP_SYNC on
a fully allocated and pre-zeroed file?

> Beyond flush from userspace it also
> can be used to solve the swapfile problems you highlighted

Which swapfile problem?

> and it
> allows safe ongoing dma to a filesystem-dax mapping beyond what we can
> already do with direct-I/O.

Please explain how this interface allows for any sort of safe userspace
DMA.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-11 10:44         ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-11 10:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API,
	Darrick J. Wong, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Sun, Aug 06, 2017 at 11:51:50AM -0700, Dan Williams wrote:
> Of course it's a useful API. An application already needs to worry
> about the block map, that's why we have fallocate, msync, fiemap
> and...

Fallocate and msync do not expose the block map in any way.  Proof:
they work just fine over say nfs.

fiemap does indeed expose the block map, which is the whole point.
But it's a debug tool that we don't event have a man page for.  And
it's not usable for anything else, if only for the fact that it doesn't
tell you what device your returned extents are relative to.

> > We've been through this a few times but let me repeat it:  The only
> > sensible API gurantee is one that is observable and usable.
> 
> I'm missing how block-map immutable files violate this observable and
> usable constraint?

What is the observable behavior of an extent map change?  How can you
describe your immutable extent map behavior so that when I violate
them by e.g. moving one extent to a different place on disk you can
observe that in userspace?

> This immutable approach should also go in, it solves the same problem
> without the the latency drawback,

How is your latency going to be any different from MAP_SYNC on
a fully allocated and pre-zeroed file?

> Beyond flush from userspace it also
> can be used to solve the swapfile problems you highlighted

Which swapfile problem?

> and it
> allows safe ongoing dma to a filesystem-dax mapping beyond what we can
> already do with direct-I/O.

Please explain how this interface allows for any sort of safe userspace
DMA.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-11 10:44         ` Christoph Hellwig
  (?)
@ 2017-08-11 22:26           ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-11 22:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, Andy Lutomirski,
	linux-fsdevel

On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Aug 06, 2017 at 11:51:50AM -0700, Dan Williams wrote:
>> Of course it's a useful API. An application already needs to worry
>> about the block map, that's why we have fallocate, msync, fiemap
>> and...
>
> Fallocate and msync do not expose the block map in any way.  Proof:
> they work just fine over say nfs.

Right, but they let userspace make inferences about the state of
metadata relative to I/O to a given storage address. In this regard
S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
a step further to let an application infer that the storage address is
stable. This enables applications that MAP_SYNC does not, see below.

> fiemap does indeed expose the block map, which is the whole point.
> But it's a debug tool that we don't event have a man page for.  And
> it's not usable for anything else, if only for the fact that it doesn't
> tell you what device your returned extents are relative to.

True, one couldn't just use immutable + fiemap and expect to have the
right storage device.

>
>> > We've been through this a few times but let me repeat it:  The only
>> > sensible API gurantee is one that is observable and usable.
>>
>> I'm missing how block-map immutable files violate this observable and
>> usable constraint?
>
> What is the observable behavior of an extent map change?  How can you
> describe your immutable extent map behavior so that when I violate
> them by e.g. moving one extent to a different place on disk you can
> observe that in userspace?

The violation is blocked, it's immutable. Using this feature means the
application is taking away some of the kernel's freedom. That is a
valid / safe tradeoff for the set of applications that would otherwise
resort to raw device access.

>
>> This immutable approach should also go in, it solves the same problem
>> without the the latency drawback,
>
> How is your latency going to be any different from MAP_SYNC on
> a fully allocated and pre-zeroed file?

So, I went back and read Jan's patches, and in the pre-allocated case
I don't think we can get stuck behind a backlog of dirty metada
flushing since the implementation only seems to take the synchronous
fault path if the fault dirtied the block map.

>> Beyond flush from userspace it also
>> can be used to solve the swapfile problems you highlighted
>
> Which swapfile problem?

The TOCTOU problem of enabling swap vs reflink that you mentioned in
your criticism of the daxctl syscall, but now that I look your
comments were based on the *general* case use of bmap(), However, xfs
in particular as of commits:

   eb5e248d502b xfs: don't allow bmap on rt files
   db1327b16c2b xfs: report shared extent mappings to userspace correctly

...doesn't appear to have this problem. That said Dave's idea to use
immutable + unwritten extents for swap makes sense to me. That's a
feature, not a bug fix, but I went ahead and appended a
proof-of-concept implementation to the v3 posting.

>> and it
>> allows safe ongoing dma to a filesystem-dax mapping beyond what we can
>> already do with direct-I/O.
>
> Please explain how this interface allows for any sort of safe userspace
> DMA.

So this is where I continue to see S_IOMAP_IMMUTABLE being able to
support applications that MAP_SYNC does not. Dave mentioned userspace
pNFS4 servers, but there's also Samba and other protocols that want to
negotiate a direct path to pmem outside the kernel. Xen support has
thus far not been able to follow in the footsteps of KVM enabling due
to a dependence on static M2P tables that assume a static
guest-physical to host-physical relationship [1]. Immutable files
would allow Xen to follow the same "mmap a file" semantic as KVM.

Applications that just want flush from userspace can use MAP_SYNC,
those that need to temporarily pin the block for RDMA can use the
in-kernel pNFS server, and those that need to coordinate both from
userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
competition.

[1]: https://lists.xen.org/archives/html/xen-devel/2017-04/msg00427.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-11 22:26           ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-11 22:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API

On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sun, Aug 06, 2017 at 11:51:50AM -0700, Dan Williams wrote:
>> Of course it's a useful API. An application already needs to worry
>> about the block map, that's why we have fallocate, msync, fiemap
>> and...
>
> Fallocate and msync do not expose the block map in any way.  Proof:
> they work just fine over say nfs.

Right, but they let userspace make inferences about the state of
metadata relative to I/O to a given storage address. In this regard
S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
a step further to let an application infer that the storage address is
stable. This enables applications that MAP_SYNC does not, see below.

> fiemap does indeed expose the block map, which is the whole point.
> But it's a debug tool that we don't event have a man page for.  And
> it's not usable for anything else, if only for the fact that it doesn't
> tell you what device your returned extents are relative to.

True, one couldn't just use immutable + fiemap and expect to have the
right storage device.

>
>> > We've been through this a few times but let me repeat it:  The only
>> > sensible API gurantee is one that is observable and usable.
>>
>> I'm missing how block-map immutable files violate this observable and
>> usable constraint?
>
> What is the observable behavior of an extent map change?  How can you
> describe your immutable extent map behavior so that when I violate
> them by e.g. moving one extent to a different place on disk you can
> observe that in userspace?

The violation is blocked, it's immutable. Using this feature means the
application is taking away some of the kernel's freedom. That is a
valid / safe tradeoff for the set of applications that would otherwise
resort to raw device access.

>
>> This immutable approach should also go in, it solves the same problem
>> without the the latency drawback,
>
> How is your latency going to be any different from MAP_SYNC on
> a fully allocated and pre-zeroed file?

So, I went back and read Jan's patches, and in the pre-allocated case
I don't think we can get stuck behind a backlog of dirty metada
flushing since the implementation only seems to take the synchronous
fault path if the fault dirtied the block map.

>> Beyond flush from userspace it also
>> can be used to solve the swapfile problems you highlighted
>
> Which swapfile problem?

The TOCTOU problem of enabling swap vs reflink that you mentioned in
your criticism of the daxctl syscall, but now that I look your
comments were based on the *general* case use of bmap(), However, xfs
in particular as of commits:

   eb5e248d502b xfs: don't allow bmap on rt files
   db1327b16c2b xfs: report shared extent mappings to userspace correctly

...doesn't appear to have this problem. That said Dave's idea to use
immutable + unwritten extents for swap makes sense to me. That's a
feature, not a bug fix, but I went ahead and appended a
proof-of-concept implementation to the v3 posting.

>> and it
>> allows safe ongoing dma to a filesystem-dax mapping beyond what we can
>> already do with direct-I/O.
>
> Please explain how this interface allows for any sort of safe userspace
> DMA.

So this is where I continue to see S_IOMAP_IMMUTABLE being able to
support applications that MAP_SYNC does not. Dave mentioned userspace
pNFS4 servers, but there's also Samba and other protocols that want to
negotiate a direct path to pmem outside the kernel. Xen support has
thus far not been able to follow in the footsteps of KVM enabling due
to a dependence on static M2P tables that assume a static
guest-physical to host-physical relationship [1]. Immutable files
would allow Xen to follow the same "mmap a file" semantic as KVM.

Applications that just want flush from userspace can use MAP_SYNC,
those that need to temporarily pin the block for RDMA can use the
in-kernel pNFS server, and those that need to coordinate both from
userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
competition.

[1]: https://lists.xen.org/archives/html/xen-devel/2017-04/msg00427.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-11 22:26           ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-11 22:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API

On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> On Sun, Aug 06, 2017 at 11:51:50AM -0700, Dan Williams wrote:
>> Of course it's a useful API. An application already needs to worry
>> about the block map, that's why we have fallocate, msync, fiemap
>> and...
>
> Fallocate and msync do not expose the block map in any way.  Proof:
> they work just fine over say nfs.

Right, but they let userspace make inferences about the state of
metadata relative to I/O to a given storage address. In this regard
S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
a step further to let an application infer that the storage address is
stable. This enables applications that MAP_SYNC does not, see below.

> fiemap does indeed expose the block map, which is the whole point.
> But it's a debug tool that we don't event have a man page for.  And
> it's not usable for anything else, if only for the fact that it doesn't
> tell you what device your returned extents are relative to.

True, one couldn't just use immutable + fiemap and expect to have the
right storage device.

>
>> > We've been through this a few times but let me repeat it:  The only
>> > sensible API gurantee is one that is observable and usable.
>>
>> I'm missing how block-map immutable files violate this observable and
>> usable constraint?
>
> What is the observable behavior of an extent map change?  How can you
> describe your immutable extent map behavior so that when I violate
> them by e.g. moving one extent to a different place on disk you can
> observe that in userspace?

The violation is blocked, it's immutable. Using this feature means the
application is taking away some of the kernel's freedom. That is a
valid / safe tradeoff for the set of applications that would otherwise
resort to raw device access.

>
>> This immutable approach should also go in, it solves the same problem
>> without the the latency drawback,
>
> How is your latency going to be any different from MAP_SYNC on
> a fully allocated and pre-zeroed file?

So, I went back and read Jan's patches, and in the pre-allocated case
I don't think we can get stuck behind a backlog of dirty metada
flushing since the implementation only seems to take the synchronous
fault path if the fault dirtied the block map.

>> Beyond flush from userspace it also
>> can be used to solve the swapfile problems you highlighted
>
> Which swapfile problem?

The TOCTOU problem of enabling swap vs reflink that you mentioned in
your criticism of the daxctl syscall, but now that I look your
comments were based on the *general* case use of bmap(), However, xfs
in particular as of commits:

   eb5e248d502b xfs: don't allow bmap on rt files
   db1327b16c2b xfs: report shared extent mappings to userspace correctly

...doesn't appear to have this problem. That said Dave's idea to use
immutable + unwritten extents for swap makes sense to me. That's a
feature, not a bug fix, but I went ahead and appended a
proof-of-concept implementation to the v3 posting.

>> and it
>> allows safe ongoing dma to a filesystem-dax mapping beyond what we can
>> already do with direct-I/O.
>
> Please explain how this interface allows for any sort of safe userspace
> DMA.

So this is where I continue to see S_IOMAP_IMMUTABLE being able to
support applications that MAP_SYNC does not. Dave mentioned userspace
pNFS4 servers, but there's also Samba and other protocols that want to
negotiate a direct path to pmem outside the kernel. Xen support has
thus far not been able to follow in the footsteps of KVM enabling due
to a dependence on static M2P tables that assume a static
guest-physical to host-physical relationship [1]. Immutable files
would allow Xen to follow the same "mmap a file" semantic as KVM.

Applications that just want flush from userspace can use MAP_SYNC,
those that need to temporarily pin the block for RDMA can use the
in-kernel pNFS server, and those that need to coordinate both from
userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
competition.

[1]: https://lists.xen.org/archives/html/xen-devel/2017-04/msg00427.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-11 22:26           ` Dan Williams
@ 2017-08-12  3:57             ` Andy Lutomirski
  -1 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2017-08-12  3:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, Andy Lutomirski,
	linux-fsdevel, Christoph Hellwig

On Fri, Aug 11, 2017 at 3:26 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig <hch@lst.de> wrote:
>> Please explain how this interface allows for any sort of safe userspace
>> DMA.
>
> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
> support applications that MAP_SYNC does not. Dave mentioned userspace
> pNFS4 servers, but there's also Samba and other protocols that want to
> negotiate a direct path to pmem outside the kernel. Xen support has
> thus far not been able to follow in the footsteps of KVM enabling due
> to a dependence on static M2P tables that assume a static
> guest-physical to host-physical relationship [1]. Immutable files
> would allow Xen to follow the same "mmap a file" semantic as KVM.

One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
degree to which things go badly if one program relies on it while
another program clears the flag: you risk corrupting unrelated
filesystem metadata.  I think a userspace interface to pin the extent
mapping of a file really wants a way to reliably keep it pinned (or to
reliably zap the userspace application if it gets unpinned).
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-12  3:57             ` Andy Lutomirski
  0 siblings, 0 replies; 108+ messages in thread
From: Andy Lutomirski @ 2017-08-12  3:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Darrick J. Wong, Jan Kara, linux-nvdimm,
	Dave Chinner, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Linux API

On Fri, Aug 11, 2017 at 3:26 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig <hch@lst.de> wrote:
>> Please explain how this interface allows for any sort of safe userspace
>> DMA.
>
> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
> support applications that MAP_SYNC does not. Dave mentioned userspace
> pNFS4 servers, but there's also Samba and other protocols that want to
> negotiate a direct path to pmem outside the kernel. Xen support has
> thus far not been able to follow in the footsteps of KVM enabling due
> to a dependence on static M2P tables that assume a static
> guest-physical to host-physical relationship [1]. Immutable files
> would allow Xen to follow the same "mmap a file" semantic as KVM.

One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
degree to which things go badly if one program relies on it while
another program clears the flag: you risk corrupting unrelated
filesystem metadata.  I think a userspace interface to pin the extent
mapping of a file really wants a way to reliably keep it pinned (or to
reliably zap the userspace application if it gets unpinned).

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-12  3:57             ` Andy Lutomirski
  (?)
@ 2017-08-12  4:44               ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-12  4:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, linux-fsdevel,
	Christoph Hellwig

On Fri, Aug 11, 2017 at 8:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Aug 11, 2017 at 3:26 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> Please explain how this interface allows for any sort of safe userspace
>>> DMA.
>>
>> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
>> support applications that MAP_SYNC does not. Dave mentioned userspace
>> pNFS4 servers, but there's also Samba and other protocols that want to
>> negotiate a direct path to pmem outside the kernel. Xen support has
>> thus far not been able to follow in the footsteps of KVM enabling due
>> to a dependence on static M2P tables that assume a static
>> guest-physical to host-physical relationship [1]. Immutable files
>> would allow Xen to follow the same "mmap a file" semantic as KVM.
>
> One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
> degree to which things go badly if one program relies on it while
> another program clears the flag: you risk corrupting unrelated
> filesystem metadata.  I think a userspace interface to pin the extent
> mapping of a file really wants a way to reliably keep it pinned (or to
> reliably zap the userspace application if it gets unpinned).

In the current patches, mapping_mapped() pins the immutable state.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-12  4:44               ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-12  4:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Darrick J. Wong, Jan Kara, linux-nvdimm,
	Dave Chinner, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, linux-fsdevel, Ross Zwisler, Linux API

On Fri, Aug 11, 2017 at 8:57 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Aug 11, 2017 at 3:26 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig <hch@lst.de> wrote:
>>> Please explain how this interface allows for any sort of safe userspace
>>> DMA.
>>
>> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
>> support applications that MAP_SYNC does not. Dave mentioned userspace
>> pNFS4 servers, but there's also Samba and other protocols that want to
>> negotiate a direct path to pmem outside the kernel. Xen support has
>> thus far not been able to follow in the footsteps of KVM enabling due
>> to a dependence on static M2P tables that assume a static
>> guest-physical to host-physical relationship [1]. Immutable files
>> would allow Xen to follow the same "mmap a file" semantic as KVM.
>
> One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
> degree to which things go badly if one program relies on it while
> another program clears the flag: you risk corrupting unrelated
> filesystem metadata.  I think a userspace interface to pin the extent
> mapping of a file really wants a way to reliably keep it pinned (or to
> reliably zap the userspace application if it gets unpinned).

In the current patches, mapping_mapped() pins the immutable state.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-12  4:44               ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-12  4:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API,
	Darrick J. Wong, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, linux-fsdevel,
	Christoph Hellwig

On Fri, Aug 11, 2017 at 8:57 PM, Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Fri, Aug 11, 2017 at 3:26 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> On Fri, Aug 11, 2017 at 3:44 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
>>> Please explain how this interface allows for any sort of safe userspace
>>> DMA.
>>
>> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
>> support applications that MAP_SYNC does not. Dave mentioned userspace
>> pNFS4 servers, but there's also Samba and other protocols that want to
>> negotiate a direct path to pmem outside the kernel. Xen support has
>> thus far not been able to follow in the footsteps of KVM enabling due
>> to a dependence on static M2P tables that assume a static
>> guest-physical to host-physical relationship [1]. Immutable files
>> would allow Xen to follow the same "mmap a file" semantic as KVM.
>
> One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
> degree to which things go badly if one program relies on it while
> another program clears the flag: you risk corrupting unrelated
> filesystem metadata.  I think a userspace interface to pin the extent
> mapping of a file really wants a way to reliably keep it pinned (or to
> reliably zap the userspace application if it gets unpinned).

In the current patches, mapping_mapped() pins the immutable state.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-11 22:26           ` Dan Williams
  (?)
@ 2017-08-12  7:33             ` Christoph Hellwig
  -1 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-12  7:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, Andy Lutomirski,
	linux-fsdevel, Christoph Hellwig

On Fri, Aug 11, 2017 at 03:26:05PM -0700, Dan Williams wrote:
> Right, but they let userspace make inferences about the state of
> metadata relative to I/O to a given storage address. In this regard
> S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
> a step further to let an application infer that the storage address is
> stable. This enables applications that MAP_SYNC does not, see below.

But the application must not know (and cannot know) the storage address,
so it doesn't matter.

> > What is the observable behavior of an extent map change?  How can you
> > describe your immutable extent map behavior so that when I violate
> > them by e.g. moving one extent to a different place on disk you can
> > observe that in userspace?
> 
> The violation is blocked, it's immutable. Using this feature means the
> application is taking away some of the kernel's freedom. That is a
> valid / safe tradeoff for the set of applications that would otherwise
> resort to raw device access.

What can the application do with it safely that it can't otherwise do?
Short answer: nothing.

> >
> > Please explain how this interface allows for any sort of safe userspace
> > DMA.
> 
> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
> support applications that MAP_SYNC does not. Dave mentioned userspace
> pNFS4 servers, but there's also Samba and other protocols that want to
> negotiate a direct path to pmem outside the kernel.

Userspace pNFS servers must use a userspace file system.  Everything
else is just brainded stupid due to the amount of communication they
need to do.  Also note that the only pNFS layouts that would even cause
direct block access are pNFS block/scsi and for those the
S_IOMAP_IMMUTABLE semantics are not very useful (background: I wrote
the Linux implementation for those, and authored the scsi layout spec)


> Applications that just want flush from userspace can use MAP_SYNC,
> those that need to temporarily pin the block for RDMA can use the
> in-kernel pNFS server, and those that need to coordinate both from
> userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
> competition.

Again - how does your application even know that I moved your block
around with your S_IOMAP_IMMUTABLE?  We should never add interfaces
that mandate implementations - we should based interfaces based on
user observable behavior - and debug tools like fiemap don't count.

Before going any further please write a man page that describeѕ your
intended semantics in a way that an application programmer understands.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-12  7:33             ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-12  7:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Darrick J. Wong, Jan Kara, linux-nvdimm,
	Dave Chinner, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Linux API

On Fri, Aug 11, 2017 at 03:26:05PM -0700, Dan Williams wrote:
> Right, but they let userspace make inferences about the state of
> metadata relative to I/O to a given storage address. In this regard
> S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
> a step further to let an application infer that the storage address is
> stable. This enables applications that MAP_SYNC does not, see below.

But the application must not know (and cannot know) the storage address,
so it doesn't matter.

> > What is the observable behavior of an extent map change?  How can you
> > describe your immutable extent map behavior so that when I violate
> > them by e.g. moving one extent to a different place on disk you can
> > observe that in userspace?
> 
> The violation is blocked, it's immutable. Using this feature means the
> application is taking away some of the kernel's freedom. That is a
> valid / safe tradeoff for the set of applications that would otherwise
> resort to raw device access.

What can the application do with it safely that it can't otherwise do?
Short answer: nothing.

> >
> > Please explain how this interface allows for any sort of safe userspace
> > DMA.
> 
> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
> support applications that MAP_SYNC does not. Dave mentioned userspace
> pNFS4 servers, but there's also Samba and other protocols that want to
> negotiate a direct path to pmem outside the kernel.

Userspace pNFS servers must use a userspace file system.  Everything
else is just brainded stupid due to the amount of communication they
need to do.  Also note that the only pNFS layouts that would even cause
direct block access are pNFS block/scsi and for those the
S_IOMAP_IMMUTABLE semantics are not very useful (background: I wrote
the Linux implementation for those, and authored the scsi layout spec)


> Applications that just want flush from userspace can use MAP_SYNC,
> those that need to temporarily pin the block for RDMA can use the
> in-kernel pNFS server, and those that need to coordinate both from
> userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
> competition.

Again - how does your application even know that I moved your block
around with your S_IOMAP_IMMUTABLE?  We should never add interfaces
that mandate implementations - we should based interfaces based on
user observable behavior - and debug tools like fiemap don't count.

Before going any further please write a man page that describeѕ your
intended semantics in a way that an application programmer understands.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-12  7:33             ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-12  7:33 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Darrick J. Wong, Jan Kara,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API

On Fri, Aug 11, 2017 at 03:26:05PM -0700, Dan Williams wrote:
> Right, but they let userspace make inferences about the state of
> metadata relative to I/O to a given storage address. In this regard
> S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
> a step further to let an application infer that the storage address is
> stable. This enables applications that MAP_SYNC does not, see below.

But the application must not know (and cannot know) the storage address,
so it doesn't matter.

> > What is the observable behavior of an extent map change?  How can you
> > describe your immutable extent map behavior so that when I violate
> > them by e.g. moving one extent to a different place on disk you can
> > observe that in userspace?
> 
> The violation is blocked, it's immutable. Using this feature means the
> application is taking away some of the kernel's freedom. That is a
> valid / safe tradeoff for the set of applications that would otherwise
> resort to raw device access.

What can the application do with it safely that it can't otherwise do?
Short answer: nothing.

> >
> > Please explain how this interface allows for any sort of safe userspace
> > DMA.
> 
> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
> support applications that MAP_SYNC does not. Dave mentioned userspace
> pNFS4 servers, but there's also Samba and other protocols that want to
> negotiate a direct path to pmem outside the kernel.

Userspace pNFS servers must use a userspace file system.  Everything
else is just brainded stupid due to the amount of communication they
need to do.  Also note that the only pNFS layouts that would even cause
direct block access are pNFS block/scsi and for those the
S_IOMAP_IMMUTABLE semantics are not very useful (background: I wrote
the Linux implementation for those, and authored the scsi layout spec)


> Applications that just want flush from userspace can use MAP_SYNC,
> those that need to temporarily pin the block for RDMA can use the
> in-kernel pNFS server, and those that need to coordinate both from
> userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
> competition.

Again - how does your application even know that I moved your block
around with your S_IOMAP_IMMUTABLE?  We should never add interfaces
that mandate implementations - we should based interfaces based on
user observable behavior - and debug tools like fiemap don't count.

Before going any further please write a man page that describeѕ your
intended semantics in a way that an application programmer understands.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-12  3:57             ` Andy Lutomirski
  (?)
@ 2017-08-12  7:34               ` Christoph Hellwig
  -1 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-12  7:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, linux-fsdevel,
	Christoph Hellwig

On Fri, Aug 11, 2017 at 08:57:18PM -0700, Andy Lutomirski wrote:
> One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
> degree to which things go badly if one program relies on it while
> another program clears the flag: you risk corrupting unrelated
> filesystem metadata.  I think a userspace interface to pin the extent
> mapping of a file really wants a way to reliably keep it pinned (or to
> reliably zap the userspace application if it gets unpinned).

The nice thing is that no application can rely on it anyway..
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-12  7:34               ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-12  7:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dan Williams, Christoph Hellwig, Darrick J. Wong, Jan Kara,
	linux-nvdimm, Dave Chinner, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, linux-fsdevel, Ross Zwisler, Linux API

On Fri, Aug 11, 2017 at 08:57:18PM -0700, Andy Lutomirski wrote:
> One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
> degree to which things go badly if one program relies on it while
> another program clears the flag: you risk corrupting unrelated
> filesystem metadata.  I think a userspace interface to pin the extent
> mapping of a file really wants a way to reliably keep it pinned (or to
> reliably zap the userspace application if it gets unpinned).

The nice thing is that no application can rely on it anyway..

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-12  7:34               ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-12  7:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API,
	Darrick J. Wong, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro, linux-fsdevel,
	Christoph Hellwig

On Fri, Aug 11, 2017 at 08:57:18PM -0700, Andy Lutomirski wrote:
> One thing that makes me quite nervous about S_IOMAP_IMMUTABLE is the
> degree to which things go badly if one program relies on it while
> another program clears the flag: you risk corrupting unrelated
> filesystem metadata.  I think a userspace interface to pin the extent
> mapping of a file really wants a way to reliably keep it pinned (or to
> reliably zap the userspace application if it gets unpinned).

The nice thing is that no application can rely on it anyway..

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-12  7:33             ` Christoph Hellwig
  (?)
@ 2017-08-12 19:19               ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-12 19:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, Andy Lutomirski,
	linux-fsdevel

On Sat, Aug 12, 2017 at 12:33 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Aug 11, 2017 at 03:26:05PM -0700, Dan Williams wrote:
>> Right, but they let userspace make inferences about the state of
>> metadata relative to I/O to a given storage address. In this regard
>> S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
>> a step further to let an application infer that the storage address is
>> stable. This enables applications that MAP_SYNC does not, see below.
>
> But the application must not know (and cannot know) the storage address,
> so it doesn't matter.
>
>> > What is the observable behavior of an extent map change?  How can you
>> > describe your immutable extent map behavior so that when I violate
>> > them by e.g. moving one extent to a different place on disk you can
>> > observe that in userspace?
>>
>> The violation is blocked, it's immutable. Using this feature means the
>> application is taking away some of the kernel's freedom. That is a
>> valid / safe tradeoff for the set of applications that would otherwise
>> resort to raw device access.
>
> What can the application do with it safely that it can't otherwise do?
> Short answer: nothing.

The application does not need to know the storage address, it needs to
know that the storage address to file offset is fixed. With this
information it can make assumptions about the permanence of results it
gets from the kernel.

For example get_user_pages() today makes no guarantees outside of
"page will not be freed", but with immutable files and dax you now
have a mechanism for userspace to coordinate direct access to storage
addresses. Those raw storage addresses need not be exposed to the
application, as you say it doesn't need to know that detail. MAP_SYNC
does not fully satisfy this case because it requires agents that can
generate MMU faults to coordinate with the filesystem.

>> >
>> > Please explain how this interface allows for any sort of safe userspace
>> > DMA.
>>
>> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
>> support applications that MAP_SYNC does not. Dave mentioned userspace
>> pNFS4 servers, but there's also Samba and other protocols that want to
>> negotiate a direct path to pmem outside the kernel.
>
> Userspace pNFS servers must use a userspace file system.  Everything
> else is just brainded stupid due to the amount of communication they
> need to do.  Also note that the only pNFS layouts that would even cause
> direct block access are pNFS block/scsi and for those the
> S_IOMAP_IMMUTABLE semantics are not very useful (background: I wrote
> the Linux implementation for those, and authored the scsi layout spec)
>

Understood.

All I know is that SMB Direct for persistent memory seems like a
potential consumer. I know they're not going to use a userspace
filesystem or put an SMB server in the kernel.

>
>> Applications that just want flush from userspace can use MAP_SYNC,
>> those that need to temporarily pin the block for RDMA can use the
>> in-kernel pNFS server, and those that need to coordinate both from
>> userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
>> competition.
>
> Again - how does your application even know that I moved your block
> around with your S_IOMAP_IMMUTABLE?  We should never add interfaces
> that mandate implementations - we should based interfaces based on
> user observable behavior - and debug tools like fiemap don't count.

I'm still not grokking this "I moved your block" example. What agent
is moving blocks while the file is immutable?

> Before going any further please write a man page that describeѕ your
> intended semantics in a way that an application programmer understands.

Sure, I'll try to write this up in terms of the use cases I know about
that can immediately consume it and switch away from device-dax.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-12 19:19               ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-12 19:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API

On Sat, Aug 12, 2017 at 12:33 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Aug 11, 2017 at 03:26:05PM -0700, Dan Williams wrote:
>> Right, but they let userspace make inferences about the state of
>> metadata relative to I/O to a given storage address. In this regard
>> S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
>> a step further to let an application infer that the storage address is
>> stable. This enables applications that MAP_SYNC does not, see below.
>
> But the application must not know (and cannot know) the storage address,
> so it doesn't matter.
>
>> > What is the observable behavior of an extent map change?  How can you
>> > describe your immutable extent map behavior so that when I violate
>> > them by e.g. moving one extent to a different place on disk you can
>> > observe that in userspace?
>>
>> The violation is blocked, it's immutable. Using this feature means the
>> application is taking away some of the kernel's freedom. That is a
>> valid / safe tradeoff for the set of applications that would otherwise
>> resort to raw device access.
>
> What can the application do with it safely that it can't otherwise do?
> Short answer: nothing.

The application does not need to know the storage address, it needs to
know that the storage address to file offset is fixed. With this
information it can make assumptions about the permanence of results it
gets from the kernel.

For example get_user_pages() today makes no guarantees outside of
"page will not be freed", but with immutable files and dax you now
have a mechanism for userspace to coordinate direct access to storage
addresses. Those raw storage addresses need not be exposed to the
application, as you say it doesn't need to know that detail. MAP_SYNC
does not fully satisfy this case because it requires agents that can
generate MMU faults to coordinate with the filesystem.

>> >
>> > Please explain how this interface allows for any sort of safe userspace
>> > DMA.
>>
>> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
>> support applications that MAP_SYNC does not. Dave mentioned userspace
>> pNFS4 servers, but there's also Samba and other protocols that want to
>> negotiate a direct path to pmem outside the kernel.
>
> Userspace pNFS servers must use a userspace file system.  Everything
> else is just brainded stupid due to the amount of communication they
> need to do.  Also note that the only pNFS layouts that would even cause
> direct block access are pNFS block/scsi and for those the
> S_IOMAP_IMMUTABLE semantics are not very useful (background: I wrote
> the Linux implementation for those, and authored the scsi layout spec)
>

Understood.

All I know is that SMB Direct for persistent memory seems like a
potential consumer. I know they're not going to use a userspace
filesystem or put an SMB server in the kernel.

>
>> Applications that just want flush from userspace can use MAP_SYNC,
>> those that need to temporarily pin the block for RDMA can use the
>> in-kernel pNFS server, and those that need to coordinate both from
>> userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
>> competition.
>
> Again - how does your application even know that I moved your block
> around with your S_IOMAP_IMMUTABLE?  We should never add interfaces
> that mandate implementations - we should based interfaces based on
> user observable behavior - and debug tools like fiemap don't count.

I'm still not grokking this "I moved your block" example. What agent
is moving blocks while the file is immutable?

> Before going any further please write a man page that describeѕ your
> intended semantics in a way that an application programmer understands.

Sure, I'll try to write this up in terms of the use cases I know about
that can immediately consume it and switch away from device-dax.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-12 19:19               ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-12 19:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API,
	Darrick J. Wong, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Andy Lutomirski, linux-fsdevel

On Sat, Aug 12, 2017 at 12:33 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, Aug 11, 2017 at 03:26:05PM -0700, Dan Williams wrote:
>> Right, but they let userspace make inferences about the state of
>> metadata relative to I/O to a given storage address. In this regard
>> S_IOMAP_IMMUTABLE is no different than MAP_SYNC, but 'immutable' goes
>> a step further to let an application infer that the storage address is
>> stable. This enables applications that MAP_SYNC does not, see below.
>
> But the application must not know (and cannot know) the storage address,
> so it doesn't matter.
>
>> > What is the observable behavior of an extent map change?  How can you
>> > describe your immutable extent map behavior so that when I violate
>> > them by e.g. moving one extent to a different place on disk you can
>> > observe that in userspace?
>>
>> The violation is blocked, it's immutable. Using this feature means the
>> application is taking away some of the kernel's freedom. That is a
>> valid / safe tradeoff for the set of applications that would otherwise
>> resort to raw device access.
>
> What can the application do with it safely that it can't otherwise do?
> Short answer: nothing.

The application does not need to know the storage address, it needs to
know that the storage address to file offset is fixed. With this
information it can make assumptions about the permanence of results it
gets from the kernel.

For example get_user_pages() today makes no guarantees outside of
"page will not be freed", but with immutable files and dax you now
have a mechanism for userspace to coordinate direct access to storage
addresses. Those raw storage addresses need not be exposed to the
application, as you say it doesn't need to know that detail. MAP_SYNC
does not fully satisfy this case because it requires agents that can
generate MMU faults to coordinate with the filesystem.

>> >
>> > Please explain how this interface allows for any sort of safe userspace
>> > DMA.
>>
>> So this is where I continue to see S_IOMAP_IMMUTABLE being able to
>> support applications that MAP_SYNC does not. Dave mentioned userspace
>> pNFS4 servers, but there's also Samba and other protocols that want to
>> negotiate a direct path to pmem outside the kernel.
>
> Userspace pNFS servers must use a userspace file system.  Everything
> else is just brainded stupid due to the amount of communication they
> need to do.  Also note that the only pNFS layouts that would even cause
> direct block access are pNFS block/scsi and for those the
> S_IOMAP_IMMUTABLE semantics are not very useful (background: I wrote
> the Linux implementation for those, and authored the scsi layout spec)
>

Understood.

All I know is that SMB Direct for persistent memory seems like a
potential consumer. I know they're not going to use a userspace
filesystem or put an SMB server in the kernel.

>
>> Applications that just want flush from userspace can use MAP_SYNC,
>> those that need to temporarily pin the block for RDMA can use the
>> in-kernel pNFS server, and those that need to coordinate both from
>> userspace can use S_IOMAP_IMMUTABLE. It's a continuum, not a
>> competition.
>
> Again - how does your application even know that I moved your block
> around with your S_IOMAP_IMMUTABLE?  We should never add interfaces
> that mandate implementations - we should based interfaces based on
> user observable behavior - and debug tools like fiemap don't count.

I'm still not grokking this "I moved your block" example. What agent
is moving blocks while the file is immutable?

> Before going any further please write a man page that describeѕ your
> intended semantics in a way that an application programmer understands.

Sure, I'll try to write this up in terms of the use cases I know about
that can immediately consume it and switch away from device-dax.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-12 19:19               ` Dan Williams
@ 2017-08-13  9:24                 ` Christoph Hellwig
  -1 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-13  9:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, Andy Lutomirski,
	linux-fsdevel, Christoph Hellwig

On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
> The application does not need to know the storage address, it needs to
> know that the storage address to file offset is fixed. With this
> information it can make assumptions about the permanence of results it
> gets from the kernel.

Only if we clearly document that fact - and documenting the permanence
is different from saying the block map won't change.

> For example get_user_pages() today makes no guarantees outside of
> "page will not be freed",

It also makes the extremely important gurantee that the page won't
_move_ - e.g. that we won't do a memory migration for compaction or
other reasons.  That's why for example RDMA can use to register
memory and then we can later set up memory windows that point to this
registration from userspace and implement userspace RDMA.

> but with immutable files and dax you now
> have a mechanism for userspace to coordinate direct access to storage
> addresses. Those raw storage addresses need not be exposed to the
> application, as you say it doesn't need to know that detail. MAP_SYNC
> does not fully satisfy this case because it requires agents that can
> generate MMU faults to coordinate with the filesystem.

The file system is always in the fault path, can you explain what other
agents you are talking about?

> All I know is that SMB Direct for persistent memory seems like a
> potential consumer. I know they're not going to use a userspace
> filesystem or put an SMB server in the kernel.

Last I talked to the Samba folks they didn't expect a userspace
SMB direct implementation to work anyway due to the fact that
libibverbs memory registrations interact badly with their fork()ing
daemon model.  That being said during the recent submission of the
RDMA client code some comments were made about userspace versions of
it, so I'm not sure if that opinion has changed in one way or another.

Thay being said I think we absolutely should support RDMA memory
registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
all the blocks are polulated and all ptes are set up.  Second we need
to make sure get_user_page works, which for now means we'll need a
struct page mapping for the region (which will be really annoying
for PCIe mappings, like the upcoming NVMe persistent memory region),
and we need to gurantee that the extent mapping won't change while
the get_user_pages holds the pages inside it.  I think that is true
due to side effects even with the current DAX code, but we'll need to
make it explicit.  And maybe that's where we need to converge - 
"sealing" the extent map makes sense as such a temporary measure
that is not persisted on disk, which automatically gets released
when the holding process exits, because we sort of already do this
implicitly.  It might also make sense to have explicitl breakable
seals similar to what I do for the pNFS blocks kernel server, as
any userspace RDMA file server would also need those semantics.

Last but not least we have any interesting additional case for modern
Mellanox hardware - On Demand Paging where we don't actually do a
get_user_pages but the hardware implements SVM and thus gets fed
virtual addresses directly.  My head spins when talking about the
implications for DAX mappings on that, so I'm just throwing that in
for now instead of trying to come up with a solution.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-13  9:24                 ` Christoph Hellwig
  0 siblings, 0 replies; 108+ messages in thread
From: Christoph Hellwig @ 2017-08-13  9:24 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Darrick J. Wong, Jan Kara, linux-nvdimm,
	Dave Chinner, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Linux API

On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
> The application does not need to know the storage address, it needs to
> know that the storage address to file offset is fixed. With this
> information it can make assumptions about the permanence of results it
> gets from the kernel.

Only if we clearly document that fact - and documenting the permanence
is different from saying the block map won't change.

> For example get_user_pages() today makes no guarantees outside of
> "page will not be freed",

It also makes the extremely important gurantee that the page won't
_move_ - e.g. that we won't do a memory migration for compaction or
other reasons.  That's why for example RDMA can use to register
memory and then we can later set up memory windows that point to this
registration from userspace and implement userspace RDMA.

> but with immutable files and dax you now
> have a mechanism for userspace to coordinate direct access to storage
> addresses. Those raw storage addresses need not be exposed to the
> application, as you say it doesn't need to know that detail. MAP_SYNC
> does not fully satisfy this case because it requires agents that can
> generate MMU faults to coordinate with the filesystem.

The file system is always in the fault path, can you explain what other
agents you are talking about?

> All I know is that SMB Direct for persistent memory seems like a
> potential consumer. I know they're not going to use a userspace
> filesystem or put an SMB server in the kernel.

Last I talked to the Samba folks they didn't expect a userspace
SMB direct implementation to work anyway due to the fact that
libibverbs memory registrations interact badly with their fork()ing
daemon model.  That being said during the recent submission of the
RDMA client code some comments were made about userspace versions of
it, so I'm not sure if that opinion has changed in one way or another.

Thay being said I think we absolutely should support RDMA memory
registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
all the blocks are polulated and all ptes are set up.  Second we need
to make sure get_user_page works, which for now means we'll need a
struct page mapping for the region (which will be really annoying
for PCIe mappings, like the upcoming NVMe persistent memory region),
and we need to gurantee that the extent mapping won't change while
the get_user_pages holds the pages inside it.  I think that is true
due to side effects even with the current DAX code, but we'll need to
make it explicit.  And maybe that's where we need to converge - 
"sealing" the extent map makes sense as such a temporary measure
that is not persisted on disk, which automatically gets released
when the holding process exits, because we sort of already do this
implicitly.  It might also make sense to have explicitl breakable
seals similar to what I do for the pNFS blocks kernel server, as
any userspace RDMA file server would also need those semantics.

Last but not least we have any interesting additional case for modern
Mellanox hardware - On Demand Paging where we don't actually do a
get_user_pages but the hardware implements SVM and thus gets fed
virtual addresses directly.  My head spins when talking about the
implications for DAX mappings on that, so I'm just throwing that in
for now instead of trying to come up with a solution.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-13  9:24                 ` Christoph Hellwig
  (?)
@ 2017-08-13 20:31                   ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-13 20:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, Andy Lutomirski,
	linux-fsdevel

On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
>> The application does not need to know the storage address, it needs to
>> know that the storage address to file offset is fixed. With this
>> information it can make assumptions about the permanence of results it
>> gets from the kernel.
>
> Only if we clearly document that fact - and documenting the permanence
> is different from saying the block map won't change.

I can get on board with that.

>
>> For example get_user_pages() today makes no guarantees outside of
>> "page will not be freed",
>
> It also makes the extremely important gurantee that the page won't
> _move_ - e.g. that we won't do a memory migration for compaction or
> other reasons.  That's why for example RDMA can use to register
> memory and then we can later set up memory windows that point to this
> registration from userspace and implement userspace RDMA.
>
>> but with immutable files and dax you now
>> have a mechanism for userspace to coordinate direct access to storage
>> addresses. Those raw storage addresses need not be exposed to the
>> application, as you say it doesn't need to know that detail. MAP_SYNC
>> does not fully satisfy this case because it requires agents that can
>> generate MMU faults to coordinate with the filesystem.
>
> The file system is always in the fault path, can you explain what other
> agents you are talking about?

Exactly the one's you mention below. SVM hardware can just use a
MAP_SYNC mapping and be sure that its metadata dirtying writes are
synchronized with the filesystem through the fault path. Hardware that
does not have SVM, or hypervisors like Xen that want to attach their
own static metadata about the file offset to physical block mapping,
need a mechanism to make sure the block map is sealed while they have
it mapped.

>> All I know is that SMB Direct for persistent memory seems like a
>> potential consumer. I know they're not going to use a userspace
>> filesystem or put an SMB server in the kernel.
>
> Last I talked to the Samba folks they didn't expect a userspace
> SMB direct implementation to work anyway due to the fact that
> libibverbs memory registrations interact badly with their fork()ing
> daemon model.  That being said during the recent submission of the
> RDMA client code some comments were made about userspace versions of
> it, so I'm not sure if that opinion has changed in one way or another.

Ok.

>
> Thay being said I think we absolutely should support RDMA memory
> registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> all the blocks are polulated and all ptes are set up.  Second we need
> to make sure get_user_page works, which for now means we'll need a
> struct page mapping for the region (which will be really annoying
> for PCIe mappings, like the upcoming NVMe persistent memory region),
> and we need to gurantee that the extent mapping won't change while
> the get_user_pages holds the pages inside it.  I think that is true
> due to side effects even with the current DAX code, but we'll need to
> make it explicit.  And maybe that's where we need to converge -
> "sealing" the extent map makes sense as such a temporary measure
> that is not persisted on disk, which automatically gets released
> when the holding process exits, because we sort of already do this
> implicitly.  It might also make sense to have explicitl breakable
> seals similar to what I do for the pNFS blocks kernel server, as
> any userspace RDMA file server would also need those semantics.

Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:

    1/ only succeed if the fault can be satisfied without page cache

    2/ only install a pte for the fault if it can do so without
triggering block map updates

So, I think it would still end up setting an inode flag to make
xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
active. However, it would not record that state in the on-disk
metadata and it would automatically clear at munmap time. That should
be enough to support the host-persistent-memory, and
NVMe-persistent-memory use cases (provided we have struct page for
NVMe). Although, we need more safety infrastructure in the NVMe case
where we would need to software manage I/O coherence.

> Last but not least we have any interesting additional case for modern
> Mellanox hardware - On Demand Paging where we don't actually do a
> get_user_pages but the hardware implements SVM and thus gets fed
> virtual addresses directly.  My head spins when talking about the
> implications for DAX mappings on that, so I'm just throwing that in
> for now instead of trying to come up with a solution.

Yeah, DAX + SVM needs more thought.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-13 20:31                   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-13 20:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API

On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
>> The application does not need to know the storage address, it needs to
>> know that the storage address to file offset is fixed. With this
>> information it can make assumptions about the permanence of results it
>> gets from the kernel.
>
> Only if we clearly document that fact - and documenting the permanence
> is different from saying the block map won't change.

I can get on board with that.

>
>> For example get_user_pages() today makes no guarantees outside of
>> "page will not be freed",
>
> It also makes the extremely important gurantee that the page won't
> _move_ - e.g. that we won't do a memory migration for compaction or
> other reasons.  That's why for example RDMA can use to register
> memory and then we can later set up memory windows that point to this
> registration from userspace and implement userspace RDMA.
>
>> but with immutable files and dax you now
>> have a mechanism for userspace to coordinate direct access to storage
>> addresses. Those raw storage addresses need not be exposed to the
>> application, as you say it doesn't need to know that detail. MAP_SYNC
>> does not fully satisfy this case because it requires agents that can
>> generate MMU faults to coordinate with the filesystem.
>
> The file system is always in the fault path, can you explain what other
> agents you are talking about?

Exactly the one's you mention below. SVM hardware can just use a
MAP_SYNC mapping and be sure that its metadata dirtying writes are
synchronized with the filesystem through the fault path. Hardware that
does not have SVM, or hypervisors like Xen that want to attach their
own static metadata about the file offset to physical block mapping,
need a mechanism to make sure the block map is sealed while they have
it mapped.

>> All I know is that SMB Direct for persistent memory seems like a
>> potential consumer. I know they're not going to use a userspace
>> filesystem or put an SMB server in the kernel.
>
> Last I talked to the Samba folks they didn't expect a userspace
> SMB direct implementation to work anyway due to the fact that
> libibverbs memory registrations interact badly with their fork()ing
> daemon model.  That being said during the recent submission of the
> RDMA client code some comments were made about userspace versions of
> it, so I'm not sure if that opinion has changed in one way or another.

Ok.

>
> Thay being said I think we absolutely should support RDMA memory
> registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> all the blocks are polulated and all ptes are set up.  Second we need
> to make sure get_user_page works, which for now means we'll need a
> struct page mapping for the region (which will be really annoying
> for PCIe mappings, like the upcoming NVMe persistent memory region),
> and we need to gurantee that the extent mapping won't change while
> the get_user_pages holds the pages inside it.  I think that is true
> due to side effects even with the current DAX code, but we'll need to
> make it explicit.  And maybe that's where we need to converge -
> "sealing" the extent map makes sense as such a temporary measure
> that is not persisted on disk, which automatically gets released
> when the holding process exits, because we sort of already do this
> implicitly.  It might also make sense to have explicitl breakable
> seals similar to what I do for the pNFS blocks kernel server, as
> any userspace RDMA file server would also need those semantics.

Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:

    1/ only succeed if the fault can be satisfied without page cache

    2/ only install a pte for the fault if it can do so without
triggering block map updates

So, I think it would still end up setting an inode flag to make
xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
active. However, it would not record that state in the on-disk
metadata and it would automatically clear at munmap time. That should
be enough to support the host-persistent-memory, and
NVMe-persistent-memory use cases (provided we have struct page for
NVMe). Although, we need more safety infrastructure in the NVMe case
where we would need to software manage I/O coherence.

> Last but not least we have any interesting additional case for modern
> Mellanox hardware - On Demand Paging where we don't actually do a
> get_user_pages but the hardware implements SVM and thus gets fed
> virtual addresses directly.  My head spins when talking about the
> implications for DAX mappings on that, so I'm just throwing that in
> for now instead of trying to come up with a solution.

Yeah, DAX + SVM needs more thought.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-13 20:31                   ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-13 20:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API,
	Darrick J. Wong, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Andy Lutomirski, linux-fsdevel

On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
>> The application does not need to know the storage address, it needs to
>> know that the storage address to file offset is fixed. With this
>> information it can make assumptions about the permanence of results it
>> gets from the kernel.
>
> Only if we clearly document that fact - and documenting the permanence
> is different from saying the block map won't change.

I can get on board with that.

>
>> For example get_user_pages() today makes no guarantees outside of
>> "page will not be freed",
>
> It also makes the extremely important gurantee that the page won't
> _move_ - e.g. that we won't do a memory migration for compaction or
> other reasons.  That's why for example RDMA can use to register
> memory and then we can later set up memory windows that point to this
> registration from userspace and implement userspace RDMA.
>
>> but with immutable files and dax you now
>> have a mechanism for userspace to coordinate direct access to storage
>> addresses. Those raw storage addresses need not be exposed to the
>> application, as you say it doesn't need to know that detail. MAP_SYNC
>> does not fully satisfy this case because it requires agents that can
>> generate MMU faults to coordinate with the filesystem.
>
> The file system is always in the fault path, can you explain what other
> agents you are talking about?

Exactly the one's you mention below. SVM hardware can just use a
MAP_SYNC mapping and be sure that its metadata dirtying writes are
synchronized with the filesystem through the fault path. Hardware that
does not have SVM, or hypervisors like Xen that want to attach their
own static metadata about the file offset to physical block mapping,
need a mechanism to make sure the block map is sealed while they have
it mapped.

>> All I know is that SMB Direct for persistent memory seems like a
>> potential consumer. I know they're not going to use a userspace
>> filesystem or put an SMB server in the kernel.
>
> Last I talked to the Samba folks they didn't expect a userspace
> SMB direct implementation to work anyway due to the fact that
> libibverbs memory registrations interact badly with their fork()ing
> daemon model.  That being said during the recent submission of the
> RDMA client code some comments were made about userspace versions of
> it, so I'm not sure if that opinion has changed in one way or another.

Ok.

>
> Thay being said I think we absolutely should support RDMA memory
> registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> all the blocks are polulated and all ptes are set up.  Second we need
> to make sure get_user_page works, which for now means we'll need a
> struct page mapping for the region (which will be really annoying
> for PCIe mappings, like the upcoming NVMe persistent memory region),
> and we need to gurantee that the extent mapping won't change while
> the get_user_pages holds the pages inside it.  I think that is true
> due to side effects even with the current DAX code, but we'll need to
> make it explicit.  And maybe that's where we need to converge -
> "sealing" the extent map makes sense as such a temporary measure
> that is not persisted on disk, which automatically gets released
> when the holding process exits, because we sort of already do this
> implicitly.  It might also make sense to have explicitl breakable
> seals similar to what I do for the pNFS blocks kernel server, as
> any userspace RDMA file server would also need those semantics.

Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:

    1/ only succeed if the fault can be satisfied without page cache

    2/ only install a pte for the fault if it can do so without
triggering block map updates

So, I think it would still end up setting an inode flag to make
xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
active. However, it would not record that state in the on-disk
metadata and it would automatically clear at munmap time. That should
be enough to support the host-persistent-memory, and
NVMe-persistent-memory use cases (provided we have struct page for
NVMe). Although, we need more safety infrastructure in the NVMe case
where we would need to software manage I/O coherence.

> Last but not least we have any interesting additional case for modern
> Mellanox hardware - On Demand Paging where we don't actually do a
> get_user_pages but the hardware implements SVM and thus gets fed
> virtual addresses directly.  My head spins when talking about the
> implications for DAX mappings on that, so I'm just throwing that in
> for now instead of trying to come up with a solution.

Yeah, DAX + SVM needs more thought.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-13  9:24                 ` Christoph Hellwig
@ 2017-08-13 23:46                   ` Dave Chinner
  -1 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2017-08-13 23:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Linux API, Darrick J. Wong, linux-kernel,
	linux-xfs, Alexander Viro, Andy Lutomirski, linux-fsdevel

On Sun, Aug 13, 2017 at 11:24:36AM +0200, Christoph Hellwig wrote:
> And maybe that's where we need to converge - 
> "sealing" the extent map makes sense as such a temporary measure
> that is not persisted on disk, which automatically gets released
> when the holding process exits, because we sort of already do this
> implicitly.

That seems reasonable to me. Personally I don't need persistent
state, and I'd only intended persistence to be so that we didn't get
arbitrary processes whacking holes in the file when the DAX app
wasn't running that would then cause for userspace data sync. Seeing
as the interface is morphing away from a "fill holes and persist"
interface to just a "seal the existing map" interface, it'll be up
to the app/library to prep check file layout for sanity every time
it is sealed.


> It might also make sense to have explicitl breakable
> seals similar to what I do for the pNFS blocks kernel server, as
> any userspace RDMA file server would also need those semantics.

How would that work? IIUC, we'd need userspace to take out a file
lease so that it gets notified when the seal is going to be broken
by the filesystem via the break_layouts() interface, and the break
then blocks until the app releases the lease? So the seal lifetime
is bounded by the lease?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-13 23:46                   ` Dave Chinner
  0 siblings, 0 replies; 108+ messages in thread
From: Dave Chinner @ 2017-08-13 23:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Darrick J. Wong, Jan Kara, linux-nvdimm,
	linux-kernel, linux-xfs, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API

On Sun, Aug 13, 2017 at 11:24:36AM +0200, Christoph Hellwig wrote:
> And maybe that's where we need to converge - 
> "sealing" the extent map makes sense as such a temporary measure
> that is not persisted on disk, which automatically gets released
> when the holding process exits, because we sort of already do this
> implicitly.

That seems reasonable to me. Personally I don't need persistent
state, and I'd only intended persistence to be so that we didn't get
arbitrary processes whacking holes in the file when the DAX app
wasn't running that would then cause for userspace data sync. Seeing
as the interface is morphing away from a "fill holes and persist"
interface to just a "seal the existing map" interface, it'll be up
to the app/library to prep check file layout for sanity every time
it is sealed.


> It might also make sense to have explicitl breakable
> seals similar to what I do for the pNFS blocks kernel server, as
> any userspace RDMA file server would also need those semantics.

How would that work? IIUC, we'd need userspace to take out a file
lease so that it gets notified when the seal is going to be broken
by the filesystem via the break_layouts() interface, and the break
then blocks until the app releases the lease? So the seal lifetime
is bounded by the lease?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-13 20:31                   ` Dan Williams
  (?)
@ 2017-08-14 12:40                     ` Jan Kara
  -1 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2017-08-14 12:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Peter Zijlstra, Linux API,
	Darrick J. Wong, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Christoph Hellwig

On Sun 13-08-17 13:31:45, Dan Williams wrote:
> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> > Thay being said I think we absolutely should support RDMA memory
> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> > all the blocks are polulated and all ptes are set up.  Second we need
> > to make sure get_user_page works, which for now means we'll need a
> > struct page mapping for the region (which will be really annoying
> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> > and we need to gurantee that the extent mapping won't change while
> > the get_user_pages holds the pages inside it.  I think that is true
> > due to side effects even with the current DAX code, but we'll need to
> > make it explicit.  And maybe that's where we need to converge -
> > "sealing" the extent map makes sense as such a temporary measure
> > that is not persisted on disk, which automatically gets released
> > when the holding process exits, because we sort of already do this
> > implicitly.  It might also make sense to have explicitl breakable
> > seals similar to what I do for the pNFS blocks kernel server, as
> > any userspace RDMA file server would also need those semantics.
> 
> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> 
>     1/ only succeed if the fault can be satisfied without page cache
> 
>     2/ only install a pte for the fault if it can do so without
> triggering block map updates
> 
> So, I think it would still end up setting an inode flag to make
> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> active. However, it would not record that state in the on-disk
> metadata and it would automatically clear at munmap time. That should
> be enough to support the host-persistent-memory, and
> NVMe-persistent-memory use cases (provided we have struct page for
> NVMe). Although, we need more safety infrastructure in the NVMe case
> where we would need to software manage I/O coherence.

Hum, this proposal (and the problems you are trying to deal with) seem very
similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
the DAX area (and so additionally complicated by the fact that filesystems
now have to care). The patch set was not merged due to lack of interest I
think but it looked sensible and the proposed API would make sense for more
stuff than just DAX so maybe it would be better than MAP_DIRECT flag?

[1] https://lwn.net/Articles/600502/

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-14 12:40                     ` Jan Kara
  0 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2017-08-14 12:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Darrick J. Wong, Jan Kara, linux-nvdimm,
	Dave Chinner, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Linux API, Peter Zijlstra

On Sun 13-08-17 13:31:45, Dan Williams wrote:
> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> > Thay being said I think we absolutely should support RDMA memory
> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> > all the blocks are polulated and all ptes are set up.  Second we need
> > to make sure get_user_page works, which for now means we'll need a
> > struct page mapping for the region (which will be really annoying
> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> > and we need to gurantee that the extent mapping won't change while
> > the get_user_pages holds the pages inside it.  I think that is true
> > due to side effects even with the current DAX code, but we'll need to
> > make it explicit.  And maybe that's where we need to converge -
> > "sealing" the extent map makes sense as such a temporary measure
> > that is not persisted on disk, which automatically gets released
> > when the holding process exits, because we sort of already do this
> > implicitly.  It might also make sense to have explicitl breakable
> > seals similar to what I do for the pNFS blocks kernel server, as
> > any userspace RDMA file server would also need those semantics.
> 
> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> 
>     1/ only succeed if the fault can be satisfied without page cache
> 
>     2/ only install a pte for the fault if it can do so without
> triggering block map updates
> 
> So, I think it would still end up setting an inode flag to make
> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> active. However, it would not record that state in the on-disk
> metadata and it would automatically clear at munmap time. That should
> be enough to support the host-persistent-memory, and
> NVMe-persistent-memory use cases (provided we have struct page for
> NVMe). Although, we need more safety infrastructure in the NVMe case
> where we would need to software manage I/O coherence.

Hum, this proposal (and the problems you are trying to deal with) seem very
similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
the DAX area (and so additionally complicated by the fact that filesystems
now have to care). The patch set was not merged due to lack of interest I
think but it looked sensible and the proposed API would make sense for more
stuff than just DAX so maybe it would be better than MAP_DIRECT flag?

[1] https://lwn.net/Articles/600502/

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-14 12:40                     ` Jan Kara
  0 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2017-08-14 12:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Peter Zijlstra,
	Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Sun 13-08-17 13:31:45, Dan Williams wrote:
> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> > Thay being said I think we absolutely should support RDMA memory
> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> > all the blocks are polulated and all ptes are set up.  Second we need
> > to make sure get_user_page works, which for now means we'll need a
> > struct page mapping for the region (which will be really annoying
> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> > and we need to gurantee that the extent mapping won't change while
> > the get_user_pages holds the pages inside it.  I think that is true
> > due to side effects even with the current DAX code, but we'll need to
> > make it explicit.  And maybe that's where we need to converge -
> > "sealing" the extent map makes sense as such a temporary measure
> > that is not persisted on disk, which automatically gets released
> > when the holding process exits, because we sort of already do this
> > implicitly.  It might also make sense to have explicitl breakable
> > seals similar to what I do for the pNFS blocks kernel server, as
> > any userspace RDMA file server would also need those semantics.
> 
> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> 
>     1/ only succeed if the fault can be satisfied without page cache
> 
>     2/ only install a pte for the fault if it can do so without
> triggering block map updates
> 
> So, I think it would still end up setting an inode flag to make
> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> active. However, it would not record that state in the on-disk
> metadata and it would automatically clear at munmap time. That should
> be enough to support the host-persistent-memory, and
> NVMe-persistent-memory use cases (provided we have struct page for
> NVMe). Although, we need more safety infrastructure in the NVMe case
> where we would need to software manage I/O coherence.

Hum, this proposal (and the problems you are trying to deal with) seem very
similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
the DAX area (and so additionally complicated by the fact that filesystems
now have to care). The patch set was not merged due to lack of interest I
think but it looked sensible and the proposed API would make sense for more
stuff than just DAX so maybe it would be better than MAP_DIRECT flag?

[1] https://lwn.net/Articles/600502/

								Honza

-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-14 12:40                     ` Jan Kara
@ 2017-08-14 16:14                       ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-14 16:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Peter Zijlstra, Linux API, Darrick J. Wong,
	Dave Chinner, linux-kernel, linux-xfs, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
> On Sun 13-08-17 13:31:45, Dan Williams wrote:
>> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > Thay being said I think we absolutely should support RDMA memory
>> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
>> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
>> > all the blocks are polulated and all ptes are set up.  Second we need
>> > to make sure get_user_page works, which for now means we'll need a
>> > struct page mapping for the region (which will be really annoying
>> > for PCIe mappings, like the upcoming NVMe persistent memory region),
>> > and we need to gurantee that the extent mapping won't change while
>> > the get_user_pages holds the pages inside it.  I think that is true
>> > due to side effects even with the current DAX code, but we'll need to
>> > make it explicit.  And maybe that's where we need to converge -
>> > "sealing" the extent map makes sense as such a temporary measure
>> > that is not persisted on disk, which automatically gets released
>> > when the holding process exits, because we sort of already do this
>> > implicitly.  It might also make sense to have explicitl breakable
>> > seals similar to what I do for the pNFS blocks kernel server, as
>> > any userspace RDMA file server would also need those semantics.
>>
>> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
>>
>>     1/ only succeed if the fault can be satisfied without page cache
>>
>>     2/ only install a pte for the fault if it can do so without
>> triggering block map updates
>>
>> So, I think it would still end up setting an inode flag to make
>> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
>> active. However, it would not record that state in the on-disk
>> metadata and it would automatically clear at munmap time. That should
>> be enough to support the host-persistent-memory, and
>> NVMe-persistent-memory use cases (provided we have struct page for
>> NVMe). Although, we need more safety infrastructure in the NVMe case
>> where we would need to software manage I/O coherence.
>
> Hum, this proposal (and the problems you are trying to deal with) seem very
> similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> the DAX area (and so additionally complicated by the fact that filesystems
> now have to care). The patch set was not merged due to lack of interest I
> think but it looked sensible and the proposed API would make sense for more
> stuff than just DAX so maybe it would be better than MAP_DIRECT flag?

Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
"no-fault" guarantee and fixes the accounting of locked System RAM.
MAP_DIRECT still allows faults, and DAX mappings don't consume System
RAM so the accounting problem is not there for DAX. mm_pin() also does
not appear to have a relationship to a file backed memory like mmap
allows.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-14 16:14                       ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-14 16:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API,
	Peter Zijlstra

On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
> On Sun 13-08-17 13:31:45, Dan Williams wrote:
>> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
>> > Thay being said I think we absolutely should support RDMA memory
>> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
>> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
>> > all the blocks are polulated and all ptes are set up.  Second we need
>> > to make sure get_user_page works, which for now means we'll need a
>> > struct page mapping for the region (which will be really annoying
>> > for PCIe mappings, like the upcoming NVMe persistent memory region),
>> > and we need to gurantee that the extent mapping won't change while
>> > the get_user_pages holds the pages inside it.  I think that is true
>> > due to side effects even with the current DAX code, but we'll need to
>> > make it explicit.  And maybe that's where we need to converge -
>> > "sealing" the extent map makes sense as such a temporary measure
>> > that is not persisted on disk, which automatically gets released
>> > when the holding process exits, because we sort of already do this
>> > implicitly.  It might also make sense to have explicitl breakable
>> > seals similar to what I do for the pNFS blocks kernel server, as
>> > any userspace RDMA file server would also need those semantics.
>>
>> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
>>
>>     1/ only succeed if the fault can be satisfied without page cache
>>
>>     2/ only install a pte for the fault if it can do so without
>> triggering block map updates
>>
>> So, I think it would still end up setting an inode flag to make
>> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
>> active. However, it would not record that state in the on-disk
>> metadata and it would automatically clear at munmap time. That should
>> be enough to support the host-persistent-memory, and
>> NVMe-persistent-memory use cases (provided we have struct page for
>> NVMe). Although, we need more safety infrastructure in the NVMe case
>> where we would need to software manage I/O coherence.
>
> Hum, this proposal (and the problems you are trying to deal with) seem very
> similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> the DAX area (and so additionally complicated by the fact that filesystems
> now have to care). The patch set was not merged due to lack of interest I
> think but it looked sensible and the proposed API would make sense for more
> stuff than just DAX so maybe it would be better than MAP_DIRECT flag?

Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
"no-fault" guarantee and fixes the accounting of locked System RAM.
MAP_DIRECT still allows faults, and DAX mappings don't consume System
RAM so the accounting problem is not there for DAX. mm_pin() also does
not appear to have a relationship to a file backed memory like mmap
allows.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-13 20:31                   ` Dan Williams
  (?)
@ 2017-08-14 21:46                     ` Darrick J. Wong
  -1 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-14 21:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Linux API, Dave Chinner, linux-kernel,
	linux-xfs, Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Christoph Hellwig

On Sun, Aug 13, 2017 at 01:31:45PM -0700, Dan Williams wrote:
> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
> >> The application does not need to know the storage address, it needs to
> >> know that the storage address to file offset is fixed. With this
> >> information it can make assumptions about the permanence of results it
> >> gets from the kernel.
> >
> > Only if we clearly document that fact - and documenting the permanence
> > is different from saying the block map won't change.
> 
> I can get on board with that.
> 
> >
> >> For example get_user_pages() today makes no guarantees outside of
> >> "page will not be freed",
> >
> > It also makes the extremely important gurantee that the page won't
> > _move_ - e.g. that we won't do a memory migration for compaction or
> > other reasons.  That's why for example RDMA can use to register
> > memory and then we can later set up memory windows that point to this
> > registration from userspace and implement userspace RDMA.
> >
> >> but with immutable files and dax you now
> >> have a mechanism for userspace to coordinate direct access to storage
> >> addresses. Those raw storage addresses need not be exposed to the
> >> application, as you say it doesn't need to know that detail. MAP_SYNC
> >> does not fully satisfy this case because it requires agents that can
> >> generate MMU faults to coordinate with the filesystem.
> >
> > The file system is always in the fault path, can you explain what other
> > agents you are talking about?
> 
> Exactly the one's you mention below. SVM hardware can just use a
> MAP_SYNC mapping and be sure that its metadata dirtying writes are
> synchronized with the filesystem through the fault path. Hardware that
> does not have SVM, or hypervisors like Xen that want to attach their
> own static metadata about the file offset to physical block mapping,
> need a mechanism to make sure the block map is sealed while they have
> it mapped.
> 
> >> All I know is that SMB Direct for persistent memory seems like a
> >> potential consumer. I know they're not going to use a userspace
> >> filesystem or put an SMB server in the kernel.
> >
> > Last I talked to the Samba folks they didn't expect a userspace
> > SMB direct implementation to work anyway due to the fact that
> > libibverbs memory registrations interact badly with their fork()ing
> > daemon model.  That being said during the recent submission of the
> > RDMA client code some comments were made about userspace versions of
> > it, so I'm not sure if that opinion has changed in one way or another.
> 
> Ok.
> 
> >
> > Thay being said I think we absolutely should support RDMA memory
> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> > all the blocks are polulated and all ptes are set up.  Second we need
> > to make sure get_user_page works, which for now means we'll need a
> > struct page mapping for the region (which will be really annoying
> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> > and we need to gurantee that the extent mapping won't change while
> > the get_user_pages holds the pages inside it.  I think that is true
> > due to side effects even with the current DAX code, but we'll need to
> > make it explicit.  And maybe that's where we need to converge -
> > "sealing" the extent map makes sense as such a temporary measure
> > that is not persisted on disk, which automatically gets released
> > when the holding process exits, because we sort of already do this
> > implicitly.  It might also make sense to have explicitl breakable
> > seals similar to what I do for the pNFS blocks kernel server, as
> > any userspace RDMA file server would also need those semantics.
> 
> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> 
>     1/ only succeed if the fault can be satisfied without page cache
> 
>     2/ only install a pte for the fault if it can do so without
> triggering block map updates
> 
> So, I think it would still end up setting an inode flag to make
> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> active. However, it would not record that state in the on-disk
> metadata and it would automatically clear at munmap time. That should

TBH even after the last round of 'do we need this on-disk flag?' I still
wasn't 100% convinced that we really needed a permanent flag vs.
requiring apps to ask for a sealed iomap mmap like what you just
described, so I'm glad this converation has continue. :)

--D

> be enough to support the host-persistent-memory, and
> NVMe-persistent-memory use cases (provided we have struct page for
> NVMe). Although, we need more safety infrastructure in the NVMe case
> where we would need to software manage I/O coherence.
> 
> > Last but not least we have any interesting additional case for modern
> > Mellanox hardware - On Demand Paging where we don't actually do a
> > get_user_pages but the hardware implements SVM and thus gets fed
> > virtual addresses directly.  My head spins when talking about the
> > implications for DAX mappings on that, so I'm just throwing that in
> > for now instead of trying to come up with a solution.
> 
> Yeah, DAX + SVM needs more thought.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-14 21:46                     ` Darrick J. Wong
  0 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-14 21:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API

On Sun, Aug 13, 2017 at 01:31:45PM -0700, Dan Williams wrote:
> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> > On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
> >> The application does not need to know the storage address, it needs to
> >> know that the storage address to file offset is fixed. With this
> >> information it can make assumptions about the permanence of results it
> >> gets from the kernel.
> >
> > Only if we clearly document that fact - and documenting the permanence
> > is different from saying the block map won't change.
> 
> I can get on board with that.
> 
> >
> >> For example get_user_pages() today makes no guarantees outside of
> >> "page will not be freed",
> >
> > It also makes the extremely important gurantee that the page won't
> > _move_ - e.g. that we won't do a memory migration for compaction or
> > other reasons.  That's why for example RDMA can use to register
> > memory and then we can later set up memory windows that point to this
> > registration from userspace and implement userspace RDMA.
> >
> >> but with immutable files and dax you now
> >> have a mechanism for userspace to coordinate direct access to storage
> >> addresses. Those raw storage addresses need not be exposed to the
> >> application, as you say it doesn't need to know that detail. MAP_SYNC
> >> does not fully satisfy this case because it requires agents that can
> >> generate MMU faults to coordinate with the filesystem.
> >
> > The file system is always in the fault path, can you explain what other
> > agents you are talking about?
> 
> Exactly the one's you mention below. SVM hardware can just use a
> MAP_SYNC mapping and be sure that its metadata dirtying writes are
> synchronized with the filesystem through the fault path. Hardware that
> does not have SVM, or hypervisors like Xen that want to attach their
> own static metadata about the file offset to physical block mapping,
> need a mechanism to make sure the block map is sealed while they have
> it mapped.
> 
> >> All I know is that SMB Direct for persistent memory seems like a
> >> potential consumer. I know they're not going to use a userspace
> >> filesystem or put an SMB server in the kernel.
> >
> > Last I talked to the Samba folks they didn't expect a userspace
> > SMB direct implementation to work anyway due to the fact that
> > libibverbs memory registrations interact badly with their fork()ing
> > daemon model.  That being said during the recent submission of the
> > RDMA client code some comments were made about userspace versions of
> > it, so I'm not sure if that opinion has changed in one way or another.
> 
> Ok.
> 
> >
> > Thay being said I think we absolutely should support RDMA memory
> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> > all the blocks are polulated and all ptes are set up.  Second we need
> > to make sure get_user_page works, which for now means we'll need a
> > struct page mapping for the region (which will be really annoying
> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> > and we need to gurantee that the extent mapping won't change while
> > the get_user_pages holds the pages inside it.  I think that is true
> > due to side effects even with the current DAX code, but we'll need to
> > make it explicit.  And maybe that's where we need to converge -
> > "sealing" the extent map makes sense as such a temporary measure
> > that is not persisted on disk, which automatically gets released
> > when the holding process exits, because we sort of already do this
> > implicitly.  It might also make sense to have explicitl breakable
> > seals similar to what I do for the pNFS blocks kernel server, as
> > any userspace RDMA file server would also need those semantics.
> 
> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> 
>     1/ only succeed if the fault can be satisfied without page cache
> 
>     2/ only install a pte for the fault if it can do so without
> triggering block map updates
> 
> So, I think it would still end up setting an inode flag to make
> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> active. However, it would not record that state in the on-disk
> metadata and it would automatically clear at munmap time. That should

TBH even after the last round of 'do we need this on-disk flag?' I still
wasn't 100% convinced that we really needed a permanent flag vs.
requiring apps to ask for a sealed iomap mmap like what you just
described, so I'm glad this converation has continue. :)

--D

> be enough to support the host-persistent-memory, and
> NVMe-persistent-memory use cases (provided we have struct page for
> NVMe). Although, we need more safety infrastructure in the NVMe case
> where we would need to software manage I/O coherence.
> 
> > Last but not least we have any interesting additional case for modern
> > Mellanox hardware - On Demand Paging where we don't actually do a
> > get_user_pages but the hardware implements SVM and thus gets fed
> > virtual addresses directly.  My head spins when talking about the
> > implications for DAX mappings on that, so I'm just throwing that in
> > for now instead of trying to come up with a solution.
> 
> Yeah, DAX + SVM needs more thought.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-14 21:46                     ` Darrick J. Wong
  0 siblings, 0 replies; 108+ messages in thread
From: Darrick J. Wong @ 2017-08-14 21:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Sun, Aug 13, 2017 at 01:31:45PM -0700, Dan Williams wrote:
> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> > On Sat, Aug 12, 2017 at 12:19:50PM -0700, Dan Williams wrote:
> >> The application does not need to know the storage address, it needs to
> >> know that the storage address to file offset is fixed. With this
> >> information it can make assumptions about the permanence of results it
> >> gets from the kernel.
> >
> > Only if we clearly document that fact - and documenting the permanence
> > is different from saying the block map won't change.
> 
> I can get on board with that.
> 
> >
> >> For example get_user_pages() today makes no guarantees outside of
> >> "page will not be freed",
> >
> > It also makes the extremely important gurantee that the page won't
> > _move_ - e.g. that we won't do a memory migration for compaction or
> > other reasons.  That's why for example RDMA can use to register
> > memory and then we can later set up memory windows that point to this
> > registration from userspace and implement userspace RDMA.
> >
> >> but with immutable files and dax you now
> >> have a mechanism for userspace to coordinate direct access to storage
> >> addresses. Those raw storage addresses need not be exposed to the
> >> application, as you say it doesn't need to know that detail. MAP_SYNC
> >> does not fully satisfy this case because it requires agents that can
> >> generate MMU faults to coordinate with the filesystem.
> >
> > The file system is always in the fault path, can you explain what other
> > agents you are talking about?
> 
> Exactly the one's you mention below. SVM hardware can just use a
> MAP_SYNC mapping and be sure that its metadata dirtying writes are
> synchronized with the filesystem through the fault path. Hardware that
> does not have SVM, or hypervisors like Xen that want to attach their
> own static metadata about the file offset to physical block mapping,
> need a mechanism to make sure the block map is sealed while they have
> it mapped.
> 
> >> All I know is that SMB Direct for persistent memory seems like a
> >> potential consumer. I know they're not going to use a userspace
> >> filesystem or put an SMB server in the kernel.
> >
> > Last I talked to the Samba folks they didn't expect a userspace
> > SMB direct implementation to work anyway due to the fact that
> > libibverbs memory registrations interact badly with their fork()ing
> > daemon model.  That being said during the recent submission of the
> > RDMA client code some comments were made about userspace versions of
> > it, so I'm not sure if that opinion has changed in one way or another.
> 
> Ok.
> 
> >
> > Thay being said I think we absolutely should support RDMA memory
> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> > all the blocks are polulated and all ptes are set up.  Second we need
> > to make sure get_user_page works, which for now means we'll need a
> > struct page mapping for the region (which will be really annoying
> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> > and we need to gurantee that the extent mapping won't change while
> > the get_user_pages holds the pages inside it.  I think that is true
> > due to side effects even with the current DAX code, but we'll need to
> > make it explicit.  And maybe that's where we need to converge -
> > "sealing" the extent map makes sense as such a temporary measure
> > that is not persisted on disk, which automatically gets released
> > when the holding process exits, because we sort of already do this
> > implicitly.  It might also make sense to have explicitl breakable
> > seals similar to what I do for the pNFS blocks kernel server, as
> > any userspace RDMA file server would also need those semantics.
> 
> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> 
>     1/ only succeed if the fault can be satisfied without page cache
> 
>     2/ only install a pte for the fault if it can do so without
> triggering block map updates
> 
> So, I think it would still end up setting an inode flag to make
> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> active. However, it would not record that state in the on-disk
> metadata and it would automatically clear at munmap time. That should

TBH even after the last round of 'do we need this on-disk flag?' I still
wasn't 100% convinced that we really needed a permanent flag vs.
requiring apps to ask for a sealed iomap mmap like what you just
described, so I'm glad this converation has continue. :)

--D

> be enough to support the host-persistent-memory, and
> NVMe-persistent-memory use cases (provided we have struct page for
> NVMe). Although, we need more safety infrastructure in the NVMe case
> where we would need to software manage I/O coherence.
> 
> > Last but not least we have any interesting additional case for modern
> > Mellanox hardware - On Demand Paging where we don't actually do a
> > get_user_pages but the hardware implements SVM and thus gets fed
> > virtual addresses directly.  My head spins when talking about the
> > implications for DAX mappings on that, so I'm just throwing that in
> > for now instead of trying to come up with a solution.
> 
> Yeah, DAX + SVM needs more thought.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-14 16:14                       ` Dan Williams
  (?)
@ 2017-08-15  8:37                         ` Jan Kara
  -1 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2017-08-15  8:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Peter Zijlstra, Linux API,
	Darrick J. Wong, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Christoph Hellwig

On Mon 14-08-17 09:14:42, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > Thay being said I think we absolutely should support RDMA memory
> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> > to make sure get_user_page works, which for now means we'll need a
> >> > struct page mapping for the region (which will be really annoying
> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> > and we need to gurantee that the extent mapping won't change while
> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> > due to side effects even with the current DAX code, but we'll need to
> >> > make it explicit.  And maybe that's where we need to converge -
> >> > "sealing" the extent map makes sense as such a temporary measure
> >> > that is not persisted on disk, which automatically gets released
> >> > when the holding process exits, because we sort of already do this
> >> > implicitly.  It might also make sense to have explicitl breakable
> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> > any userspace RDMA file server would also need those semantics.
> >>
> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >>
> >>     1/ only succeed if the fault can be satisfied without page cache
> >>
> >>     2/ only install a pte for the fault if it can do so without
> >> triggering block map updates
> >>
> >> So, I think it would still end up setting an inode flag to make
> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> active. However, it would not record that state in the on-disk
> >> metadata and it would automatically clear at munmap time. That should
> >> be enough to support the host-persistent-memory, and
> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> where we would need to software manage I/O coherence.
> >
> > Hum, this proposal (and the problems you are trying to deal with) seem very
> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> > the DAX area (and so additionally complicated by the fact that filesystems
> > now have to care). The patch set was not merged due to lack of interest I
> > think but it looked sensible and the proposed API would make sense for more
> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> "no-fault" guarantee and fixes the accounting of locked System RAM.
> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> RAM so the accounting problem is not there for DAX. mm_pin() also does
> not appear to have a relationship to a file backed memory like mmap
> allows.

So the accounting part is probably non-interesting for DAX purposes and I
agree there are other differences as well. But mm_mpin() prevented page
migrations which is parallel to your requirement of "offset->block mapping
is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
has saner interface to pin pages than get_user_pages() and you mention RDMA
and similar technologies as a usecase for your work for similar reasons.
So my thought was that possibly we should have the same API for pinning
"storage" for RDMA transfers regardless of whether the backing is page
cache or pmem and the API should be usable for in-kernel users as well?
mmap flag seems a bit clumsy in this regard so maybe a form of a separate
syscall - be it mpin(start, len) or some other name - might be more
suitable?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-15  8:37                         ` Jan Kara
  0 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2017-08-15  8:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Linux API, Peter Zijlstra

On Mon 14-08-17 09:14:42, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> > Thay being said I think we absolutely should support RDMA memory
> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> > to make sure get_user_page works, which for now means we'll need a
> >> > struct page mapping for the region (which will be really annoying
> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> > and we need to gurantee that the extent mapping won't change while
> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> > due to side effects even with the current DAX code, but we'll need to
> >> > make it explicit.  And maybe that's where we need to converge -
> >> > "sealing" the extent map makes sense as such a temporary measure
> >> > that is not persisted on disk, which automatically gets released
> >> > when the holding process exits, because we sort of already do this
> >> > implicitly.  It might also make sense to have explicitl breakable
> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> > any userspace RDMA file server would also need those semantics.
> >>
> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >>
> >>     1/ only succeed if the fault can be satisfied without page cache
> >>
> >>     2/ only install a pte for the fault if it can do so without
> >> triggering block map updates
> >>
> >> So, I think it would still end up setting an inode flag to make
> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> active. However, it would not record that state in the on-disk
> >> metadata and it would automatically clear at munmap time. That should
> >> be enough to support the host-persistent-memory, and
> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> where we would need to software manage I/O coherence.
> >
> > Hum, this proposal (and the problems you are trying to deal with) seem very
> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> > the DAX area (and so additionally complicated by the fact that filesystems
> > now have to care). The patch set was not merged due to lack of interest I
> > think but it looked sensible and the proposed API would make sense for more
> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> "no-fault" guarantee and fixes the accounting of locked System RAM.
> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> RAM so the accounting problem is not there for DAX. mm_pin() also does
> not appear to have a relationship to a file backed memory like mmap
> allows.

So the accounting part is probably non-interesting for DAX purposes and I
agree there are other differences as well. But mm_mpin() prevented page
migrations which is parallel to your requirement of "offset->block mapping
is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
has saner interface to pin pages than get_user_pages() and you mention RDMA
and similar technologies as a usecase for your work for similar reasons.
So my thought was that possibly we should have the same API for pinning
"storage" for RDMA transfers regardless of whether the backing is page
cache or pmem and the API should be usable for in-kernel users as well?
mmap flag seems a bit clumsy in this regard so maybe a form of a separate
syscall - be it mpin(start, len) or some other name - might be more
suitable?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-15  8:37                         ` Jan Kara
  0 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2017-08-15  8:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, Darrick J. Wong,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API,
	Peter Zijlstra

On Mon 14-08-17 09:14:42, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> >> > Thay being said I think we absolutely should support RDMA memory
> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> > to make sure get_user_page works, which for now means we'll need a
> >> > struct page mapping for the region (which will be really annoying
> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> > and we need to gurantee that the extent mapping won't change while
> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> > due to side effects even with the current DAX code, but we'll need to
> >> > make it explicit.  And maybe that's where we need to converge -
> >> > "sealing" the extent map makes sense as such a temporary measure
> >> > that is not persisted on disk, which automatically gets released
> >> > when the holding process exits, because we sort of already do this
> >> > implicitly.  It might also make sense to have explicitl breakable
> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> > any userspace RDMA file server would also need those semantics.
> >>
> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >>
> >>     1/ only succeed if the fault can be satisfied without page cache
> >>
> >>     2/ only install a pte for the fault if it can do so without
> >> triggering block map updates
> >>
> >> So, I think it would still end up setting an inode flag to make
> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> active. However, it would not record that state in the on-disk
> >> metadata and it would automatically clear at munmap time. That should
> >> be enough to support the host-persistent-memory, and
> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> where we would need to software manage I/O coherence.
> >
> > Hum, this proposal (and the problems you are trying to deal with) seem very
> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> > the DAX area (and so additionally complicated by the fact that filesystems
> > now have to care). The patch set was not merged due to lack of interest I
> > think but it looked sensible and the proposed API would make sense for more
> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> "no-fault" guarantee and fixes the accounting of locked System RAM.
> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> RAM so the accounting problem is not there for DAX. mm_pin() also does
> not appear to have a relationship to a file backed memory like mmap
> allows.

So the accounting part is probably non-interesting for DAX purposes and I
agree there are other differences as well. But mm_mpin() prevented page
migrations which is parallel to your requirement of "offset->block mapping
is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
has saner interface to pin pages than get_user_pages() and you mention RDMA
and similar technologies as a usecase for your work for similar reasons.
So my thought was that possibly we should have the same API for pinning
"storage" for RDMA transfers regardless of whether the backing is page
cache or pmem and the API should be usable for in-kernel users as well?
mmap flag seems a bit clumsy in this regard so maybe a form of a separate
syscall - be it mpin(start, len) or some other name - might be more
suitable?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-15  8:37                         ` Jan Kara
@ 2017-08-15 23:50                           ` Dan Williams
  -1 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-15 23:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Peter Zijlstra, Linux API, Darrick J. Wong,
	Dave Chinner, linux-kernel, linux-xfs, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Tue, Aug 15, 2017 at 1:37 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 14-08-17 09:14:42, Dan Williams wrote:
>> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
>> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> > Thay being said I think we absolutely should support RDMA memory
>> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
>> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
>> >> > all the blocks are polulated and all ptes are set up.  Second we need
>> >> > to make sure get_user_page works, which for now means we'll need a
>> >> > struct page mapping for the region (which will be really annoying
>> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
>> >> > and we need to gurantee that the extent mapping won't change while
>> >> > the get_user_pages holds the pages inside it.  I think that is true
>> >> > due to side effects even with the current DAX code, but we'll need to
>> >> > make it explicit.  And maybe that's where we need to converge -
>> >> > "sealing" the extent map makes sense as such a temporary measure
>> >> > that is not persisted on disk, which automatically gets released
>> >> > when the holding process exits, because we sort of already do this
>> >> > implicitly.  It might also make sense to have explicitl breakable
>> >> > seals similar to what I do for the pNFS blocks kernel server, as
>> >> > any userspace RDMA file server would also need those semantics.
>> >>
>> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
>> >>
>> >>     1/ only succeed if the fault can be satisfied without page cache
>> >>
>> >>     2/ only install a pte for the fault if it can do so without
>> >> triggering block map updates
>> >>
>> >> So, I think it would still end up setting an inode flag to make
>> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
>> >> active. However, it would not record that state in the on-disk
>> >> metadata and it would automatically clear at munmap time. That should
>> >> be enough to support the host-persistent-memory, and
>> >> NVMe-persistent-memory use cases (provided we have struct page for
>> >> NVMe). Although, we need more safety infrastructure in the NVMe case
>> >> where we would need to software manage I/O coherence.
>> >
>> > Hum, this proposal (and the problems you are trying to deal with) seem very
>> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
>> > the DAX area (and so additionally complicated by the fact that filesystems
>> > now have to care). The patch set was not merged due to lack of interest I
>> > think but it looked sensible and the proposed API would make sense for more
>> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
>>
>> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
>> "no-fault" guarantee and fixes the accounting of locked System RAM.
>> MAP_DIRECT still allows faults, and DAX mappings don't consume System
>> RAM so the accounting problem is not there for DAX. mm_pin() also does
>> not appear to have a relationship to a file backed memory like mmap
>> allows.
>
> So the accounting part is probably non-interesting for DAX purposes and I
> agree there are other differences as well. But mm_mpin() prevented page
> migrations which is parallel to your requirement of "offset->block mapping
> is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
> has saner interface to pin pages than get_user_pages() and you mention RDMA
> and similar technologies as a usecase for your work for similar reasons.
> So my thought was that possibly we should have the same API for pinning
> "storage" for RDMA transfers regardless of whether the backing is page
> cache or pmem and the API should be usable for in-kernel users as well?
> mmap flag seems a bit clumsy in this regard so maybe a form of a separate
> syscall - be it mpin(start, len) or some other name - might be more
> suitable?

Can you say about more about why an mmap flag for this feels awkward
to you? I think there's symmetry between O_SYNC / O_DIRECT setting up
synchronous / page-cache-bypass file descriptors and MAP_SYNC /
MAP_DIRECT setting up synchronous and page-cache bypass mappings.
"Pinning" also feels like the wrong mechanism when you consider
hardware is moving toward eliminating the pinning requirement over
time. SVM "Shared Virtual Memory" hardware will just operate on cpu
virtual addresses directly and generate typical faults. On such
hardware MAP_DIRECT would be a nop relative to MAP_SYNC, so you
wouldn't want your application to be stuck with the legacy concept
that pages need to be explicitly "pinned".
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-15 23:50                           ` Dan Williams
  0 siblings, 0 replies; 108+ messages in thread
From: Dan Williams @ 2017-08-15 23:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, linux-nvdimm, Dave Chinner,
	linux-kernel, linux-xfs, Jeff Moyer, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Ross Zwisler, Linux API,
	Peter Zijlstra

On Tue, Aug 15, 2017 at 1:37 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 14-08-17 09:14:42, Dan Williams wrote:
>> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
>> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
>> >> > Thay being said I think we absolutely should support RDMA memory
>> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
>> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
>> >> > all the blocks are polulated and all ptes are set up.  Second we need
>> >> > to make sure get_user_page works, which for now means we'll need a
>> >> > struct page mapping for the region (which will be really annoying
>> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
>> >> > and we need to gurantee that the extent mapping won't change while
>> >> > the get_user_pages holds the pages inside it.  I think that is true
>> >> > due to side effects even with the current DAX code, but we'll need to
>> >> > make it explicit.  And maybe that's where we need to converge -
>> >> > "sealing" the extent map makes sense as such a temporary measure
>> >> > that is not persisted on disk, which automatically gets released
>> >> > when the holding process exits, because we sort of already do this
>> >> > implicitly.  It might also make sense to have explicitl breakable
>> >> > seals similar to what I do for the pNFS blocks kernel server, as
>> >> > any userspace RDMA file server would also need those semantics.
>> >>
>> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
>> >>
>> >>     1/ only succeed if the fault can be satisfied without page cache
>> >>
>> >>     2/ only install a pte for the fault if it can do so without
>> >> triggering block map updates
>> >>
>> >> So, I think it would still end up setting an inode flag to make
>> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
>> >> active. However, it would not record that state in the on-disk
>> >> metadata and it would automatically clear at munmap time. That should
>> >> be enough to support the host-persistent-memory, and
>> >> NVMe-persistent-memory use cases (provided we have struct page for
>> >> NVMe). Although, we need more safety infrastructure in the NVMe case
>> >> where we would need to software manage I/O coherence.
>> >
>> > Hum, this proposal (and the problems you are trying to deal with) seem very
>> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
>> > the DAX area (and so additionally complicated by the fact that filesystems
>> > now have to care). The patch set was not merged due to lack of interest I
>> > think but it looked sensible and the proposed API would make sense for more
>> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
>>
>> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
>> "no-fault" guarantee and fixes the accounting of locked System RAM.
>> MAP_DIRECT still allows faults, and DAX mappings don't consume System
>> RAM so the accounting problem is not there for DAX. mm_pin() also does
>> not appear to have a relationship to a file backed memory like mmap
>> allows.
>
> So the accounting part is probably non-interesting for DAX purposes and I
> agree there are other differences as well. But mm_mpin() prevented page
> migrations which is parallel to your requirement of "offset->block mapping
> is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
> has saner interface to pin pages than get_user_pages() and you mention RDMA
> and similar technologies as a usecase for your work for similar reasons.
> So my thought was that possibly we should have the same API for pinning
> "storage" for RDMA transfers regardless of whether the backing is page
> cache or pmem and the API should be usable for in-kernel users as well?
> mmap flag seems a bit clumsy in this regard so maybe a form of a separate
> syscall - be it mpin(start, len) or some other name - might be more
> suitable?

Can you say about more about why an mmap flag for this feels awkward
to you? I think there's symmetry between O_SYNC / O_DIRECT setting up
synchronous / page-cache-bypass file descriptors and MAP_SYNC /
MAP_DIRECT setting up synchronous and page-cache bypass mappings.
"Pinning" also feels like the wrong mechanism when you consider
hardware is moving toward eliminating the pinning requirement over
time. SVM "Shared Virtual Memory" hardware will just operate on cpu
virtual addresses directly and generate typical faults. On such
hardware MAP_DIRECT would be a nop relative to MAP_SYNC, so you
wouldn't want your application to be stuck with the legacy concept
that pages need to be explicitly "pinned".

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-15 23:50                           ` Dan Williams
  (?)
@ 2017-08-16 13:57                             ` Jan Kara
  -1 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2017-08-16 13:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Peter Zijlstra, Linux API,
	Darrick J. Wong, Dave Chinner, linux-kernel, linux-xfs,
	Alexander Viro, Andy Lutomirski, linux-fsdevel,
	Christoph Hellwig

On Tue 15-08-17 16:50:55, Dan Williams wrote:
> On Tue, Aug 15, 2017 at 1:37 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 14-08-17 09:14:42, Dan Williams wrote:
> >> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> > Thay being said I think we absolutely should support RDMA memory
> >> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> >> > to make sure get_user_page works, which for now means we'll need a
> >> >> > struct page mapping for the region (which will be really annoying
> >> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> >> > and we need to gurantee that the extent mapping won't change while
> >> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> >> > due to side effects even with the current DAX code, but we'll need to
> >> >> > make it explicit.  And maybe that's where we need to converge -
> >> >> > "sealing" the extent map makes sense as such a temporary measure
> >> >> > that is not persisted on disk, which automatically gets released
> >> >> > when the holding process exits, because we sort of already do this
> >> >> > implicitly.  It might also make sense to have explicitl breakable
> >> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> >> > any userspace RDMA file server would also need those semantics.
> >> >>
> >> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >> >>
> >> >>     1/ only succeed if the fault can be satisfied without page cache
> >> >>
> >> >>     2/ only install a pte for the fault if it can do so without
> >> >> triggering block map updates
> >> >>
> >> >> So, I think it would still end up setting an inode flag to make
> >> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> >> active. However, it would not record that state in the on-disk
> >> >> metadata and it would automatically clear at munmap time. That should
> >> >> be enough to support the host-persistent-memory, and
> >> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> >> where we would need to software manage I/O coherence.
> >> >
> >> > Hum, this proposal (and the problems you are trying to deal with) seem very
> >> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> >> > the DAX area (and so additionally complicated by the fact that filesystems
> >> > now have to care). The patch set was not merged due to lack of interest I
> >> > think but it looked sensible and the proposed API would make sense for more
> >> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> >>
> >> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> >> "no-fault" guarantee and fixes the accounting of locked System RAM.
> >> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> >> RAM so the accounting problem is not there for DAX. mm_pin() also does
> >> not appear to have a relationship to a file backed memory like mmap
> >> allows.
> >
> > So the accounting part is probably non-interesting for DAX purposes and I
> > agree there are other differences as well. But mm_mpin() prevented page
> > migrations which is parallel to your requirement of "offset->block mapping
> > is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
> > has saner interface to pin pages than get_user_pages() and you mention RDMA
> > and similar technologies as a usecase for your work for similar reasons.
> > So my thought was that possibly we should have the same API for pinning
> > "storage" for RDMA transfers regardless of whether the backing is page
> > cache or pmem and the API should be usable for in-kernel users as well?
> > mmap flag seems a bit clumsy in this regard so maybe a form of a separate
> > syscall - be it mpin(start, len) or some other name - might be more
> > suitable?
> 
> Can you say about more about why an mmap flag for this feels awkward
> to you? I think there's symmetry between O_SYNC / O_DIRECT setting up
> synchronous / page-cache-bypass file descriptors and MAP_SYNC /
> MAP_DIRECT setting up synchronous and page-cache bypass mappings.

So my thinking was, that for in-kernel users it might be a bit more
difficult to use mmap flag directly as they generally won't need to setup
the mapping. But that can be certainly dealt with by proper helpers for
in-kernel users.

> "Pinning" also feels like the wrong mechanism when you consider
> hardware is moving toward eliminating the pinning requirement over
> time. SVM "Shared Virtual Memory" hardware will just operate on cpu
> virtual addresses directly and generate typical faults. On such
> hardware MAP_DIRECT would be a nop relative to MAP_SYNC, so you
> wouldn't want your application to be stuck with the legacy concept
> that pages need to be explicitly "pinned".

OK, makes sense.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-16 13:57                             ` Jan Kara
  0 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2017-08-16 13:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Christoph Hellwig, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Linux API, Peter Zijlstra

On Tue 15-08-17 16:50:55, Dan Williams wrote:
> On Tue, Aug 15, 2017 at 1:37 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 14-08-17 09:14:42, Dan Williams wrote:
> >> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch@lst.de> wrote:
> >> >> > Thay being said I think we absolutely should support RDMA memory
> >> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> >> > to make sure get_user_page works, which for now means we'll need a
> >> >> > struct page mapping for the region (which will be really annoying
> >> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> >> > and we need to gurantee that the extent mapping won't change while
> >> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> >> > due to side effects even with the current DAX code, but we'll need to
> >> >> > make it explicit.  And maybe that's where we need to converge -
> >> >> > "sealing" the extent map makes sense as such a temporary measure
> >> >> > that is not persisted on disk, which automatically gets released
> >> >> > when the holding process exits, because we sort of already do this
> >> >> > implicitly.  It might also make sense to have explicitl breakable
> >> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> >> > any userspace RDMA file server would also need those semantics.
> >> >>
> >> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >> >>
> >> >>     1/ only succeed if the fault can be satisfied without page cache
> >> >>
> >> >>     2/ only install a pte for the fault if it can do so without
> >> >> triggering block map updates
> >> >>
> >> >> So, I think it would still end up setting an inode flag to make
> >> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> >> active. However, it would not record that state in the on-disk
> >> >> metadata and it would automatically clear at munmap time. That should
> >> >> be enough to support the host-persistent-memory, and
> >> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> >> where we would need to software manage I/O coherence.
> >> >
> >> > Hum, this proposal (and the problems you are trying to deal with) seem very
> >> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> >> > the DAX area (and so additionally complicated by the fact that filesystems
> >> > now have to care). The patch set was not merged due to lack of interest I
> >> > think but it looked sensible and the proposed API would make sense for more
> >> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> >>
> >> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> >> "no-fault" guarantee and fixes the accounting of locked System RAM.
> >> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> >> RAM so the accounting problem is not there for DAX. mm_pin() also does
> >> not appear to have a relationship to a file backed memory like mmap
> >> allows.
> >
> > So the accounting part is probably non-interesting for DAX purposes and I
> > agree there are other differences as well. But mm_mpin() prevented page
> > migrations which is parallel to your requirement of "offset->block mapping
> > is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
> > has saner interface to pin pages than get_user_pages() and you mention RDMA
> > and similar technologies as a usecase for your work for similar reasons.
> > So my thought was that possibly we should have the same API for pinning
> > "storage" for RDMA transfers regardless of whether the backing is page
> > cache or pmem and the API should be usable for in-kernel users as well?
> > mmap flag seems a bit clumsy in this regard so maybe a form of a separate
> > syscall - be it mpin(start, len) or some other name - might be more
> > suitable?
> 
> Can you say about more about why an mmap flag for this feels awkward
> to you? I think there's symmetry between O_SYNC / O_DIRECT setting up
> synchronous / page-cache-bypass file descriptors and MAP_SYNC /
> MAP_DIRECT setting up synchronous and page-cache bypass mappings.

So my thinking was, that for in-kernel users it might be a bit more
difficult to use mmap flag directly as they generally won't need to setup
the mapping. But that can be certainly dealt with by proper helpers for
in-kernel users.

> "Pinning" also feels like the wrong mechanism when you consider
> hardware is moving toward eliminating the pinning requirement over
> time. SVM "Shared Virtual Memory" hardware will just operate on cpu
> virtual addresses directly and generate typical faults. On such
> hardware MAP_DIRECT would be a nop relative to MAP_SYNC, so you
> wouldn't want your application to be stuck with the legacy concept
> that pages need to be explicitly "pinned".

OK, makes sense.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-16 13:57                             ` Jan Kara
  0 siblings, 0 replies; 108+ messages in thread
From: Jan Kara @ 2017-08-16 13:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Peter Zijlstra,
	Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Tue 15-08-17 16:50:55, Dan Williams wrote:
> On Tue, Aug 15, 2017 at 1:37 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
> > On Mon 14-08-17 09:14:42, Dan Williams wrote:
> >> On Mon, Aug 14, 2017 at 5:40 AM, Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
> >> > On Sun 13-08-17 13:31:45, Dan Williams wrote:
> >> >> On Sun, Aug 13, 2017 at 2:24 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> >> >> > Thay being said I think we absolutely should support RDMA memory
> >> >> > registrations for DAX mappings.  I'm just not sure how S_IOMAP_IMMUTABLE
> >> >> > helps with that.  We'll want a MAP_SYNC | MAP_POPULATE to make sure
> >> >> > all the blocks are polulated and all ptes are set up.  Second we need
> >> >> > to make sure get_user_page works, which for now means we'll need a
> >> >> > struct page mapping for the region (which will be really annoying
> >> >> > for PCIe mappings, like the upcoming NVMe persistent memory region),
> >> >> > and we need to gurantee that the extent mapping won't change while
> >> >> > the get_user_pages holds the pages inside it.  I think that is true
> >> >> > due to side effects even with the current DAX code, but we'll need to
> >> >> > make it explicit.  And maybe that's where we need to converge -
> >> >> > "sealing" the extent map makes sense as such a temporary measure
> >> >> > that is not persisted on disk, which automatically gets released
> >> >> > when the holding process exits, because we sort of already do this
> >> >> > implicitly.  It might also make sense to have explicitl breakable
> >> >> > seals similar to what I do for the pNFS blocks kernel server, as
> >> >> > any userspace RDMA file server would also need those semantics.
> >> >>
> >> >> Ok, how about a MAP_DIRECT flag that arranges for faults to that range to:
> >> >>
> >> >>     1/ only succeed if the fault can be satisfied without page cache
> >> >>
> >> >>     2/ only install a pte for the fault if it can do so without
> >> >> triggering block map updates
> >> >>
> >> >> So, I think it would still end up setting an inode flag to make
> >> >> xfs_bmapi_write() fail while any process has a MAP_DIRECT mapping
> >> >> active. However, it would not record that state in the on-disk
> >> >> metadata and it would automatically clear at munmap time. That should
> >> >> be enough to support the host-persistent-memory, and
> >> >> NVMe-persistent-memory use cases (provided we have struct page for
> >> >> NVMe). Although, we need more safety infrastructure in the NVMe case
> >> >> where we would need to software manage I/O coherence.
> >> >
> >> > Hum, this proposal (and the problems you are trying to deal with) seem very
> >> > similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> >> > the DAX area (and so additionally complicated by the fact that filesystems
> >> > now have to care). The patch set was not merged due to lack of interest I
> >> > think but it looked sensible and the proposed API would make sense for more
> >> > stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> >>
> >> Interesting, but I'm not sure I see the correlation. mm_mpin() makes a
> >> "no-fault" guarantee and fixes the accounting of locked System RAM.
> >> MAP_DIRECT still allows faults, and DAX mappings don't consume System
> >> RAM so the accounting problem is not there for DAX. mm_pin() also does
> >> not appear to have a relationship to a file backed memory like mmap
> >> allows.
> >
> > So the accounting part is probably non-interesting for DAX purposes and I
> > agree there are other differences as well. But mm_mpin() prevented page
> > migrations which is parallel to your requirement of "offset->block mapping
> > is permanent".  Furthermore mm_mpin() work was there for RDMA so that it
> > has saner interface to pin pages than get_user_pages() and you mention RDMA
> > and similar technologies as a usecase for your work for similar reasons.
> > So my thought was that possibly we should have the same API for pinning
> > "storage" for RDMA transfers regardless of whether the backing is page
> > cache or pmem and the API should be usable for in-kernel users as well?
> > mmap flag seems a bit clumsy in this regard so maybe a form of a separate
> > syscall - be it mpin(start, len) or some other name - might be more
> > suitable?
> 
> Can you say about more about why an mmap flag for this feels awkward
> to you? I think there's symmetry between O_SYNC / O_DIRECT setting up
> synchronous / page-cache-bypass file descriptors and MAP_SYNC /
> MAP_DIRECT setting up synchronous and page-cache bypass mappings.

So my thinking was, that for in-kernel users it might be a bit more
difficult to use mmap flag directly as they generally won't need to setup
the mapping. But that can be certainly dealt with by proper helpers for
in-kernel users.

> "Pinning" also feels like the wrong mechanism when you consider
> hardware is moving toward eliminating the pinning requirement over
> time. SVM "Shared Virtual Memory" hardware will just operate on cpu
> virtual addresses directly and generate typical faults. On such
> hardware MAP_DIRECT would be a nop relative to MAP_SYNC, so you
> wouldn't want your application to be stuck with the legacy concept
> that pages need to be explicitly "pinned".

OK, makes sense.

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
  2017-08-14 12:40                     ` Jan Kara
  (?)
@ 2017-08-21  9:16                       ` Peter Zijlstra
  -1 siblings, 0 replies; 108+ messages in thread
From: Peter Zijlstra @ 2017-08-21  9:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm, Linux API, Darrick J. Wong, Dave Chinner,
	linux-kernel, linux-xfs, Alexander Viro, Andy Lutomirski,
	linux-fsdevel, Christoph Hellwig

On Mon, Aug 14, 2017 at 02:40:59PM +0200, Jan Kara wrote:
> Hum, this proposal (and the problems you are trying to deal with) seem very
> similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> the DAX area (and so additionally complicated by the fact that filesystems
> now have to care). The patch set was not merged due to lack of interest I
> think but it looked sensible and the proposed API would make sense for more
> stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> [1] https://lwn.net/Articles/600502/

Thanks for thinking of that. The main sticking point was that I never
got it working for RDMA, I got hopelessly lost in that code.

Also I feel (and still do) that mpin() would be very useful for CMA,
mpin() would be a good moment to migrate/compact the pages and get out
of the way.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-21  9:16                       ` Peter Zijlstra
  0 siblings, 0 replies; 108+ messages in thread
From: Peter Zijlstra @ 2017-08-21  9:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Christoph Hellwig, Darrick J. Wong, linux-nvdimm,
	Dave Chinner, linux-kernel, linux-xfs, Jeff Moyer,
	Alexander Viro, Andy Lutomirski, linux-fsdevel, Ross Zwisler,
	Linux API

On Mon, Aug 14, 2017 at 02:40:59PM +0200, Jan Kara wrote:
> Hum, this proposal (and the problems you are trying to deal with) seem very
> similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> the DAX area (and so additionally complicated by the fact that filesystems
> now have to care). The patch set was not merged due to lack of interest I
> think but it looked sensible and the proposed API would make sense for more
> stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> [1] https://lwn.net/Articles/600502/

Thanks for thinking of that. The main sticking point was that I never
got it working for RDMA, I got hopelessly lost in that code.

Also I feel (and still do) that mpin() would be very useful for CMA,
mpin() would be a good moment to migrate/compact the pages and get out
of the way.

^ permalink raw reply	[flat|nested] 108+ messages in thread

* Re: [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap
@ 2017-08-21  9:16                       ` Peter Zijlstra
  0 siblings, 0 replies; 108+ messages in thread
From: Peter Zijlstra @ 2017-08-21  9:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Linux API, Darrick J. Wong,
	Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA, Alexander Viro,
	Andy Lutomirski, linux-fsdevel, Christoph Hellwig

On Mon, Aug 14, 2017 at 02:40:59PM +0200, Jan Kara wrote:
> Hum, this proposal (and the problems you are trying to deal with) seem very
> similar to Peter Zijlstra's mpin() proposal from 2014 [1], just moved to
> the DAX area (and so additionally complicated by the fact that filesystems
> now have to care). The patch set was not merged due to lack of interest I
> think but it looked sensible and the proposed API would make sense for more
> stuff than just DAX so maybe it would be better than MAP_DIRECT flag?
> 
> [1] https://lwn.net/Articles/600502/

Thanks for thinking of that. The main sticking point was that I never
got it working for RDMA, I got hopelessly lost in that code.

Also I feel (and still do) that mpin() would be very useful for CMA,
mpin() would be a good moment to migrate/compact the pages and get out
of the way.

^ permalink raw reply	[flat|nested] 108+ messages in thread

end of thread, other threads:[~2017-08-21  9:17 UTC | newest]

Thread overview: 108+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-04  2:28 [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap Dan Williams
2017-08-04  2:28 ` Dan Williams
2017-08-04  2:28 ` [PATCH v2 1/5] fs, xfs: introduce S_IOMAP_IMMUTABLE Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04 20:00   ` Darrick J. Wong
2017-08-04 20:00     ` Darrick J. Wong
2017-08-04 20:31     ` Dan Williams
2017-08-04 20:31       ` Dan Williams
2017-08-05  9:47   ` Christoph Hellwig
2017-08-05  9:47     ` Christoph Hellwig
2017-08-07  0:25     ` Dave Chinner
2017-08-07  0:25       ` Dave Chinner
2017-08-11 10:34       ` Christoph Hellwig
2017-08-11 10:34         ` Christoph Hellwig
2017-08-04  2:28 ` [PATCH v2 2/5] fs, xfs: introduce FALLOC_FL_SEAL_BLOCK_MAP Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04 19:46   ` Darrick J. Wong
2017-08-04 19:46     ` Darrick J. Wong
2017-08-04 19:52     ` Dan Williams
2017-08-04 19:52       ` Dan Williams
2017-08-04 23:31   ` Dave Chinner
2017-08-04 23:31     ` Dave Chinner
2017-08-04 23:43     ` Dan Williams
2017-08-04 23:43       ` Dan Williams
2017-08-05  0:04       ` Dave Chinner
2017-08-05  0:04         ` Dave Chinner
2017-08-04  2:28 ` [PATCH v2 3/5] fs, xfs: introduce FALLOC_FL_UNSEAL_BLOCK_MAP Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04 20:04   ` Darrick J. Wong
2017-08-04 20:04     ` Darrick J. Wong
2017-08-04 20:36     ` Dan Williams
2017-08-04 20:36       ` Dan Williams
2017-08-04  2:28 ` [PATCH v2 4/5] xfs: introduce XFS_DIFLAG2_IOMAP_IMMUTABLE Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04 20:33   ` Darrick J. Wong
2017-08-04 20:33     ` Darrick J. Wong
2017-08-04 20:45     ` Dan Williams
2017-08-04 20:45       ` Dan Williams
2017-08-04 23:46     ` Dave Chinner
2017-08-04 23:46       ` Dave Chinner
2017-08-04 23:57       ` Darrick J. Wong
2017-08-04 23:57         ` Darrick J. Wong
2017-08-04  2:28 ` [PATCH v2 5/5] xfs: toggle XFS_DIFLAG2_IOMAP_IMMUTABLE in response to fallocate Dan Williams
2017-08-04  2:28   ` Dan Williams
2017-08-04 20:14   ` Darrick J. Wong
2017-08-04 20:14     ` Darrick J. Wong
2017-08-04 20:47     ` Dan Williams
2017-08-04 20:47       ` Dan Williams
2017-08-04 20:53       ` Darrick J. Wong
2017-08-04 20:53         ` Darrick J. Wong
2017-08-04 20:55         ` Dan Williams
2017-08-04 20:55           ` Dan Williams
2017-08-04  2:38 ` [PATCH v2 0/5] fs, xfs: block map immutable files for dax, dma-to-storage, and swap Dan Williams
2017-08-04  2:38   ` Dan Williams
2017-08-04  2:38   ` Dan Williams
2017-08-05  9:50   ` Christoph Hellwig
2017-08-05  9:50     ` Christoph Hellwig
2017-08-05  9:50     ` Christoph Hellwig
2017-08-06 18:51     ` Dan Williams
2017-08-06 18:51       ` Dan Williams
2017-08-06 18:51       ` Dan Williams
2017-08-11 10:44       ` Christoph Hellwig
2017-08-11 10:44         ` Christoph Hellwig
2017-08-11 10:44         ` Christoph Hellwig
2017-08-11 22:26         ` Dan Williams
2017-08-11 22:26           ` Dan Williams
2017-08-11 22:26           ` Dan Williams
2017-08-12  3:57           ` Andy Lutomirski
2017-08-12  3:57             ` Andy Lutomirski
2017-08-12  4:44             ` Dan Williams
2017-08-12  4:44               ` Dan Williams
2017-08-12  4:44               ` Dan Williams
2017-08-12  7:34             ` Christoph Hellwig
2017-08-12  7:34               ` Christoph Hellwig
2017-08-12  7:34               ` Christoph Hellwig
2017-08-12  7:33           ` Christoph Hellwig
2017-08-12  7:33             ` Christoph Hellwig
2017-08-12  7:33             ` Christoph Hellwig
2017-08-12 19:19             ` Dan Williams
2017-08-12 19:19               ` Dan Williams
2017-08-12 19:19               ` Dan Williams
2017-08-13  9:24               ` Christoph Hellwig
2017-08-13  9:24                 ` Christoph Hellwig
2017-08-13 20:31                 ` Dan Williams
2017-08-13 20:31                   ` Dan Williams
2017-08-13 20:31                   ` Dan Williams
2017-08-14 12:40                   ` Jan Kara
2017-08-14 12:40                     ` Jan Kara
2017-08-14 12:40                     ` Jan Kara
2017-08-14 16:14                     ` Dan Williams
2017-08-14 16:14                       ` Dan Williams
2017-08-15  8:37                       ` Jan Kara
2017-08-15  8:37                         ` Jan Kara
2017-08-15  8:37                         ` Jan Kara
2017-08-15 23:50                         ` Dan Williams
2017-08-15 23:50                           ` Dan Williams
2017-08-16 13:57                           ` Jan Kara
2017-08-16 13:57                             ` Jan Kara
2017-08-16 13:57                             ` Jan Kara
2017-08-21  9:16                     ` Peter Zijlstra
2017-08-21  9:16                       ` Peter Zijlstra
2017-08-21  9:16                       ` Peter Zijlstra
2017-08-14 21:46                   ` Darrick J. Wong
2017-08-14 21:46                     ` Darrick J. Wong
2017-08-14 21:46                     ` Darrick J. Wong
2017-08-13 23:46                 ` Dave Chinner
2017-08-13 23:46                   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.