linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] daxfile: enable byte-addressable updates to pmem
@ 2017-06-17  1:15 Dan Williams
       [not found] ` <149766212410.22552.15957843500156182524.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2017-06-17  1:15 ` [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem Dan Williams
  0 siblings, 2 replies; 39+ messages in thread
From: Dan Williams @ 2017-06-17  1:15 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

Quoting PATCH 2/2:

    To date, the full promise of byte-addressable access to persistent
    memory has only been half realized via the filesystem-dax interface. The
    current filesystem-dax mechanism allows an application to consume (read)
    data from persistent storage at byte-size granularity, bypassing the
    full page reads required by traditional storage devices.
    
    Now, for writes, applications still need to contend with
    page-granularity dirtying and flushing semantics as well as filesystem
    coordination for metadata updates after any mmap write. The current
    situation precludes use cases that leverage byte-granularity / in-place
    updates to persistent media.
    
    To get around this limitation there are some specialized applications
    that are using the device-dax interface to bypass the overhead and
    data-safety problems of the current filesystem-dax mmap-write path.
    QEMU-KVM is forced to use device-dax to safely pass through persistent
    memory to a guest [1]. Some specialized databases are using device-dax
    for byte-granularity writes. Outside of those cases, device-dax is
    difficult for general purpose persistent memory applications to consume.
    There is demand for access to pmem without needing to contend with
    special device configuration and other device-dax limitations.
    
    The 'daxfile' interface satisfies this demand and realizes one of Dave
    Chinner's ideas for allowing pmem applications to safely bypass
    fsync/msync requirements. The idea is to make the file immutable with
    respect to the offset-to-block mappings for every extent in the file
    [2]. It turns out that filesystems already need to make this guarantee
    today. This property is needed for files marked as swap files.
    
    The new daxctl() syscall manages setting a file into 'static-dax' mode
    whereby it arranges for the file to be treated as a swapfile as far as
    the filesystem is concerned, but not registered with the core-mm as
    swapfile space. A file in this mode is then safe to be mapped and
    written without the requirement to fsync/msync the writes.  The cpu
    cache management for flushing data to persistence can be handled
    completely in userspace.
   
As can be seen in the patches there are still some TODOs to resolve in
the code, but this otherwise appears to solve the problem of persistent
memory applications needing to coordinate any and all writes to a file
mapping with fsync/msync.

[1]: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg01207.html
[2]: https://lkml.org/lkml/2016/9/11/159

---

Dan Williams (2):
      mm: introduce bmap_walk()
      mm, fs: daxfile, an interface for byte-addressable updates to pmem


 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 include/linux/dax.h                    |    9 ++
 include/linux/fs.h                     |    3 +
 include/linux/syscalls.h               |    1 
 include/uapi/linux/dax.h               |    8 +
 mm/Kconfig                             |    5 +
 mm/Makefile                            |    1 
 mm/daxfile.c                           |  186 ++++++++++++++++++++++++++++++++
 mm/page_io.c                           |  117 +++++++++++++++++---
 9 files changed, 312 insertions(+), 19 deletions(-)
 create mode 100644 include/uapi/linux/dax.h
 create mode 100644 mm/daxfile.c

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH 1/2] mm: introduce bmap_walk()
       [not found] ` <149766212410.22552.15957843500156182524.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2017-06-17  1:15   ` Dan Williams
  2017-06-17  5:22     ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2017-06-17  1:15 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig

Refactor the core of generic_swapfile_activate() into bmap_walk() so
that it can be used by a new daxfile_activate() helper (to be added).

There should be no functional differences as a result of this change,
although it does add the capability to perform the bmap with a given
page-size. This is in support of daxfile users that want to ensure huge
page usage.

Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Signed-off-by: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 mm/page_io.c |   86 +++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 67 insertions(+), 19 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 23f6d0d3470f..5cec9a3d49f2 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -135,11 +135,22 @@ static void end_swap_bio_read(struct bio *bio)
 	bio_put(bio);
 }
 
-int generic_swapfile_activate(struct swap_info_struct *sis,
-				struct file *swap_file,
-				sector_t *span)
+enum bmap_check {
+	BMAP_WALK_UNALIGNED,
+	BMAP_WALK_DISCONTIG,
+	BMAP_WALK_FULLPAGE,
+	BMAP_WALK_DONE,
+};
+
+typedef int (*bmap_check_fn)(sector_t block, unsigned long page_no,
+		enum bmap_check type, void *ctx);
+
+static int bmap_walk(struct file *file, const unsigned page_size,
+		const unsigned long page_max, sector_t *span,
+		bmap_check_fn check, void *ctx)
 {
-	struct address_space *mapping = swap_file->f_mapping;
+	struct address_space *mapping = file->f_mapping;
+	const unsigned page_shift = ilog2(page_size);
 	struct inode *inode = mapping->host;
 	unsigned blocks_per_page;
 	unsigned long page_no;
@@ -152,7 +163,7 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 	int ret;
 
 	blkbits = inode->i_blkbits;
-	blocks_per_page = PAGE_SIZE >> blkbits;
+	blocks_per_page = page_size >> blkbits;
 
 	/*
 	 * Map all the blocks into the extent list.  This code doesn't try
@@ -162,7 +173,7 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 	page_no = 0;
 	last_block = i_size_read(inode) >> blkbits;
 	while ((probe_block + blocks_per_page) <= last_block &&
-			page_no < sis->max) {
+			page_no < page_max) {
 		unsigned block_in_page;
 		sector_t first_block;
 
@@ -173,11 +184,15 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 			goto bad_bmap;
 
 		/*
-		 * It must be PAGE_SIZE aligned on-disk
+		 * It must be @page_size aligned on-disk
 		 */
 		if (first_block & (blocks_per_page - 1)) {
 			probe_block++;
-			goto reprobe;
+			ret = check(first_block, page_no,
+					BMAP_WALK_UNALIGNED, ctx);
+			if (ret == -EAGAIN)
+				goto reprobe;
+			goto bad_bmap;
 		}
 
 		for (block_in_page = 1; block_in_page < blocks_per_page;
@@ -190,11 +205,15 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 			if (block != first_block + block_in_page) {
 				/* Discontiguity */
 				probe_block++;
-				goto reprobe;
+				ret = check(first_block, page_no,
+						BMAP_WALK_DISCONTIG, ctx);
+				if (ret == -EAGAIN)
+					goto reprobe;
+				goto bad_bmap;
 			}
 		}
 
-		first_block >>= (PAGE_SHIFT - blkbits);
+		first_block >>= (page_shift - blkbits);
 		if (page_no) {	/* exclude the header page */
 			if (first_block < lowest_block)
 				lowest_block = first_block;
@@ -203,9 +222,9 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 		}
 
 		/*
-		 * We found a PAGE_SIZE-length, PAGE_SIZE-aligned run of blocks
+		 * We found a @page_size-{length,aligned} run of blocks
 		 */
-		ret = add_swap_extent(sis, page_no, 1, first_block);
+		ret = check(first_block, page_no, BMAP_WALK_FULLPAGE, ctx);
 		if (ret < 0)
 			goto out;
 		nr_extents += ret;
@@ -215,20 +234,49 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 		continue;
 	}
 	ret = nr_extents;
-	*span = 1 + highest_block - lowest_block;
-	if (page_no == 0)
-		page_no = 1;	/* force Empty message */
-	sis->max = page_no;
-	sis->pages = page_no - 1;
-	sis->highest_bit = page_no - 1;
+	if (span)
+		*span = 1 + highest_block - lowest_block;
+	check(highest_block, page_no, BMAP_WALK_DONE, ctx);
 out:
 	return ret;
 bad_bmap:
-	pr_err("swapon: swapfile has holes\n");
 	ret = -EINVAL;
 	goto out;
 }
 
+static int swapfile_check(sector_t block, unsigned long page_no,
+		enum bmap_check type, void *_sis)
+{
+	struct swap_info_struct *sis = _sis;
+
+	if (type == BMAP_WALK_DONE) {
+		if (page_no == 0)
+			page_no = 1;	/* force Empty message */
+		sis->max = page_no;
+		sis->pages = page_no - 1;
+		sis->highest_bit = page_no - 1;
+		return 0;
+	}
+
+	if (type != BMAP_WALK_FULLPAGE)
+		return -EAGAIN;
+
+	return add_swap_extent(sis, page_no, 1, block);
+}
+
+int generic_swapfile_activate(struct swap_info_struct *sis,
+				struct file *swap_file,
+				sector_t *span)
+{
+	int rc = bmap_walk(swap_file, PAGE_SIZE, sis->max, span,
+			swapfile_check, sis);
+
+	if (rc < 0)
+		pr_err("swapon: swapfile has holes\n");
+
+	return rc;
+}
+
 /*
  * We may have stale swap cache pages in memory: notice
  * them here and get rid of the unnecessary final write.

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-17  1:15 [RFC PATCH 0/2] daxfile: enable byte-addressable updates to pmem Dan Williams
       [not found] ` <149766212410.22552.15957843500156182524.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2017-06-17  1:15 ` Dan Williams
       [not found]   ` <149766213493.22552.4057048843646200083.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2017-06-20  5:22   ` Darrick J. Wong
  1 sibling, 2 replies; 39+ messages in thread
From: Dan Williams @ 2017-06-17  1:15 UTC (permalink / raw)
  To: akpm
  Cc: Jan Kara, linux-nvdimm, linux-api, Dave Chinner, linux-kernel,
	linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

To date, the full promise of byte-addressable access to persistent
memory has only been half realized via the filesystem-dax interface. The
current filesystem-dax mechanism allows an application to consume (read)
data from persistent storage at byte-size granularity, bypassing the
full page reads required by traditional storage devices.

Now, for writes, applications still need to contend with
page-granularity dirtying and flushing semantics as well as filesystem
coordination for metadata updates after any mmap write. The current
situation precludes use cases that leverage byte-granularity / in-place
updates to persistent media.

To get around this limitation there are some specialized applications
that are using the device-dax interface to bypass the overhead and
data-safety problems of the current filesystem-dax mmap-write path.
QEMU-KVM is forced to use device-dax to safely pass through persistent
memory to a guest [1]. Some specialized databases are using device-dax
for byte-granularity writes. Outside of those cases, device-dax is
difficult for general purpose persistent memory applications to consume.
There is demand for access to pmem without needing to contend with
special device configuration and other device-dax limitations.

The 'daxfile' interface satisfies this demand and realizes one of Dave
Chinner's ideas for allowing pmem applications to safely bypass
fsync/msync requirements. The idea is to make the file immutable with
respect to the offset-to-block mappings for every extent in the file
[2]. It turns out that filesystems already need to make this guarantee
today. This property is needed for files marked as swap files.

The new daxctl() syscall manages setting a file into 'static-dax' mode
whereby it arranges for the file to be treated as a swapfile as far as
the filesystem is concerned, but not registered with the core-mm as
swapfile space. A file in this mode is then safe to be mapped and
written without the requirement to fsync/msync the writes.  The cpu
cache management for flushing data to persistence can be handled
completely in userspace.

[1]: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg01207.html
[2]: https://lkml.org/lkml/2016/9/11/159

Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 arch/x86/entry/syscalls/syscall_64.tbl |    1 
 include/linux/dax.h                    |    9 ++
 include/linux/fs.h                     |    3 +
 include/linux/syscalls.h               |    1 
 include/uapi/linux/dax.h               |    8 +
 mm/Kconfig                             |    5 +
 mm/Makefile                            |    1 
 mm/daxfile.c                           |  186 ++++++++++++++++++++++++++++++++
 mm/page_io.c                           |   31 +++++
 9 files changed, 245 insertions(+)
 create mode 100644 include/uapi/linux/dax.h
 create mode 100644 mm/daxfile.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..795eb93d6beb 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330	common	pkey_alloc		sys_pkey_alloc
 331	common	pkey_free		sys_pkey_free
 332	common	statx			sys_statx
+333	64	daxctl			sys_daxctl
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 5ec1f6c47716..5f1d0e0ed30f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -4,8 +4,17 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/radix-tree.h>
+#include <uapi/linux/dax.h>
 #include <asm/pgtable.h>
 
+/*
+ * TODO: make sys_daxctl() be the generic interface for toggling S_DAX
+ * across filesystems. For now, mark DAXCTL_F_DAX as an invalid flag
+ */
+#define DAXCTL_VALID_FLAGS (DAXCTL_F_GET | DAXCTL_F_STATIC)
+
+int daxfile_activate(struct file *daxfile, unsigned align);
+
 struct iomap_ops;
 struct dax_device;
 struct dax_operations {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3e68cabb8457..3af649fb669f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1824,8 +1824,10 @@ struct super_operations {
 #define S_NOSEC		4096	/* no suid or xattr security attributes */
 #ifdef CONFIG_FS_DAX
 #define S_DAX		8192	/* Direct Access, avoiding the page cache */
+#define S_DAXFILE	16384	/* no truncate (swapfile) semantics + dax */
 #else
 #define S_DAX		0	/* Make all the DAX code disappear */
+#define S_DAXFILE	0
 #endif
 
 /*
@@ -1865,6 +1867,7 @@ struct super_operations {
 #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
 #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
+#define IS_DAXFILE(inode)	((inode)->i_flags & S_DAXFILE)
 
 #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
 				 (inode)->i_rdev == WHITEOUT_DEV)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 980c3c9b06f8..49e5cc4c192e 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -701,6 +701,7 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
 			unsigned long arg4, unsigned long arg5);
 asmlinkage long sys_swapon(const char __user *specialfile, int swap_flags);
 asmlinkage long sys_swapoff(const char __user *specialfile);
+asmlinkage long sys_daxctl(const char __user *path, int flags, int align);
 asmlinkage long sys_sysctl(struct __sysctl_args __user *args);
 asmlinkage long sys_sysinfo(struct sysinfo __user *info);
 asmlinkage long sys_sysfs(int option,
diff --git a/include/uapi/linux/dax.h b/include/uapi/linux/dax.h
new file mode 100644
index 000000000000..78a41bb392c0
--- /dev/null
+++ b/include/uapi/linux/dax.h
@@ -0,0 +1,8 @@
+#ifndef _UAPI_LINUX_DAX_H
+#define _UAPI_LINUX_DAX_H
+
+#define DAXCTL_F_GET    (1 << 0)
+#define DAXCTL_F_DAX    (1 << 1)
+#define DAXCTL_F_STATIC (1 << 2)
+
+#endif /* _UAPI_LINUX_DAX_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index beb7a455915d..b874565c34eb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -450,6 +450,11 @@ config	TRANSPARENT_HUGE_PAGECACHE
 	def_bool y
 	depends on TRANSPARENT_HUGEPAGE
 
+config DAXFILE
+	def_bool y
+	depends on FS_DAX
+	depends on SWAP
+
 #
 # UP and nommu archs use km based percpu allocator
 #
diff --git a/mm/Makefile b/mm/Makefile
index 026f6a828a50..38d9025a3e37 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -56,6 +56,7 @@ endif
 obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_DAXFILE)	+= daxfile.o
 obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
diff --git a/mm/daxfile.c b/mm/daxfile.c
new file mode 100644
index 000000000000..fe230199c855
--- /dev/null
+++ b/mm/daxfile.c
@@ -0,0 +1,186 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/dax.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+
+/*
+ * TODO: a list to lookup daxfiles assumes a low number of instances,
+ * revisit.
+ */
+static LIST_HEAD(daxfiles);
+static DEFINE_SPINLOCK(dax_lock);
+
+struct dax_info {
+	struct list_head list;
+	struct file *daxfile;
+};
+
+static int daxfile_disable(struct file *victim)
+{
+	int found = 0;
+	struct dax_info *d;
+	struct inode *inode;
+	struct file *daxfile;
+	struct address_space *mapping;
+
+	mapping = victim->f_mapping;
+	spin_lock(&dax_lock);
+	list_for_each_entry(d, &daxfiles, list)
+		if (d->daxfile->f_mapping == mapping) {
+			list_del(&d->list);
+			found = 1;
+			break;
+		}
+	spin_unlock(&dax_lock);
+
+	if (!found)
+		return -EINVAL;
+
+	daxfile = d->daxfile;
+
+	inode = mapping->host;
+	inode->i_flags &= ~(S_SWAPFILE | S_DAXFILE);
+	filp_close(daxfile, NULL);
+
+	return 0;
+}
+
+static int claim_daxfile_checks(struct inode *inode)
+{
+	if (!S_ISREG(inode->i_mode))
+		return -EINVAL;
+
+	if (!IS_DAX(inode))
+		return -EINVAL;
+
+	if (IS_SWAPFILE(inode) || IS_DAXFILE(inode))
+		return -EBUSY;
+
+	return 0;
+}
+
+int daxfile_enable(struct file *daxfile, int align)
+{
+	struct address_space *mapping;
+	struct inode *inode;
+	struct dax_info *d;
+	int rc;
+
+	if (align < 0)
+		return -EINVAL;
+
+	mapping = daxfile->f_mapping;
+	inode = mapping->host;
+
+	rc = claim_daxfile_checks(inode);
+	if (rc)
+		return rc;
+
+	rc = daxfile_activate(daxfile, align);
+	if (rc)
+		return rc;
+
+	d = kzalloc(sizeof(*d), GFP_KERNEL);
+	if (!d)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&d->list);
+	d->daxfile = daxfile;
+
+	spin_lock(&dax_lock);
+	list_add(&d->list, &daxfiles);
+	spin_unlock(&dax_lock);
+
+	/*
+	 * We set S_SWAPFILE to gain "no truncate" / static block
+	 * allocation semantics, and S_DAXFILE so we can differentiate
+	 * traditional swapfiles and assume static block mappings in the
+	 * dax mmap path.
+	 */
+	inode->i_flags |= S_SWAPFILE | S_DAXFILE;
+	return 0;
+}
+
+SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align)
+{
+	int rc;
+	struct filename *name;
+	struct inode *inode = NULL;
+	struct file *daxfile = NULL;
+	struct address_space *mapping;
+
+	if (flags & ~DAXCTL_VALID_FLAGS)
+		return -EINVAL;
+
+	name = getname(path);
+	if (IS_ERR(name))
+		return PTR_ERR(name);
+
+	daxfile = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
+	if (IS_ERR(daxfile)) {
+		rc = PTR_ERR(daxfile);
+		daxfile = NULL;
+		goto out;
+	}
+
+	mapping = daxfile->f_mapping;
+	inode = mapping->host;
+	if (flags & DAXCTL_F_GET) {
+		/*
+		 * We only report the state of DAXCTL_F_STATIC since
+		 * there is no actions for applications to take based on
+		 * the setting of S_DAX. However, if this interface is
+		 * used for toggling S_DAX presumably userspace would
+		 * want to know the state of the flag.
+		 *
+		 * TODO: revisit whether we want to report DAXCTL_F_DAX
+		 * in the IS_DAX() case.
+		 */
+		if (IS_DAXFILE(inode))
+			rc = DAXCTL_F_STATIC;
+		else
+			rc = 0;
+
+		goto out;
+	}
+
+	/*
+	 * TODO: Should unprivileged users be allowed to control daxfile
+	 * behavior? Perhaps a mount flag... is -o dax that flag?
+	 */
+	if (!capable(CAP_LINUX_IMMUTABLE)) {
+		rc = -EPERM;
+		goto out;
+	}
+
+	inode_lock(inode);
+	if (!IS_DAXFILE(inode) && (flags & DAXCTL_F_STATIC)) {
+		rc = daxfile_enable(daxfile, align);
+		/* if successfully enabled hold daxfile open */
+		if (rc == 0)
+			daxfile = NULL;
+	} else if (IS_DAXFILE(inode) && !(flags & DAXCTL_F_STATIC))
+		rc = daxfile_disable(daxfile);
+	else
+		rc = 0;
+	inode_unlock(inode);
+
+out:
+	if (daxfile)
+		filp_close(daxfile, NULL);
+	if (name)
+		putname(name);
+	return rc;
+}
diff --git a/mm/page_io.c b/mm/page_io.c
index 5cec9a3d49f2..35160ad9c51f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -244,6 +244,37 @@ static int bmap_walk(struct file *file, const unsigned page_size,
 	goto out;
 }
 
+static int daxfile_check(sector_t block, unsigned long page_no,
+		enum bmap_check type, void *none)
+{
+	if (type == BMAP_WALK_DONE)
+		return 0;
+
+	/*
+	 * Unlike the swapfile case, fail daxfile_activate() if any file
+	 * extent is not page aligned.
+	 */
+	if (type != BMAP_WALK_FULLPAGE)
+		return -EINVAL;
+	return 0;
+}
+
+int daxfile_activate(struct file *daxfile, unsigned align)
+{
+	int rc;
+
+	if (!align)
+		align = PAGE_SIZE;
+
+	if (align < PAGE_SIZE || !is_power_of_2(align))
+		return -EINVAL;
+
+	rc = bmap_walk(daxfile, align, ULONG_MAX, NULL, daxfile_check, NULL);
+	if (rc)
+		pr_debug("daxctl: daxfile has holes\n");
+	return rc;
+}
+
 static int swapfile_check(sector_t block, unsigned long page_no,
 		enum bmap_check type, void *_sis)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/2] mm: introduce bmap_walk()
  2017-06-17  1:15   ` [RFC PATCH 1/2] mm: introduce bmap_walk() Dan Williams
@ 2017-06-17  5:22     ` Christoph Hellwig
       [not found]       ` <20170617052212.GA8246-jcswGhMUV9g@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2017-06-17  5:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Jan Kara, linux-nvdimm, linux-api, Dave Chinner,
	linux-kernel, linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig

On Fri, Jun 16, 2017 at 06:15:29PM -0700, Dan Williams wrote:
> Refactor the core of generic_swapfile_activate() into bmap_walk() so
> that it can be used by a new daxfile_activate() helper (to be added).

No way in hell!  generic_swapfile_activate needs to day and no new users
of ->bmap over my dead body.  It's a guaranteed to fuck up your data left,
right and center.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/2] mm: introduce bmap_walk()
       [not found]       ` <20170617052212.GA8246-jcswGhMUV9g@public.gmane.org>
@ 2017-06-17 12:29         ` Dan Williams
  2017-06-18  7:51           ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2017-06-17 12:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer,
	linux-fsdevel, Ross Zwisler

On Fri, Jun 16, 2017 at 10:22 PM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> On Fri, Jun 16, 2017 at 06:15:29PM -0700, Dan Williams wrote:
>> Refactor the core of generic_swapfile_activate() into bmap_walk() so
>> that it can be used by a new daxfile_activate() helper (to be added).
>
> No way in hell!  generic_swapfile_activate needs to day and no new users
> of ->bmap over my dead body.  It's a guaranteed to fuck up your data left,
> right and center.

Certainly you're not saying that existing swapfiles are broken, so I
wonder what bugs you're talking about?

Unless you had plans to go remove bmap() I don't see how this gets in
your way at all. That said, I think "please don't add a new bmap()
user, use iomap instead" is a fair comment. You know me well enough to
know that would be all it takes to redirect my work, I can do without
the bluster.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
       [not found]   ` <149766213493.22552.4057048843646200083.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2017-06-17 16:25     ` Andy Lutomirski
  2017-06-17 21:52       ` Dan Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2017-06-17 16:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Linux API, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux FS Devel, Andrew Morton

On Fri, Jun 16, 2017 at 6:15 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> To date, the full promise of byte-addressable access to persistent
> memory has only been half realized via the filesystem-dax interface. The
> current filesystem-dax mechanism allows an application to consume (read)
> data from persistent storage at byte-size granularity, bypassing the
> full page reads required by traditional storage devices.
>
> Now, for writes, applications still need to contend with
> page-granularity dirtying and flushing semantics as well as filesystem
> coordination for metadata updates after any mmap write. The current
> situation precludes use cases that leverage byte-granularity / in-place
> updates to persistent media.
>
> To get around this limitation there are some specialized applications
> that are using the device-dax interface to bypass the overhead and
> data-safety problems of the current filesystem-dax mmap-write path.
> QEMU-KVM is forced to use device-dax to safely pass through persistent
> memory to a guest [1]. Some specialized databases are using device-dax
> for byte-granularity writes. Outside of those cases, device-dax is
> difficult for general purpose persistent memory applications to consume.
> There is demand for access to pmem without needing to contend with
> special device configuration and other device-dax limitations.
>
> The 'daxfile' interface satisfies this demand and realizes one of Dave
> Chinner's ideas for allowing pmem applications to safely bypass
> fsync/msync requirements. The idea is to make the file immutable with
> respect to the offset-to-block mappings for every extent in the file
> [2]. It turns out that filesystems already need to make this guarantee
> today. This property is needed for files marked as swap files.
>
> The new daxctl() syscall manages setting a file into 'static-dax' mode
> whereby it arranges for the file to be treated as a swapfile as far as
> the filesystem is concerned, but not registered with the core-mm as
> swapfile space. A file in this mode is then safe to be mapped and
> written without the requirement to fsync/msync the writes.  The cpu
> cache management for flushing data to persistence can be handled
> completely in userspace.

Can you remind those of us who haven't played with DAX in a while what
the problem is with mmapping a DAX file without this patchset?  If
there's some bookkkeeping needed to make sure that the filesystem will
invalidate all the mappings if it decides to move the file, maybe that
should be the default rather than needing a new syscall.

--Andy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-17 16:25     ` Andy Lutomirski
@ 2017-06-17 21:52       ` Dan Williams
       [not found]         ` <CAPcyv4j4UEegViDJcLZjVv5AFGC18-DcvHFnhZatB0hH3BY85g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Dan Williams @ 2017-06-17 21:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Jan Kara, linux-nvdimm, Linux API, Dave Chinner,
	linux-kernel, linux-mm, Jeff Moyer, Linux FS Devel, Ross Zwisler,
	Christoph Hellwig

On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Jun 16, 2017 at 6:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> To date, the full promise of byte-addressable access to persistent
>> memory has only been half realized via the filesystem-dax interface. The
>> current filesystem-dax mechanism allows an application to consume (read)
>> data from persistent storage at byte-size granularity, bypassing the
>> full page reads required by traditional storage devices.
>>
>> Now, for writes, applications still need to contend with
>> page-granularity dirtying and flushing semantics as well as filesystem
>> coordination for metadata updates after any mmap write. The current
>> situation precludes use cases that leverage byte-granularity / in-place
>> updates to persistent media.
>>
>> To get around this limitation there are some specialized applications
>> that are using the device-dax interface to bypass the overhead and
>> data-safety problems of the current filesystem-dax mmap-write path.
>> QEMU-KVM is forced to use device-dax to safely pass through persistent
>> memory to a guest [1]. Some specialized databases are using device-dax
>> for byte-granularity writes. Outside of those cases, device-dax is
>> difficult for general purpose persistent memory applications to consume.
>> There is demand for access to pmem without needing to contend with
>> special device configuration and other device-dax limitations.
>>
>> The 'daxfile' interface satisfies this demand and realizes one of Dave
>> Chinner's ideas for allowing pmem applications to safely bypass
>> fsync/msync requirements. The idea is to make the file immutable with
>> respect to the offset-to-block mappings for every extent in the file
>> [2]. It turns out that filesystems already need to make this guarantee
>> today. This property is needed for files marked as swap files.
>>
>> The new daxctl() syscall manages setting a file into 'static-dax' mode
>> whereby it arranges for the file to be treated as a swapfile as far as
>> the filesystem is concerned, but not registered with the core-mm as
>> swapfile space. A file in this mode is then safe to be mapped and
>> written without the requirement to fsync/msync the writes.  The cpu
>> cache management for flushing data to persistence can be handled
>> completely in userspace.
>
> Can you remind those of us who haven't played with DAX in a while what
> the problem is with mmapping a DAX file without this patchset?  If
> there's some bookkkeeping needed to make sure that the filesystem will
> invalidate all the mappings if it decides to move the file, maybe that
> should be the default rather than needing a new syscall.

The bookkeeping to invalidate mappings when the filesystem moves a
block is already there.

Without this patchset an application needs to call fsync/msync after
any write to a DAX mapping otherwise there is no guarantee the
filesystem has written the metadata to find the updated block after a
crash or power loss event. Even if the sync operation is reduced to a
minimal cmpxchg in userspace to check if the filesystem-metadata is
dirty, that mechanism doesn't translate to a virtualized environment,
as requiring guests to trigger host fsync()s is not feasible. It's a
half-step solution when you can instead just ask the filesystem to
never move blocks, as Dave proposed many months back.

We stepped back from that proposal when it looked like a significant
amount of per-filesystem work to introduce the capability and it was
not clear that application developers would tolerate the side effects
of this 'immutable' semantic. However, the implementation is dead
simple since ext4 and xfs already need to make
block-allocation-immutable semantics available for swapfiles. We also
have application developers telling us they are ok with the semantics,
especially because it catches Linux up to other operating environments
that are already on board with allowing this type of access to pmem
through a filesystem. This patchset gives pmem application developers
what they want without any additional burden on filesystem
implementations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
       [not found]         ` <CAPcyv4j4UEegViDJcLZjVv5AFGC18-DcvHFnhZatB0hH3BY85g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-17 23:50           ` Andy Lutomirski
  2017-06-18  3:15             ` Dan Williams
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2017-06-17 23:50 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Andrew Morton, Jan Kara, linux-nvdimm,
	Linux API, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer, Linux FS Devel,
	Ross Zwisler, Christoph Hellwig

On Sat, Jun 17, 2017 at 2:52 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>
>> Can you remind those of us who haven't played with DAX in a while what
>> the problem is with mmapping a DAX file without this patchset?  If
>> there's some bookkkeeping needed to make sure that the filesystem will
>> invalidate all the mappings if it decides to move the file, maybe that
>> should be the default rather than needing a new syscall.
>
> The bookkeeping to invalidate mappings when the filesystem moves a
> block is already there.
>
> Without this patchset an application needs to call fsync/msync after
> any write to a DAX mapping otherwise there is no guarantee the
> filesystem has written the metadata to find the updated block after a
> crash or power loss event. Even if the sync operation is reduced to a
> minimal cmpxchg in userspace to check if the filesystem-metadata is
> dirty, that mechanism doesn't translate to a virtualized environment,
> as requiring guests to trigger host fsync()s is not feasible. It's a
> half-step solution when you can instead just ask the filesystem to
> never move blocks, as Dave proposed many months back.
>
> We stepped back from that proposal when it looked like a significant
> amount of per-filesystem work to introduce the capability and it was
> not clear that application developers would tolerate the side effects
> of this 'immutable' semantic. However, the implementation is dead
> simple since ext4 and xfs already need to make
> block-allocation-immutable semantics available for swapfiles. We also
> have application developers telling us they are ok with the semantics,
> especially because it catches Linux up to other operating environments
> that are already on board with allowing this type of access to pmem
> through a filesystem. This patchset gives pmem application developers
> what they want without any additional burden on filesystem
> implementations.

I see.

I have a couple of minor-ish issues with the current proposal, then.
One is that I think the terminology should be changed to still make
sense if filesystems or VFS improves to make this approach
unnecessary.  Rather than saying "this file is now static", perhaps
users should set a flag with the explicit semantics that "mmaps of
this file are guaranteed not to lose data due to the kernel's
activities", IOW that mmaps will be at least as durable as a direct
mapping of DAX memory.  Then the kernel has the flexibility to add a
future implementation in which, instead of pinning the file, the
filesystem just knows to keep metadata synced before allowing
page_mkwrite to re-enable writes to an mmapped DAX file.

My other objection is that the syscall intentionally leaks a reference
to the file.  This means it needs overflow protection and it probably
shouldn't ever be allowed to use it without privilege.

Why can't the underlying issue be easily fixed, though?  Could
.page_mkwrite just make sure that metadata is synced when the FS uses
DAX?  On a DAX fs, syncing metadata should be extremely fast.  This
could be conditioned on an madvise or mmap flag if performance might
be an issue.  As far as I know, this change alone should be
sufficient.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-17 23:50           ` Andy Lutomirski
@ 2017-06-18  3:15             ` Dan Williams
  2017-06-18  5:05               ` Andy Lutomirski
  2017-06-18  8:18               ` Christoph Hellwig
  0 siblings, 2 replies; 39+ messages in thread
From: Dan Williams @ 2017-06-18  3:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Jan Kara, linux-nvdimm, Linux API, Dave Chinner,
	linux-kernel, linux-mm, Jeff Moyer, Linux FS Devel, Ross Zwisler,
	Christoph Hellwig

On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Sat, Jun 17, 2017 at 2:52 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Sat, Jun 17, 2017 at 9:25 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> Can you remind those of us who haven't played with DAX in a while what
>>> the problem is with mmapping a DAX file without this patchset?  If
>>> there's some bookkkeeping needed to make sure that the filesystem will
>>> invalidate all the mappings if it decides to move the file, maybe that
>>> should be the default rather than needing a new syscall.
>>
>> The bookkeeping to invalidate mappings when the filesystem moves a
>> block is already there.
>>
>> Without this patchset an application needs to call fsync/msync after
>> any write to a DAX mapping otherwise there is no guarantee the
>> filesystem has written the metadata to find the updated block after a
>> crash or power loss event. Even if the sync operation is reduced to a
>> minimal cmpxchg in userspace to check if the filesystem-metadata is
>> dirty, that mechanism doesn't translate to a virtualized environment,
>> as requiring guests to trigger host fsync()s is not feasible. It's a
>> half-step solution when you can instead just ask the filesystem to
>> never move blocks, as Dave proposed many months back.
>>
>> We stepped back from that proposal when it looked like a significant
>> amount of per-filesystem work to introduce the capability and it was
>> not clear that application developers would tolerate the side effects
>> of this 'immutable' semantic. However, the implementation is dead
>> simple since ext4 and xfs already need to make
>> block-allocation-immutable semantics available for swapfiles. We also
>> have application developers telling us they are ok with the semantics,
>> especially because it catches Linux up to other operating environments
>> that are already on board with allowing this type of access to pmem
>> through a filesystem. This patchset gives pmem application developers
>> what they want without any additional burden on filesystem
>> implementations.
>
> I see.
>
> I have a couple of minor-ish issues with the current proposal, then.
> One is that I think the terminology should be changed to still make
> sense if filesystems or VFS improves to make this approach
> unnecessary.  Rather than saying "this file is now static", perhaps
> users should set a flag with the explicit semantics that "mmaps of
> this file are guaranteed not to lose data due to the kernel's
> activities", IOW that mmaps will be at least as durable as a direct
> mapping of DAX memory.  Then the kernel has the flexibility to add a
> future implementation in which, instead of pinning the file, the
> filesystem just knows to keep metadata synced before allowing
> page_mkwrite to re-enable writes to an mmapped DAX file.

Yes, sounds good to me. Rename the flag to DAXCTL_F_SYNC to indicate
updates via mmap to this file are synchronous as far as block
allocation metadata is concerned. Future filesystems are then free to
always support this synchronous mode without using the swapfile hack.

> My other objection is that the syscall intentionally leaks a reference
> to the file.  This means it needs overflow protection and it probably
> shouldn't ever be allowed to use it without privilege.

We only hold the one reference while S_DAXFILE is set, so I think the
protection is there, and per Dave's original proposal this requires
CAP_LINUX_IMMUTABLE.

> Why can't the underlying issue be easily fixed, though?  Could
> .page_mkwrite just make sure that metadata is synced when the FS uses
> DAX?

Yes, it most definitely could and that idea has been floated.

> On a DAX fs, syncing metadata should be extremely fast.  This
> could be conditioned on an madvise or mmap flag if performance might
> be an issue.  As far as I know, this change alone should be
> sufficient.

The hang up is that it requires per-fs enabling as it needs to be
careful to manage mmap_sem vs fs journal locks for example. I know the
in-development NOVA [1] filesystem is planning to support this out of
the gate. ext4 would be open to implementing it, but I think xfs is
cold on the idea. Christoph originally proposed it here [2], before
Dave went on to propose immutable semantics.

[1]: https://github.com/NVSL/NOVA
[2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-18  3:15             ` Dan Williams
@ 2017-06-18  5:05               ` Andy Lutomirski
       [not found]                 ` <CALCETrVY38h2ajpod2U_2pdHSp8zO4mG2p19h=OnnHmhGTairw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-06-18  8:18               ` Christoph Hellwig
  1 sibling, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2017-06-18  5:05 UTC (permalink / raw)
  To: Dan Williams, Ross Zwisler, andy.rudoff
  Cc: Andy Lutomirski, Andrew Morton, Jan Kara, linux-nvdimm,
	Linux API, Dave Chinner, linux-kernel, linux-mm, Jeff Moyer,
	Linux FS Devel, Christoph Hellwig

On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> My other objection is that the syscall intentionally leaks a reference
>> to the file.  This means it needs overflow protection and it probably
>> shouldn't ever be allowed to use it without privilege.
>
> We only hold the one reference while S_DAXFILE is set, so I think the
> protection is there, and per Dave's original proposal this requires
> CAP_LINUX_IMMUTABLE.
>
>> Why can't the underlying issue be easily fixed, though?  Could
>> .page_mkwrite just make sure that metadata is synced when the FS uses
>> DAX?
>
> Yes, it most definitely could and that idea has been floated.
>
>> On a DAX fs, syncing metadata should be extremely fast.  This
>> could be conditioned on an madvise or mmap flag if performance might
>> be an issue.  As far as I know, this change alone should be
>> sufficient.
>
> The hang up is that it requires per-fs enabling as it needs to be
> careful to manage mmap_sem vs fs journal locks for example. I know the
> in-development NOVA [1] filesystem is planning to support this out of
> the gate. ext4 would be open to implementing it, but I think xfs is
> cold on the idea. Christoph originally proposed it here [2], before
> Dave went on to propose immutable semantics.

Hmm.  Given a choice between a very clean API that works without
privilege but is awkward to implement on XFS and an awkward-to-use
API, I'd personally choose the former.

Dave, even with the lock ordering issue, couldn't XFS implement
MAP_PMEM_AWARE by having .page_mkwrite work roughly like this:

if (metadata is dirty) {
  up_write(&mmap_sem);
  sync the metadata;
  down_write(&mmap_sem);
  return 0;  /* retry the fault */
} else {
  return whatever success code;
}

This might require returning VM_FAULT_RETRY instead of 0 and it might
require auditing the core mm code to make sure that it can handle
mmap_sem being dropped like this.  I don't see why it couldn't work in
principle, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/2] mm: introduce bmap_walk()
  2017-06-17 12:29         ` Dan Williams
@ 2017-06-18  7:51           ` Christoph Hellwig
  2017-06-19 16:18             ` Darrick J. Wong
       [not found]             ` <20170618075152.GA25871-jcswGhMUV9g@public.gmane.org>
  0 siblings, 2 replies; 39+ messages in thread
From: Christoph Hellwig @ 2017-06-18  7:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andrew Morton, Jan Kara, linux-nvdimm,
	linux-api, Dave Chinner, linux-kernel, Linux MM, Jeff Moyer,
	linux-fsdevel, Ross Zwisler

On Sat, Jun 17, 2017 at 05:29:23AM -0700, Dan Williams wrote:
> On Fri, Jun 16, 2017 at 10:22 PM, Christoph Hellwig <hch@lst.de> wrote:
> > On Fri, Jun 16, 2017 at 06:15:29PM -0700, Dan Williams wrote:
> >> Refactor the core of generic_swapfile_activate() into bmap_walk() so
> >> that it can be used by a new daxfile_activate() helper (to be added).
> >
> > No way in hell!  generic_swapfile_activate needs to day and no new users
> > of ->bmap over my dead body.  It's a guaranteed to fuck up your data left,
> > right and center.
> 
> Certainly you're not saying that existing swapfiles are broken, so I
> wonder what bugs you're talking about?

They are somewhat broken, but we manage to paper over the fact.

And in fact if you plan to use a method marked:

	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
	sector_t (*bmap)(struct address_space *, sector_t);

I'd expect a little research.

By it's signature alone ->bmap can't do a whole lot - it can try to
translate the _current_ mapping of a relative block number to a physical
one, and do extremely crude error reporting.

Notice what it can't do:

 a) provide any guaranteed that the block mapping doesn't change any time
    after it returned
 b) deal with the fact that there might be anything like a physical block
 c) put the physical block into any sort of context, that is explain what
    device it actually is relative to

So yes, swap files are broken.  They sort of work by:

 a) ensuring that ->bmap is not implemented for anything fancy (btrfs), or
    living  with it doing I/O into random places (XFS RT subvolumes, *cough*)
 b) doing extremely heavy handed locking to ensure things don't change at all
    (S_SWAPFILE).  This might kinda sorta work for swapfiles which are
    part of the system and require privilegues, but an absolute no-go
    for anything else
 c) simply not using this brain-haired systems - see the swap over NFS
    support, or the WIP swap over btrfs patches.

> Unless you had plans to go remove bmap() I don't see how this gets in
> your way at all.

I'm not talking about getting in my way.  I'm talking about you doing
something incredibly stupid.  Don't do that.

> That said, I think "please don't add a new bmap()
> user, use iomap instead" is a fair comment. You know me well enough to
> know that would be all it takes to redirect my work, I can do without
> the bluster.

But that's not the point.  The point is that ->bmap() semantics simplify
do not work in practice because they don't make sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-18  3:15             ` Dan Williams
  2017-06-18  5:05               ` Andy Lutomirski
@ 2017-06-18  8:18               ` Christoph Hellwig
       [not found]                 ` <20170618081850.GA26332-jcswGhMUV9g@public.gmane.org>
  1 sibling, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2017-06-18  8:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andy Lutomirski, Andrew Morton, Jan Kara, linux-nvdimm,
	Linux API, Dave Chinner, linux-kernel, linux-mm, Jeff Moyer,
	Linux FS Devel, Ross Zwisler, Christoph Hellwig

On Sat, Jun 17, 2017 at 08:15:05PM -0700, Dan Williams wrote:
> The hang up is that it requires per-fs enabling as it needs to be
> careful to manage mmap_sem vs fs journal locks for example. I know the
> in-development NOVA [1] filesystem is planning to support this out of
> the gate. ext4 would be open to implementing it, but I think xfs is
> cold on the idea. Christoph originally proposed it here [2], before
> Dave went on to propose immutable semantics.
> 
> [1]: https://github.com/NVSL/NOVA
> [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html

And I stand to that statement.  Let's get DAX stable first, and
properly cleaned up (e.g. follow on work with separating it entirely
from the block device).  Then think hard about how most of the 
persistent memory technologies actually work, including the point that
for a lot of workloads page cache will be required at least on the
write side.   And then come up with actual real use cases and we can
look into it.

And stop trying to shoe-horn crap like this in.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
       [not found]                 ` <20170618081850.GA26332-jcswGhMUV9g@public.gmane.org>
@ 2017-06-19  1:51                   ` Dan Williams
  0 siblings, 0 replies; 39+ messages in thread
From: Dan Williams @ 2017-06-19  1:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, Andrew Morton, Jan Kara, linux-nvdimm,
	Linux API, Dave Chinner, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer, Linux FS Devel,
	Ross Zwisler

On Sun, Jun 18, 2017 at 1:18 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> On Sat, Jun 17, 2017 at 08:15:05PM -0700, Dan Williams wrote:
>> The hang up is that it requires per-fs enabling as it needs to be
>> careful to manage mmap_sem vs fs journal locks for example. I know the
>> in-development NOVA [1] filesystem is planning to support this out of
>> the gate. ext4 would be open to implementing it, but I think xfs is
>> cold on the idea. Christoph originally proposed it here [2], before
>> Dave went on to propose immutable semantics.
>>
>> [1]: https://github.com/NVSL/NOVA
>> [2]: https://lists.01.org/pipermail/linux-nvdimm/2016-February/004609.html
>
> And I stand to that statement.  Let's get DAX stable first, and
> properly cleaned up (e.g. follow on work with separating it entirely
> from the block device).  Then think hard about how most of the
> persistent memory technologies actually work, including the point that
> for a lot of workloads page cache will be required at least on the
> write side.   And then come up with actual real use cases and we can
> look into it.

I see it differently. We're already at a good point in time to start
iterating on a fix for this issue. Ross and Jan have done a lot of
good work on the dax stability front, and the block-device separation
of dax is well underway.

> And stop trying to shoe-horn crap like this in.

The kernel shoe-horning all pmem+filesystem-dax applications into
abiding page-cache semantics is a problem, and this RFC has already
helped move the needle on a couple fronts. 1/ Swapfiles are subtly
broken which is something worth fixing, and if it gets us a
synchronous-dax mode without major filesystem surgery then that's all
for the better. 2/ There's an appetite for just fixing this
incrementally in each filesystem's fault handler, so if ext4 was able
to prove out an interface / implementation for synchronous faults we
could go with that instead of a pre-allocated + immutable interface
and let other filesystems set their own timelines.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
       [not found]                 ` <CALCETrVY38h2ajpod2U_2pdHSp8zO4mG2p19h=OnnHmhGTairw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-19 13:21                   ` Dave Chinner
  2017-06-19 15:22                     ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2017-06-19 13:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-nvdimm, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux FS Devel, Andrew Morton

On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote:
> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >> My other objection is that the syscall intentionally leaks a reference
> >> to the file.  This means it needs overflow protection and it probably
> >> shouldn't ever be allowed to use it without privilege.
> >
> > We only hold the one reference while S_DAXFILE is set, so I think the
> > protection is there, and per Dave's original proposal this requires
> > CAP_LINUX_IMMUTABLE.
> >
> >> Why can't the underlying issue be easily fixed, though?  Could
> >> .page_mkwrite just make sure that metadata is synced when the FS uses
> >> DAX?
> >
> > Yes, it most definitely could and that idea has been floated.
> >
> >> On a DAX fs, syncing metadata should be extremely fast.

<sigh>

This again....

Persistent memory means the *I/O* is fast. It does not mean that
*complex filesystem operations* are fast.

Don't forget that there's an shitload of CPU that gets burnt to make
sure that the metadata is synced correctly. Do that /synchronously/
on *every* write page fault (which, BTW, modify mtime, so will
always have dirty metadata to sync) and now you have a serious
performance problem with your "fast" DAX access method.

And that's before we even consider all the problems with running
sync operations in page fault context....

> >> This
> >> could be conditioned on an madvise or mmap flag if performance might
> >> be an issue.  As far as I know, this change alone should be
> >> sufficient.
> >
> > The hang up is that it requires per-fs enabling as it needs to be
> > careful to manage mmap_sem vs fs journal locks for example. I know the
> > in-development NOVA [1] filesystem is planning to support this out of
> > the gate. ext4 would be open to implementing it, but I think xfs is
> > cold on the idea. Christoph originally proposed it here [2], before
> > Dave went on to propose immutable semantics.
> 
> Hmm.  Given a choice between a very clean API that works without
> privilege but is awkward to implement on XFS and an awkward-to-use
> API, I'd personally choose the former.

Yup, you have the choice of a clean kernel API that will be
substantially slower than the existing "dirty page" tracking and
having the app run fsync() when necessary, or having to do a little
more work in a library routine that preallocates a file and sets a
flag on it?

The apps will use the library API, not the kernel API, so who really
cares if there's a few steps to setting up the file state
appropriately?

> Dave, even with the lock ordering issue, couldn't XFS implement
> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this:
> 
> if (metadata is dirty) {
>   up_write(&mmap_sem);
>   sync the metadata;
>   down_write(&mmap_sem);
>   return 0;  /* retry the fault */
> } else {
>   return whatever success code;
> }

How do you know that there is dependent filesystem metadata that
needs syncing at a level that you can safely manipulate the
mmap_sem? And how, exactly, do you do this without races? It'd be
trivial to DOS such retryable DAX faults simply by touching the file
in a tight loop in a separate process...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-19 13:21                   ` Dave Chinner
@ 2017-06-19 15:22                     ` Andy Lutomirski
       [not found]                       ` <CALCETrUe0igzK0RZTSSondkCY3ApYQti89tOh00f0j_APrf_dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2017-06-19 15:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andy Lutomirski, Dan Williams, Ross Zwisler, andy.rudoff,
	Andrew Morton, Jan Kara, linux-nvdimm, Linux API, linux-kernel,
	linux-mm, Jeff Moyer, Linux FS Devel, Christoph Hellwig

On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote:
>> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> >> My other objection is that the syscall intentionally leaks a reference
>> >> to the file.  This means it needs overflow protection and it probably
>> >> shouldn't ever be allowed to use it without privilege.
>> >
>> > We only hold the one reference while S_DAXFILE is set, so I think the
>> > protection is there, and per Dave's original proposal this requires
>> > CAP_LINUX_IMMUTABLE.
>> >
>> >> Why can't the underlying issue be easily fixed, though?  Could
>> >> .page_mkwrite just make sure that metadata is synced when the FS uses
>> >> DAX?
>> >
>> > Yes, it most definitely could and that idea has been floated.
>> >
>> >> On a DAX fs, syncing metadata should be extremely fast.
>
> <sigh>
>
> This again....
>
> Persistent memory means the *I/O* is fast. It does not mean that
> *complex filesystem operations* are fast.
>
> Don't forget that there's an shitload of CPU that gets burnt to make
> sure that the metadata is synced correctly. Do that /synchronously/
> on *every* write page fault (which, BTW, modify mtime, so will
> always have dirty metadata to sync) and now you have a serious
> performance problem with your "fast" DAX access method.

I think the mtime issue can and should be solved separately.  But it'
s a fair point that there would be workloads for which this could be
excessively expensive.  In particular, simply creating a file,
mmapping a large range, and touching the pages one by one -- delalloc
would be completely defeated.

But here's a strawman for solving both issues.  First, mtime.  I
consider it to be either a bug or a misfeature that .page_mkwrite
*ever* dirties an inode just to update mtime.  I have old patches to
fix this, and those patches could be updated and merged.  With them
applied, there's just a set_bit() in .page_mkwrite() to handle mtime.

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4

Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
Instead add a new msync() operation:

msync(start, length, MSYNC_PMEM_PREPARE_WRITE);

If this operation succeeds, it guarantees that all future writes
through this mapping on this range will hit actual storage and that
all the metadata operations needed to make this write persistent will
hit storage such that they are ordered before the user's writes.

As an implementation detail, this will flush out the extents if
needed.  In addition, if the FS has any mechanism that would cause
problems asyncronously later on (dedupe?  deallocated extents full of
zeros?  defrag?), it may also need to set a flag on the VMA that
changes the behavior of future .page_mkwrite operations.

(On x86, for example, this would permit the FS to do WC/streaming
writes without SFENCE if the FS were structured in a way that this
worked.)

Now we have an API that should work going forward without introducing
baggage.  And XFS is free to implement this API by making the entire
file act like a swap file if XFS wants to do so, but this doesn't
force other filesystems (ext4? NOVA?) to do the same thing.

>
> And that's before we even consider all the problems with running
> sync operations in page fault context....
>
>> >> This
>> >> could be conditioned on an madvise or mmap flag if performance might
>> >> be an issue.  As far as I know, this change alone should be
>> >> sufficient.
>> >
>> > The hang up is that it requires per-fs enabling as it needs to be
>> > careful to manage mmap_sem vs fs journal locks for example. I know the
>> > in-development NOVA [1] filesystem is planning to support this out of
>> > the gate. ext4 would be open to implementing it, but I think xfs is
>> > cold on the idea. Christoph originally proposed it here [2], before
>> > Dave went on to propose immutable semantics.
>>
>> Hmm.  Given a choice between a very clean API that works without
>> privilege but is awkward to implement on XFS and an awkward-to-use
>> API, I'd personally choose the former.
>
> Yup, you have the choice of a clean kernel API that will be
> substantially slower than the existing "dirty page" tracking and
> having the app run fsync() when necessary, or having to do a little
> more work in a library routine that preallocates a file and sets a
> flag on it?
>
> The apps will use the library API, not the kernel API, so who really
> cares if there's a few steps to setting up the file state
> appropriately?
>
>> Dave, even with the lock ordering issue, couldn't XFS implement
>> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this:
>>
>> if (metadata is dirty) {
>>   up_write(&mmap_sem);
>>   sync the metadata;
>>   down_write(&mmap_sem);
>>   return 0;  /* retry the fault */
>> } else {
>>   return whatever success code;
>> }
>
> How do you know that there is dependent filesystem metadata that
> needs syncing at a level that you can safely manipulate the
> mmap_sem? And how, exactly, do you do this without races?

I have no idea, but I expect that all the locking issues are solvable.

> It'd be
> trivial to DOS such retryable DAX faults simply by touching the file
> in a tight loop in a separate process...

If the code were smart enough to only cause a retry when the extent
being touched is dirty, this problem wouldn't exist.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/2] mm: introduce bmap_walk()
  2017-06-18  7:51           ` Christoph Hellwig
@ 2017-06-19 16:18             ` Darrick J. Wong
       [not found]             ` <20170618075152.GA25871-jcswGhMUV9g@public.gmane.org>
  1 sibling, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2017-06-19 16:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Andrew Morton, Jan Kara, linux-nvdimm, linux-api,
	Dave Chinner, linux-kernel, Linux MM, Jeff Moyer, linux-fsdevel,
	Ross Zwisler

On Sun, Jun 18, 2017 at 09:51:52AM +0200, Christoph Hellwig wrote:
> On Sat, Jun 17, 2017 at 05:29:23AM -0700, Dan Williams wrote:
> > On Fri, Jun 16, 2017 at 10:22 PM, Christoph Hellwig <hch@lst.de> wrote:
> > > On Fri, Jun 16, 2017 at 06:15:29PM -0700, Dan Williams wrote:
> > >> Refactor the core of generic_swapfile_activate() into bmap_walk() so
> > >> that it can be used by a new daxfile_activate() helper (to be added).
> > >
> > > No way in hell!  generic_swapfile_activate needs to day and no new users
> > > of ->bmap over my dead body.  It's a guaranteed to fuck up your data left,
> > > right and center.
> > 
> > Certainly you're not saying that existing swapfiles are broken, so I
> > wonder what bugs you're talking about?
> 
> They are somewhat broken, but we manage to paper over the fact.
> 
> And in fact if you plan to use a method marked:
> 
> 	/* Unfortunately this kludge is needed for FIBMAP. Don't use it */
> 	sector_t (*bmap)(struct address_space *, sector_t);
> 
> I'd expect a little research.
> 
> By it's signature alone ->bmap can't do a whole lot - it can try to
> translate the _current_ mapping of a relative block number to a physical
> one, and do extremely crude error reporting.
> 
> Notice what it can't do:
> 
>  a) provide any guaranteed that the block mapping doesn't change any time
>     after it returned
>  b) deal with the fact that there might be anything like a physical block
>  c) put the physical block into any sort of context, that is explain what
>     device it actually is relative to
> 
> So yes, swap files are broken.  They sort of work by:
> 
>  a) ensuring that ->bmap is not implemented for anything fancy (btrfs), or
>     living  with it doing I/O into random places (XFS RT subvolumes, *cough*)

Ye $deities, it really /doesn't/ check XFS_IS_REALTIME_INODE(ip)!  AIEEEE!

Uh... patch soon.

>  b) doing extremely heavy handed locking to ensure things don't change at all
>     (S_SWAPFILE).  This might kinda sorta work for swapfiles which are
>     part of the system and require privilegues, but an absolute no-go
>     for anything else
>  c) simply not using this brain-haired systems - see the swap over NFS
>     support, or the WIP swap over btrfs patches.
> 
> > Unless you had plans to go remove bmap() I don't see how this gets in
> > your way at all.
> 
> I'm not talking about getting in my way.  I'm talking about you doing
> something incredibly stupid.  Don't do that.
> 
> > That said, I think "please don't add a new bmap()
> > user, use iomap instead" is a fair comment. You know me well enough to
> > know that would be all it takes to redirect my work, I can do without
> > the bluster.
> 
> But that's not the point.  The point is that ->bmap() semantics simplify
> do not work in practice because they don't make sense.

Seconded, bmap doesn't coordinate with the filesystem in any way to
guarantee that the mappings are stable, nor does it seem to care about
delayed alloc reservations.  Granted I suspect the dax usage model is
"all the blocks were already allocated" so there are no da reservations,
but still, ugh, bmap. :)

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/2] mm: introduce bmap_walk()
       [not found]             ` <20170618075152.GA25871-jcswGhMUV9g@public.gmane.org>
@ 2017-06-19 18:19               ` Al Viro
  2017-06-20  7:34                 ` Christoph Hellwig
  0 siblings, 1 reply; 39+ messages in thread
From: Al Viro @ 2017-06-19 18:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Andrew Morton, Jan Kara,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Linux MM, Jeff Moyer,
	linux-fsdevel, Ross Zwisler

On Sun, Jun 18, 2017 at 09:51:52AM +0200, Christoph Hellwig wrote:

> > That said, I think "please don't add a new bmap()
> > user, use iomap instead" is a fair comment. You know me well enough to
> > know that would be all it takes to redirect my work, I can do without
> > the bluster.
> 
> But that's not the point.  The point is that ->bmap() semantics simplify
> do not work in practice because they don't make sense.

Speaking of iomap, what's supposed to happen when doing a write into what
used to be a hole?  Suppose we have a file with a megabyte hole in it
and there's some process mmapping that range.  Another process does
write over the entire range.  We call ->iomap_begin() and allocate
disk blocks.  Then we start copying data into those.  In the meanwhile,
the first process attempts to fetch from address in the middle of that
hole.  What should happen?

Should the blocks we'd allocated in ->iomap_begin() be immediately linked
into the whatever indirect locks/btree/whatnot we are using?  That would
require zeroing all of them first - otherwise that readpage will read
uninitialized block.  Another variant would be to delay linking them
in until ->iomap_end(), but...  Suppose we get the page evicted by
memory pressure after the writer is finished with it.  If ->readpage()
comes before ->iomap_end(), we'll need to somehow figure out that it's
not a hole anymore, or we'll end up with an uptodate page full of zeroes
observed by reads after successful write().

The comment you've got in linux/iomap.h would seem to suggest the second
interpretation, but neither it nor anything in Documentation discusses the
relations with readpage/writepage...

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
       [not found]                       ` <CALCETrUe0igzK0RZTSSondkCY3ApYQti89tOh00f0j_APrf_dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-20  0:46                         ` Dave Chinner
  2017-06-20  5:53                           ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2017-06-20  0:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-nvdimm, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux FS Devel, Andrew Morton

On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
> On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> > On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote:
> >> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> >> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> >> >> My other objection is that the syscall intentionally leaks a reference
> >> >> to the file.  This means it needs overflow protection and it probably
> >> >> shouldn't ever be allowed to use it without privilege.
> >> >
> >> > We only hold the one reference while S_DAXFILE is set, so I think the
> >> > protection is there, and per Dave's original proposal this requires
> >> > CAP_LINUX_IMMUTABLE.
> >> >
> >> >> Why can't the underlying issue be easily fixed, though?  Could
> >> >> .page_mkwrite just make sure that metadata is synced when the FS uses
> >> >> DAX?
> >> >
> >> > Yes, it most definitely could and that idea has been floated.
> >> >
> >> >> On a DAX fs, syncing metadata should be extremely fast.
> >
> > <sigh>
> >
> > This again....
> >
> > Persistent memory means the *I/O* is fast. It does not mean that
> > *complex filesystem operations* are fast.
> >
> > Don't forget that there's an shitload of CPU that gets burnt to make
> > sure that the metadata is synced correctly. Do that /synchronously/
> > on *every* write page fault (which, BTW, modify mtime, so will
> > always have dirty metadata to sync) and now you have a serious
> > performance problem with your "fast" DAX access method.
> 
> I think the mtime issue can and should be solved separately.  But it'
> s a fair point that there would be workloads for which this could be
> excessively expensive.  In particular, simply creating a file,
> mmapping a large range, and touching the pages one by one -- delalloc
> would be completely defeated.
> 
> But here's a strawman for solving both issues.  First, mtime.  I
> consider it to be either a bug or a misfeature that .page_mkwrite
> *ever* dirties an inode just to update mtime.  I have old patches to
> fix this, and those patches could be updated and merged.  With them
> applied, there's just a set_bit() in .page_mkwrite() to handle mtime.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4

Yup, I remember that - it delays the update to data writeback time,
IOWs the proposed MAP_SYNC page fault semantics result in the same
(poor) behaviour because the sync operation will trigger mtime
updates instead of the page fault.

Unless, of course, you are implying that MAP_SYNC should not
actually sync all known dirty metadata on an inode.

<smacks head on desk>

> Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
> Instead add a new msync() operation:
> 
> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);

How's this any different from the fallocate command I proposed to do
this (apart from the fact that msync() is not intended to be abused
as a file preallocation/zeroing interface)?

> If this operation succeeds, it guarantees that all future writes
> through this mapping on this range will hit actual storage and that
> all the metadata operations needed to make this write persistent will
> hit storage such that they are ordered before the user's writes.
> As an implementation detail, this will flush out the extents if
> needed.  In addition, if the FS has any mechanism that would cause
> problems asyncronously later on (dedupe?  deallocated extents full
> of zeros?  defrag?),

Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and
other background filesystem maintenance operations, etc can all
change the extent layout asynchronously if there's no mechanism to
tell the fs not to modify the inode extent layout.

> it may also need to set a flag on the VMA
> that changes the behavior of future .page_mkwrite operations.
> 
> (On x86, for example, this would permit the FS to do WC/streaming
> writes without SFENCE if the FS were structured in a way that this
> worked.)
> 
> Now we have an API that should work going forward without
> introducing baggage.  And XFS is free to implement this API by
> making the entire file act like a swap file if XFS wants to do so,
> but this doesn't force other filesystems (ext4? NOVA?) to do the
> same thing.

Sure, you are providing a simple programmatic API, but this does not
provide a viable feature management strategy.

i.e. the API you are now proposing requires the filesystem to ensure
an inode's extent map cannot be modified ever again in the future
(that "guarantees all future writes" bit).  This requires, at
minimum, a persistent flag to be set on the inode so the VFS and
filesystem implementations can use it to prevent anything that, for
example, relies on copy-on-write semantics being done on those
files. That means the proposed msync operation will need to check
the filesystem can support this feature and *fail* if it can't.

Further, administrators need to be aware of this application
requirement so they can plan their backup and disaster recovery
operations appropriately (e.g. reflink and/or snapshots cannot be
used as part of thei backup strategy). Hence the point of such
restricted file manipulation functionality requiring permissions to
be granted - it ensures sysadmins know they've got something less
than obvious going on they may need special processes to handle
safely.

Unsurprisingly, this is exactly what the "DAX immutable" inode flag
I proposed provides.  It provides an explicit, standardised and
*documented* management strategy that is common across all
filesystems. It uses mechanisms that *already exist*, the VFS and
filesystems already implement, and adminstrators are familiar with
using to manage their systems (e.g. setting the "NODUMP" inode flag
to exclude files from backups). This also avoids the management
level fragmentation which would occur if filesystems each solve the
"DAX userspace data sync" problem differently via different
management tools, behaviours and semantics.

Keep in mind that there are more uses for immutable extent maps than
just DAX. e.g. every so often someone pops up and says "I have this
high speed data aquisition hardware and we'd like to DMA data direct
to the storage because it's far too slow pushing it through memory
and then the OS to get it to storage. How do I map the storage
blocks and guarantee the mapping won't change while we are
transferring data direct from the hardware?". A file with an
allocated, immutable extent map solves this problem for these sorts
of esoteric applications.

As such, can we please drop all the mmap/msync special API snowflake
proposals and instead address the problem how to set up and manage
files with immutable extent maps efficiently through fallocate?
Once we have them, pmem aware applications don't need to do anything
special with mmap to be able to use userspace data sync instructions
safely.  i.e. the pmem library should select the correct data sync
method according to how the file was set up and what functionality
the underlying filesystem DAX implementation supports.

> >> Dave, even with the lock ordering issue, couldn't XFS implement
> >> MAP_PMEM_AWARE by having .page_mkwrite work roughly like this:
> >>
> >> if (metadata is dirty) { up_write(&mmap_sem); sync the
> >> metadata; down_write(&mmap_sem); return 0;  /* retry the fault
> >> */ } else { return whatever success code; }
> >
> > How do you know that there is dependent filesystem metadata that
> > needs syncing at a level that you can safely manipulate the
> > mmap_sem? And how, exactly, do you do this without races?
> 
> I have no idea, but I expect that all the locking issues are
> solvable.

Yay, Dunning-Kruger to the rescue!

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-17  1:15 ` [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem Dan Williams
       [not found]   ` <149766213493.22552.4057048843646200083.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2017-06-20  5:22   ` Darrick J. Wong
  2017-06-20 15:42     ` Ross Zwisler
       [not found]     ` <20170620052214.GA3787-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
  1 sibling, 2 replies; 39+ messages in thread
From: Darrick J. Wong @ 2017-06-20  5:22 UTC (permalink / raw)
  To: Dan Williams
  Cc: akpm, Jan Kara, linux-nvdimm, linux-api, Dave Chinner,
	linux-kernel, linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig, xfs

[add linux-xfs to the fray]

On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote:
> To date, the full promise of byte-addressable access to persistent
> memory has only been half realized via the filesystem-dax interface. The
> current filesystem-dax mechanism allows an application to consume (read)
> data from persistent storage at byte-size granularity, bypassing the
> full page reads required by traditional storage devices.
> 
> Now, for writes, applications still need to contend with
> page-granularity dirtying and flushing semantics as well as filesystem
> coordination for metadata updates after any mmap write. The current
> situation precludes use cases that leverage byte-granularity / in-place
> updates to persistent media.
> 
> To get around this limitation there are some specialized applications
> that are using the device-dax interface to bypass the overhead and
> data-safety problems of the current filesystem-dax mmap-write path.
> QEMU-KVM is forced to use device-dax to safely pass through persistent
> memory to a guest [1]. Some specialized databases are using device-dax
> for byte-granularity writes. Outside of those cases, device-dax is
> difficult for general purpose persistent memory applications to consume.
> There is demand for access to pmem without needing to contend with
> special device configuration and other device-dax limitations.
> 
> The 'daxfile' interface satisfies this demand and realizes one of Dave
> Chinner's ideas for allowing pmem applications to safely bypass
> fsync/msync requirements. The idea is to make the file immutable with
> respect to the offset-to-block mappings for every extent in the file
> [2]. It turns out that filesystems already need to make this guarantee
> today. This property is needed for files marked as swap files.
> 
> The new daxctl() syscall manages setting a file into 'static-dax' mode
> whereby it arranges for the file to be treated as a swapfile as far as
> the filesystem is concerned, but not registered with the core-mm as
> swapfile space. A file in this mode is then safe to be mapped and
> written without the requirement to fsync/msync the writes.  The cpu
> cache management for flushing data to persistence can be handled
> completely in userspace.
> 
> [1]: https://lists.gnu.org/archive/html/qemu-devel/2017-06/msg01207.html
> [2]: https://lkml.org/lkml/2016/9/11/159
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Jeff Moyer <jmoyer@redhat.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 
>  include/linux/dax.h                    |    9 ++
>  include/linux/fs.h                     |    3 +
>  include/linux/syscalls.h               |    1 
>  include/uapi/linux/dax.h               |    8 +
>  mm/Kconfig                             |    5 +
>  mm/Makefile                            |    1 
>  mm/daxfile.c                           |  186 ++++++++++++++++++++++++++++++++
>  mm/page_io.c                           |   31 +++++
>  9 files changed, 245 insertions(+)
>  create mode 100644 include/uapi/linux/dax.h
>  create mode 100644 mm/daxfile.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 5aef183e2f85..795eb93d6beb 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -339,6 +339,7 @@
>  330	common	pkey_alloc		sys_pkey_alloc
>  331	common	pkey_free		sys_pkey_free
>  332	common	statx			sys_statx
> +333	64	daxctl			sys_daxctl
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 5ec1f6c47716..5f1d0e0ed30f 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -4,8 +4,17 @@
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/radix-tree.h>
> +#include <uapi/linux/dax.h>
>  #include <asm/pgtable.h>
>  
> +/*
> + * TODO: make sys_daxctl() be the generic interface for toggling S_DAX
> + * across filesystems. For now, mark DAXCTL_F_DAX as an invalid flag
> + */
> +#define DAXCTL_VALID_FLAGS (DAXCTL_F_GET | DAXCTL_F_STATIC)
> +
> +int daxfile_activate(struct file *daxfile, unsigned align);
> +
>  struct iomap_ops;
>  struct dax_device;
>  struct dax_operations {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3e68cabb8457..3af649fb669f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1824,8 +1824,10 @@ struct super_operations {
>  #define S_NOSEC		4096	/* no suid or xattr security attributes */
>  #ifdef CONFIG_FS_DAX
>  #define S_DAX		8192	/* Direct Access, avoiding the page cache */
> +#define S_DAXFILE	16384	/* no truncate (swapfile) semantics + dax */
>  #else
>  #define S_DAX		0	/* Make all the DAX code disappear */
> +#define S_DAXFILE	0
>  #endif
>  
>  /*
> @@ -1865,6 +1867,7 @@ struct super_operations {
>  #define IS_AUTOMOUNT(inode)	((inode)->i_flags & S_AUTOMOUNT)
>  #define IS_NOSEC(inode)		((inode)->i_flags & S_NOSEC)
>  #define IS_DAX(inode)		((inode)->i_flags & S_DAX)
> +#define IS_DAXFILE(inode)	((inode)->i_flags & S_DAXFILE)
>  
>  #define IS_WHITEOUT(inode)	(S_ISCHR(inode->i_mode) && \
>  				 (inode)->i_rdev == WHITEOUT_DEV)
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 980c3c9b06f8..49e5cc4c192e 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -701,6 +701,7 @@ asmlinkage long sys_prctl(int option, unsigned long arg2, unsigned long arg3,
>  			unsigned long arg4, unsigned long arg5);
>  asmlinkage long sys_swapon(const char __user *specialfile, int swap_flags);
>  asmlinkage long sys_swapoff(const char __user *specialfile);
> +asmlinkage long sys_daxctl(const char __user *path, int flags, int align);
>  asmlinkage long sys_sysctl(struct __sysctl_args __user *args);
>  asmlinkage long sys_sysinfo(struct sysinfo __user *info);
>  asmlinkage long sys_sysfs(int option,
> diff --git a/include/uapi/linux/dax.h b/include/uapi/linux/dax.h
> new file mode 100644
> index 000000000000..78a41bb392c0
> --- /dev/null
> +++ b/include/uapi/linux/dax.h
> @@ -0,0 +1,8 @@
> +#ifndef _UAPI_LINUX_DAX_H
> +#define _UAPI_LINUX_DAX_H
> +
> +#define DAXCTL_F_GET    (1 << 0)
> +#define DAXCTL_F_DAX    (1 << 1)
> +#define DAXCTL_F_STATIC (1 << 2)
> +
> +#endif /* _UAPI_LINUX_DAX_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index beb7a455915d..b874565c34eb 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -450,6 +450,11 @@ config	TRANSPARENT_HUGE_PAGECACHE
>  	def_bool y
>  	depends on TRANSPARENT_HUGEPAGE
>  
> +config DAXFILE
> +	def_bool y
> +	depends on FS_DAX
> +	depends on SWAP
> +
>  #
>  # UP and nommu archs use km based percpu allocator
>  #
> diff --git a/mm/Makefile b/mm/Makefile
> index 026f6a828a50..38d9025a3e37 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -56,6 +56,7 @@ endif
>  obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
>  
>  obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
> +obj-$(CONFIG_DAXFILE)	+= daxfile.o
>  obj-$(CONFIG_FRONTSWAP)	+= frontswap.o
>  obj-$(CONFIG_ZSWAP)	+= zswap.o
>  obj-$(CONFIG_HAS_DMA)	+= dmapool.o
> diff --git a/mm/daxfile.c b/mm/daxfile.c
> new file mode 100644
> index 000000000000..fe230199c855
> --- /dev/null
> +++ b/mm/daxfile.c
> @@ -0,0 +1,186 @@
> +/*
> + * Copyright(c) 2017 Intel Corporation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/dax.h>
> +#include <linux/slab.h>
> +#include <linux/highmem.h>
> +#include <linux/pagemap.h>
> +#include <linux/syscalls.h>
> +
> +/*
> + * TODO: a list to lookup daxfiles assumes a low number of instances,
> + * revisit.
> + */
> +static LIST_HEAD(daxfiles);
> +static DEFINE_SPINLOCK(dax_lock);
> +
> +struct dax_info {
> +	struct list_head list;
> +	struct file *daxfile;
> +};
> +
> +static int daxfile_disable(struct file *victim)
> +{
> +	int found = 0;
> +	struct dax_info *d;
> +	struct inode *inode;
> +	struct file *daxfile;
> +	struct address_space *mapping;
> +
> +	mapping = victim->f_mapping;
> +	spin_lock(&dax_lock);
> +	list_for_each_entry(d, &daxfiles, list)
> +		if (d->daxfile->f_mapping == mapping) {
> +			list_del(&d->list);
> +			found = 1;
> +			break;
> +		}
> +	spin_unlock(&dax_lock);
> +
> +	if (!found)
> +		return -EINVAL;
> +
> +	daxfile = d->daxfile;
> +
> +	inode = mapping->host;
> +	inode->i_flags &= ~(S_SWAPFILE | S_DAXFILE);
> +	filp_close(daxfile, NULL);
> +
> +	return 0;
> +}
> +
> +static int claim_daxfile_checks(struct inode *inode)
> +{
> +	if (!S_ISREG(inode->i_mode))
> +		return -EINVAL;
> +
> +	if (!IS_DAX(inode))
> +		return -EINVAL;
> +
> +	if (IS_SWAPFILE(inode) || IS_DAXFILE(inode))
> +		return -EBUSY;
> +
> +	return 0;
> +}
> +
> +int daxfile_enable(struct file *daxfile, int align)
> +{
> +	struct address_space *mapping;
> +	struct inode *inode;
> +	struct dax_info *d;
> +	int rc;
> +
> +	if (align < 0)
> +		return -EINVAL;
> +
> +	mapping = daxfile->f_mapping;
> +	inode = mapping->host;
> +
> +	rc = claim_daxfile_checks(inode);
> +	if (rc)
> +		return rc;
> +
> +	rc = daxfile_activate(daxfile, align);
> +	if (rc)
> +		return rc;
> +
> +	d = kzalloc(sizeof(*d), GFP_KERNEL);
> +	if (!d)
> +		return -ENOMEM;
> +	INIT_LIST_HEAD(&d->list);
> +	d->daxfile = daxfile;
> +
> +	spin_lock(&dax_lock);
> +	list_add(&d->list, &daxfiles);
> +	spin_unlock(&dax_lock);
> +
> +	/*
> +	 * We set S_SWAPFILE to gain "no truncate" / static block
> +	 * allocation semantics, and S_DAXFILE so we can differentiate
> +	 * traditional swapfiles and assume static block mappings in the
> +	 * dax mmap path.
> +	 */
> +	inode->i_flags |= S_SWAPFILE | S_DAXFILE;

Yikes.  You know, I hadn't even thought about considering swap files as
a subcase of files with immutable block maps, but here we are.  Both
swap files and DAX require absolutely stable block mappings, they are
both (probably) intolerant of inode metadata changes (size, mtime, etc.)

But on the other hand, the bmap interface is so... yuck.  We return zero
to indicate no mapping or error or whatever, it doesn't actually tell us
/which/ device it's returning offsets into, etc.  I was writing a
regression test earlier to check that we've sealed off XFS RT files from
becoming swap files (because bmap is broken, not (afaik) because of any
weird limitation of xfs) and forgot that quirk long enough to waste time
wondering why it failed to fail on a 4.11 kernel.  That's right, the
first rt file gets block zero and magically doesn't fail to fail if
that's the swap file.

Gross.  I've now ranted twice this month about how bmap() doesn't work
on reflinked files on XFS.

Honestly, I realize we've gone back, forth, and around all over the
place on this.  I still prefer something similar to a permanent flag,
similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE
and some of the semantics.

First, a new inode flag S_IOMAP_FROZEN that means the file's block map
cannot change.

Second, some kind of function to toggle the S_IOMAP_FROZEN flag.
Turning it on will lock the inode, check the extent map for holes,
shared, or unwritten bits, and bail out if it finds any, or set the
flag.  Not sure if we should require CAP_LINUX_IMMUTABLE -- probably
yes, at least at first.  I don't currently have any objection to writing
non-iomap inode metadata out to disk.

Third, the flag can only be cleared if the file isn't mapped.

Fourth, the VFS entry points for things like read, write, truncate,
utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
file, so that the block map cannot be modified.  mmap is still allowed,
as we've discussed.  /Maybe/ we can allow fallocate to extend a file
with zeroed extents (it will be slow) as I've heard murmurs about
wanting to be able to extend a file, maybe not.

Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want
stable iomap but probably don't care about things like mtime.  Maybe
they can call iomap too.

Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it
whenever the in-core inode gets constructed.  This enables us to
prohibit reflinking and other such undesirable activity.

If we actually /do/ come up with a reference implementation for XFS, I'd
be ok with tacking it on the end of my dev branch, which will give us a
loooong runway to try it out.  The end of the dev branch is beyond
online XFS fsck and repair and the "root metadata btrees in inodes"
rework; since that's ~90 patches with my name on it that I cannot also
review, it won't go in for a long time indeed!

(Yes, that was also sort of a plea for someone to go review the XFS
scrub patches.)

> +	return 0;
> +}
> +
> +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align)

I was /about/ to grouse about this syscall, then realized that maybe it
/is/ useful to be able to check a specific alignment.  Maybe not, since
I had something more permanent in mind anyway.  In any case, just pass
in an opened fd if this sticks around.

--D

> +{
> +	int rc;
> +	struct filename *name;
> +	struct inode *inode = NULL;
> +	struct file *daxfile = NULL;
> +	struct address_space *mapping;
> +
> +	if (flags & ~DAXCTL_VALID_FLAGS)
> +		return -EINVAL;
> +
> +	name = getname(path);
> +	if (IS_ERR(name))
> +		return PTR_ERR(name);
> +
> +	daxfile = file_open_name(name, O_RDWR|O_LARGEFILE, 0);
> +	if (IS_ERR(daxfile)) {
> +		rc = PTR_ERR(daxfile);
> +		daxfile = NULL;
> +		goto out;
> +	}
> +
> +	mapping = daxfile->f_mapping;
> +	inode = mapping->host;
> +	if (flags & DAXCTL_F_GET) {
> +		/*
> +		 * We only report the state of DAXCTL_F_STATIC since
> +		 * there is no actions for applications to take based on
> +		 * the setting of S_DAX. However, if this interface is
> +		 * used for toggling S_DAX presumably userspace would
> +		 * want to know the state of the flag.
> +		 *
> +		 * TODO: revisit whether we want to report DAXCTL_F_DAX
> +		 * in the IS_DAX() case.
> +		 */
> +		if (IS_DAXFILE(inode))
> +			rc = DAXCTL_F_STATIC;
> +		else
> +			rc = 0;
> +
> +		goto out;
> +	}
> +
> +	/*
> +	 * TODO: Should unprivileged users be allowed to control daxfile
> +	 * behavior? Perhaps a mount flag... is -o dax that flag?
> +	 */
> +	if (!capable(CAP_LINUX_IMMUTABLE)) {
> +		rc = -EPERM;
> +		goto out;
> +	}
> +
> +	inode_lock(inode);
> +	if (!IS_DAXFILE(inode) && (flags & DAXCTL_F_STATIC)) {
> +		rc = daxfile_enable(daxfile, align);
> +		/* if successfully enabled hold daxfile open */
> +		if (rc == 0)
> +			daxfile = NULL;
> +	} else if (IS_DAXFILE(inode) && !(flags & DAXCTL_F_STATIC))
> +		rc = daxfile_disable(daxfile);
> +	else
> +		rc = 0;
> +	inode_unlock(inode);
> +
> +out:
> +	if (daxfile)
> +		filp_close(daxfile, NULL);
> +	if (name)
> +		putname(name);
> +	return rc;
> +}
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 5cec9a3d49f2..35160ad9c51f 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -244,6 +244,37 @@ static int bmap_walk(struct file *file, const unsigned page_size,
>  	goto out;
>  }
>  
> +static int daxfile_check(sector_t block, unsigned long page_no,
> +		enum bmap_check type, void *none)
> +{
> +	if (type == BMAP_WALK_DONE)
> +		return 0;
> +
> +	/*
> +	 * Unlike the swapfile case, fail daxfile_activate() if any file
> +	 * extent is not page aligned.
> +	 */
> +	if (type != BMAP_WALK_FULLPAGE)
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +int daxfile_activate(struct file *daxfile, unsigned align)
> +{
> +	int rc;
> +
> +	if (!align)
> +		align = PAGE_SIZE;
> +
> +	if (align < PAGE_SIZE || !is_power_of_2(align))
> +		return -EINVAL;
> +
> +	rc = bmap_walk(daxfile, align, ULONG_MAX, NULL, daxfile_check, NULL);
> +	if (rc)
> +		pr_debug("daxctl: daxfile has holes\n");
> +	return rc;
> +}
> +
>  static int swapfile_check(sector_t block, unsigned long page_no,
>  		enum bmap_check type, void *_sis)
>  {
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-20  0:46                         ` Dave Chinner
@ 2017-06-20  5:53                           ` Andy Lutomirski
  2017-06-20  8:49                             ` Christoph Hellwig
       [not found]                             ` <CALCETrVuoPDRuuhc9X8eVCYiFUzWLSTRkcjbD6jas_2J2GixNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 2 replies; 39+ messages in thread
From: Andy Lutomirski @ 2017-06-20  5:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andy Lutomirski, Linux FS Devel,
	Andrew Morton

On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
>> On Mon, Jun 19, 2017 at 6:21 AM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
>> > On Sat, Jun 17, 2017 at 10:05:45PM -0700, Andy Lutomirski wrote:
>> >> On Sat, Jun 17, 2017 at 8:15 PM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> >> > On Sat, Jun 17, 2017 at 4:50 PM, Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>> >> >> My other objection is that the syscall intentionally leaks a reference
>> >> >> to the file.  This means it needs overflow protection and it probably
>> >> >> shouldn't ever be allowed to use it without privilege.
>> >> >
>> >> > We only hold the one reference while S_DAXFILE is set, so I think the
>> >> > protection is there, and per Dave's original proposal this requires
>> >> > CAP_LINUX_IMMUTABLE.
>> >> >
>> >> >> Why can't the underlying issue be easily fixed, though?  Could
>> >> >> .page_mkwrite just make sure that metadata is synced when the FS uses
>> >> >> DAX?
>> >> >
>> >> > Yes, it most definitely could and that idea has been floated.
>> >> >
>> >> >> On a DAX fs, syncing metadata should be extremely fast.
>> >
>> > <sigh>
>> >
>> > This again....
>> >
>> > Persistent memory means the *I/O* is fast. It does not mean that
>> > *complex filesystem operations* are fast.
>> >
>> > Don't forget that there's an shitload of CPU that gets burnt to make
>> > sure that the metadata is synced correctly. Do that /synchronously/
>> > on *every* write page fault (which, BTW, modify mtime, so will
>> > always have dirty metadata to sync) and now you have a serious
>> > performance problem with your "fast" DAX access method.
>>
>> I think the mtime issue can and should be solved separately.  But it'
>> s a fair point that there would be workloads for which this could be
>> excessively expensive.  In particular, simply creating a file,
>> mmapping a large range, and touching the pages one by one -- delalloc
>> would be completely defeated.
>>
>> But here's a strawman for solving both issues.  First, mtime.  I
>> consider it to be either a bug or a misfeature that .page_mkwrite
>> *ever* dirties an inode just to update mtime.  I have old patches to
>> fix this, and those patches could be updated and merged.  With them
>> applied, there's just a set_bit() in .page_mkwrite() to handle mtime.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=mmap_mtime/patch_v4
>
> Yup, I remember that - it delays the update to data writeback time,
> IOWs the proposed MAP_SYNC page fault semantics result in the same
> (poor) behaviour because the sync operation will trigger mtime
> updates instead of the page fault.
>
> Unless, of course, you are implying that MAP_SYNC should not
> actually sync all known dirty metadata on an inode.
>
> <smacks head on desk>
>
>> Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
>> Instead add a new msync() operation:
>>
>> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);
>
> How's this any different from the fallocate command I proposed to do
> this (apart from the fact that msync() is not intended to be abused
> as a file preallocation/zeroing interface)?

I must have missed that suggestion.

But it's different in a major way.  fallocate() takes an fd parameter,
which means that, if some flag gets set, it's set on the struct file.
The persistence property seems to me like it belongs on the vma, not
the file.  But it doesn't have to be msync() -- it could be madvise or
even a new mallocate().  (Although mallocate() is possible the worst
name ever.)

>
>> If this operation succeeds, it guarantees that all future writes
>> through this mapping on this range will hit actual storage and that
>> all the metadata operations needed to make this write persistent will
>> hit storage such that they are ordered before the user's writes.
>> As an implementation detail, this will flush out the extents if
>> needed.  In addition, if the FS has any mechanism that would cause
>> problems asyncronously later on (dedupe?  deallocated extents full
>> of zeros?  defrag?),
>
> Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and
> other background filesystem maintenance operations, etc can all
> change the extent layout asynchronously if there's no mechanism to
> tell the fs not to modify the inode extent layout.

But that's my whole point.  The kernel doesn't really need to prevent
all these background maintenance operations -- it just needs to block
.page_mkwrite until they are synced.  I think that whatever new
mechanism we add for this should be sticky, but I see no reason why
the filesystem should have to block reflink on a DAX file entirely.

In fact, the daxctl() proposal seems actively problematic for some
usecases.  I think I should be able to mmap() a DAX file and then,
while it's still mapped, extend the file, mmap the new part (with the
appropriate flag, madvise(), msync(), fallocate(), whatever), and
write directly through that mapping and through the original mapping,
concurrently, with the full persistence guarantee.  This seems really
awkward to do using daxctl().

>
>> it may also need to set a flag on the VMA
>> that changes the behavior of future .page_mkwrite operations.
>>
>> (On x86, for example, this would permit the FS to do WC/streaming
>> writes without SFENCE if the FS were structured in a way that this
>> worked.)
>>
>> Now we have an API that should work going forward without
>> introducing baggage.  And XFS is free to implement this API by
>> making the entire file act like a swap file if XFS wants to do so,
>> but this doesn't force other filesystems (ext4? NOVA?) to do the
>> same thing.
>
> Sure, you are providing a simple programmatic API, but this does not
> provide a viable feature management strategy.
>
> i.e. the API you are now proposing requires the filesystem to ensure
> an inode's extent map cannot be modified ever again in the future
> (that "guarantees all future writes" bit).  This requires, at
> minimum, a persistent flag to be set on the inode so the VFS and
> filesystem implementations can use it to prevent anything that, for
> example, relies on copy-on-write semantics being done on those
> files. That means the proposed msync operation will need to check
> the filesystem can support this feature and *fail* if it can't.

No it doesn't.  A filesystem *could* implement it like that, but it
could also implement it using .page_mkwrite.  And yes, a filesystem
that can't can't guarantee durability with CLFLUSHOPT; SFENCE on a
mapping should fail this operation to indicate that it can't support
it.

>
> Further, administrators need to be aware of this application
> requirement so they can plan their backup and disaster recovery
> operations appropriately (e.g. reflink and/or snapshots cannot be
> used as part of thei backup strategy).

Or they could use a filesystem that will understand that the new
operation needs to break COW.

>
> Unsurprisingly, this is exactly what the "DAX immutable" inode flag
> I proposed provides.  It provides an explicit, standardised and
> *documented* management strategy that is common across all
> filesystems. It uses mechanisms that *already exist*, the VFS and
> filesystems already implement, and adminstrators are familiar with
> using to manage their systems (e.g. setting the "NODUMP" inode flag
> to exclude files from backups). This also avoids the management
> level fragmentation which would occur if filesystems each solve the
> "DAX userspace data sync" problem differently via different
> management tools, behaviours and semantics.

The DAX immutable flag is really nasty for my software that would like
to use DAX.  I have quite a few processes, all unprivileged, that
create files that they'd like to map using DAX and write to durably
without needing to fsync() (using CLFLUSHOPT; SFENCE or perhaps a WT
mapping).  If I were to use daxctl(), I'd have to have to write a
privileged daemon to manage it, and that would be rather nasty.

If, instead, we had a nice unprivileged per-vma or per-fd mechanism to
tell the filesystem that I want DAX durability, I could just use it
without any fuss.  If it worked on ext4 before it worked on xfs, then
I'd use ext4.  If it ended up being heavier weight on XFS than it was
on ext4 because XFS needed to lock down the extent map for the inode
whereas ext4 could manage it through .page_mkwrite(), then I'd
benchmark it and see which fs would win.  (For my particular use case,
I doubt it would matter, since I aggressively offload fs metadata
operations to a thread whose performance I don't really care about.)


--Andy

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 1/2] mm: introduce bmap_walk()
  2017-06-19 18:19               ` Al Viro
@ 2017-06-20  7:34                 ` Christoph Hellwig
  0 siblings, 0 replies; 39+ messages in thread
From: Christoph Hellwig @ 2017-06-20  7:34 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Dan Williams, Andrew Morton, Jan Kara,
	linux-nvdimm, linux-api, Dave Chinner, linux-kernel, Linux MM,
	Jeff Moyer, linux-fsdevel, Ross Zwisler

On Mon, Jun 19, 2017 at 07:19:57PM +0100, Al Viro wrote:
> Speaking of iomap, what's supposed to happen when doing a write into what
> used to be a hole?  Suppose we have a file with a megabyte hole in it
> and there's some process mmapping that range.  Another process does
> write over the entire range.  We call ->iomap_begin() and allocate
> disk blocks.  Then we start copying data into those.  In the meanwhile,
> the first process attempts to fetch from address in the middle of that
> hole.  What should happen?

Right now the buffered iomap code expects delayed allocations.
So ->iomap_begin will only reserve block in memory, and not even
mark the blocks as allocated in the page / buffer_head.  The fact
that the block is allocated is only propagated into the page buffer_head
on a page by page basis in the actor.

> Should the blocks we'd allocated in ->iomap_begin() be immediately linked
> into the whatever indirect locks/btree/whatnot we are using?  That would
> require zeroing all of them first - otherwise that readpage will read
> uninitialized block.  Another variant would be to delay linking them
> in until ->iomap_end(), but...  Suppose we get the page evicted by
> memory pressure after the writer is finished with it.  If ->readpage()
> comes before ->iomap_end(), we'll need to somehow figure out that it's
> not a hole anymore, or we'll end up with an uptodate page full of zeroes
> observed by reads after successful write().

Delayed blocks are ignored by the read code, so it will read 'through'
them.

> The comment you've got in linux/iomap.h would seem to suggest the second
> interpretation, but neither it nor anything in Documentation discusses the
> relations with readpage/writepage...

I'll see if I can come up with some better documentation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-20  5:53                           ` Andy Lutomirski
@ 2017-06-20  8:49                             ` Christoph Hellwig
       [not found]                               ` <20170620084924.GA9752-jcswGhMUV9g@public.gmane.org>
       [not found]                             ` <CALCETrVuoPDRuuhc9X8eVCYiFUzWLSTRkcjbD6jas_2J2GixNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 39+ messages in thread
From: Christoph Hellwig @ 2017-06-20  8:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Chinner, Dan Williams, Ross Zwisler, andy.rudoff,
	Andrew Morton, Jan Kara, linux-nvdimm, Linux API, linux-kernel,
	linux-mm, Jeff Moyer, Linux FS Devel, Christoph Hellwig

[stripped giant fullquotes]

On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> But that's my whole point.  The kernel doesn't really need to prevent
> all these background maintenance operations -- it just needs to block
> .page_mkwrite until they are synced.  I think that whatever new
> mechanism we add for this should be sticky, but I see no reason why
> the filesystem should have to block reflink on a DAX file entirely.

Agreed - IFF we want to support write through semantics this is the
only somewhat feasible way.  It still has massive downsides of forcing
the full sync machinery to run from the page fauly handler, which
I'm rather scared off, but that's still better than creating a magic
special case that isn't managable at all.

> If, instead, we had a nice unprivileged per-vma or per-fd mechanism to
> tell the filesystem that I want DAX durability, I could just use it
> without any fuss.  If it worked on ext4 before it worked on xfs, then
> I'd use ext4.  If it ended up being heavier weight on XFS than it was
> on ext4 because XFS needed to lock down the extent map for the inode
> whereas ext4 could manage it through .page_mkwrite(), then I'd
> benchmark it and see which fs would win.  (For my particular use case,
> I doubt it would matter, since I aggressively offload fs metadata
> operations to a thread whose performance I don't really care about.)

ext4 and XFS have the same fundamental issue:  both have a file system
wide log of modified data that needs to be flushed to stable storage
to ensure everything is safe.  So if you solve the issue for one of
them you've solved it for the other one as well modulo implementation
details.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
       [not found]                             ` <CALCETrVuoPDRuuhc9X8eVCYiFUzWLSTRkcjbD6jas_2J2GixNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-20 10:11                               ` Dave Chinner
  2017-06-20 16:14                                 ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2017-06-20 10:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-nvdimm, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux FS Devel, Andrew Morton

On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
> >> Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
> >> Instead add a new msync() operation:
> >>
> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);
> >
> > How's this any different from the fallocate command I proposed to do
> > this (apart from the fact that msync() is not intended to be abused
> > as a file preallocation/zeroing interface)?
> 
> I must have missed that suggestion.
> 
> But it's different in a major way.  fallocate() takes an fd parameter,
> which means that, if some flag gets set, it's set on the struct file.

DAX is a property of the inode, not the VMA or struct file as it
needs to be consistent across all VMAs and struct files that
reference that inode. Also, fallocate() manipulates state and
metadata hidden behind the struct inode, not the struct file, so it
seems to me like the right API to use.

And, as mmap() requires a fd to set up the mapping and fallocate()
would have to be run *before* mmap() is used to access the data
directly, I don't see why using fallocate would be a problem here...

> >> If this operation succeeds, it guarantees that all future writes
> >> through this mapping on this range will hit actual storage and that
> >> all the metadata operations needed to make this write persistent will
> >> hit storage such that they are ordered before the user's writes.
> >> As an implementation detail, this will flush out the extents if
> >> needed.  In addition, if the FS has any mechanism that would cause
> >> problems asyncronously later on (dedupe?  deallocated extents full
> >> of zeros?  defrag?),
> >
> > Hole punch, truncate, reflink, dedupe, snapshots, scrubbing and
> > other background filesystem maintenance operations, etc can all
> > change the extent layout asynchronously if there's no mechanism to
> > tell the fs not to modify the inode extent layout.
> 
> But that's my whole point.  The kernel doesn't really need to prevent
> all these background maintenance operations -- it just needs to block
> .page_mkwrite until they are synced.  I think that whatever new
> mechanism we add for this should be sticky, but I see no reason why
> the filesystem should have to block reflink on a DAX file entirely.

I understand the problem quite well, thank you very much. Yes,
COW operations (and other things) can be handled by invalidating DAX
mappings and blocking new page faults.  I see little difference
between this and running the sync path after page-mkwrite has
triggered filesystem metadata changes (e.g.  block allocation). i.e.
If MAP_SYNC is going to be used, then all the things you are talking
about comes along for the ride via invalidations.

The MAP_SYNC proposal is effectively "run the metadata side of
fdatasync() on every page fault". If the inode is not metadata
dirty, then it will do nothing, otherwise it will do what it needs
to stabilise the inode for userspace to be able to sync the data and
it will block until it is done.

Prediction for the MAP_SYNC future: frequent bug reports about huge,
unpredictable page fault latencies on DAX files because every so
often a page fault is required to sync tens of thousands of
unrelated dirty objects because of filesystem journal ordering
constraints....

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-20  5:22   ` Darrick J. Wong
@ 2017-06-20 15:42     ` Ross Zwisler
  2017-06-22  7:09       ` Darrick J. Wong
       [not found]     ` <20170620052214.GA3787-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
  1 sibling, 1 reply; 39+ messages in thread
From: Ross Zwisler @ 2017-06-20 15:42 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, akpm, Jan Kara, linux-nvdimm, linux-api,
	Dave Chinner, linux-kernel, linux-mm, Jeff Moyer, linux-fsdevel,
	Ross Zwisler, Christoph Hellwig, xfs

On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote:
<>
> Fourth, the VFS entry points for things like read, write, truncate,
> utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
> file, so that the block map cannot be modified.  mmap is still allowed,
> as we've discussed.  /Maybe/ we can allow fallocate to extend a file
> with zeroed extents (it will be slow) as I've heard murmurs about
> wanting to be able to extend a file, maybe not.

Read and write should still be allowed, right?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-20 10:11                               ` Dave Chinner
@ 2017-06-20 16:14                                 ` Andy Lutomirski
  2017-06-21  1:40                                   ` Dave Chinner
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2017-06-20 16:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andy Lutomirski, Linux FS Devel,
	Andrew Morton

On Tue, Jun 20, 2017 at 3:11 AM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
>> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
>> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
>> >> Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
>> >> Instead add a new msync() operation:
>> >>
>> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);
>> >
>> > How's this any different from the fallocate command I proposed to do
>> > this (apart from the fact that msync() is not intended to be abused
>> > as a file preallocation/zeroing interface)?
>>
>> I must have missed that suggestion.
>>
>> But it's different in a major way.  fallocate() takes an fd parameter,
>> which means that, if some flag gets set, it's set on the struct file.
>
> DAX is a property of the inode, not the VMA or struct file as it
> needs to be consistent across all VMAs and struct files that
> reference that inode. Also, fallocate() manipulates state and
> metadata hidden behind the struct inode, not the struct file, so it
> seems to me like the right API to use.

I'm not sure I see why.  I can think of a few different scenarios:

 - Reflink while a DAX-using program is running.  It would be nice for
performance, but not critical for functionality, if trying to reflink
a file that's mapped for DAX would copy instead of COWing.  But
breaking COW on the next page_mkwrite would work, too.  A per-inode
count of the number of live DAX mappings or of the number of struct
file instances that have requested DAX would work here.

 - Trying to use DAX on a file that is already reflinked.  The order
of operations doesn't matter hugely, except that breaking COW for the
entire range in question all at once would be faster and result in
better allocation.

 - Checksumming filesystems.  I think it's basically impossible to do
DAX writes to a file like this.  A filesystem could skip checksumming
on extents that are actively mapped for DAX, but something like chattr
to tell, say, btrfs that a given file is intended for DAX is probably
better.  (But, if so, it should presumably be persistent like chattr
and not something that's purely in memory like daxctl().)

 - Defrag and such on an otherwise simple filesystem.  Defragging a
file while it's not actively mapped for DAX should be allowed.
Defragging a file while it is actively mapped for DAX may be slow, but
it a filesystem could certainly make it work.

 - RAID.  Not going to work.

>
> And, as mmap() requires a fd to set up the mapping and fallocate()
> would have to be run *before* mmap() is used to access the data
> directly, I don't see why using fallocate would be a problem here...

Fair enough.  But ISTM fallocate() has no business setting up
important state in the struct file.  It's an operation on inodes.

Looking at the above scenarios, it seems to may that two or three
separate mechanisms may be ideal here.

1. The most important: some way to tell the kernel that an open file
description or an mmap or some subrange thereof is going to be used
for DAX writes and for the kernel to respond by saying "yes, okay" or
"no, not okay".  This should be 100% reliable, which means that all
the corner cases have to work.  This means that, if one task says
"make this file work for DAX" and another task extends the file using
truncate (without calling fallocate), then whatever the kernel
promised to the first task should remain true.

2. Some way to tell filesystems like btrfs to make a file that will be
DAX-able.  chattr +C might already fit the bill.  Without this, I'd
expect the normal incantation to DAX-map a file on btrfs to return an
error.

3. (Not strictly related to DAX.) A way to tell the kernel "I have
this file mmapped for write.  Please go out of your way to avoid
future page faults."  I've wanted this for ordinary files on ext4.
The kernel could, but presently does not, use hardware dirty tracking
instead of software dirty tracking to decide when to write the page
back.  The kernel could also, in principle, write back dirty pages
without ever write protecting them.  For DAX, this might change
behavior to prevent any operation that would relocate blocks or to
have the kernel go out of its way to only do such operations when
absolutely necessary and to immediately update and unwriteprotect the
relevant pages.

(3) is optional and could be delayed to the distant future.  It's not
needed for correctness.

I really don't want to see a scenario where DAX works if you use some
fancy special-purpose library exactly as intended but occasionally
eats your data if you don't use the library exactly as intended.
Getting a valid DAX mapping, writing to it, and doing
CLFLUSHOPT;SFENCE must make that write durable no matter what (unless
the underlying hardware actually fails).

> The MAP_SYNC proposal is effectively "run the metadata side of
> fdatasync() on every page fault". If the inode is not metadata
> dirty, then it will do nothing, otherwise it will do what it needs
> to stabilise the inode for userspace to be able to sync the data and
> it will block until it is done.
>
> Prediction for the MAP_SYNC future: frequent bug reports about huge,
> unpredictable page fault latencies on DAX files because every so
> often a page fault is required to sync tens of thousands of
> unrelated dirty objects because of filesystem journal ordering
> constraints....

Is this really so bad?  Someone might ask for relaxed journal ordering
or they might switch to a different filesystem.  IIRC some filesystems
(ZFS?) have explicit support for this use case.  I suspect that users
saying "your filesystem is slower than I'd like for such-and-such use
case" isn't all that rare.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
       [not found]                               ` <20170620084924.GA9752-jcswGhMUV9g@public.gmane.org>
@ 2017-06-20 16:17                                 ` Dan Williams
       [not found]                                   ` <CAPcyv4jkH6iwDoG4NnCaTNXozwYgVXiJDe2iFSONcE63KvGQoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2017-06-20 23:53                                   ` Dave Chinner
  0 siblings, 2 replies; 39+ messages in thread
From: Dan Williams @ 2017-06-20 16:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Linux API, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andy Lutomirski, Linux FS Devel,
	Andrew Morton

On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
> [stripped giant fullquotes]
>
> On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
>> But that's my whole point.  The kernel doesn't really need to prevent
>> all these background maintenance operations -- it just needs to block
>> .page_mkwrite until they are synced.  I think that whatever new
>> mechanism we add for this should be sticky, but I see no reason why
>> the filesystem should have to block reflink on a DAX file entirely.
>
> Agreed - IFF we want to support write through semantics this is the
> only somewhat feasible way.  It still has massive downsides of forcing
> the full sync machinery to run from the page fauly handler, which
> I'm rather scared off, but that's still better than creating a magic
> special case that isn't managable at all.

An immutable-extent DAX-file and a reflink-capable DAX-file are not
mutually exclusive, and I have yet to hear a need for reflink support
without fsync/msync. Instead I have heard the need for an immutable
file for RDMA purposes, especially for hardware that can't trigger an
mmu fault. The special management of an immutable file is acceptable
to get these capabilities.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
       [not found]                                   ` <CAPcyv4jkH6iwDoG4NnCaTNXozwYgVXiJDe2iFSONcE63KvGQoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-20 16:26                                     ` Andy Lutomirski
  0 siblings, 0 replies; 39+ messages in thread
From: Andy Lutomirski @ 2017-06-20 16:26 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, linux-nvdimm, Linux API, Dave Chinner,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andy Lutomirski, Linux FS Devel,
	Andrew Morton

On Tue, Jun 20, 2017 at 9:17 AM, Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org> wrote:
>> [stripped giant fullquotes]
>>
>> On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
>>> But that's my whole point.  The kernel doesn't really need to prevent
>>> all these background maintenance operations -- it just needs to block
>>> .page_mkwrite until they are synced.  I think that whatever new
>>> mechanism we add for this should be sticky, but I see no reason why
>>> the filesystem should have to block reflink on a DAX file entirely.
>>
>> Agreed - IFF we want to support write through semantics this is the
>> only somewhat feasible way.  It still has massive downsides of forcing
>> the full sync machinery to run from the page fauly handler, which
>> I'm rather scared off, but that's still better than creating a magic
>> special case that isn't managable at all.
>
> An immutable-extent DAX-file and a reflink-capable DAX-file are not
> mutually exclusive, and I have yet to hear a need for reflink support
> without fsync/msync. Instead I have heard the need for an immutable
> file for RDMA purposes, especially for hardware that can't trigger an
> mmu fault. The special management of an immutable file is acceptable
> to get these capabilities.

I guess this applies to any user of get_user_pages() on a DAX-mapped file.  Hmm.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-20 16:17                                 ` Dan Williams
       [not found]                                   ` <CAPcyv4jkH6iwDoG4NnCaTNXozwYgVXiJDe2iFSONcE63KvGQoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-20 23:53                                   ` Dave Chinner
  2017-06-21  1:24                                     ` Darrick J. Wong
  1 sibling, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2017-06-20 23:53 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Andy Lutomirski, Ross Zwisler, Rudoff, Andy,
	Andrew Morton, Jan Kara, linux-nvdimm, Linux API, linux-kernel,
	linux-mm, Jeff Moyer, Linux FS Devel

On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote:
> On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch@lst.de> wrote:
> > [stripped giant fullquotes]
> >
> > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> >> But that's my whole point.  The kernel doesn't really need to prevent
> >> all these background maintenance operations -- it just needs to block
> >> .page_mkwrite until they are synced.  I think that whatever new
> >> mechanism we add for this should be sticky, but I see no reason why
> >> the filesystem should have to block reflink on a DAX file entirely.
> >
> > Agreed - IFF we want to support write through semantics this is the
> > only somewhat feasible way.  It still has massive downsides of forcing
> > the full sync machinery to run from the page fauly handler, which
> > I'm rather scared off, but that's still better than creating a magic
> > special case that isn't managable at all.
> 
> An immutable-extent DAX-file and a reflink-capable DAX-file are not
> mutually exclusive,

Actually, they are mutually exclusive: when the immutable extent DAX
inode is breaking the extent sharing done during the reflink
operation, the copy-on-write operation requires allocating and
freeing extents on the inode that has immutable extents. Which, if
the inode really has immutable extents, cannot be done.

That said, if the extent sharing is broken on the other side of the
reflink (i.e. the non-immutable inode created by the reflink) then
the extent map of the inode with immutable extents will remain
unchanged. i.e. there are two sides to this, and if you only see one
side you might come to the wrong conclusion.

However, we cannot guarantee that no writes occur to the inode with
immutable extent maps (especially as the whole point is to allow
userspace writes and commits without the kernel being involved), so
extent sharing on immutable extent maps cannot be allowed...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-20 23:53                                   ` Dave Chinner
@ 2017-06-21  1:24                                     ` Darrick J. Wong
  2017-06-21  2:19                                       ` Dave Chinner
  0 siblings, 1 reply; 39+ messages in thread
From: Darrick J. Wong @ 2017-06-21  1:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, Christoph Hellwig, Andy Lutomirski, Ross Zwisler,
	Rudoff, Andy, Andrew Morton, Jan Kara, linux-nvdimm, Linux API,
	linux-kernel, linux-mm, Jeff Moyer, Linux FS Devel

On Wed, Jun 21, 2017 at 09:53:46AM +1000, Dave Chinner wrote:
> On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote:
> > On Tue, Jun 20, 2017 at 1:49 AM, Christoph Hellwig <hch@lst.de> wrote:
> > > [stripped giant fullquotes]
> > >
> > > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> > >> But that's my whole point.  The kernel doesn't really need to prevent
> > >> all these background maintenance operations -- it just needs to block
> > >> .page_mkwrite until they are synced.  I think that whatever new
> > >> mechanism we add for this should be sticky, but I see no reason why
> > >> the filesystem should have to block reflink on a DAX file entirely.
> > >
> > > Agreed - IFF we want to support write through semantics this is the
> > > only somewhat feasible way.  It still has massive downsides of forcing
> > > the full sync machinery to run from the page fauly handler, which
> > > I'm rather scared off, but that's still better than creating a magic
> > > special case that isn't managable at all.
> > 
> > An immutable-extent DAX-file and a reflink-capable DAX-file are not
> > mutually exclusive,
> 
> Actually, they are mutually exclusive: when the immutable extent DAX
> inode is breaking the extent sharing done during the reflink
> operation, the copy-on-write operation requires allocating and
> freeing extents on the inode that has immutable extents. Which, if
> the inode really has immutable extents, cannot be done.
> 
> That said, if the extent sharing is broken on the other side of the
> reflink (i.e. the non-immutable inode created by the reflink) then
> the extent map of the inode with immutable extents will remain
> unchanged. i.e. there are two sides to this, and if you only see one
> side you might come to the wrong conclusion.
> 
> However, we cannot guarantee that no writes occur to the inode with
> immutable extent maps (especially as the whole point is to allow
> userspace writes and commits without the kernel being involved), so
> extent sharing on immutable extent maps cannot be allowed...

Just to play devil's advocate...

/If/ you have rmap and /if/ you discover that there's only one
IOMAP_IMMUTABLE file owning this same block and /if/ you're willing to
relocate every other mapping on the whole filesystem, /then/ you could
/in theory/ support shared daxfiles.

However, that's so many on-disk metadata lookups to shove into a
pagefault handler that I don't think anyone in XFSland would entertain
such an ugly fantasy.  You'd be making a lot of metadata requests, and
you'd have to lock the rmapbt while grabbing inodes, which is insane.

Much easier to have a per-inode flag that says "the block map of this
file does not change" and put up with the restricted semantics.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-20 16:14                                 ` Andy Lutomirski
@ 2017-06-21  1:40                                   ` Dave Chinner
  2017-06-21  5:18                                     ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2017-06-21  1:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dan Williams, Ross Zwisler, andy.rudoff, Andrew Morton, Jan Kara,
	linux-nvdimm, Linux API, linux-kernel, linux-mm, Jeff Moyer,
	Linux FS Devel, Christoph Hellwig

On Tue, Jun 20, 2017 at 09:14:24AM -0700, Andy Lutomirski wrote:
> On Tue, Jun 20, 2017 at 3:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Jun 19, 2017 at 10:53:12PM -0700, Andy Lutomirski wrote:
> >> On Mon, Jun 19, 2017 at 5:46 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Mon, Jun 19, 2017 at 08:22:10AM -0700, Andy Lutomirski wrote:
> >> >> Second: syncing extents.  Here's a straw man.  Forget the mmap() flag.
> >> >> Instead add a new msync() operation:
> >> >>
> >> >> msync(start, length, MSYNC_PMEM_PREPARE_WRITE);
> >> >
> >> > How's this any different from the fallocate command I proposed to do
> >> > this (apart from the fact that msync() is not intended to be abused
> >> > as a file preallocation/zeroing interface)?
> >>
> >> I must have missed that suggestion.
> >>
> >> But it's different in a major way.  fallocate() takes an fd parameter,
> >> which means that, if some flag gets set, it's set on the struct file.
> >
> > DAX is a property of the inode, not the VMA or struct file as it
> > needs to be consistent across all VMAs and struct files that
> > reference that inode. Also, fallocate() manipulates state and
> > metadata hidden behind the struct inode, not the struct file, so it
> > seems to me like the right API to use.
> 
> I'm not sure I see why.  I can think of a few different scenarios:
> 
>  - Reflink while a DAX-using program is running.  It would be nice for
> performance, but not critical for functionality, if trying to reflink
> a file that's mapped for DAX would copy instead of COWing.  But
> breaking COW on the next page_mkwrite would work, too. 

Your mangling terminology here. We don't "break COW" - we *use*
copy-on-write to break *extent sharing*. We can break extent sharing
in page_mkwrite - that's exactly what we do for normal pagecache
based mmap writes, and it's done in page_mkwrite.

It hasn't been enabled it for DAX yet because it simply hasn't been
robustly tested yet.

> A per-inode
> count of the number of live DAX mappings or of the number of struct
> file instances that have requested DAX would work here.

For what purpose does this serve? The reflink invalidates all the
existing mappings, so the next write access causes a fault and then
page_mkwrite is called and the shared extent will get COWed....

>  - Trying to use DAX on a file that is already reflinked.  The order
> of operations doesn't matter hugely, except that breaking COW for the
> entire range in question all at once would be faster and result in
> better allocation.

We have COW extent size hints for that. i.e. if you want to COW a
huge page at a time, set the COW extent size hint to the huge page
size...

>  - Checksumming filesystems.  I think it's basically impossible to do
> DAX writes to a file like this.

No shit, sherlock. See my previous comments about compression and
encryption. DAX cannot be used where in-place data manipulations
would be required by the IO path.

> A filesystem could skip checksumming
> on extents that are actively mapped for DAX, but something like chattr
> to tell, say, btrfs that a given file is intended for DAX is probably
> better.  (But, if so, it should presumably be persistent like chattr
> and not something that's purely in memory like daxctl().)

You can already do this on btrfs regardless of DAX. There's an inode
flag:

#define BTRFS_INODE_NODATASUM           (1 << 0)

And you set it by turning off copy-on-write when the file is empty.
i.e. just after creation, run:

	ioctl(FS_IOC_GETFLAGS, &flags).
	flags |= FS_NOCOW_FL;
	ioctl(FS_IOC_SETFLAGS, &flags).

You're talking about stuff that already exists - stop trying to tell
me about basic filesystem functionality and DAX requirements that I
understood years ago. Please start with the assumption that I know a
lot more about this than you do, and if there's something you write
that I don't understand I'll ask you to explain....

>  - Defrag and such on an otherwise simple filesystem.  Defragging a
> file while it's not actively mapped for DAX should be allowed.
> Defragging a file while it is actively mapped for DAX may be slow, but
> it a filesystem could certainly make it work.

FYI, XFS already does this - it skips mapped files altogether, so
DAX state simply doesn't matter here. We also have the FS_NODEFRAG
ioctl flag, so admins can choose to mark files that they want defrag
to skip...

> > And, as mmap() requires a fd to set up the mapping and fallocate()
> > would have to be run *before* mmap() is used to access the data
> > directly, I don't see why using fallocate would be a problem here...
> 
> Fair enough.  But ISTM fallocate() has no business setting up
> important state in the struct file.  It's an operation on inodes.

Yup, the state in question is kept in inode->i_flags rather than
duplicated into the struct file. Which leads to code like this in
the page fault handlers. e.g. in xfs_filemap_page_mkwrite():

	struct inode            *inode = file_inode(vma->vm_file);
	....
	if (IS_DAX(inode)) {
		.....

Yeah, the page fault behaviour is determined by state kept on the
inode, not the struct file or vma....

> Looking at the above scenarios, it seems to may that two or three
> separate mechanisms may be ideal here.
> 
> 1. The most important: some way to tell the kernel that an open file
> description or an mmap or some subrange thereof is going to be used
> for DAX writes and for the kernel to respond by saying "yes, okay" or
> "no, not okay". 

A file can only be accessed by DAX or through the page cache at a
point in time - it can't do both because the mapping infrastructure
(e.g. the radix tree) can only handle one or the other at any point
in time. So, you want to change the DAX access mode of an open file?
We have that on XFS. Turn on DAX:

	ioctl(FS_IOC_FSGETXATTR, &flags).
	flags |= FS_XFLAG_DAX;
	ioctl(FS_IOC_FSSETXATTR, &flags).

This will throw an error if the kernel and/or fs does not support
DAX. The fs will lock out page faults on the inode, invalidate all
the existing mappings, remove the cached pages and switch to DAX
mode.

Turn off DAX:

	ioctl(FS_IOC_FSGETXATTR, &flags).
	flags &= ~FS_XFLAG_DAX;
	ioctl(FS_IOC_FSSETXATTR, &flags).

And the filesystem will lock out page faults, invalidate all the DAX
mappings and switch to cached mode...

> 2. Some way to tell filesystems like btrfs to make a file that will be
> DAX-able.

See above.

> > Prediction for the MAP_SYNC future: frequent bug reports about
> > huge, unpredictable page fault latencies on DAX files because
> > every so often a page fault is required to sync tens of
> > thousands of unrelated dirty objects because of filesystem
> > journal ordering constraints....
> 
> Is this really so bad?

Apparently it is. There are people telling us that mtime
updates in page faults introduce too much unpredictable latency and
that screws over their low latency real time applications.

Those same people are telling use that dirty tracking in page faults
for msync/fsync on DAX is too heavyweight and calling msync is too
onerous and has unpredictable latencies because it might result in
having to sync tens of thousands of unrelated dirty objects. Hence
they want to use userspace data sync primitives to avoid this
overhead and so filesystems need to make it possible to provide this
userspace idata sync capability.

So, we came up with a method that removed all overhead and
unpredictability from the page fault path, and now we're being told
that calling fallocate() is too hard and difficult and the
restrictions that prevent all the page fault overhead (operations
which you won't want to be doing for low-latency apps in the first
place) are too onerous.

And so the proposed solution is an API that requires the filesystem
to *run fdatasync in every write page fault*?

Can you see the contradictions in the requirements here? On one hand
we're being told page faults have to be low overhead and have
predictable latency, but when presented with a solution we're told
you want page faults to be blocked arbitrarily and for unbound
lengths of time.

Put simply: I don't care what gets implemented. What I care about is
that everyone understands that we're being given contradictory
requirements and that neither of the proposed solutions solves both
sets of requirements.

> I suspect that users saying "your filesystem is slower than I'd
> like for such-and-such use case" isn't all that rare.

Except the people asking for userspace data sync are asking for it
for *performance reasons*. Making that explicit case *slower* is the
exact opposite of what we've been asked to provide....

Cheers,

Dave.


-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-21  1:24                                     ` Darrick J. Wong
@ 2017-06-21  2:19                                       ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2017-06-21  2:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, Christoph Hellwig, Andy Lutomirski, Ross Zwisler,
	Rudoff, Andy, Andrew Morton, Jan Kara, linux-nvdimm, Linux API,
	linux-kernel, linux-mm, Jeff Moyer, Linux FS Devel

On Tue, Jun 20, 2017 at 06:24:03PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 21, 2017 at 09:53:46AM +1000, Dave Chinner wrote:
> > On Tue, Jun 20, 2017 at 09:17:36AM -0700, Dan Williams wrote:
> > > An immutable-extent DAX-file and a reflink-capable DAX-file are not
> > > mutually exclusive,
> > 
> > Actually, they are mutually exclusive: when the immutable extent DAX
> > inode is breaking the extent sharing done during the reflink
> > operation, the copy-on-write operation requires allocating and
> > freeing extents on the inode that has immutable extents. Which, if
> > the inode really has immutable extents, cannot be done.
> > 
> > That said, if the extent sharing is broken on the other side of the
> > reflink (i.e. the non-immutable inode created by the reflink) then
> > the extent map of the inode with immutable extents will remain
> > unchanged. i.e. there are two sides to this, and if you only see one
> > side you might come to the wrong conclusion.
> > 
> > However, we cannot guarantee that no writes occur to the inode with
> > immutable extent maps (especially as the whole point is to allow
> > userspace writes and commits without the kernel being involved), so
> > extent sharing on immutable extent maps cannot be allowed...
> 
> Just to play devil's advocate...
> 
> /If/ you have rmap and /if/ you discover that there's only one
> IOMAP_IMMUTABLE file owning this same block and /if/ you're willing to
> relocate every other mapping on the whole filesystem, /then/ you could
> /in theory/ support shared daxfiles.

I figured that nobody apart from experienced filesystem developers
would understand the complexities of rmap and refcounts and how they
could be abused to do this. I also assumed that that people like you
would understand this is possible but completely impractical....

> However, that's so many on-disk metadata lookups to shove into a
> pagefault handler that I don't think anyone in XFSland would entertain
> such an ugly fantasy.  You'd be making a lot of metadata requests, and
> you'd have to lock the rmapbt while grabbing inodes, which is insane.

Exactly. But while I understand this, consider the amount of assumed
filesystem and XFS knowledge in that one simple paragraph. Most
non-experts would have stopped *understanding* at "/If/ you have
rmap" and go away with the wrong ideas in their heads. Hence I now
tend to omit mentioning "possible but impractical" things in mixed
expertise discussions....

> Much easier to have a per-inode flag that says "the block map of this
> file does not change" and put up with the restricted semantics.

In a nutshell.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-21  1:40                                   ` Dave Chinner
@ 2017-06-21  5:18                                     ` Andy Lutomirski
       [not found]                                       ` <CALCETrVYmbyNS-btvsN_M-QyWPZA_Y_4JXOM893g7nhZA+WviQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2017-06-21  5:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andy Lutomirski, Linux FS Devel,
	Andrew Morton

On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> Your mangling terminology here. We don't "break COW" - we *use*
> copy-on-write to break *extent sharing*. We can break extent sharing
> in page_mkwrite - that's exactly what we do for normal pagecache
> based mmap writes, and it's done in page_mkwrite.

Right, my bad.

>
> It hasn't been enabled it for DAX yet because it simply hasn't been
> robustly tested yet.
>
>> A per-inode
>> count of the number of live DAX mappings or of the number of struct
>> file instances that have requested DAX would work here.
>
> For what purpose does this serve? The reflink invalidates all the
> existing mappings, so the next write access causes a fault and then
> page_mkwrite is called and the shared extent will get COWed....

The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it
right), except that IMO an API that doesn't involve making a change to
an inode that sticks around would be nice.  The inode flag has the
unfortunate property that, if two different programs each try to set
the flag, mmap, write, and clear the flag, they'll stomp on each other
and risk data corruption.

I admit I'm now thoroughly confused as to exactly what XFS does here
-- does FS_XFLAG_DAX persist across unmount/mount?

>
>>  - Trying to use DAX on a file that is already reflinked.  The order
>> of operations doesn't matter hugely, except that breaking COW for the
>> entire range in question all at once would be faster and result in
>> better allocation.
>
> We have COW extent size hints for that. i.e. if you want to COW a
> huge page at a time, set the COW extent size hint to the huge page
> size...

Nifty.

> Apparently it is. There are people telling us that mtime
> updates in page faults introduce too much unpredictable latency and
> that screws over their low latency real time applications.

I was one of those, and I even wrote patches.  I should try to dust them off.

>
> Those same people are telling use that dirty tracking in page faults
> for msync/fsync on DAX is too heavyweight and calling msync is too
> onerous and has unpredictable latencies because it might result in
> having to sync tens of thousands of unrelated dirty objects. Hence
> they want to use userspace data sync primitives to avoid this
> overhead and so filesystems need to make it possible to provide this
> userspace idata sync capability.

If I were using DAX in production, I'd have exactly this issue.  Let
me quote myself:

On Tue, Jun 20, 2017 at 9:14 AM, Andy Lutomirski <luto-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> 3. (Not strictly related to DAX.) A way to tell the kernel "I have
> this file mmapped for write.  Please go out of your way to avoid
> future page faults."  I've wanted this for ordinary files on ext4.
> The kernel could, but presently does not, use hardware dirty tracking
> instead of software dirty tracking to decide when to write the page
> back.  The kernel could also, in principle, write back dirty pages
> without ever write protecting them.  For DAX, this might change
> behavior to prevent any operation that would relocate blocks or to
> have the kernel go out of its way to only do such operations when
> absolutely necessary and to immediately update and unwriteprotect the
> relevant pages.

I agree that this is a real issue, but it's not limited to DAX.  I've
wanted a mode where I tell the kernel "I'm a high-performance
application mmapping this file and I'm going to write to it a lot.  Do
your best to avoid any page faults, even if it adversely affects the
performance of the system."  This mode could do lots of things.  It
could cause the system to leave the page writable even after writeback
and, if possible, to use hardware dirty tracking.  It could cause the
system to make a copy of the page and write back from the copy if
there is anything in play that could need stable pages during
writeback.  And, for DAX, it could tell the system to keep the page
pinned and disallow moving it and reflinking it.

(Of course, the above requires that we either deal with mtime like my
patches do or that this heavyweight mechanism disable mtime updates.
I prefer the former.)

Here's the overall point I'm trying to make: unprivileged programs
that want to write to DAX files with userspace commit mechanisms
(CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without
privilege, and with reasonably clean APIs.  Ideally they could do this
to any file they have write access to.  Programs that want to write to
mmapped files, DAX or otherwise, without latency spikes due to
.page_mkwrite should be able to opt in to a heavier weight mechanism.
But these two issues are someone independent, and I think they should
be solved separately.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
       [not found]     ` <20170620052214.GA3787-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
@ 2017-06-21 23:37       ` Dave Chinner
  2017-06-22  7:23         ` Darrick J. Wong
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2017-06-21 23:37 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Dan Williams, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Jan Kara,
	linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Jeff Moyer,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Ross Zwisler,
	Christoph Hellwig, xfs

On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote:
> [add linux-xfs to the fray]
> 
> On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote:
> > +	spin_lock(&dax_lock);
> > +	list_add(&d->list, &daxfiles);
> > +	spin_unlock(&dax_lock);
> > +
> > +	/*
> > +	 * We set S_SWAPFILE to gain "no truncate" / static block
> > +	 * allocation semantics, and S_DAXFILE so we can differentiate
> > +	 * traditional swapfiles and assume static block mappings in the
> > +	 * dax mmap path.
> > +	 */
> > +	inode->i_flags |= S_SWAPFILE | S_DAXFILE;
> 
> Yikes.  You know, I hadn't even thought about considering swap files as
> a subcase of files with immutable block maps, but here we are.  Both
> swap files and DAX require absolutely stable block mappings, they are
> both (probably) intolerant of inode metadata changes (size, mtime, etc.)

Swap files are intolerant of any metadata changes because once the
mapping has been sucked into the swapfile code, the inode is never
looked at again. DAX file data access always goes through the inode,
so they is much more tolerant of metadata changes given certain
constraints.

<snip bmap rant>

> Honestly, I realize we've gone back, forth, and around all over the
> place on this.  I still prefer something similar to a permanent flag,
> similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE
> and some of the semantics.
> 
> First, a new inode flag S_IOMAP_FROZEN that means the file's block map
> cannot change.

I've been calling it "immutable extents" - freezing has implications
that it's only temporary (i.e. freezing filesystems) and will be
followed shortly by a thaw. That isn't the case here - we truly want
the extent/block map to be immutable....

> Second, some kind of function to toggle the S_IOMAP_FROZEN flag.
> Turning it on will lock the inode, check the extent map for holes,
> shared, or unwritten bits, and bail out if it finds any, or set the
> flag. 

Hmmm, I disagree on the unwritten state here.  We want swap files to
be able to use unwritten extents - it means we can preallocate the
swap file and hand it straight to swapon without having to zero it
(i.e. no I/O needed to demand allocate more swap space when memory
is very low).  Also, anyone who tries to read the swap file from
userspace will be reading unwritten extents, which will always
return zeros rather than whatever is in the swap file...

> Not sure if we should require CAP_LINUX_IMMUTABLE -- probably
> yes, at least at first.  I don't currently have any objection to writing
> non-iomap inode metadata out to disk.
> 
> Third, the flag can only be cleared if the file isn't mapped.

How do we check this from the fs without racing? AFAICT we can't
prevent a concurrent map operation from occurring while we are
changing the state of the inode - we can only block page faults
after then inode is mapped....

> Fourth, the VFS entry points for things like read, write, truncate,
> utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
> file, so that the block map cannot be modified.
> mmap is still allowed,
> as we've discussed.  /Maybe/ we can allow fallocate to extend a file
> with zeroed extents (it will be slow) as I've heard murmurs about
> wanting to be able to extend a file, maybe not.

read is fine, write should be fine as long as the iomap call can
error out operations that would require extent map modifications.
fallocate should be allowed to modify the extent map, too, because
it should be the mechanism used be applications to set up file
extents in the correct form for applications to use as immutable
(i.e. lock out page faults, allocate, zero, extend and fsync in
one atomic operation)....

> Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want
> stable iomap but probably don't care about things like mtime.  Maybe
> they can call iomap too.
> 
> Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it
> whenever the in-core inode gets constructed.  This enables us to
> prohibit reflinking and other such undesirable activity.

*nod*

> If we actually /do/ come up with a reference implementation for XFS, I'd
> be ok with tacking it on the end of my dev branch, which will give us a
> loooong runway to try it out.  The end of the dev branch is beyond
> online XFS fsck and repair and the "root metadata btrees in inodes"
> rework; since that's ~90 patches with my name on it that I cannot also
> review, it won't go in for a long time indeed!

I don't think it's so complex to need such a long dev time -
all the infrastructure we need is pretty much there already...

> (Yes, that was also sort of a plea for someone to go review the XFS
> scrub patches.)
> 
> > +	return 0;
> > +}
> > +
> > +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align)
> 
> I was /about/ to grouse about this syscall, then realized that maybe it
> /is/ useful to be able to check a specific alignment.  Maybe not, since
> I had something more permanent in mind anyway.  In any case, just pass
> in an opened fd if this sticks around.

We can do all that via fallocate(), too...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
       [not found]                                       ` <CALCETrVYmbyNS-btvsN_M-QyWPZA_Y_4JXOM893g7nhZA+WviQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2017-06-22  0:02                                         ` Dave Chinner
  2017-06-22  4:07                                           ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2017-06-22  0:02 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jan Kara, linux-nvdimm, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Linux FS Devel, Andrew Morton

On Tue, Jun 20, 2017 at 10:18:24PM -0700, Andy Lutomirski wrote:
> On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
> >> A per-inode
> >> count of the number of live DAX mappings or of the number of struct
> >> file instances that have requested DAX would work here.
> >
> > For what purpose does this serve? The reflink invalidates all the
> > existing mappings, so the next write access causes a fault and then
> > page_mkwrite is called and the shared extent will get COWed....
> 
> The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it
> right), except that IMO an API that doesn't involve making a change to
> an inode that sticks around would be nice.  The inode flag has the
> unfortunate property that, if two different programs each try to set
> the flag, mmap, write, and clear the flag, they'll stomp on each other
> and risk data corruption.
> 
> I admit I'm now thoroughly confused as to exactly what XFS does here
> -- does FS_XFLAG_DAX persist across unmount/mount?

Yes, it is.

i.e. DAX on XFS does not rely on a naive fs-wide mount option. You
can have applications on pmem filesystems use either DAX or normal
IO based on directory/inode flags.  Something doesn't work with DAX,
so just remove the DAX flags from the directories/inodes, and it
will safely and transparently switch to page-cache based IO.

<snip>

> Here's the overall point I'm trying to make: unprivileged programs
> that want to write to DAX files with userspace commit mechanisms
> (CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without
> privilege, and with reasonably clean APIs.  Ideally they could do this
> to any file they have write access to.

The privilege argument is irrelevant now - it was /suggested/
initially as a way of preventing people from shooting themselves in
the foot based on the immutable file model. It's clear that's not
desired, and it's not a show stopper. 

> Programs that want to write to
> mmapped files, DAX or otherwise, without latency spikes due to
> .page_mkwrite should be able to opt in to a heavier weight mechanism.
> But these two issues are someone independent, and I think they should
> be solved separately.

You seem to be calling the "fdatasync on every page fault" the
"lightweight" option. That's the brute-force-with-big-hammer
solution - it's most definitely not lightweight as every page fault
has extra overhead to call ->fsync(). Sure, the API is simple, but
the runtime overhead is significant.

The lightweight *runtime* option is to set up the file in such a
way that there is never any extra overhead at page fault time.  This
is what immutable extent maps provide.  Indeed, because the mappings
never change, you could use hardware dirty tracking if you wanted,
as there's no need to look up the filesystem to do writeback as
everything needed for writeback was mapped at page fault time.  This
"map first and then just write when you need to" is *exactly how
swap files work*.

Even if you are considering the complexity of the APIs, it's hardly
a "heavyweight" when it only requires a single call to fallocate()
before mmap() to set up the immutable extents on the file...

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-22  0:02                                         ` Dave Chinner
@ 2017-06-22  4:07                                           ` Andy Lutomirski
  2017-06-23  0:52                                             ` Dave Chinner
  0 siblings, 1 reply; 39+ messages in thread
From: Andy Lutomirski @ 2017-06-22  4:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-nvdimm, Linux API,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andy Lutomirski, Linux FS Devel,
	Andrew Morton

On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:
>
> You seem to be calling the "fdatasync on every page fault" the

It's the opposite of fdatasync().  It needs to sync whatever metadata
is needed to find the data.  The data doesn't need to be synced.

> "lightweight" option. That's the brute-force-with-big-hammer
> solution - it's most definitely not lightweight as every page fault
> has extra overhead to call ->fsync(). Sure, the API is simple, but
> the runtime overhead is significant.

It's lightweight in terms of its impact on the filesystem.  It doesn't
need any persistent setup -- you can just use it.

> Even if you are considering the complexity of the APIs, it's hardly
> a "heavyweight" when it only requires a single call to fallocate()
> before mmap() to set up the immutable extents on the file...

So what would the exact semantics be?  In particular, how can it fail?
 If I do the fallocate(), is it absolutely promised that the extent
map won't get out of sync between what mmap sees and what's on disk?
Do user programs need to worry about colliding with each other when
one does fallocate() to DAXify a file and the other does fallocate()
to unDAXify a file?  Does this particular fallocate() call still keep
its effect after a reboot?

These issues are why I think it would be nicer to have an API that
makes a particular mapping or fd be unconditionally *correct* and then
to provide something else that makes it avoid latency spikes.

Is there an actual concrete proposal that's reviewable?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-20 15:42     ` Ross Zwisler
@ 2017-06-22  7:09       ` Darrick J. Wong
  0 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2017-06-22  7:09 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, akpm, Jan Kara, linux-nvdimm,
	linux-api, Dave Chinner, linux-kernel, linux-mm, Jeff Moyer,
	linux-fsdevel, Christoph Hellwig, xfs

On Tue, Jun 20, 2017 at 09:42:55AM -0600, Ross Zwisler wrote:
> On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote:
> <>
> > Fourth, the VFS entry points for things like read, write, truncate,
> > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
> > file, so that the block map cannot be modified.  mmap is still allowed,
> > as we've discussed.  /Maybe/ we can allow fallocate to extend a file
> > with zeroed extents (it will be slow) as I've heard murmurs about
> > wanting to be able to extend a file, maybe not.
> 
> Read and write should still be allowed, right?

<shrug> I had thought the usage model was pretty slanted towards mmap,
but it's not a big deal to turn read/writes into glorified memcpy,
provided we reject the io request if it goes past EOF.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-21 23:37       ` Dave Chinner
@ 2017-06-22  7:23         ` Darrick J. Wong
  0 siblings, 0 replies; 39+ messages in thread
From: Darrick J. Wong @ 2017-06-22  7:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dan Williams, akpm, Jan Kara, linux-nvdimm, linux-api,
	linux-kernel, linux-mm, Jeff Moyer, linux-fsdevel, Ross Zwisler,
	Christoph Hellwig, xfs

On Thu, Jun 22, 2017 at 09:37:14AM +1000, Dave Chinner wrote:
> On Mon, Jun 19, 2017 at 10:22:14PM -0700, Darrick J. Wong wrote:
> > [add linux-xfs to the fray]
> > 
> > On Fri, Jun 16, 2017 at 06:15:35PM -0700, Dan Williams wrote:
> > > +	spin_lock(&dax_lock);
> > > +	list_add(&d->list, &daxfiles);
> > > +	spin_unlock(&dax_lock);
> > > +
> > > +	/*
> > > +	 * We set S_SWAPFILE to gain "no truncate" / static block
> > > +	 * allocation semantics, and S_DAXFILE so we can differentiate
> > > +	 * traditional swapfiles and assume static block mappings in the
> > > +	 * dax mmap path.
> > > +	 */
> > > +	inode->i_flags |= S_SWAPFILE | S_DAXFILE;
> > 
> > Yikes.  You know, I hadn't even thought about considering swap files as
> > a subcase of files with immutable block maps, but here we are.  Both
> > swap files and DAX require absolutely stable block mappings, they are
> > both (probably) intolerant of inode metadata changes (size, mtime, etc.)
> 
> Swap files are intolerant of any metadata changes because once the
> mapping has been sucked into the swapfile code, the inode is never
> looked at again. DAX file data access always goes through the inode,
> so they is much more tolerant of metadata changes given certain
> constraints.

Fair enough.

> <snip bmap rant>
> 
> > Honestly, I realize we've gone back, forth, and around all over the
> > place on this.  I still prefer something similar to a permanent flag,
> > similar to what Dave suggested, though I hate the name PMEM_IMMUTABLE
> > and some of the semantics.
> > 
> > First, a new inode flag S_IOMAP_FROZEN that means the file's block map
> > cannot change.
> 
> I've been calling it "immutable extents" - freezing has implications
> that it's only temporary (i.e. freezing filesystems) and will be
> followed shortly by a thaw. That isn't the case here - we truly want
> the extent/block map to be immutable....

<nod> S_IOMAP_IMMUTABLE it is, then.

> > Second, some kind of function to toggle the S_IOMAP_FROZEN flag.
> > Turning it on will lock the inode, check the extent map for holes,
> > shared, or unwritten bits, and bail out if it finds any, or set the
> > flag. 
> 
> Hmmm, I disagree on the unwritten state here.  We want swap files to
> be able to use unwritten extents - it means we can preallocate the
> swap file and hand it straight to swapon without having to zero it
> (i.e. no I/O needed to demand allocate more swap space when memory
> is very low).  Also, anyone who tries to read the swap file from
> userspace will be reading unwritten extents, which will always
> return zeros rather than whatever is in the swap file...

Now I've twisted all the way around to thinking that swap files
should be /totally/ unwritten, except for the file header. :)

> > Not sure if we should require CAP_LINUX_IMMUTABLE -- probably
> > yes, at least at first.  I don't currently have any objection to writing
> > non-iomap inode metadata out to disk.
> > 
> > Third, the flag can only be cleared if the file isn't mapped.
> 
> How do we check this from the fs without racing? AFAICT we can't
> prevent a concurrent map operation from occurring while we are
> changing the state of the inode - we can only block page faults
> after then inode is mapped....

I'd thought we could coordinate that via xfs_file_mmap, but tbh my brain
paged that out a while ago.

> > Fourth, the VFS entry points for things like read, write, truncate,
> > utimes, fallocate, etc. all just bail out if S_IOMAP_FROZEN is set on a
> > file, so that the block map cannot be modified.
> > mmap is still allowed,
> > as we've discussed.  /Maybe/ we can allow fallocate to extend a file
> > with zeroed extents (it will be slow) as I've heard murmurs about
> > wanting to be able to extend a file, maybe not.
> 
> read is fine, write should be fine as long as the iomap call can
> error out operations that would require extent map modifications.

Ok.

> fallocate should be allowed to modify the extent map, too, because
> it should be the mechanism used be applications to set up file
> extents in the correct form for applications to use as immutable
> (i.e. lock out page faults, allocate, zero, extend and fsync in
> one atomic operation)....

<nod>

> > Fifth, swapfiles now require the S_IOMAP_FROZEN flag since they want
> > stable iomap but probably don't care about things like mtime.  Maybe
> > they can call iomap too.
> > 
> > Sixth, XFS can record the S_IOMAP_FROZEN state in di_flags2 and set it
> > whenever the in-core inode gets constructed.  This enables us to
> > prohibit reflinking and other such undesirable activity.
> 
> *nod*
> 
> > If we actually /do/ come up with a reference implementation for XFS, I'd
> > be ok with tacking it on the end of my dev branch, which will give us a
> > loooong runway to try it out.  The end of the dev branch is beyond
> > online XFS fsck and repair and the "root metadata btrees in inodes"
> > rework; since that's ~90 patches with my name on it that I cannot also
> > review, it won't go in for a long time indeed!
> 
> I don't think it's so complex to need such a long dev time -
> all the infrastructure we need is pretty much there already...

It definitely is, but we've been bikeshedding so long it now has
momentum. :)

That said, it (and having a go at dax+reflink) are things that I'd like
to look at (after pushing the scrub stuff) for the rest of the year.

> > (Yes, that was also sort of a plea for someone to go review the XFS
> > scrub patches.)
> > 
> > > +	return 0;
> > > +}
> > > +
> > > +SYSCALL_DEFINE3(daxctl, const char __user *, path, int, flags, int, align)
> > 
> > I was /about/ to grouse about this syscall, then realized that maybe it
> > /is/ useful to be able to check a specific alignment.  Maybe not, since
> > I had something more permanent in mind anyway.  In any case, just pass
> > in an opened fd if this sticks around.
> 
> We can do all that via fallocate(), too...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-22  4:07                                           ` Andy Lutomirski
@ 2017-06-23  0:52                                             ` Dave Chinner
  2017-06-23  3:07                                               ` Andy Lutomirski
  0 siblings, 1 reply; 39+ messages in thread
From: Dave Chinner @ 2017-06-23  0:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dan Williams, Ross Zwisler, Rudoff, Andy, Andrew Morton,
	Jan Kara, linux-nvdimm, Linux API, linux-kernel, linux-mm,
	Jeff Moyer, Linux FS Devel, Christoph Hellwig

On Wed, Jun 21, 2017 at 09:07:57PM -0700, Andy Lutomirski wrote:
> On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > You seem to be calling the "fdatasync on every page fault" the
> 
> It's the opposite of fdatasync().  It needs to sync whatever metadata
> is needed to find the data.  The data doesn't need to be synced.

So much wrong with that statement.

Andy, what does fdatasync() do when you have a data-clean,
metadata-dirty file (e.g. you just punched a hole  or preallocated
more space via fallocate())?  Hint: it doesn't sync any data
because the mapping tree is clean, but it still syncs the dirty
metadata needed to access the data.

Now, what does a file where we do direct IO writes look like? Yup,
the mapping tree always remains clean and so it's only ever going to
appear to the kernel as a *data-clean, metadata-dirty* file. So,
after a direct IO write is done, what operation do we need to run to
ensure that we can always access the data?

Yup, it's fdatasync().

So, what does a DAX file that does userspace data flushes look like
to the kernel? Yup, again the mapping tree always remains clean and
so it's only ever going to be a *data-clean, metadata-dirty* file.

It should be clear now why I said "fdatasync on every page fault"
because that's exactly the mechanism we'd use to implement this
functionality....

It should also be clear that DAX is not introducing any new data
integrity problems to the filesystems that direct IO hasn't already
introduced. Both DAX with userspace data sync and Direct IO writes
are completely untracked by the kernel.  IOWs, direct IO is a form
of "kernel bypass", just like DAX+userspace data sync is.  All that
is different is the method by which data is written to the storage
media from userspace, which in the case of DAX is via mmap rather
than read/write.

> > "lightweight" option. That's the brute-force-with-big-hammer
> > solution - it's most definitely not lightweight as every page fault
> > has extra overhead to call ->fsync(). Sure, the API is simple, but
> > the runtime overhead is significant.
> 
> It's lightweight in terms of its impact on the filesystem.  It doesn't
> need any persistent setup -- you can just use it.

Well, no, that's wrong, because we have to co-ordinate multiple
concurrent accesses to the data in the kernel. What happens when
some other process writes to the file *at the same time* but does
not use userspace sync? We aren't tracking dirty regions on the
inode mapping because we've been told not to do that, so fsync()
from that other process *won't sync the data it wrote*. IOws, the
kernel has failed to provide the guarantee that userspace wants it
to provide.

The single mapping tree is central to the problem here - we can't
mix modes of dirty tracking across different processes. Either
everything uses userspace sync, or everything uses kernel controlled
dirty tracking so fsync() works correctly in all cases. Put simply -
dirty tracking is a per-inode function, not a per-file or per-vma
function.

As the direct IO kernel-bypass model demonstrates, as soon as you
start considering multi-process data coherency and durability with
mixed kernel+kernel bypass methods in play, lots of potential
problems and issues crop up that can't easily be solved by the
kernel or filesystems. We try to minimise the problems, but we don't
guarantee mixed mode coherency (and hence integrity) as we've
delegated data coherency and integrity responsibility to the app
bypassing the kernel data coherency and integrity mechanisms.

What I'd like to avoid is creating another kernel bypass mechanism
where we allow coherency and/or integrity to be fucked up in a way that
we can't fix without giving up all the performance that the kernel
bypass provides userspace apps. Constrain the cases where kernel
bypass is allowed, and we avoid all the crappy corner cases where
our only answer to users with corrupt data is "the man page advises
application developers not to do that".

If in future we work out how to implement everything without
needing immutable extents in the inode, we can relax the
restrictions we've placed on userspace DAX data sync....

> > Even if you are considering the complexity of the APIs, it's hardly
> > a "heavyweight" when it only requires a single call to fallocate()
> > before mmap() to set up the immutable extents on the file...
> 
> So what would the exact semantics be?  In particular, how can it fail?
>  If I do the fallocate(), is it absolutely promised that the extent
> map won't get out of sync between what mmap sees and what's on disk?

That's precisely the guarantee I documented would be given by
immutable extents in my very first proposal.

> Do user programs need to worry about colliding with each other when
> one does fallocate() to DAXify a file and the other does fallocate()
> to unDAXify a file?

Yes, it can. This was one of the reasons for putting it under
privilege - so only the app has full control of the extent map
changes and nobody else can fuck with it.

> Does this particular fallocate() call still keep
> its effect after a reboot?

Yes, it does, because it has to be transparent and behave
consistently with all of userspace, not just the app that owns the
file, and not just while that app is running. (e.g. defrag could be
running on the file before the app starts, and then you're screwed
when defrag modifies the extent map after app startup...)

> Is there an actual concrete proposal that's reviewable?

Yes, the first posting where I proposed this functionality many
months ago spelled this all out in detail.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
  2017-06-23  0:52                                             ` Dave Chinner
@ 2017-06-23  3:07                                               ` Andy Lutomirski
  0 siblings, 0 replies; 39+ messages in thread
From: Andy Lutomirski @ 2017-06-23  3:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andy Lutomirski, Dan Williams, Ross Zwisler, Rudoff, Andy,
	Andrew Morton, Jan Kara, linux-nvdimm, Linux API, linux-kernel,
	linux-mm, Jeff Moyer, Linux FS Devel, Christoph Hellwig

On Thu, Jun 22, 2017 at 5:52 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Jun 21, 2017 at 09:07:57PM -0700, Andy Lutomirski wrote:
>> On Wed, Jun 21, 2017 at 5:02 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >
>> > You seem to be calling the "fdatasync on every page fault" the
>>
>> It's the opposite of fdatasync().  It needs to sync whatever metadata
>> is needed to find the data.  The data doesn't need to be synced.
>
> So much wrong with that statement.
>
> Andy, what does fdatasync() do when you have a data-clean,
> metadata-dirty file (e.g. you just punched a hole  or preallocated
> more space via fallocate())?  Hint: it doesn't sync any data
> because the mapping tree is clean, but it still syncs the dirty
> metadata needed to access the data.
>
> Now, what does a file where we do direct IO writes look like? Yup,
> the mapping tree always remains clean and so it's only ever going to
> appear to the kernel as a *data-clean, metadata-dirty* file. So,
> after a direct IO write is done, what operation do we need to run to
> ensure that we can always access the data?
>
> Yup, it's fdatasync().

Fair enough.  Except that fdatasync() goes through dax_writeback_one()
(I think), which deals with cache flushes (via wb_cache_pmem()).  This
special type of sync shouldn't need to do that, so it's not really
quite fdatasync().

>> > "lightweight" option. That's the brute-force-with-big-hammer
>> > solution - it's most definitely not lightweight as every page fault
>> > has extra overhead to call ->fsync(). Sure, the API is simple, but
>> > the runtime overhead is significant.
>>
>> It's lightweight in terms of its impact on the filesystem.  It doesn't
>> need any persistent setup -- you can just use it.
>
> Well, no, that's wrong, because we have to co-ordinate multiple
> concurrent accesses to the data in the kernel. What happens when
> some other process writes to the file *at the same time* but does
> not use userspace sync? We aren't tracking dirty regions on the
> inode mapping because we've been told not to do that, so fsync()
> from that other process *won't sync the data it wrote*. IOws, the
> kernel has failed to provide the guarantee that userspace wants it
> to provide.

...

> What I'd like to avoid is creating another kernel bypass mechanism
> where we allow coherency and/or integrity to be fucked up in a way that
> we can't fix without giving up all the performance that the kernel
> bypass provides userspace apps. Constrain the cases where kernel
> bypass is allowed, and we avoid all the crappy corner cases where
> our only answer to users with corrupt data is "the man page advises
> application developers not to do that".

Ah, I see, a DAX file makes regular write() flush out the cache
automatically.  But I think the situation may be fucked up
integrity-wise anyway.  If you make an immutable-extent DAX file and a
DAX-unaware process mmaps() it and writes to the mapping, what flushes
the CPU cache?  Isn't part of the point of the magic immutable-extent
mode that it wouldn't have to track dirty extents?  Can it keep track
of which mappings are DAX-aware (via an mmap() flag, I assume)?  Would
all mappings of a DAX immutable-extent file be forced to be uncached
(or writethrough or WC or some type that allows fsync to be fast)?

Can you send a link to your fallocate email?  I'm having trouble finding it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2017-06-23  3:07 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-17  1:15 [RFC PATCH 0/2] daxfile: enable byte-addressable updates to pmem Dan Williams
     [not found] ` <149766212410.22552.15957843500156182524.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-06-17  1:15   ` [RFC PATCH 1/2] mm: introduce bmap_walk() Dan Williams
2017-06-17  5:22     ` Christoph Hellwig
     [not found]       ` <20170617052212.GA8246-jcswGhMUV9g@public.gmane.org>
2017-06-17 12:29         ` Dan Williams
2017-06-18  7:51           ` Christoph Hellwig
2017-06-19 16:18             ` Darrick J. Wong
     [not found]             ` <20170618075152.GA25871-jcswGhMUV9g@public.gmane.org>
2017-06-19 18:19               ` Al Viro
2017-06-20  7:34                 ` Christoph Hellwig
2017-06-17  1:15 ` [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem Dan Williams
     [not found]   ` <149766213493.22552.4057048843646200083.stgit-p8uTFz9XbKj2zm6wflaqv1nYeNYlB/vhral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2017-06-17 16:25     ` Andy Lutomirski
2017-06-17 21:52       ` Dan Williams
     [not found]         ` <CAPcyv4j4UEegViDJcLZjVv5AFGC18-DcvHFnhZatB0hH3BY85g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-17 23:50           ` Andy Lutomirski
2017-06-18  3:15             ` Dan Williams
2017-06-18  5:05               ` Andy Lutomirski
     [not found]                 ` <CALCETrVY38h2ajpod2U_2pdHSp8zO4mG2p19h=OnnHmhGTairw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-19 13:21                   ` Dave Chinner
2017-06-19 15:22                     ` Andy Lutomirski
     [not found]                       ` <CALCETrUe0igzK0RZTSSondkCY3ApYQti89tOh00f0j_APrf_dQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-20  0:46                         ` Dave Chinner
2017-06-20  5:53                           ` Andy Lutomirski
2017-06-20  8:49                             ` Christoph Hellwig
     [not found]                               ` <20170620084924.GA9752-jcswGhMUV9g@public.gmane.org>
2017-06-20 16:17                                 ` Dan Williams
     [not found]                                   ` <CAPcyv4jkH6iwDoG4NnCaTNXozwYgVXiJDe2iFSONcE63KvGQoA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-20 16:26                                     ` Andy Lutomirski
2017-06-20 23:53                                   ` Dave Chinner
2017-06-21  1:24                                     ` Darrick J. Wong
2017-06-21  2:19                                       ` Dave Chinner
     [not found]                             ` <CALCETrVuoPDRuuhc9X8eVCYiFUzWLSTRkcjbD6jas_2J2GixNQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-20 10:11                               ` Dave Chinner
2017-06-20 16:14                                 ` Andy Lutomirski
2017-06-21  1:40                                   ` Dave Chinner
2017-06-21  5:18                                     ` Andy Lutomirski
     [not found]                                       ` <CALCETrVYmbyNS-btvsN_M-QyWPZA_Y_4JXOM893g7nhZA+WviQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-22  0:02                                         ` Dave Chinner
2017-06-22  4:07                                           ` Andy Lutomirski
2017-06-23  0:52                                             ` Dave Chinner
2017-06-23  3:07                                               ` Andy Lutomirski
2017-06-18  8:18               ` Christoph Hellwig
     [not found]                 ` <20170618081850.GA26332-jcswGhMUV9g@public.gmane.org>
2017-06-19  1:51                   ` Dan Williams
2017-06-20  5:22   ` Darrick J. Wong
2017-06-20 15:42     ` Ross Zwisler
2017-06-22  7:09       ` Darrick J. Wong
     [not found]     ` <20170620052214.GA3787-PTl6brltDGh4DFYR7WNSRA@public.gmane.org>
2017-06-21 23:37       ` Dave Chinner
2017-06-22  7:23         ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).