nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] dax, ext4: Synchronous page faults
@ 2017-07-27 13:12 Jan Kara
  2017-07-27 13:12 ` [PATCH 1/7] mm: Remove VM_FAULT_HWPOISON_LARGE_MASK Jan Kara
                   ` (8 more replies)
  0 siblings, 9 replies; 41+ messages in thread
From: Jan Kara @ 2017-07-27 13:12 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-ext4

Hello,

after last discussions about whether / how to make flushing of DAX mappings
possible from userspace so that they can be flushed on finer than page
granularity and also avoid the overhead of a syscall, I've decided to give
a stab at implementing "synchronous page faults" idea for ext4 so that
we can see whether that is reasonably possible to implement and how would
such implementation look like. This patch set is the result.

So the functionality this patches implement: We have an inode flag (currently
I abuse S_SYNC inode flag for this and IMHO it kind of makes sense but if
people hate that I'm certainly open to using new flag in the final
implementation) that marks inode as requiring synchronous page faults.
The guarantee provided by this flag on inode is: While a block is writeably
mapped into page tables, it is guaranteed to be visible in the file at that
offset also after a crash.

How I implement this is that ->iomap_begin() indicates by a flag that inode
block mapping metadata is unstable and may need flushing (use the same test as
whether fdatasync() has metadata to write). If yes, DAX maps page table entries
read-only and returns special flag VM_FAULT_RO to the filesystem fault handler.
The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
and after that calls DAX code to write-enable the page table entries.

>From my (fairly limited) knowledge of XFS it seems XFS should be able to do the
same and it should be even possible for filesystem to implement safe remapping
of a file offset to a different block (i.e. break reflink, do defrag, or
similar stuff) like:

1) Block page faults
2) fdatasync() remapped range (there can be outstanding data modifications
   not yet flushed)
3) unmap_mapping_range()
4) Now remap blocks
5) Unblock page faults

Basically we do the same on events like punch hole so there is not much new
there.

There are couple of open questions with this implementation:

1) Is it worth the hassle?
2) Is S_SYNC good flag to use or should we use a new inode flag?
3) VM_FAULT_RO and especially passing of resulting 'pfn' from
   dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
   vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
   things to make this cleaner.

Anyway, here are the patches, comments are welcome.

								Honza
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/7] mm: Remove VM_FAULT_HWPOISON_LARGE_MASK
  2017-07-27 13:12 [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jan Kara
@ 2017-07-27 13:12 ` Jan Kara
  2017-07-27 21:57   ` Ross Zwisler
  2017-08-01 10:52   ` Christoph Hellwig
  2017-07-27 13:12 ` [PATCH 2/7] dax: Add sync argument to dax_iomap_fault() Jan Kara
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 41+ messages in thread
From: Jan Kara @ 2017-07-27 13:12 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-ext4

It is unused.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/mm.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 483e84cf9fc6..fa036093e76c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1143,8 +1143,6 @@ static inline void clear_page_pfmemalloc(struct page *page)
 #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
 #define VM_FAULT_DONE_COW   0x1000	/* ->fault has fully handled COW */
 
-#define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
-
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
 			 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
 			 VM_FAULT_FALLBACK)
-- 
2.12.3

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/7] dax: Add sync argument to dax_iomap_fault()
  2017-07-27 13:12 [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jan Kara
  2017-07-27 13:12 ` [PATCH 1/7] mm: Remove VM_FAULT_HWPOISON_LARGE_MASK Jan Kara
@ 2017-07-27 13:12 ` Jan Kara
  2017-07-27 22:06   ` Ross Zwisler
  2017-07-27 13:12 ` [PATCH 3/7] dax: Simplify arguments of dax_insert_mapping() Jan Kara
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2017-07-27 13:12 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-ext4

Add 'sync' argument to dax_iomap_fault(). It will be used to communicate
the fact that synchronous fault is requested.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 12 ++++++------
 fs/ext2/file.c      |  2 +-
 fs/ext4/file.c      |  2 +-
 fs/xfs/xfs_file.c   |  8 ++++----
 include/linux/dax.h |  2 +-
 5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 75760f7bdf5d..44d9cce4399b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1069,7 +1069,7 @@ static int dax_fault_return(int error)
 	return VM_FAULT_SIGBUS;
 }
 
-static int dax_iomap_pte_fault(struct vm_fault *vmf,
+static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 			       const struct iomap_ops *ops)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -1300,7 +1300,7 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
 	return VM_FAULT_FALLBACK;
 }
 
-static int dax_iomap_pmd_fault(struct vm_fault *vmf,
+static int dax_iomap_pmd_fault(struct vm_fault *vmf, bool sync,
 			       const struct iomap_ops *ops)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -1422,7 +1422,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
 	return result;
 }
 #else
-static int dax_iomap_pmd_fault(struct vm_fault *vmf,
+static int dax_iomap_pmd_fault(struct vm_fault *vmf, bool sync,
 			       const struct iomap_ops *ops)
 {
 	return VM_FAULT_FALLBACK;
@@ -1440,13 +1440,13 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
  * successfully.
  */
 int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
-		    const struct iomap_ops *ops)
+		    bool sync, const struct iomap_ops *ops)
 {
 	switch (pe_size) {
 	case PE_SIZE_PTE:
-		return dax_iomap_pte_fault(vmf, ops);
+		return dax_iomap_pte_fault(vmf, sync, ops);
 	case PE_SIZE_PMD:
-		return dax_iomap_pmd_fault(vmf, ops);
+		return dax_iomap_pmd_fault(vmf, sync, ops);
 	default:
 		return VM_FAULT_FALLBACK;
 	}
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..430e6ecf4e0e 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_fault *vmf)
 	}
 	down_read(&ei->dax_sem);
 
-	ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &ext2_iomap_ops);
+	ret = dax_iomap_fault(vmf, PE_SIZE_PTE, false, &ext2_iomap_ops);
 
 	up_read(&ei->dax_sem);
 	if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 9dda70edba74..d401403e5095 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,7 +291,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
 		down_read(&EXT4_I(inode)->i_mmap_sem);
 	}
 	if (!IS_ERR(handle))
-		result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
+		result = dax_iomap_fault(vmf, pe_size, false, &ext4_iomap_ops);
 	else
 		result = VM_FAULT_SIGBUS;
 	if (write) {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 62db8ffa83b9..9fc8b5668578 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1032,7 +1032,7 @@ xfs_filemap_page_mkwrite(
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (IS_DAX(inode)) {
-		ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vmf, PE_SIZE_PTE, false, &xfs_iomap_ops);
 	} else {
 		ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops);
 		ret = block_page_mkwrite_return(ret);
@@ -1059,7 +1059,7 @@ xfs_filemap_fault(
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 	if (IS_DAX(inode))
-		ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vmf, PE_SIZE_PTE, false, &xfs_iomap_ops);
 	else
 		ret = filemap_fault(vmf);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1094,7 +1094,7 @@ xfs_filemap_huge_fault(
 	}
 
 	xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-	ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
+	ret = dax_iomap_fault(vmf, pe_size, false, &xfs_iomap_ops);
 	xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
 	if (vmf->flags & FAULT_FLAG_WRITE)
@@ -1130,7 +1130,7 @@ xfs_filemap_pfn_mkwrite(
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
 	else if (IS_DAX(inode))
-		ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops);
+		ret = dax_iomap_fault(vmf, PE_SIZE_PTE, false, &xfs_iomap_ops);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d0e32729ad1e..98950f4d127e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -91,7 +91,7 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc);
 ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops);
 int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
-		    const struct iomap_ops *ops);
+		    bool sync, const struct iomap_ops *ops);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
-- 
2.12.3

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/7] dax: Simplify arguments of dax_insert_mapping()
  2017-07-27 13:12 [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jan Kara
  2017-07-27 13:12 ` [PATCH 1/7] mm: Remove VM_FAULT_HWPOISON_LARGE_MASK Jan Kara
  2017-07-27 13:12 ` [PATCH 2/7] dax: Add sync argument to dax_iomap_fault() Jan Kara
@ 2017-07-27 13:12 ` Jan Kara
  2017-07-27 22:09   ` Ross Zwisler
  2017-08-01 10:54   ` Christoph Hellwig
  2017-07-27 13:12 ` [PATCH 4/7] dax: Make dax_insert_mapping() return VM_FAULT_ state Jan Kara
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 41+ messages in thread
From: Jan Kara @ 2017-07-27 13:12 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-ext4

dax_insert_mapping() has lots of arguments and a lot of them is actuall
duplicated by passing vm_fault structure as well. Change the function to
take the same arguments as dax_pmd_insert_mapping().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 44d9cce4399b..0673efd72f53 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -814,23 +814,30 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
-static int dax_insert_mapping(struct address_space *mapping,
-		struct block_device *bdev, struct dax_device *dax_dev,
-		sector_t sector, size_t size, void *entry,
-		struct vm_area_struct *vma, struct vm_fault *vmf)
+static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
+{
+	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
+static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
+			      loff_t pos, void *entry)
 {
+	const sector_t sector = dax_iomap_sector(iomap, pos);
+	struct vm_area_struct *vma = vmf->vma;
+	struct address_space *mapping = vma->vm_file->f_mapping;
 	unsigned long vaddr = vmf->address;
 	void *ret, *kaddr;
 	pgoff_t pgoff;
 	int id, rc;
 	pfn_t pfn;
 
-	rc = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+	rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
 	if (rc)
 		return rc;
 
 	id = dax_read_lock();
-	rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), &kaddr, &pfn);
+	rc = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(PAGE_SIZE),
+			       &kaddr, &pfn);
 	if (rc < 0) {
 		dax_read_unlock(id);
 		return rc;
@@ -930,11 +937,6 @@ int __dax_zero_page_range(struct block_device *bdev,
 }
 EXPORT_SYMBOL_GPL(__dax_zero_page_range);
 
-static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
-{
-	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
-}
-
 static loff_t
 dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct iomap *iomap)
@@ -1076,7 +1078,6 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 	struct inode *inode = mapping->host;
 	unsigned long vaddr = vmf->address;
 	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
-	sector_t sector;
 	struct iomap iomap = { 0 };
 	unsigned flags = IOMAP_FAULT;
 	int error, major = 0;
@@ -1129,9 +1130,9 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 		goto error_finish_iomap;
 	}
 
-	sector = dax_iomap_sector(&iomap, pos);
-
 	if (vmf->cow_page) {
+		sector_t sector = dax_iomap_sector(&iomap, pos);
+
 		switch (iomap.type) {
 		case IOMAP_HOLE:
 		case IOMAP_UNWRITTEN:
@@ -1164,8 +1165,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 			count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 			major = VM_FAULT_MAJOR;
 		}
-		error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev,
-				sector, PAGE_SIZE, entry, vmf->vma, vmf);
+		error = dax_insert_mapping(vmf, &iomap, pos, entry);
 		/* -EBUSY is fine, somebody else faulted on the same PTE */
 		if (error == -EBUSY)
 			error = 0;
-- 
2.12.3

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/7] dax: Make dax_insert_mapping() return VM_FAULT_ state
  2017-07-27 13:12 [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jan Kara
                   ` (2 preceding siblings ...)
  2017-07-27 13:12 ` [PATCH 3/7] dax: Simplify arguments of dax_insert_mapping() Jan Kara
@ 2017-07-27 13:12 ` Jan Kara
  2017-07-27 22:22   ` Ross Zwisler
  2017-07-27 13:12 ` [PATCH 5/7] dax, iomap: Add support for synchronous faults Jan Kara
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2017-07-27 13:12 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-ext4

Currently dax_insert_mapping() returns normal error code which is later
converted to VM_FAULT_ state. Since we will need to do more state
modifications specific to dax_insert_mapping() it does not make sense to
push them up to the caller of dax_insert_mapping(). Instead make
dax_insert_mapping() return VM_FAULT_ state the same way as
dax_pmd_insert_mapping() does that.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c | 45 +++++++++++++++++++++++++--------------------
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0673efd72f53..9658975b926a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -814,6 +814,15 @@ int dax_writeback_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
 
+static int dax_fault_return(int error)
+{
+	if (error == 0)
+		return VM_FAULT_NOPAGE;
+	if (error == -ENOMEM)
+		return VM_FAULT_OOM;
+	return VM_FAULT_SIGBUS;
+}
+
 static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
 {
 	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
@@ -828,7 +837,7 @@ static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 	unsigned long vaddr = vmf->address;
 	void *ret, *kaddr;
 	pgoff_t pgoff;
-	int id, rc;
+	int id, rc, vmf_ret;
 	pfn_t pfn;
 
 	rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
@@ -850,9 +859,18 @@ static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 
 	trace_dax_insert_mapping(mapping->host, vmf, ret);
 	if (vmf->flags & FAULT_FLAG_WRITE)
-		return vm_insert_mixed_mkwrite(vma, vaddr, pfn);
+		rc = vm_insert_mixed_mkwrite(vma, vaddr, pfn);
 	else
-		return vm_insert_mixed(vma, vaddr, pfn);
+		rc = vm_insert_mixed(vma, vaddr, pfn);
+
+	/* -EBUSY is fine, somebody else faulted on the same PTE */
+	if (rc == -EBUSY)
+		rc = 0;
+
+	vmf_ret = dax_fault_return(rc);
+	if (iomap->flags & IOMAP_F_NEW)
+		vmf_ret |= VM_FAULT_MAJOR;
+	return vmf_ret;
 }
 
 /*
@@ -1062,15 +1080,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 }
 EXPORT_SYMBOL_GPL(dax_iomap_rw);
 
-static int dax_fault_return(int error)
-{
-	if (error == 0)
-		return VM_FAULT_NOPAGE;
-	if (error == -ENOMEM)
-		return VM_FAULT_OOM;
-	return VM_FAULT_SIGBUS;
-}
-
 static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 			       const struct iomap_ops *ops)
 {
@@ -1080,7 +1089,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 	loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
 	struct iomap iomap = { 0 };
 	unsigned flags = IOMAP_FAULT;
-	int error, major = 0;
+	int error;
 	int vmf_ret = 0;
 	void *entry;
 
@@ -1163,13 +1172,9 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 		if (iomap.flags & IOMAP_F_NEW) {
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
-			major = VM_FAULT_MAJOR;
 		}
-		error = dax_insert_mapping(vmf, &iomap, pos, entry);
-		/* -EBUSY is fine, somebody else faulted on the same PTE */
-		if (error == -EBUSY)
-			error = 0;
-		break;
+		vmf_ret = dax_insert_mapping(vmf, &iomap, pos, entry);
+		goto finish_iomap;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
 		if (!(vmf->flags & FAULT_FLAG_WRITE)) {
@@ -1184,7 +1189,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 	}
 
  error_finish_iomap:
-	vmf_ret = dax_fault_return(error) | major;
+	vmf_ret = dax_fault_return(error);
  finish_iomap:
 	if (ops->iomap_end) {
 		int copied = PAGE_SIZE;
-- 
2.12.3

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/7] dax, iomap: Add support for synchronous faults
  2017-07-27 13:12 [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jan Kara
                   ` (3 preceding siblings ...)
  2017-07-27 13:12 ` [PATCH 4/7] dax: Make dax_insert_mapping() return VM_FAULT_ state Jan Kara
@ 2017-07-27 13:12 ` Jan Kara
  2017-07-27 22:42   ` Ross Zwisler
  2017-07-27 13:12 ` [PATCH 6/7] dax: Implement dax_pfn_mkwrite() Jan Kara
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2017-07-27 13:12 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-ext4

Add a flag to iomap interface informing the caller that inode needs
fdstasync(2) for returned extent to become persistent and use it in DAX
fault code so that we map such extents only read only. We propagate the
information that the page table entry has been inserted write-protected
from dax_iomap_fault() with a new VM_FAULT_RO flag. Filesystem fault
handler is then responsible for calling fdatasync(2) and updating page
tables to map pfns read-write. dax_iomap_fault() also takes care of
updating vmf->orig_pte to match the PTE that was inserted so that we can
safely recheck that PTE did not change while write-enabling it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c              | 42 +++++++++++++++++++++++++++++++++++-------
 include/linux/iomap.h |  2 ++
 include/linux/mm.h    |  2 ++
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9658975b926a..8a6cf158c691 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -829,7 +829,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
 }
 
 static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
-			      loff_t pos, void *entry)
+			      loff_t pos, void *entry, bool force_ro)
 {
 	const sector_t sector = dax_iomap_sector(iomap, pos);
 	struct vm_area_struct *vma = vmf->vma;
@@ -858,7 +858,7 @@ static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 		return PTR_ERR(ret);
 
 	trace_dax_insert_mapping(mapping->host, vmf, ret);
-	if (vmf->flags & FAULT_FLAG_WRITE)
+	if ((vmf->flags & FAULT_FLAG_WRITE) && !force_ro)
 		rc = vm_insert_mixed_mkwrite(vma, vaddr, pfn);
 	else
 		rc = vm_insert_mixed(vma, vaddr, pfn);
@@ -870,6 +870,14 @@ static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 	vmf_ret = dax_fault_return(rc);
 	if (iomap->flags & IOMAP_F_NEW)
 		vmf_ret |= VM_FAULT_MAJOR;
+	if (!rc && (vmf->flags & FAULT_FLAG_WRITE) && force_ro) {
+		vmf_ret |= VM_FAULT_RO;
+		/*
+		 * Hack: Store PFN here so that we can pass it to
+		 * vm_insert_mixed_mkwrite() when changing PTE to RW.
+		 */
+		vmf->orig_pte = pfn_t_pte(pfn, vma->vm_page_prot);
+	}
 	return vmf_ret;
 }
 
@@ -1092,6 +1100,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 	int error;
 	int vmf_ret = 0;
 	void *entry;
+	bool force_ro;
 
 	trace_dax_pte_fault(inode, vmf, vmf_ret);
 	/*
@@ -1167,13 +1176,15 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 		goto finish_iomap;
 	}
 
+	force_ro = (vmf->flags & FAULT_FLAG_WRITE) && sync &&
+			(iomap.flags & IOMAP_F_NEEDDSYNC);
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
 		if (iomap.flags & IOMAP_F_NEW) {
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
 		}
-		vmf_ret = dax_insert_mapping(vmf, &iomap, pos, entry);
+		vmf_ret = dax_insert_mapping(vmf, &iomap, pos, entry, force_ro);
 		goto finish_iomap;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
@@ -1219,7 +1230,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
 #define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
 
 static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
-		loff_t pos, void *entry)
+		loff_t pos, void *entry, bool force_ro)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
 	const sector_t sector = dax_iomap_sector(iomap, pos);
@@ -1232,6 +1243,7 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 	pgoff_t pgoff;
 	pfn_t pfn;
 	int id;
+	int result;
 
 	if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
 		goto fallback;
@@ -1256,8 +1268,19 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
 		goto fallback;
 
 	trace_dax_pmd_insert_mapping(inode, vmf, length, pfn, ret);
-	return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
-			pfn, vmf->flags & FAULT_FLAG_WRITE);
+	result = vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
+			pfn, (vmf->flags & FAULT_FLAG_WRITE) && !force_ro);
+	/* Did we insert RO PMD despite the fault being a write one? */
+	if (!(result & VM_FAULT_ERROR) && (vmf->flags & FAULT_FLAG_WRITE) &&
+	    force_ro) {
+		result |= VM_FAULT_RO;
+		/*
+		 * Hack: Store PFN here so that we can pass it to
+		 * vmf_insert_pfn_pmd() when changing PMD to RW.
+		 */
+		vmf->orig_pte = pfn_t_pte(pfn, vmf->vma->vm_page_prot);
+	}
+	return result;
 
 unlock_fallback:
 	dax_read_unlock(id);
@@ -1320,6 +1343,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, bool sync,
 	void *entry;
 	loff_t pos;
 	int error;
+	bool force_ro;
 
 	/*
 	 * Check whether offset isn't beyond end of file now. Caller is
@@ -1385,9 +1409,13 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, bool sync,
 	if (iomap.offset + iomap.length < pos + PMD_SIZE)
 		goto finish_iomap;
 
+	force_ro = (vmf->flags & FAULT_FLAG_WRITE) && sync &&
+			(iomap.flags & IOMAP_F_NEEDDSYNC);
+
 	switch (iomap.type) {
 	case IOMAP_MAPPED:
-		result = dax_pmd_insert_mapping(vmf, &iomap, pos, entry);
+		result = dax_pmd_insert_mapping(vmf, &iomap, pos, entry,
+						force_ro);
 		break;
 	case IOMAP_UNWRITTEN:
 	case IOMAP_HOLE:
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f64dc6ce5161..957463602f6e 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -22,6 +22,8 @@ struct vm_fault;
  * Flags for all iomap mappings:
  */
 #define IOMAP_F_NEW	0x01	/* blocks have been newly allocated */
+#define IOMAP_F_NEEDDSYNC	0x02	/* inode needs fdatasync for storage to
+					 * become persistent */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa036093e76c..5085647d9f2f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1142,6 +1142,8 @@ static inline void clear_page_pfmemalloc(struct page *page)
 #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
 #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
 #define VM_FAULT_DONE_COW   0x1000	/* ->fault has fully handled COW */
+#define VM_FAULT_RO	0x2000		/* Write fault was handled just by
+					 * inserting RO page table entry for DAX */
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
 			 VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
-- 
2.12.3

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/7] dax: Implement dax_pfn_mkwrite()
  2017-07-27 13:12 [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jan Kara
                   ` (4 preceding siblings ...)
  2017-07-27 13:12 ` [PATCH 5/7] dax, iomap: Add support for synchronous faults Jan Kara
@ 2017-07-27 13:12 ` Jan Kara
  2017-07-27 22:53   ` Ross Zwisler
  2017-07-27 13:12 ` [PATCH 7/7] ext4: Support for synchronous DAX faults Jan Kara
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2017-07-27 13:12 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-ext4

Implement a function that marks existing page table entry (PTE or PMD)
as writeable and takes care of marking it dirty in the radix tree. This
function will be used to finish synchronous page fault where the page
table entry is first inserted as read-only and then needs to be marked
as read-write.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/dax.c            | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |  1 +
 2 files changed, 49 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 8a6cf158c691..90b763c86dc2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1485,3 +1485,51 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 	}
 }
 EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+/**
+ * dax_pfn_mkwrite - make page table entry writeable on a DAX file
+ * @vmf: The description of the fault
+ * @pe_size: size of entry to be marked writeable
+ *
+ * This function mark PTE or PMD entry as writeable in page tables for mmaped
+ * DAX file. It takes care of marking corresponding radix tree entry as dirty
+ * as well.
+ */
+int dax_pfn_mkwrite(struct vm_fault *vmf, enum page_entry_size pe_size)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	void *entry, **slot;
+	pgoff_t index = vmf->pgoff;
+	pfn_t pfn = pfn_to_pfn_t(pte_pfn(vmf->orig_pte));
+	int vmf_ret, error;
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = get_unlocked_mapping_entry(mapping, index, &slot);
+	/* Did we race with someone splitting entry or so? */
+	if (!entry || (pe_size == PE_SIZE_PTE && !dax_is_pte_entry(entry)) ||
+	    (pe_size == PE_SIZE_PMD && !dax_is_pmd_entry(entry))) {
+		put_unlocked_mapping_entry(mapping, index, entry);
+		spin_unlock_irq(&mapping->tree_lock);
+		return VM_FAULT_NOPAGE;
+	}
+	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+	entry = lock_slot(mapping, slot);
+	spin_unlock_irq(&mapping->tree_lock);
+	switch (pe_size) {
+	case PE_SIZE_PTE:
+		error = vm_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
+		vmf_ret = dax_fault_return(error);
+		break;
+#ifdef CONFIG_FS_DAX_PMD
+	case PE_SIZE_PMD:
+		vmf_ret = vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
+			pfn, true);
+		break;
+#endif
+	default:
+		vmf_ret = VM_FAULT_FALLBACK;
+	}
+	put_locked_mapping_entry(mapping, index);
+	return vmf_ret;
+}
+EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 98950f4d127e..6ce5912e4516 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -92,6 +92,7 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
 		const struct iomap_ops *ops);
 int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
 		    bool sync, const struct iomap_ops *ops);
+int dax_pfn_mkwrite(struct vm_fault *vmf, enum page_entry_size pe_size);
 int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
 int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
 				      pgoff_t index);
-- 
2.12.3

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 7/7] ext4: Support for synchronous DAX faults
  2017-07-27 13:12 [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jan Kara
                   ` (5 preceding siblings ...)
  2017-07-27 13:12 ` [PATCH 6/7] dax: Implement dax_pfn_mkwrite() Jan Kara
@ 2017-07-27 13:12 ` Jan Kara
  2017-07-27 22:57   ` Ross Zwisler
  2017-07-27 14:09 ` [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jeff Moyer
  2017-08-01 10:52 ` Christoph Hellwig
  8 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2017-07-27 13:12 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-ext4

We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
synchronous write fault when inode has some uncommitted metadata
changes. In the fault handler ext4_dax_fault() we then detect this case,
call vfs_fsync_range() to make sure all metadata is committed, and call
dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
dirty corresponding radix tree entry which is what we want - fsync(2)
will still provide data integrity guarantees for applications not using
userspace flushing. And applications using userspace flushing can avoid
calling fsync(2) and thus avoid the performance overhead.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c       | 35 +++++++++++++++++++++++++++++------
 fs/ext4/inode.c      |  4 ++++
 fs/jbd2/journal.c    | 16 ++++++++++++++++
 include/linux/jbd2.h |  1 +
 4 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d401403e5095..b221d0b546b0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -287,16 +287,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
 		down_read(&EXT4_I(inode)->i_mmap_sem);
 		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
 					       EXT4_DATA_TRANS_BLOCKS(sb));
+		if (IS_ERR(handle)) {
+			up_read(&EXT4_I(inode)->i_mmap_sem);
+			sb_end_pagefault(sb);
+			return VM_FAULT_SIGBUS;
+		}
 	} else {
 		down_read(&EXT4_I(inode)->i_mmap_sem);
 	}
-	if (!IS_ERR(handle))
-		result = dax_iomap_fault(vmf, pe_size, false, &ext4_iomap_ops);
-	else
-		result = VM_FAULT_SIGBUS;
+	result = dax_iomap_fault(vmf, pe_size, IS_SYNC(inode), &ext4_iomap_ops);
 	if (write) {
-		if (!IS_ERR(handle))
-			ext4_journal_stop(handle);
+		ext4_journal_stop(handle);
+		/* Write fault but PFN mapped only RO? */
+		if (result & VM_FAULT_RO) {
+			int err;
+			loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
+			size_t len = 0;
+
+			if (pe_size == PE_SIZE_PTE)
+				len = PAGE_SIZE;
+#ifdef CONFIG_FS_DAX_PMD
+			else if (pe_size == PE_SIZE_PMD)
+				len = HPAGE_PMD_SIZE;
+			else
+				WARN_ON_ONCE(1);
+#endif
+			WARN_ON_ONCE(!IS_SYNC(inode));
+			err = vfs_fsync_range(vmf->vma->vm_file, start,
+					      start + len - 1, 1);
+			if (err)
+				result = VM_FAULT_SIGBUS;
+			else
+				result = dax_pfn_mkwrite(vmf, pe_size);
+		}
 		up_read(&EXT4_I(inode)->i_mmap_sem);
 		sb_end_pagefault(sb);
 	} else {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3c600f02673f..e68231bb227c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3429,6 +3429,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	}
 
 	iomap->flags = 0;
+	if ((flags & IOMAP_FAULT) && (flags & IOMAP_WRITE) && IS_SYNC(inode) &&
+	    !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
+					EXT4_I(inode)->i_datasync_tid))
+		iomap->flags |= IOMAP_F_NEEDDSYNC;
 	bdev = inode->i_sb->s_bdev;
 	iomap->bdev = bdev;
 	if (blk_queue_dax(bdev->bd_queue))
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 7d5ef3bf3f3e..e0f436c42d67 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -738,6 +738,22 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 	return err;
 }
 
+/* Return 1 when transaction with given tid has already committed. */
+int jbd2_transaction_committed(journal_t *journal, tid_t tid)
+{
+	int ret = 1;
+
+	read_lock(&journal->j_state_lock);
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == tid)
+		ret = 0;
+	if (journal->j_committing_transaction &&
+	    journal->j_committing_transaction->t_tid == tid)
+		ret = 0;
+	read_unlock(&journal->j_state_lock);
+	return ret;
+}
+
 /*
  * When this function returns the transaction corresponding to tid
  * will be completed.  If the transaction has currently running, start
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 606b6bce3a5b..296d1e0ea87b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1367,6 +1367,7 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_transaction_committed(journal_t *journal, tid_t tid);
 int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
 int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
-- 
2.12.3

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-07-27 13:12 [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jan Kara
                   ` (6 preceding siblings ...)
  2017-07-27 13:12 ` [PATCH 7/7] ext4: Support for synchronous DAX faults Jan Kara
@ 2017-07-27 14:09 ` Jeff Moyer
  2017-07-27 21:57   ` Ross Zwisler
  2017-08-01 10:52 ` Christoph Hellwig
  8 siblings, 1 reply; 41+ messages in thread
From: Jeff Moyer @ 2017-07-27 14:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-xfs, linux-nvdimm, Dave Chinner, Christoph Hellwig,
	Andy Lutomirski, linux-fsdevel, linux-ext4

Jan Kara <jack@suse.cz> writes:

Hi, Jan,

Thanks for looking into this!

> There are couple of open questions with this implementation:
>
> 1) Is it worth the hassle?
> 2) Is S_SYNC good flag to use or should we use a new inode flag?
> 3) VM_FAULT_RO and especially passing of resulting 'pfn' from
>    dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
>    vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
>    things to make this cleaner.

4) How does an application discover that it is safe to flush from
   userspace?

-Jeff
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-07-27 14:09 ` [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jeff Moyer
@ 2017-07-27 21:57   ` Ross Zwisler
  2017-07-28  2:05     ` Andy Lutomirski
  0 siblings, 1 reply; 41+ messages in thread
From: Ross Zwisler @ 2017-07-27 21:57 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu, Jul 27, 2017 at 10:09:07AM -0400, Jeff Moyer wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> Hi, Jan,
> 
> Thanks for looking into this!
> 
> > There are couple of open questions with this implementation:
> >
> > 1) Is it worth the hassle?
> > 2) Is S_SYNC good flag to use or should we use a new inode flag?
> > 3) VM_FAULT_RO and especially passing of resulting 'pfn' from
> >    dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
> >    vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
> >    things to make this cleaner.
> 
> 4) How does an application discover that it is safe to flush from
>    userspace?

I think that we would be best off with a new flag available via
lsattr(1)/chattr(1).  This would have the following advantages:

1) We could only set the flag if the inode supported DAX (either via the mount
option or via the individual DAX flag).  This would give NVML et al. one
central way to detect whether it was safe to flush from userspace because the
FS supported synchronous faults.

2) Defining a new flag prevents any confusion about whether the kernel version
you have supports sync faults.  Otherwise NVML would have to do something like
look at the trio of (kernel version, S_SYNC flag, mount/inode option for DAX)
which is complex and of course breaks for OS kernel versions.

3) Defining the flag in a generic way via lsattr/chattr opens the door for the
same API and flag to be used by other filesystems in the future.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/7] mm: Remove VM_FAULT_HWPOISON_LARGE_MASK
  2017-07-27 13:12 ` [PATCH 1/7] mm: Remove VM_FAULT_HWPOISON_LARGE_MASK Jan Kara
@ 2017-07-27 21:57   ` Ross Zwisler
  2017-08-01 10:52   ` Christoph Hellwig
  1 sibling, 0 replies; 41+ messages in thread
From: Ross Zwisler @ 2017-07-27 21:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu, Jul 27, 2017 at 03:12:39PM +0200, Jan Kara wrote:
> It is unused.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/7] dax: Add sync argument to dax_iomap_fault()
  2017-07-27 13:12 ` [PATCH 2/7] dax: Add sync argument to dax_iomap_fault() Jan Kara
@ 2017-07-27 22:06   ` Ross Zwisler
  2017-07-28  9:40     ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Ross Zwisler @ 2017-07-27 22:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu, Jul 27, 2017 at 03:12:40PM +0200, Jan Kara wrote:
> Add 'sync' argument to dax_iomap_fault(). It will be used to communicate
> the fact that synchronous fault is requested.

I don't actually think you need to pass this 'sync' parameter around.  I think
you can completely rely on IOMAP_F_NEEDSYNC being set in iomap.flags.  The DAX
fault handlers can call ops->iomap_begin() and use that flag for all the
tests and make it our once source of truth.

That flag also tells us that we are doing a write fault (from
ext4_iomap_begin()):

	if ((flags & IOMAP_FAULT) && (flags & IOMAP_WRITE) && IS_SYNC(inode) &&
	    !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
					EXT4_I(inode)->i_datasync_tid))
		iomap->flags |= IOMAP_F_NEEDDSYNC;

So conditionals like this from dax_iomap_pte_fault():

	force_ro = (vmf->flags & FAULT_FLAG_WRITE) && sync &&
			(iomap.flags & IOMAP_F_NEEDDSYNC);

can be simplified to:

	force_ro = (iomap.flags & IOMAP_F_NEEDDSYNC);
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/7] dax: Simplify arguments of dax_insert_mapping()
  2017-07-27 13:12 ` [PATCH 3/7] dax: Simplify arguments of dax_insert_mapping() Jan Kara
@ 2017-07-27 22:09   ` Ross Zwisler
  2017-08-01 10:54   ` Christoph Hellwig
  1 sibling, 0 replies; 41+ messages in thread
From: Ross Zwisler @ 2017-07-27 22:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu, Jul 27, 2017 at 03:12:41PM +0200, Jan Kara wrote:
> dax_insert_mapping() has lots of arguments and a lot of them is actuall
								  actually
> duplicated by passing vm_fault structure as well. Change the function to
> take the same arguments as dax_pmd_insert_mapping().

Yay, this is much better!

> Signed-off-by: Jan Kara <jack@suse.cz>

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7] dax: Make dax_insert_mapping() return VM_FAULT_ state
  2017-07-27 13:12 ` [PATCH 4/7] dax: Make dax_insert_mapping() return VM_FAULT_ state Jan Kara
@ 2017-07-27 22:22   ` Ross Zwisler
  2017-07-28  9:43     ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Ross Zwisler @ 2017-07-27 22:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu, Jul 27, 2017 at 03:12:42PM +0200, Jan Kara wrote:
> Currently dax_insert_mapping() returns normal error code which is later
> converted to VM_FAULT_ state. Since we will need to do more state
> modifications specific to dax_insert_mapping() it does not make sense to
> push them up to the caller of dax_insert_mapping(). Instead make
> dax_insert_mapping() return VM_FAULT_ state the same way as
> dax_pmd_insert_mapping() does that.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c | 45 +++++++++++++++++++++++++--------------------
>  1 file changed, 25 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 0673efd72f53..9658975b926a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -814,6 +814,15 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>  
> +static int dax_fault_return(int error)
> +{
> +	if (error == 0)
> +		return VM_FAULT_NOPAGE;
> +	if (error == -ENOMEM)
> +		return VM_FAULT_OOM;
> +	return VM_FAULT_SIGBUS;
> +}
> +
>  static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
>  {
>  	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
> @@ -828,7 +837,7 @@ static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
>  	unsigned long vaddr = vmf->address;
>  	void *ret, *kaddr;
>  	pgoff_t pgoff;
> -	int id, rc;
> +	int id, rc, vmf_ret;
>  	pfn_t pfn;
>  
>  	rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
> @@ -850,9 +859,18 @@ static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
>  
>  	trace_dax_insert_mapping(mapping->host, vmf, ret);
>  	if (vmf->flags & FAULT_FLAG_WRITE)
> -		return vm_insert_mixed_mkwrite(vma, vaddr, pfn);
> +		rc = vm_insert_mixed_mkwrite(vma, vaddr, pfn);
>  	else
> -		return vm_insert_mixed(vma, vaddr, pfn);
> +		rc = vm_insert_mixed(vma, vaddr, pfn);
> +
> +	/* -EBUSY is fine, somebody else faulted on the same PTE */
> +	if (rc == -EBUSY)
> +		rc = 0;
> +
> +	vmf_ret = dax_fault_return(rc);

Prior to this point where we convert our 'rc' (which is a normal error code
like 0, -ENOMEM, etc) to a 'vmf_ret' (which is a VM return code like
VM_FAULT_NOPAGE, VM_FAULT_SIGBUS, etc), there are several paths in
dax_insert_mapping() where we could still return a normal error code.  I think
this will confuse the fault handler because this code will get returned all
the way up to do_fault() which expects to get a VM return code.

So, I think we either need to:

a) Make sure all return paths through dax_insert_mapping() end up going
through dax_fault_return(), as we do in the PMD case, or

b) Keep allowing this function to return normal error codes, and just teach
dax_fault_return() to look for IOMAP_F_NEEDSYNC flag so it knows when to set
VM_FAULT_RO.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/7] dax, iomap: Add support for synchronous faults
  2017-07-27 13:12 ` [PATCH 5/7] dax, iomap: Add support for synchronous faults Jan Kara
@ 2017-07-27 22:42   ` Ross Zwisler
  2017-08-01 10:56     ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Ross Zwisler @ 2017-07-27 22:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu, Jul 27, 2017 at 03:12:43PM +0200, Jan Kara wrote:
> Add a flag to iomap interface informing the caller that inode needs
> fdstasync(2) for returned extent to become persistent and use it in DAX
> fault code so that we map such extents only read only. We propagate the
> information that the page table entry has been inserted write-protected
> from dax_iomap_fault() with a new VM_FAULT_RO flag. Filesystem fault
> handler is then responsible for calling fdatasync(2) and updating page
> tables to map pfns read-write. dax_iomap_fault() also takes care of
> updating vmf->orig_pte to match the PTE that was inserted so that we can
> safely recheck that PTE did not change while write-enabling it.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
<>
> @@ -1385,9 +1409,13 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, bool sync,
>  	if (iomap.offset + iomap.length < pos + PMD_SIZE)
>  		goto finish_iomap;
>  
> +	force_ro = (vmf->flags & FAULT_FLAG_WRITE) && sync &&
> +			(iomap.flags & IOMAP_F_NEEDDSYNC);

I already mentioned this in my response to your cover letter, but I think that
we can just lean really heavily on IOMAP_F_NEEDDSYNC and have that let us know
that we are doing sync faults and that the fault is a write.  That simplifies
a few things in this patch, with the above just becoming:

	force_ro = (iomap.flags & IOMAP_F_NEEDDSYNC);

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fa036093e76c..5085647d9f2f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1142,6 +1142,8 @@ static inline void clear_page_pfmemalloc(struct page *page)
>  #define VM_FAULT_RETRY	0x0400	/* ->fault blocked, must retry */
>  #define VM_FAULT_FALLBACK 0x0800	/* huge page fault failed, fall back to small */
>  #define VM_FAULT_DONE_COW   0x1000	/* ->fault has fully handled COW */
> +#define VM_FAULT_RO	0x2000		/* Write fault was handled just by
> +					 * inserting RO page table entry for DAX */

I wonder if we should name this flag something a little stronger and more
specific to its usage with respect to DAX and sync faults?  Maybe
"VM_FAULT_NEEDDSYNC" for consistency with the iomap flag?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/7] dax: Implement dax_pfn_mkwrite()
  2017-07-27 13:12 ` [PATCH 6/7] dax: Implement dax_pfn_mkwrite() Jan Kara
@ 2017-07-27 22:53   ` Ross Zwisler
  2017-07-27 23:04     ` Ross Zwisler
  2017-07-28 10:37     ` Jan Kara
  0 siblings, 2 replies; 41+ messages in thread
From: Ross Zwisler @ 2017-07-27 22:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu, Jul 27, 2017 at 03:12:44PM +0200, Jan Kara wrote:
> Implement a function that marks existing page table entry (PTE or PMD)
> as writeable and takes care of marking it dirty in the radix tree. This
> function will be used to finish synchronous page fault where the page
> table entry is first inserted as read-only and then needs to be marked
> as read-write.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/dax.c            | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dax.h |  1 +
>  2 files changed, 49 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 8a6cf158c691..90b763c86dc2 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1485,3 +1485,51 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
>  	}
>  }
>  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> +
> +/**
> + * dax_pfn_mkwrite - make page table entry writeable on a DAX file
> + * @vmf: The description of the fault
> + * @pe_size: size of entry to be marked writeable
> + *
> + * This function mark PTE or PMD entry as writeable in page tables for mmaped
> + * DAX file. It takes care of marking corresponding radix tree entry as dirty
> + * as well.
> + */
> +int dax_pfn_mkwrite(struct vm_fault *vmf, enum page_entry_size pe_size)

I wonder if this incarnation of this function should be named something other
than *_pfn_mkwrite so that it's clear that unlike in previous versions of the
codd, this version isn't supposed to be called via
vm_operations_struct->pfn_mkwrite, but is instead a helper for sync faults?
Maybe just dax_mkwrite()?

> +{
> +	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> +	void *entry, **slot;
> +	pgoff_t index = vmf->pgoff;
> +	pfn_t pfn = pfn_to_pfn_t(pte_pfn(vmf->orig_pte));
> +	int vmf_ret, error;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> +	/* Did we race with someone splitting entry or so? */
> +	if (!entry || (pe_size == PE_SIZE_PTE && !dax_is_pte_entry(entry)) ||
> +	    (pe_size == PE_SIZE_PMD && !dax_is_pmd_entry(entry))) {
> +		put_unlocked_mapping_entry(mapping, index, entry);
> +		spin_unlock_irq(&mapping->tree_lock);

The previous version of this function had tracepoints in this failure path and
in the successful completion path.  I use this kind of tracing daily for
debugging, so lets add it back in.

> +		return VM_FAULT_NOPAGE;
> +	}
> +	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
> +	entry = lock_slot(mapping, slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	switch (pe_size) {
> +	case PE_SIZE_PTE:
> +		error = vm_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);

This path goes through the 'mkwrite' branch in insert_pfn() where we validate
that the PFN we are about to map matches the one pointed to by the existing
PTE, but I don't see any checks in this path that validate against
vmf->orig_pte?

This kind of check was done by the old
dax_pfn_mkwrite()->finish_mkwrite_fault() path via the pte_same() check in
finish_mkwrite_fault().

Do we need to add an equivalent check somewhere in this path, since we're
going through the trouble of setting vmf->orig_pte in the DAX fault handlers?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 7/7] ext4: Support for synchronous DAX faults
  2017-07-27 13:12 ` [PATCH 7/7] ext4: Support for synchronous DAX faults Jan Kara
@ 2017-07-27 22:57   ` Ross Zwisler
  0 siblings, 0 replies; 41+ messages in thread
From: Ross Zwisler @ 2017-07-27 22:57 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu, Jul 27, 2017 at 03:12:45PM +0200, Jan Kara wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> synchronous write fault when inode has some uncommitted metadata
> changes. In the fault handler ext4_dax_fault() we then detect this case,
> call vfs_fsync_range() to make sure all metadata is committed, and call
> dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
> dirty corresponding radix tree entry which is what we want - fsync(2)
> will still provide data integrity guarantees for applications not using
> userspace flushing. And applications using userspace flushing can avoid
> calling fsync(2) and thus avoid the performance overhead.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/file.c       | 35 +++++++++++++++++++++++++++++------
>  fs/ext4/inode.c      |  4 ++++
>  fs/jbd2/journal.c    | 16 ++++++++++++++++
>  include/linux/jbd2.h |  1 +
>  4 files changed, 50 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index d401403e5095..b221d0b546b0 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -287,16 +287,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
>  		down_read(&EXT4_I(inode)->i_mmap_sem);
>  		handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
>  					       EXT4_DATA_TRANS_BLOCKS(sb));
> +		if (IS_ERR(handle)) {
> +			up_read(&EXT4_I(inode)->i_mmap_sem);
> +			sb_end_pagefault(sb);
> +			return VM_FAULT_SIGBUS;
> +		}

Yay, this error handling seems cleaner to me anyway.

>  	} else {
>  		down_read(&EXT4_I(inode)->i_mmap_sem);
>  	}
> -	if (!IS_ERR(handle))
> -		result = dax_iomap_fault(vmf, pe_size, false, &ext4_iomap_ops);
> -	else
> -		result = VM_FAULT_SIGBUS;
> +	result = dax_iomap_fault(vmf, pe_size, IS_SYNC(inode), &ext4_iomap_ops);
>  	if (write) {
> -		if (!IS_ERR(handle))
> -			ext4_journal_stop(handle);
> +		ext4_journal_stop(handle);
> +		/* Write fault but PFN mapped only RO? */
> +		if (result & VM_FAULT_RO) {
> +			int err;
> +			loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> +			size_t len = 0;
> +
> +			if (pe_size == PE_SIZE_PTE)
> +				len = PAGE_SIZE;
> +#ifdef CONFIG_FS_DAX_PMD
> +			else if (pe_size == PE_SIZE_PMD)
> +				len = HPAGE_PMD_SIZE;
> +			else
> +				WARN_ON_ONCE(1);

I think this "else WARN_ON_ONCE(1);" should live outside of the
CONFIG_FS_DAX_PMD so that we get warned in all configs if we get an
unsupported pe_size.

> +#endif
> +			WARN_ON_ONCE(!IS_SYNC(inode));
> +			err = vfs_fsync_range(vmf->vma->vm_file, start,
> +					      start + len - 1, 1);
> +			if (err)
> +				result = VM_FAULT_SIGBUS;
> +			else
> +				result = dax_pfn_mkwrite(vmf, pe_size);
> +		}
>  		up_read(&EXT4_I(inode)->i_mmap_sem);
>  		sb_end_pagefault(sb);
>  	} else {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3c600f02673f..e68231bb227c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3429,6 +3429,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
>  	}
>  
>  	iomap->flags = 0;
> +	if ((flags & IOMAP_FAULT) && (flags & IOMAP_WRITE) && IS_SYNC(inode) &&
> +	    !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
> +					EXT4_I(inode)->i_datasync_tid))
> +		iomap->flags |= IOMAP_F_NEEDDSYNC;

Do we need to check for (flags & IOMAP_FAULT), or can we rely on the fact that
we are in ext4_iomap_begin()?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/7] dax: Implement dax_pfn_mkwrite()
  2017-07-27 22:53   ` Ross Zwisler
@ 2017-07-27 23:04     ` Ross Zwisler
  2017-07-28 10:37     ` Jan Kara
  1 sibling, 0 replies; 41+ messages in thread
From: Ross Zwisler @ 2017-07-27 23:04 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-fsdevel, linux-ext4, Dan Williams,
	Andy Lutomirski, linux-nvdimm, linux-xfs, Christoph Hellwig,
	Dave Chinner

On Thu, Jul 27, 2017 at 04:53:22PM -0600, Ross Zwisler wrote:
> On Thu, Jul 27, 2017 at 03:12:44PM +0200, Jan Kara wrote:
> > Implement a function that marks existing page table entry (PTE or PMD)
> > as writeable and takes care of marking it dirty in the radix tree. This
> > function will be used to finish synchronous page fault where the page
> > table entry is first inserted as read-only and then needs to be marked
> > as read-write.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
<>
> > +	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
> > +	entry = lock_slot(mapping, slot);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	switch (pe_size) {
> > +	case PE_SIZE_PTE:
> > +		error = vm_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
> 
> This path goes through the 'mkwrite' branch in insert_pfn() where we validate
> that the PFN we are about to map matches the one pointed to by the existing
> PTE, but I don't see any checks in this path that validate against
> vmf->orig_pte?
> 
> This kind of check was done by the old
> dax_pfn_mkwrite()->finish_mkwrite_fault() path via the pte_same() check in
> finish_mkwrite_fault().
> 
> Do we need to add an equivalent check somewhere in this path, since we're
> going through the trouble of setting vmf->orig_pte in the DAX fault handlers?

Or, actually, does the fact that we do the PFN based sanity check in
insert_pfn() mean that we don't actually need to set vmf->orig_pte in the DAX
fault handlers at all?  We already sanity check the PFN of the RW PTE we
insert against what is already in the page table, and maybe we don't need to
keep another copy around in vmf->orig_pte to verify against?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-07-27 21:57   ` Ross Zwisler
@ 2017-07-28  2:05     ` Andy Lutomirski
  2017-07-28  9:38       ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Andy Lutomirski @ 2017-07-28  2:05 UTC (permalink / raw)
  To: Ross Zwisler, Jeff Moyer, Jan Kara, linux-xfs, linux-nvdimm,
	Dave Chinner, Christoph Hellwig, Andy Lutomirski, Linux FS Devel,
	linux-ext4

On Thu, Jul 27, 2017 at 2:57 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Thu, Jul 27, 2017 at 10:09:07AM -0400, Jeff Moyer wrote:
>> Jan Kara <jack@suse.cz> writes:
>>
>> Hi, Jan,
>>
>> Thanks for looking into this!
>>
>> > There are couple of open questions with this implementation:
>> >
>> > 1) Is it worth the hassle?
>> > 2) Is S_SYNC good flag to use or should we use a new inode flag?
>> > 3) VM_FAULT_RO and especially passing of resulting 'pfn' from
>> >    dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
>> >    vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
>> >    things to make this cleaner.
>>
>> 4) How does an application discover that it is safe to flush from
>>    userspace?
>
> I think that we would be best off with a new flag available via
> lsattr(1)/chattr(1).  This would have the following advantages:
>
> 1) We could only set the flag if the inode supported DAX (either via the mount
> option or via the individual DAX flag).  This would give NVML et al. one
> central way to detect whether it was safe to flush from userspace because the
> FS supported synchronous faults.
>
> 2) Defining a new flag prevents any confusion about whether the kernel version
> you have supports sync faults.  Otherwise NVML would have to do something like
> look at the trio of (kernel version, S_SYNC flag, mount/inode option for DAX)
> which is complex and of course breaks for OS kernel versions.
>
> 3) Defining the flag in a generic way via lsattr/chattr opens the door for the
> same API and flag to be used by other filesystems in the future.

I would advocate using a new fcntl() instead of lsattr for the
following reason: ISTM the fact that it's an *inode* flag in this
patchset is a bit of an implementation detail.  I can easily imagine a
future implementation that makes it per-struct-file instead.  A
fcntl() that asks "can I flush from userspace" would still work under
than scenario.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-07-28  2:05     ` Andy Lutomirski
@ 2017-07-28  9:38       ` Jan Kara
  2017-08-01 11:02         ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2017-07-28  9:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Linux FS Devel, linux-ext4

On Thu 27-07-17 19:05:24, Andy Lutomirski wrote:
> On Thu, Jul 27, 2017 at 2:57 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Thu, Jul 27, 2017 at 10:09:07AM -0400, Jeff Moyer wrote:
> >> Jan Kara <jack@suse.cz> writes:
> >>
> >> Hi, Jan,
> >>
> >> Thanks for looking into this!
> >>
> >> > There are couple of open questions with this implementation:
> >> >
> >> > 1) Is it worth the hassle?
> >> > 2) Is S_SYNC good flag to use or should we use a new inode flag?
> >> > 3) VM_FAULT_RO and especially passing of resulting 'pfn' from
> >> >    dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
> >> >    vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
> >> >    things to make this cleaner.
> >>
> >> 4) How does an application discover that it is safe to flush from
> >>    userspace?
> >
> > I think that we would be best off with a new flag available via
> > lsattr(1)/chattr(1).  This would have the following advantages:
> >
> > 1) We could only set the flag if the inode supported DAX (either via the mount
> > option or via the individual DAX flag).  This would give NVML et al. one
> > central way to detect whether it was safe to flush from userspace because the
> > FS supported synchronous faults.
> >
> > 2) Defining a new flag prevents any confusion about whether the kernel version
> > you have supports sync faults.  Otherwise NVML would have to do something like
> > look at the trio of (kernel version, S_SYNC flag, mount/inode option for DAX)
> > which is complex and of course breaks for OS kernel versions.
> >
> > 3) Defining the flag in a generic way via lsattr/chattr opens the door for the
> > same API and flag to be used by other filesystems in the future.
> 
> I would advocate using a new fcntl() instead of lsattr for the
> following reason: ISTM the fact that it's an *inode* flag in this
> patchset is a bit of an implementation detail.  I can easily imagine a
> future implementation that makes it per-struct-file instead.  A
> fcntl() that asks "can I flush from userspace" would still work under
> than scenario.

Well, you are right I can make the implementation work with struct file
flag as well - let's call it O_DAXDSYNC. However there are filesystem
operations where you may need to answer question: Is there any fd with
O_DAXDSYNC open against this inode (for operations that change file offset
-> block mapping)? And in that case inode flag is straightforward while
file flag is a bit awkward (you need to implement counter of fd's with that
flag in the inode).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/7] dax: Add sync argument to dax_iomap_fault()
  2017-07-27 22:06   ` Ross Zwisler
@ 2017-07-28  9:40     ` Jan Kara
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Kara @ 2017-07-28  9:40 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu 27-07-17 16:06:37, Ross Zwisler wrote:
> On Thu, Jul 27, 2017 at 03:12:40PM +0200, Jan Kara wrote:
> > Add 'sync' argument to dax_iomap_fault(). It will be used to communicate
> > the fact that synchronous fault is requested.
> 
> I don't actually think you need to pass this 'sync' parameter around.  I think
> you can completely rely on IOMAP_F_NEEDSYNC being set in iomap.flags.  The DAX
> fault handlers can call ops->iomap_begin() and use that flag for all the
> tests and make it our once source of truth.
> 
> That flag also tells us that we are doing a write fault (from
> ext4_iomap_begin()):
> 
> 	if ((flags & IOMAP_FAULT) && (flags & IOMAP_WRITE) && IS_SYNC(inode) &&
> 	    !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
> 					EXT4_I(inode)->i_datasync_tid))
> 		iomap->flags |= IOMAP_F_NEEDDSYNC;
> 
> So conditionals like this from dax_iomap_pte_fault():
> 
> 	force_ro = (vmf->flags & FAULT_FLAG_WRITE) && sync &&
> 			(iomap.flags & IOMAP_F_NEEDDSYNC);
> 
> can be simplified to:
> 
> 	force_ro = (iomap.flags & IOMAP_F_NEEDDSYNC);

Yeah, probably you're right. I'll look into changing this.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/7] dax: Make dax_insert_mapping() return VM_FAULT_ state
  2017-07-27 22:22   ` Ross Zwisler
@ 2017-07-28  9:43     ` Jan Kara
  0 siblings, 0 replies; 41+ messages in thread
From: Jan Kara @ 2017-07-28  9:43 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu 27-07-17 16:22:30, Ross Zwisler wrote:
> On Thu, Jul 27, 2017 at 03:12:42PM +0200, Jan Kara wrote:
> > Currently dax_insert_mapping() returns normal error code which is later
> > converted to VM_FAULT_ state. Since we will need to do more state
> > modifications specific to dax_insert_mapping() it does not make sense to
> > push them up to the caller of dax_insert_mapping(). Instead make
> > dax_insert_mapping() return VM_FAULT_ state the same way as
> > dax_pmd_insert_mapping() does that.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c | 45 +++++++++++++++++++++++++--------------------
> >  1 file changed, 25 insertions(+), 20 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 0673efd72f53..9658975b926a 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -814,6 +814,15 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> >  }
> >  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
> >  
> > +static int dax_fault_return(int error)
> > +{
> > +	if (error == 0)
> > +		return VM_FAULT_NOPAGE;
> > +	if (error == -ENOMEM)
> > +		return VM_FAULT_OOM;
> > +	return VM_FAULT_SIGBUS;
> > +}
> > +
> >  static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
> >  {
> >  	return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
> > @@ -828,7 +837,7 @@ static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
> >  	unsigned long vaddr = vmf->address;
> >  	void *ret, *kaddr;
> >  	pgoff_t pgoff;
> > -	int id, rc;
> > +	int id, rc, vmf_ret;
> >  	pfn_t pfn;
> >  
> >  	rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
> > @@ -850,9 +859,18 @@ static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
> >  
> >  	trace_dax_insert_mapping(mapping->host, vmf, ret);
> >  	if (vmf->flags & FAULT_FLAG_WRITE)
> > -		return vm_insert_mixed_mkwrite(vma, vaddr, pfn);
> > +		rc = vm_insert_mixed_mkwrite(vma, vaddr, pfn);
> >  	else
> > -		return vm_insert_mixed(vma, vaddr, pfn);
> > +		rc = vm_insert_mixed(vma, vaddr, pfn);
> > +
> > +	/* -EBUSY is fine, somebody else faulted on the same PTE */
> > +	if (rc == -EBUSY)
> > +		rc = 0;
> > +
> > +	vmf_ret = dax_fault_return(rc);
> 
> Prior to this point where we convert our 'rc' (which is a normal error code
> like 0, -ENOMEM, etc) to a 'vmf_ret' (which is a VM return code like
> VM_FAULT_NOPAGE, VM_FAULT_SIGBUS, etc), there are several paths in
> dax_insert_mapping() where we could still return a normal error code.  I think
> this will confuse the fault handler because this code will get returned all
> the way up to do_fault() which expects to get a VM return code.
> 
> So, I think we either need to:
> 
> a) Make sure all return paths through dax_insert_mapping() end up going
> through dax_fault_return(), as we do in the PMD case, or

Yeah, this was the intention (as my eventually I'd like to make
dax_insert_mapping() and dax_pmd_insert_mapping() one function - they are
already currently very similar). I'll fix the missed cases.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/7] dax: Implement dax_pfn_mkwrite()
  2017-07-27 22:53   ` Ross Zwisler
  2017-07-27 23:04     ` Ross Zwisler
@ 2017-07-28 10:37     ` Jan Kara
  1 sibling, 0 replies; 41+ messages in thread
From: Jan Kara @ 2017-07-28 10:37 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu 27-07-17 16:53:22, Ross Zwisler wrote:
> On Thu, Jul 27, 2017 at 03:12:44PM +0200, Jan Kara wrote:
> > Implement a function that marks existing page table entry (PTE or PMD)
> > as writeable and takes care of marking it dirty in the radix tree. This
> > function will be used to finish synchronous page fault where the page
> > table entry is first inserted as read-only and then needs to be marked
> > as read-write.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/dax.c            | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/dax.h |  1 +
> >  2 files changed, 49 insertions(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 8a6cf158c691..90b763c86dc2 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1485,3 +1485,51 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
> >  	}
> >  }
> >  EXPORT_SYMBOL_GPL(dax_iomap_fault);
> > +
> > +/**
> > + * dax_pfn_mkwrite - make page table entry writeable on a DAX file
> > + * @vmf: The description of the fault
> > + * @pe_size: size of entry to be marked writeable
> > + *
> > + * This function mark PTE or PMD entry as writeable in page tables for mmaped
> > + * DAX file. It takes care of marking corresponding radix tree entry as dirty
> > + * as well.
> > + */
> > +int dax_pfn_mkwrite(struct vm_fault *vmf, enum page_entry_size pe_size)
> 
> I wonder if this incarnation of this function should be named something other
> than *_pfn_mkwrite so that it's clear that unlike in previous versions of the
> codd, this version isn't supposed to be called via
> vm_operations_struct->pfn_mkwrite, but is instead a helper for sync faults?
> Maybe just dax_mkwrite()?

Yeah, I'll change the name.

> > +{
> > +	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> > +	void *entry, **slot;
> > +	pgoff_t index = vmf->pgoff;
> > +	pfn_t pfn = pfn_to_pfn_t(pte_pfn(vmf->orig_pte));
> > +	int vmf_ret, error;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = get_unlocked_mapping_entry(mapping, index, &slot);
> > +	/* Did we race with someone splitting entry or so? */
> > +	if (!entry || (pe_size == PE_SIZE_PTE && !dax_is_pte_entry(entry)) ||
> > +	    (pe_size == PE_SIZE_PMD && !dax_is_pmd_entry(entry))) {
> > +		put_unlocked_mapping_entry(mapping, index, entry);
> > +		spin_unlock_irq(&mapping->tree_lock);
> 
> The previous version of this function had tracepoints in this failure path and
> in the successful completion path.  I use this kind of tracing daily for
> debugging, so lets add it back in.

OK, I will add the tracepoints.

> > +		return VM_FAULT_NOPAGE;
> > +	}
> > +	radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
> > +	entry = lock_slot(mapping, slot);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	switch (pe_size) {
> > +	case PE_SIZE_PTE:
> > +		error = vm_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
> 
> This path goes through the 'mkwrite' branch in insert_pfn() where we validate
> that the PFN we are about to map matches the one pointed to by the existing
> PTE, but I don't see any checks in this path that validate against
> vmf->orig_pte?

Yeah, and that's deliberate. The current PTE (or PMD in PMD fault case) is
the read-only entry installed by dax_iomap_fault() a while ago. We pass the
PFN we map there in vmf->orig_pte (in a very crude way) and in this function
we extract it and pass it to vm_insert_mixed_mkwrite() to verify PFN didn't
change (which would be a bug) and make the PTE writeable. We cannot use
vmf->orig_pte directly for verification as that is just entry we
constructed for passing the PFN, not the real entry. And ultimately I want
to get rid of this hack and just pass PFN in some other way and leave
vmf->orig_pte untouched (which would then be pte_none()).

> This kind of check was done by the old
> dax_pfn_mkwrite()->finish_mkwrite_fault() path via the pte_same() check in
> finish_mkwrite_fault().

Yes, but here the situation is special - we actually know PTE has changed
from pte_none() to the read-only PTE. I was thinking about not changing the
PTE at all during the initial dax_iomap_fault() call (i.e., not mapping the
block read-only), just pass the PFN to the filesystem call handler, and
then install directly writeable PTE with that PFN in the new DAX helper.
Then we could also perform the orig_pte check.

What I dislike about this option is that dax_iomap_fault() would in some
cases not install the PTE and in other cases it would. However I guess it
is not such a huge difference from dax_iomap_fault() sometimes mapping the
entry read-only so maybe this is a cleaner way to go.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-07-27 13:12 [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jan Kara
                   ` (7 preceding siblings ...)
  2017-07-27 14:09 ` [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jeff Moyer
@ 2017-08-01 10:52 ` Christoph Hellwig
  8 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2017-08-01 10:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu, Jul 27, 2017 at 03:12:38PM +0200, Jan Kara wrote:
> So the functionality this patches implement: We have an inode flag (currently
> I abuse S_SYNC inode flag for this and IMHO it kind of makes sense but if
> people hate that I'm certainly open to using new flag in the final
> implementation) that marks inode as requiring synchronous page faults.
> The guarantee provided by this flag on inode is: While a block is writeably
> mapped into page tables, it is guaranteed to be visible in the file at that
> offset also after a crash.

I think the right interface for page fault behavior is a mmap
flag, MAP_SYNC or similar, which will be optional and a failure of
a MAP_SYNC mmap will indicated that this behavior can't be provided
for the given file descriptor.

> >From my (fairly limited) knowledge of XFS it seems XFS should be able to do the
> same and it should be even possible for filesystem to implement safe remapping
> of a file offset to a different block (i.e. break reflink, do defrag, or
> similar stuff) like:

It should.  But what I'm worried about for both ext4 and XFS is the
worst case behavior that the page faul path can now hit, e.g. flushing
a potentially full log.  Do you have any numbers of how long your
ext4 page faults take with this in the worst case?

> There are couple of open questions with this implementation:
> 
> 1) Is it worth the hassle?

For that I'd really like to see performance numbers.  And compared to
the immutable nightmare that Dan proposed this looks orders of magnitude
better.

> 2) Is S_SYNC good flag to use or should we use a new inode flag?

I think the right interface is mmap as said above.  But even if not
we should not simply reuse existing flags with a well defined (although
not particular useful) behavior.

> 3) VM_FAULT_RO and especially passing of resulting 'pfn' from
>    dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
>    vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
>    things to make this cleaner.

I'll take a look.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/7] mm: Remove VM_FAULT_HWPOISON_LARGE_MASK
  2017-07-27 13:12 ` [PATCH 1/7] mm: Remove VM_FAULT_HWPOISON_LARGE_MASK Jan Kara
  2017-07-27 21:57   ` Ross Zwisler
@ 2017-08-01 10:52   ` Christoph Hellwig
  1 sibling, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2017-08-01 10:52 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, linux-fsdevel, linux-ext4

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/7] dax: Simplify arguments of dax_insert_mapping()
  2017-07-27 13:12 ` [PATCH 3/7] dax: Simplify arguments of dax_insert_mapping() Jan Kara
  2017-07-27 22:09   ` Ross Zwisler
@ 2017-08-01 10:54   ` Christoph Hellwig
  1 sibling, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2017-08-01 10:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, linux-fsdevel, linux-ext4

On Thu, Jul 27, 2017 at 03:12:41PM +0200, Jan Kara wrote:
> dax_insert_mapping() has lots of arguments and a lot of them is actuall
> duplicated by passing vm_fault structure as well. Change the function to
> take the same arguments as dax_pmd_insert_mapping().

Some of this is probably for historic reasons, when I did the initial
pass at iomap based DAX dax_insert_mapping still had to be usable for
the non-DAX case.

Either way this looks like a great cleanup:

Reviewed-by: Christoph Hellwig <hch@lst.de>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/7] dax, iomap: Add support for synchronous faults
  2017-07-27 22:42   ` Ross Zwisler
@ 2017-08-01 10:56     ` Christoph Hellwig
  0 siblings, 0 replies; 41+ messages in thread
From: Christoph Hellwig @ 2017-08-01 10:56 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, linux-fsdevel, linux-ext4, Dan Williams,
	Andy Lutomirski, linux-nvdimm, linux-xfs, Christoph Hellwig,
	Dave Chinner

On Thu, Jul 27, 2017 at 04:42:45PM -0600, Ross Zwisler wrote:
> I wonder if we should name this flag something a little stronger and more
> specific to its usage with respect to DAX and sync faults?  Maybe
> "VM_FAULT_NEEDDSYNC" for consistency with the iomap flag?

Agreed.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-07-28  9:38       ` Jan Kara
@ 2017-08-01 11:02         ` Christoph Hellwig
  2017-08-01 11:26           ` Jan Kara
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2017-08-01 11:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-nvdimm, Dave Chinner, linux-xfs,
	Andy Lutomirski, Linux FS Devel, linux-ext4

On Fri, Jul 28, 2017 at 11:38:21AM +0200, Jan Kara wrote:
> Well, you are right I can make the implementation work with struct file
> flag as well - let's call it O_DAXDSYNC. However there are filesystem
> operations where you may need to answer question: Is there any fd with
> O_DAXDSYNC open against this inode (for operations that change file offset
> -> block mapping)? And in that case inode flag is straightforward while
> file flag is a bit awkward (you need to implement counter of fd's with that
> flag in the inode).

We can still keep and inode flag as the internal implementation
detail.  As mentioned earlier the right flag to control behavior
of a mapping is an mmap flag.  And the initial naive implementation
would simply mark the inode as sync once the first MAP_SYNC open happens
on it.  We could then move to more precise tracking if/when needed.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-01 11:02         ` Christoph Hellwig
@ 2017-08-01 11:26           ` Jan Kara
  2017-08-08  0:24             ` Dan Williams
  0 siblings, 1 reply; 41+ messages in thread
From: Jan Kara @ 2017-08-01 11:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-xfs, Andy Lutomirski,
	Linux FS Devel, linux-ext4

On Tue 01-08-17 04:02:41, Christoph Hellwig wrote:
> On Fri, Jul 28, 2017 at 11:38:21AM +0200, Jan Kara wrote:
> > Well, you are right I can make the implementation work with struct file
> > flag as well - let's call it O_DAXDSYNC. However there are filesystem
> > operations where you may need to answer question: Is there any fd with
> > O_DAXDSYNC open against this inode (for operations that change file offset
> > -> block mapping)? And in that case inode flag is straightforward while
> > file flag is a bit awkward (you need to implement counter of fd's with that
> > flag in the inode).
> 
> We can still keep and inode flag as the internal implementation
> detail.  As mentioned earlier the right flag to control behavior
> of a mapping is an mmap flag.  And the initial naive implementation
> would simply mark the inode as sync once the first MAP_SYNC open happens
> on it.  We could then move to more precise tracking if/when needed.

OK, makes sense and I like the MAP_SYNC proposal. I'll change it in my
implementation.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-01 11:26           ` Jan Kara
@ 2017-08-08  0:24             ` Dan Williams
  2017-08-11 10:03               ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Williams @ 2017-08-08  0:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-xfs, linux-nvdimm, Dave Chinner, Christoph Hellwig,
	Andy Lutomirski, Linux FS Devel, linux-ext4

On Tue, Aug 1, 2017 at 4:26 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 01-08-17 04:02:41, Christoph Hellwig wrote:
>> On Fri, Jul 28, 2017 at 11:38:21AM +0200, Jan Kara wrote:
>> > Well, you are right I can make the implementation work with struct file
>> > flag as well - let's call it O_DAXDSYNC. However there are filesystem
>> > operations where you may need to answer question: Is there any fd with
>> > O_DAXDSYNC open against this inode (for operations that change file offset
>> > -> block mapping)? And in that case inode flag is straightforward while
>> > file flag is a bit awkward (you need to implement counter of fd's with that
>> > flag in the inode).
>>
>> We can still keep and inode flag as the internal implementation
>> detail.  As mentioned earlier the right flag to control behavior
>> of a mapping is an mmap flag.  And the initial naive implementation
>> would simply mark the inode as sync once the first MAP_SYNC open happens
>> on it.  We could then move to more precise tracking if/when needed.
>
> OK, makes sense and I like the MAP_SYNC proposal. I'll change it in my
> implementation.

Does sys_mmap() reject unknown flag values today? I'm either not
looking in the right place or it's missing and we'll need some
interface/mechanism to check if MAP_SYNC is honored.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-08  0:24             ` Dan Williams
@ 2017-08-11 10:03               ` Christoph Hellwig
  2017-08-13  2:44                 ` Dan Williams
  0 siblings, 1 reply; 41+ messages in thread
From: Christoph Hellwig @ 2017-08-11 10:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-xfs, Jan Kara, linux-nvdimm, Dave Chinner,
	Christoph Hellwig, Andy Lutomirski, Linux FS Devel, linux-ext4

On Mon, Aug 07, 2017 at 05:24:08PM -0700, Dan Williams wrote:
> Does sys_mmap() reject unknown flag values today? I'm either not
> looking in the right place or it's missing and we'll need some
> interface/mechanism to check if MAP_SYNC is honored.

It doesn't seem to reject unknown flags.

The lack of flags checking strikes back once more, sighṫ.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-11 10:03               ` Christoph Hellwig
@ 2017-08-13  2:44                 ` Dan Williams
  2017-08-13  9:25                   ` Christoph Hellwig
  0 siblings, 1 reply; 41+ messages in thread
From: Dan Williams @ 2017-08-13  2:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-xfs, Andy Lutomirski,
	Linux FS Devel, linux-ext4

On Fri, Aug 11, 2017 at 3:03 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Mon, Aug 07, 2017 at 05:24:08PM -0700, Dan Williams wrote:
>> Does sys_mmap() reject unknown flag values today? I'm either not
>> looking in the right place or it's missing and we'll need some
>> interface/mechanism to check if MAP_SYNC is honored.
>
> It doesn't seem to reject unknown flags.
>
> The lack of flags checking strikes back once more, sighṫ.

How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
should get -EINVAL, and on new kernels it means SYNC+SHARED.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-13  2:44                 ` Dan Williams
@ 2017-08-13  9:25                   ` Christoph Hellwig
  2017-08-13 17:08                     ` Dan Williams
                                       ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Christoph Hellwig @ 2017-08-13  9:25 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-xfs, Jan Kara, linux-nvdimm, Dave Chinner,
	Christoph Hellwig, Andy Lutomirski, Linux FS Devel, linux-ext4

On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
> should get -EINVAL, and on new kernels it means SYNC+SHARED.

Cute trick, but I'd hate to waster it just for our little flag.

How about:

#define __MAP_VALIDATE		MAP_SHARED|MAP_PRIVATE
#define MAP_SYNC		0x??? | __MAP_VALIDATE

so that we can reuse that trick for any new flag?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-13  9:25                   ` Christoph Hellwig
@ 2017-08-13 17:08                     ` Dan Williams
  2017-08-14  8:30                     ` Jan Kara
  2017-08-14 14:04                     ` Boaz Harrosh
  2 siblings, 0 replies; 41+ messages in thread
From: Dan Williams @ 2017-08-13 17:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-xfs, Andy Lutomirski,
	Linux FS Devel, linux-ext4

On Sun, Aug 13, 2017 at 2:25 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
>> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
>> should get -EINVAL, and on new kernels it means SYNC+SHARED.
>
> Cute trick, but I'd hate to waster it just for our little flag.
>
> How about:
>
> #define __MAP_VALIDATE          MAP_SHARED|MAP_PRIVATE
> #define MAP_SYNC                0x??? | __MAP_VALIDATE
>
> so that we can reuse that trick for any new flag?

Yes, even better. I think that effectively kicks the need for mmap3()
down the road a bit.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-13  9:25                   ` Christoph Hellwig
  2017-08-13 17:08                     ` Dan Williams
@ 2017-08-14  8:30                     ` Jan Kara
  2017-08-14 14:04                     ` Boaz Harrosh
  2 siblings, 0 replies; 41+ messages in thread
From: Jan Kara @ 2017-08-14  8:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-xfs, Andy Lutomirski,
	Linux FS Devel, linux-ext4

On Sun 13-08-17 02:25:56, Christoph Hellwig wrote:
> On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
> > How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
> > should get -EINVAL, and on new kernels it means SYNC+SHARED.
> 
> Cute trick, but I'd hate to waster it just for our little flag.
> 
> How about:
> 
> #define __MAP_VALIDATE		MAP_SHARED|MAP_PRIVATE
> #define MAP_SYNC		0x??? | __MAP_VALIDATE
> 
> so that we can reuse that trick for any new flag?

Heh, that's such an ugly hack that it is cute ;)... I'll use this
definition of MAP_SYNC in my patches and send new version of the series
(probably today in the afternoon if the testing goes fine).

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-13  9:25                   ` Christoph Hellwig
  2017-08-13 17:08                     ` Dan Williams
  2017-08-14  8:30                     ` Jan Kara
@ 2017-08-14 14:04                     ` Boaz Harrosh
  2017-08-14 16:03                       ` Dan Williams
  2017-08-17 16:08                       ` Jan Kara
  2 siblings, 2 replies; 41+ messages in thread
From: Boaz Harrosh @ 2017-08-14 14:04 UTC (permalink / raw)
  To: Christoph Hellwig, Dan Williams, Jan Kara
  Cc: linux-nvdimm, Dave Chinner, linux-xfs, Andy Lutomirski,
	Linux FS Devel, linux-ext4, Amit Golander


Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
it in 2014. I'm so glad its time is finally do.

Thank you for working on this. Please CC me on future patches.
(note the new Netapp email)

On 13/08/17 12:25, Christoph Hellwig wrote:
> On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
>> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
>> should get -EINVAL, and on new kernels it means SYNC+SHARED.
> 
> Cute trick, but I'd hate to waster it just for our little flag.
> 
> How about:
> 
> #define __MAP_VALIDATE		MAP_SHARED|MAP_PRIVATE
> #define MAP_SYNC		0x??? | __MAP_VALIDATE
> 
> so that we can reuse that trick for any new flag?
> 

YES! And please create a mask for all new flags and in validation
code if ((m_flags & __MAP_VALIDATE) == __MAP_VALIDATE) then you
want that (m_flags & __MAP_NEWFLAGS) does not come empty, this
way you actually preserve the old check that SHARED and PRIVATE
do not co exist.

Few Comments on this new MAP_ flag

0] The name at least needs to be MAP_MSYNC because only meta-data is
    synced not the data pointed to. That is the responsibility of the app

1] This flag you have named MAP_SYNC but it is very much related to
   dax and the ability for user-mode to "flush" the data pointed by this
   now "synced" meta data.
   For example in ext4, this flag set on an inode that is *not* IS_DAX
   should fail the mmap. Because there is no point of synced meta if the
   data is actually in page-cache and we know for sure it was not yet synced,
   And there is no way for user-mode to directly "sync" the data as well.

2] The code should be constructed that the default check for the MAP_SYNC
   should fail, and only Hopped in FSs are allowed.
   (So not to modify all Implementations of file_operations->mmap() )

3] /dev/pmem could start serving DAX pages in mmap, if asked for MAP_MSYNC
   (which is also an API that says "I know I need to cl_flush". See 1. )

4] Once we have this flag. And properly implemented at least in one FS
   and optionally in /dev/pmemX we no longer have any justification for
   /dev/daxX and it can die a slow and happy death.

Thanks, Cheers
Boaz

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-14 14:04                     ` Boaz Harrosh
@ 2017-08-14 16:03                       ` Dan Williams
  2017-08-15  9:06                         ` Boaz Harrosh
  2017-08-21 19:57                         ` Ross Zwisler
  2017-08-17 16:08                       ` Jan Kara
  1 sibling, 2 replies; 41+ messages in thread
From: Dan Williams @ 2017-08-14 16:03 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-xfs, Jan Kara, linux-nvdimm, Dave Chinner,
	Christoph Hellwig, Andy Lutomirski, Linux FS Devel, linux-ext4,
	Amit Golander

On Mon, Aug 14, 2017 at 7:04 AM, Boaz Harrosh <boazh@netapp.com> wrote:
>
> Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
> it in 2014. I'm so glad its time is finally do.
>
> Thank you for working on this. Please CC me on future patches.
> (note the new Netapp email)
>
> On 13/08/17 12:25, Christoph Hellwig wrote:
>> On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
>>> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
>>> should get -EINVAL, and on new kernels it means SYNC+SHARED.
>>
>> Cute trick, but I'd hate to waster it just for our little flag.
>>
>> How about:
>>
>> #define __MAP_VALIDATE                MAP_SHARED|MAP_PRIVATE
>> #define MAP_SYNC              0x??? | __MAP_VALIDATE
>>
>> so that we can reuse that trick for any new flag?
>>
>
> YES! And please create a mask for all new flags and in validation
> code if ((m_flags & __MAP_VALIDATE) == __MAP_VALIDATE) then you
> want that (m_flags & __MAP_NEWFLAGS) does not come empty, this
> way you actually preserve the old check that SHARED and PRIVATE
> do not co exist.
>
> Few Comments on this new MAP_ flag
>
> 0] The name at least needs to be MAP_MSYNC because only meta-data is
>     synced not the data pointed to. That is the responsibility of the app
>
> 1] This flag you have named MAP_SYNC but it is very much related to
>    dax and the ability for user-mode to "flush" the data pointed by this
>    now "synced" meta data.
>    For example in ext4, this flag set on an inode that is *not* IS_DAX
>    should fail the mmap. Because there is no point of synced meta if the
>    data is actually in page-cache and we know for sure it was not yet synced,
>    And there is no way for user-mode to directly "sync" the data as well.
>
> 2] The code should be constructed that the default check for the MAP_SYNC
>    should fail, and only Hopped in FSs are allowed.
>    (So not to modify all Implementations of file_operations->mmap() )
>
> 3] /dev/pmem could start serving DAX pages in mmap, if asked for MAP_MSYNC
>    (which is also an API that says "I know I need to cl_flush". See 1. )
>
> 4] Once we have this flag. And properly implemented at least in one FS
>    and optionally in /dev/pmemX we no longer have any justification for
>    /dev/daxX and it can die a slow and happy death.

I'm all for replacing /dev/dax with filesystem equivalent
functionality, but I don't think MAP_SYNC gets us fully there. That's
what the MAP_DIRECT proposal [1] is meant to address.

[1]: https://lkml.org/lkml/2017/8/13/160
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-14 16:03                       ` Dan Williams
@ 2017-08-15  9:06                         ` Boaz Harrosh
  2017-08-15  9:44                           ` Boaz Harrosh
  2017-08-21 19:57                         ` Ross Zwisler
  1 sibling, 1 reply; 41+ messages in thread
From: Boaz Harrosh @ 2017-08-15  9:06 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, Linux FS Devel, linux-ext4,
	Amit Golander

On 14/08/17 19:03, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 7:04 AM, Boaz Harrosh <boazh@netapp.com> wrote:
>>
>> Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
>> it in 2014. I'm so glad its time is finally do.
>>

<>

>> 4] Once we have this flag. And properly implemented at least in one FS
>>    and optionally in /dev/pmemX we no longer have any justification for
>>    /dev/daxX and it can die a slow and happy death.
> 
> I'm all for replacing /dev/dax with filesystem equivalent
> functionality, but I don't think MAP_SYNC gets us fully there. That's
> what the MAP_DIRECT proposal [1] is meant to address.
> 

OK This is true. Could you please summarise for us the exact semantics of both
proposed flags?

That said I think the big difference is the movability of physical blocks
underneath the mmap mapping. Now for swap files that is a problem. because
of the deadlocks that can happen with memory needed if blocks start moving.
But for an application like nvml? why does it care. Why can't an nvml image file
not be cloned and COWed underneath the NVM application transparently.

Sorry for being slow but I don't see why you need MAP_DIRECT from user-mode
If you have MAP_SYNC. Please advise

(not that the immutable patchset is not a very needed fixing)

> [1]: https://lkml.org/lkml/2017/8/13/160
> 

Thanks
Boaz

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-15  9:06                         ` Boaz Harrosh
@ 2017-08-15  9:44                           ` Boaz Harrosh
  0 siblings, 0 replies; 41+ messages in thread
From: Boaz Harrosh @ 2017-08-15  9:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Jan Kara, linux-nvdimm, Dave Chinner,
	linux-xfs, Andy Lutomirski, Linux FS Devel, linux-ext4,
	Amit Golander

On 15/08/17 12:06, Boaz Harrosh wrote:
> On 14/08/17 19:03, Dan Williams wrote:
>> On Mon, Aug 14, 2017 at 7:04 AM, Boaz Harrosh <boazh@netapp.com> wrote:
>>>
>>> Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
>>> it in 2014. I'm so glad its time is finally do.
>>>
> 
> <>
> 
>>> 4] Once we have this flag. And properly implemented at least in one FS
>>>    and optionally in /dev/pmemX we no longer have any justification for
>>>    /dev/daxX and it can die a slow and happy death.
>>
>> I'm all for replacing /dev/dax with filesystem equivalent
>> functionality, but I don't think MAP_SYNC gets us fully there. That's
>> what the MAP_DIRECT proposal [1] is meant to address.
>>
> 
> OK This is true. Could you please summarise for us the exact semantics of both
> proposed flags?
> 
> That said I think the big difference is the movability of physical blocks
> underneath the mmap mapping. Now for swap files that is a problem. because
> of the deadlocks that can happen with memory needed if blocks start moving.
> But for an application like nvml? why does it care. Why can't an nvml image file
> not be cloned and COWed underneath the NVM application transparently.
> 

OK Sorry didn't do my homework. Struck this out, it is all about RDMA and friends
from an "immutable" file.

I'll go dig into this now. 

Thanks
Boaz

> Sorry for being slow but I don't see why you need MAP_DIRECT from user-mode
> If you have MAP_SYNC. Please advise
> 
> (not that the immutable patchset is not a very needed fixing)
> 
>> [1]: https://lkml.org/lkml/2017/8/13/160
>>
> 
> Thanks
> Boaz
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-14 14:04                     ` Boaz Harrosh
  2017-08-14 16:03                       ` Dan Williams
@ 2017-08-17 16:08                       ` Jan Kara
  1 sibling, 0 replies; 41+ messages in thread
From: Jan Kara @ 2017-08-17 16:08 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: linux-xfs, Jan Kara, linux-nvdimm, Dave Chinner,
	Christoph Hellwig, Andy Lutomirski, Linux FS Devel, linux-ext4,
	Amit Golander

On Mon 14-08-17 17:04:17, Boaz Harrosh wrote:
> 
> Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
> it in 2014. I'm so glad its time is finally do.
> 
> Thank you for working on this. Please CC me on future patches.
> (note the new Netapp email)
> 
> On 13/08/17 12:25, Christoph Hellwig wrote:
> > On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
> >> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
> >> should get -EINVAL, and on new kernels it means SYNC+SHARED.
> > 
> > Cute trick, but I'd hate to waster it just for our little flag.
> > 
> > How about:
> > 
> > #define __MAP_VALIDATE		MAP_SHARED|MAP_PRIVATE
> > #define MAP_SYNC		0x??? | __MAP_VALIDATE
> > 
> > so that we can reuse that trick for any new flag?
> > 
> 
> YES! And please create a mask for all new flags and in validation
> code if ((m_flags & __MAP_VALIDATE) == __MAP_VALIDATE) then you
> want that (m_flags & __MAP_NEWFLAGS) does not come empty, this
> way you actually preserve the old check that SHARED and PRIVATE
> do not co exist.

For now I did just a crude hack. Dan is working on new mmap syscall which
checks flags which will be cleaner...

> Few Comments on this new MAP_ flag
> 
> 0] The name at least needs to be MAP_MSYNC because only meta-data is
>     synced not the data pointed to. That is the responsibility of the app

So we actually do normal fdatasync() call so we do flush data as well. This
way we don't have to be afraid of stale data exposure or other strange
effects. So I've kept the name to be MAP_SYNC.

> 1] This flag you have named MAP_SYNC but it is very much related to
>    dax and the ability for user-mode to "flush" the data pointed by this
>    now "synced" meta data.
>    For example in ext4, this flag set on an inode that is *not* IS_DAX
>    should fail the mmap. Because there is no point of synced meta if the
>    data is actually in page-cache and we know for sure it was not yet synced,
>    And there is no way for user-mode to directly "sync" the data as well.

Yes, done.

> 2] The code should be constructed that the default check for the MAP_SYNC
>    should fail, and only Hopped in FSs are allowed.
>    (So not to modify all Implementations of file_operations->mmap() )

Agreed but for now I've skipped this as I wait for new mmap syscall and
how Dan implements flag checking there.

> 3] /dev/pmem could start serving DAX pages in mmap, if asked for MAP_MSYNC
>    (which is also an API that says "I know I need to cl_flush". See 1. )

MAP_SYNC is rather more like: I can also use clflush instead of
fdatasync(2). And this is rather important as all legacy applications are
100% safe in the new scheme.
 
> 4] Once we have this flag. And properly implemented at least in one FS
>    and optionally in /dev/pmemX we no longer have any justification for
>    /dev/daxX and it can die a slow and happy death.

This will be more complex I guess - see MAP_DIRECT proposal...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults
  2017-08-14 16:03                       ` Dan Williams
  2017-08-15  9:06                         ` Boaz Harrosh
@ 2017-08-21 19:57                         ` Ross Zwisler
  1 sibling, 0 replies; 41+ messages in thread
From: Ross Zwisler @ 2017-08-21 19:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, linux-ext4, Jan Kara, linux-nvdimm,
	Dave Chinner, linux-xfs, Andy Lutomirski, Linux FS Devel,
	Boaz Harrosh, Amit Golander

On Mon, Aug 14, 2017 at 09:03:59AM -0700, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 7:04 AM, Boaz Harrosh <boazh@netapp.com> wrote:
> >
> > Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
> > it in 2014. I'm so glad its time is finally do.
> >
> > Thank you for working on this. Please CC me on future patches.
> > (note the new Netapp email)
> >
> > On 13/08/17 12:25, Christoph Hellwig wrote:
> >> On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
> >>> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
> >>> should get -EINVAL, and on new kernels it means SYNC+SHARED.
> >>
> >> Cute trick, but I'd hate to waster it just for our little flag.
> >>
> >> How about:
> >>
> >> #define __MAP_VALIDATE                MAP_SHARED|MAP_PRIVATE
> >> #define MAP_SYNC              0x??? | __MAP_VALIDATE
> >>
> >> so that we can reuse that trick for any new flag?
> >>
> >
> > YES! And please create a mask for all new flags and in validation
> > code if ((m_flags & __MAP_VALIDATE) == __MAP_VALIDATE) then you
> > want that (m_flags & __MAP_NEWFLAGS) does not come empty, this
> > way you actually preserve the old check that SHARED and PRIVATE
> > do not co exist.
> >
> > Few Comments on this new MAP_ flag
> >
> > 0] The name at least needs to be MAP_MSYNC because only meta-data is
> >     synced not the data pointed to. That is the responsibility of the app
> >
> > 1] This flag you have named MAP_SYNC but it is very much related to
> >    dax and the ability for user-mode to "flush" the data pointed by this
> >    now "synced" meta data.
> >    For example in ext4, this flag set on an inode that is *not* IS_DAX
> >    should fail the mmap. Because there is no point of synced meta if the
> >    data is actually in page-cache and we know for sure it was not yet synced,
> >    And there is no way for user-mode to directly "sync" the data as well.
> >
> > 2] The code should be constructed that the default check for the MAP_SYNC
> >    should fail, and only Hopped in FSs are allowed.
> >    (So not to modify all Implementations of file_operations->mmap() )
> >
> > 3] /dev/pmem could start serving DAX pages in mmap, if asked for MAP_MSYNC
> >    (which is also an API that says "I know I need to cl_flush". See 1. )
> >
> > 4] Once we have this flag. And properly implemented at least in one FS
> >    and optionally in /dev/pmemX we no longer have any justification for
> >    /dev/daxX and it can die a slow and happy death.
> 
> I'm all for replacing /dev/dax with filesystem equivalent
> functionality, but I don't think MAP_SYNC gets us fully there. That's
> what the MAP_DIRECT proposal [1] is meant to address.
> 
> [1]: https://lkml.org/lkml/2017/8/13/160

I think we may also need to get DAX working on raw block devices again before
/dev/dax can go away.  AFAIK some customers still prefer to use DAX on a
per-device basis and not have a filesystem.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2017-08-21 19:55 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-27 13:12 [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jan Kara
2017-07-27 13:12 ` [PATCH 1/7] mm: Remove VM_FAULT_HWPOISON_LARGE_MASK Jan Kara
2017-07-27 21:57   ` Ross Zwisler
2017-08-01 10:52   ` Christoph Hellwig
2017-07-27 13:12 ` [PATCH 2/7] dax: Add sync argument to dax_iomap_fault() Jan Kara
2017-07-27 22:06   ` Ross Zwisler
2017-07-28  9:40     ` Jan Kara
2017-07-27 13:12 ` [PATCH 3/7] dax: Simplify arguments of dax_insert_mapping() Jan Kara
2017-07-27 22:09   ` Ross Zwisler
2017-08-01 10:54   ` Christoph Hellwig
2017-07-27 13:12 ` [PATCH 4/7] dax: Make dax_insert_mapping() return VM_FAULT_ state Jan Kara
2017-07-27 22:22   ` Ross Zwisler
2017-07-28  9:43     ` Jan Kara
2017-07-27 13:12 ` [PATCH 5/7] dax, iomap: Add support for synchronous faults Jan Kara
2017-07-27 22:42   ` Ross Zwisler
2017-08-01 10:56     ` Christoph Hellwig
2017-07-27 13:12 ` [PATCH 6/7] dax: Implement dax_pfn_mkwrite() Jan Kara
2017-07-27 22:53   ` Ross Zwisler
2017-07-27 23:04     ` Ross Zwisler
2017-07-28 10:37     ` Jan Kara
2017-07-27 13:12 ` [PATCH 7/7] ext4: Support for synchronous DAX faults Jan Kara
2017-07-27 22:57   ` Ross Zwisler
2017-07-27 14:09 ` [RFC PATCH 0/7] dax, ext4: Synchronous page faults Jeff Moyer
2017-07-27 21:57   ` Ross Zwisler
2017-07-28  2:05     ` Andy Lutomirski
2017-07-28  9:38       ` Jan Kara
2017-08-01 11:02         ` Christoph Hellwig
2017-08-01 11:26           ` Jan Kara
2017-08-08  0:24             ` Dan Williams
2017-08-11 10:03               ` Christoph Hellwig
2017-08-13  2:44                 ` Dan Williams
2017-08-13  9:25                   ` Christoph Hellwig
2017-08-13 17:08                     ` Dan Williams
2017-08-14  8:30                     ` Jan Kara
2017-08-14 14:04                     ` Boaz Harrosh
2017-08-14 16:03                       ` Dan Williams
2017-08-15  9:06                         ` Boaz Harrosh
2017-08-15  9:44                           ` Boaz Harrosh
2017-08-21 19:57                         ` Ross Zwisler
2017-08-17 16:08                       ` Jan Kara
2017-08-01 10:52 ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).