linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v8 0/9] DAX fsync/msync support
@ 2016-01-08  5:27 Ross Zwisler
  2016-01-08  5:27 ` [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-01-08  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

Changes since v7 [1]:

1) Update patch 1 so that we initialize bh->b_bdev before passing it to
get_block() instead of working around the fact that it could still be NULL
after get_block() completes. (Dan)

2) Add a check to dax_radix_entry() so that we WARN_ON_ONCE() and exit
gracefully if we find a page cache entry still in the radix tree when
trying to insert a DAX entry.

This series replaces v7 in the MM tree and in the "akpm" branch of the next
tree.  A working tree can be found here:

https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_v8

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-January/003886.html

Ross Zwisler (9):
  dax: fix NULL pointer dereference in __dax_dbg()
  dax: fix conversion of holes to PMDs
  pmem: add wb_cache_pmem() to the PMEM API
  dax: support dirty DAX entries in radix tree
  mm: add find_get_entries_tag()
  dax: add support for fsync/msync
  ext2: call dax_pfn_mkwrite() for DAX fsync/msync
  ext4: call dax_pfn_mkwrite() for DAX fsync/msync
  xfs: call dax_pfn_mkwrite() for DAX fsync/msync

 arch/x86/include/asm/pmem.h |  11 +--
 fs/block_dev.c              |   2 +-
 fs/dax.c                    | 215 ++++++++++++++++++++++++++++++++++++++++----
 fs/ext2/file.c              |   4 +-
 fs/ext4/file.c              |   4 +-
 fs/inode.c                  |   2 +-
 fs/xfs/xfs_file.c           |   7 +-
 include/linux/dax.h         |   7 ++
 include/linux/fs.h          |   3 +-
 include/linux/pagemap.h     |   3 +
 include/linux/pmem.h        |  22 ++++-
 include/linux/radix-tree.h  |   9 ++
 mm/filemap.c                |  91 +++++++++++++++++--
 mm/truncate.c               |  69 +++++++-------
 mm/vmscan.c                 |   9 +-
 mm/workingset.c             |   4 +-
 16 files changed, 393 insertions(+), 69 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-08  5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
@ 2016-01-08  5:27 ` Ross Zwisler
  2016-01-12  9:34   ` Jan Kara
  2016-01-08  5:27 ` [PATCH v8 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Ross Zwisler @ 2016-01-08  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

In __dax_pmd_fault() we currently assume that get_block() will always set
bh.b_bdev and we unconditionally dereference it in __dax_dbg().  This
assumption isn't always true - when called for reads of holes
ext4_dax_mmap_get_block() returns a buffer head where bh->b_bdev is never
set.  I hit this BUG while testing the DAX PMD fault path.

Instead, initialize bh.b_bdev before passing bh into get_block().  It is
possible that the filesystem's get_block() will update bh.b_bdev, and this
is fine - we just want to initialize bh.b_bdev to something reasonable so
that the calls to __dax_dbg() work and print something useful.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/dax.c b/fs/dax.c
index 7af8797..513bba5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -624,6 +624,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	}
 
 	memset(&bh, 0, sizeof(bh));
+	bh.b_bdev = inode->i_sb->s_bdev;
 	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
 
 	bh.b_size = PMD_SIZE;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 2/9] dax: fix conversion of holes to PMDs
  2016-01-08  5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
  2016-01-08  5:27 ` [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
@ 2016-01-08  5:27 ` Ross Zwisler
  2016-01-12  9:44   ` Jan Kara
  2016-01-08  5:27 ` [PATCH v8 3/9] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Ross Zwisler @ 2016-01-08  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

When we get a DAX PMD fault for a write it is possible that there could be
some number of 4k zero pages already present for the same range that were
inserted to service reads from a hole.  These 4k zero pages need to be
unmapped from the VMAs and removed from the struct address_space radix tree
before the real DAX PMD entry can be inserted.

For PTE faults this same use case also exists and is handled by a
combination of unmap_mapping_range() to unmap the VMAs and
delete_from_page_cache() to remove the page from the address_space radix
tree.

For PMD faults we do have a call to unmap_mapping_range() (protected by a
buffer_new() check), but nothing clears out the radix tree entry.  The
buffer_new() check is also incorrect as the current ext4 and XFS filesystem
code will never return a buffer_head with BH_New set, even when allocating
new blocks over a hole.  Instead the filesystem will zero the blocks
manually and return a buffer_head with only BH_Mapped set.

Fix this situation by removing the buffer_new() check and adding a call to
truncate_inode_pages_range() to clear out the radix tree entries before we
insert the DAX PMD.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 513bba5..5b84a46 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -589,6 +589,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	bool write = flags & FAULT_FLAG_WRITE;
 	struct block_device *bdev;
 	pgoff_t size, pgoff;
+	loff_t lstart, lend;
 	sector_t block;
 	int result = 0;
 
@@ -643,15 +644,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 	}
 
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (buffer_new(&bh)) {
-		i_mmap_unlock_read(mapping);
-		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
-		i_mmap_lock_read(mapping);
-	}
+	/* make sure no process has any zero pages covering this hole */
+	lstart = pgoff << PAGE_SHIFT;
+	lend = lstart + PMD_SIZE - 1; /* inclusive */
+	i_mmap_unlock_read(mapping);
+	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
+	truncate_inode_pages_range(mapping, lstart, lend);
+	i_mmap_lock_read(mapping);
 
 	/*
 	 * If a truncate happened while we were allocating blocks, we may
@@ -665,7 +664,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto out;
 	}
 	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(&bh, address, "pgoff unaligned");
+		dax_pmd_dbg(&bh, address,
+				"offset + huge page size > file size");
 		goto fallback;
 	}
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 3/9] pmem: add wb_cache_pmem() to the PMEM API
  2016-01-08  5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
  2016-01-08  5:27 ` [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
  2016-01-08  5:27 ` [PATCH v8 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
@ 2016-01-08  5:27 ` Ross Zwisler
  2016-01-08  5:27 ` [PATCH v8 4/9] dax: support dirty DAX entries in radix tree Ross Zwisler
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-01-08  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

One thing worth noting is that we really do want this to be part of the
PMEM API as opposed to a stand-alone function like clflush_cache_range()
because of ordering restrictions.  By having wb_cache_pmem() as part of the
PMEM API we can leave it unordered, call it multiple times to write back
large amounts of memory, and then order the multiple calls with a single
wmb_pmem().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pmem.h | 11 ++++++-----
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 1544fab..c57fd1e 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -133,7 +134,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	void *vaddr = (void __force *)addr;
 
 	memset(vaddr, 0, size);
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index acfea8c..7c3d11a 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -178,4 +184,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 4/9] dax: support dirty DAX entries in radix tree
  2016-01-08  5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
                   ` (2 preceding siblings ...)
  2016-01-08  5:27 ` [PATCH v8 3/9] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
@ 2016-01-08  5:27 ` Ross Zwisler
  2016-01-13  9:44   ` Jan Kara
  2016-01-08  5:27 ` [PATCH v8 5/9] mm: add find_get_entries_tag() Ross Zwisler
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Ross Zwisler @ 2016-01-08  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.
These exceptional entries will also contain the writeback sectors for the
PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  We rely on
the fact that only one type of exceptional entry can be found in a given
radix tree based on its usage.  This happens for free with DAX vs shmem but
we explicitly prevent shadow entries from being added to radix trees for
DAX mappings.

The only shadow entries that would be generated for DAX radix trees would
be to track zero page mappings that were created for holes.  These pages
would receive minimal benefit from having shadow entries, and the choice
to have only one type of exceptional entry in a given radix tree makes the
logic simpler both in clear_exceptional_entry() and in the rest of DAX.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c             |  2 +-
 fs/inode.c                 |  2 +-
 include/linux/dax.h        |  5 ++++
 include/linux/fs.h         |  3 +-
 include/linux/radix-tree.h |  9 ++++++
 mm/filemap.c               | 17 ++++++++----
 mm/truncate.c              | 69 ++++++++++++++++++++++++++--------------------
 mm/vmscan.c                |  9 +++++-
 mm/workingset.c            |  4 +--
 9 files changed, 78 insertions(+), 42 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e123641..303b7cd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -75,7 +75,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index 4c8f719..6e3e5d0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -495,7 +495,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
-	BUG_ON(inode->i_data.nrshadows);
+	BUG_ON(inode->i_data.nrexceptional);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+	return mapping->host && IS_DAX(mapping->host);
+}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ddb7bad..fdab768 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,7 +433,8 @@ struct address_space {
 	struct rw_semaphore	i_mmap_rwsem;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
-	unsigned long		nrshadows;	/* number of shadow entries */
+	/* number of shadow or DAX exceptional entries */
+	unsigned long		nrexceptional;
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 57e7d87..7c88ad1 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,15 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 847ee43..7b8be78 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@
  */
 #include <linux/export.h>
 #include <linux/compiler.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/uaccess.h>
 #include <linux/capability.h>
@@ -123,9 +124,9 @@ static void page_cache_tree_delete(struct address_space *mapping,
 	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
 
 	if (shadow) {
-		mapping->nrshadows++;
+		mapping->nrexceptional++;
 		/*
-		 * Make sure the nrshadows update is committed before
+		 * Make sure the nrexceptional update is committed before
 		 * the nrpages update so that final truncate racing
 		 * with reclaim does not see both counters 0 at the
 		 * same time and miss a shadow entry.
@@ -579,9 +580,13 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
+
+		if (WARN_ON(dax_mapping(mapping)))
+			return -EINVAL;
+
 		if (shadowp)
 			*shadowp = p;
-		mapping->nrshadows--;
+		mapping->nrexceptional--;
 		if (node)
 			workingset_node_shadows_dec(node);
 	}
@@ -1245,9 +1250,9 @@ repeat:
 			if (radix_tree_deref_retry(page))
 				goto restart;
 			/*
-			 * A shadow entry of a recently evicted page,
-			 * or a swap entry from shmem/tmpfs.  Return
-			 * it without attempting to raise page count.
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
 			 */
 			goto export;
 		}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..e3ee0e2 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@
 
 #include <linux/kernel.h>
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
@@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
 		return;
 
 	spin_lock_irq(&mapping->tree_lock);
-	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
-	 */
-	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-	radix_tree_replace_slot(slot, NULL);
-	mapping->nrshadows--;
-	if (!node)
-		goto unlock;
-	workingset_node_shadows_dec(node);
-	/*
-	 * Don't track node without shadow entries.
-	 *
-	 * Avoid acquiring the list_lru lock if already untracked.
-	 * The list_empty() test is safe as node->private_list is
-	 * protected by mapping->tree_lock.
-	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
-		list_lru_del(&workingset_shadow_nodes, &node->private_list);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+
+	if (dax_mapping(mapping)) {
+		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
+			mapping->nrexceptional--;
+	} else {
+		/*
+		 * Regular page slots are stabilized by the page lock even
+		 * without the tree itself locked.  These unlocked entries
+		 * need verification under the tree lock.
+		 */
+		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+					&slot))
+			goto unlock;
+		if (*slot != entry)
+			goto unlock;
+		radix_tree_replace_slot(slot, NULL);
+		mapping->nrexceptional--;
+		if (!node)
+			goto unlock;
+		workingset_node_shadows_dec(node);
+		/*
+		 * Don't track node without shadow entries.
+		 *
+		 * Avoid acquiring the list_lru lock if already untracked.
+		 * The list_empty() test is safe as node->private_list is
+		 * protected by mapping->tree_lock.
+		 */
+		if (!workingset_node_shadows(node) &&
+		    !list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes,
+					&node->private_list);
+		__radix_tree_delete_node(&mapping->page_tree, node);
+	}
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
@@ -228,7 +237,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	/* Offsets within partial pages */
@@ -402,7 +411,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
  */
 void truncate_inode_pages_final(struct address_space *mapping)
 {
-	unsigned long nrshadows;
+	unsigned long nrexceptional;
 	unsigned long nrpages;
 
 	/*
@@ -416,14 +425,14 @@ void truncate_inode_pages_final(struct address_space *mapping)
 
 	/*
 	 * When reclaim installs eviction entries, it increases
-	 * nrshadows first, then decreases nrpages.  Make sure we see
+	 * nrexceptional first, then decreases nrpages.  Make sure we see
 	 * this in the right order or we might miss an entry.
 	 */
 	nrpages = mapping->nrpages;
 	smp_rmb();
-	nrshadows = mapping->nrshadows;
+	nrexceptional = mapping->nrexceptional;
 
-	if (nrpages || nrshadows) {
+	if (nrpages || nrexceptional) {
 		/*
 		 * As truncation uses a lockless tree lookup, cycle
 		 * the tree lock to make sure any ongoing tree
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44ec50f..30e0cd7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -46,6 +46,7 @@
 #include <linux/oom.h>
 #include <linux/prefetch.h>
 #include <linux/printk.h>
+#include <linux/dax.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		 * inode reclaim needs to empty out the radix tree or
 		 * the nodes are lost.  Don't plant shadows behind its
 		 * back.
+		 *
+		 * We also don't store shadows for DAX mappings because the
+		 * only page cache pages found in these are zero pages
+		 * covering holes, and because we don't want to mix DAX
+		 * exceptional entries and shadow exceptional entries in the
+		 * same page_tree.
 		 */
 		if (reclaimed && page_is_file_cache(page) &&
-		    !mapping_exiting(mapping))
+		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow, memcg);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
diff --git a/mm/workingset.c b/mm/workingset.c
index aa01713..61ead9e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -351,8 +351,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 			node->slots[i] = NULL;
 			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
 			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-			BUG_ON(!mapping->nrshadows);
-			mapping->nrshadows--;
+			BUG_ON(!mapping->nrexceptional);
+			mapping->nrexceptional--;
 		}
 	}
 	BUG_ON(node->count);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 5/9] mm: add find_get_entries_tag()
  2016-01-08  5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
                   ` (3 preceding siblings ...)
  2016-01-08  5:27 ` [PATCH v8 4/9] dax: support dirty DAX entries in radix tree Ross Zwisler
@ 2016-01-08  5:27 ` Ross Zwisler
  2016-01-08  5:27 ` [PATCH v8 6/9] dax: add support for fsync/msync Ross Zwisler
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-01-08  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            | 68 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4d08b6c..92395a0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b8be78..1e215fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1499,6 +1499,74 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping:	the address_space to search
+ * @start:	the starting page cache index
+ * @tag:	the tag index
+ * @nr_entries:	the maximum number of entries
+ * @entries:	where the resulting entries are placed
+ * @indices:	the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_tagged(slot, &mapping->page_tree,
+				   &iter, start, tag) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				goto restart;
+			}
+
+			/*
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		entries[ret] = page;
+		if (++ret == nr_entries)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 6/9] dax: add support for fsync/msync
  2016-01-08  5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
                   ` (4 preceding siblings ...)
  2016-01-08  5:27 ` [PATCH v8 5/9] mm: add find_get_entries_tag() Ross Zwisler
@ 2016-01-08  5:27 ` Ross Zwisler
  2016-01-12 10:57   ` Jan Kara
  2016-02-06 14:33   ` Dmitry Monakhov
  2016-01-08  5:27 ` [PATCH v8 7/9] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Ross Zwisler
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-01-08  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dax.h |   2 +
 mm/filemap.c        |   6 ++
 3 files changed, 196 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5b84a46..0db21ea 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,6 +24,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagevec.h>
 #include <linux/pmem.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
@@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
+#define NO_SECTOR -1
+
+static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
+		sector_t sector, bool pmd_entry, bool dirty)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int type, error = 0;
+	void *entry;
+
+	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = radix_tree_lookup(page_tree, index);
+
+	if (entry) {
+		type = RADIX_DAX_TYPE(entry);
+		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
+					type != RADIX_DAX_PMD)) {
+			error = -EIO;
+			goto unlock;
+		}
+
+		if (!pmd_entry || type == RADIX_DAX_PMD)
+			goto dirty;
+		radix_tree_delete(&mapping->page_tree, index);
+		mapping->nrexceptional--;
+	}
+
+	if (sector == NO_SECTOR) {
+		/*
+		 * This can happen during correct operation if our pfn_mkwrite
+		 * fault raced against a hole punch operation.  If this
+		 * happens the pte that was hole punched will have been
+		 * unmapped and the radix tree entry will have been removed by
+		 * the time we are called, but the call will still happen.  We
+		 * will return all the way up to wp_pfn_shared(), where the
+		 * pte_same() check will fail, eventually causing page fault
+		 * to be retried by the CPU.
+		 */
+		goto unlock;
+	}
+
+	error = radix_tree_insert(page_tree, index,
+			RADIX_DAX_ENTRY(sector, pmd_entry));
+	if (error)
+		goto unlock;
+
+	mapping->nrexceptional++;
+ dirty:
+	if (dirty)
+		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return error;
+}
+
+static int dax_writeback_one(struct block_device *bdev,
+		struct address_space *mapping, pgoff_t index, void *entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int type = RADIX_DAX_TYPE(entry);
+	struct radix_tree_node *node;
+	struct blk_dax_ctl dax;
+	void **slot;
+	int ret = 0;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+
+	/* another fsync thread may have already written back this entry */
+	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto unlock;
+
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+
+	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	dax.sector = RADIX_DAX_SECTOR(entry);
+	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/*
+	 * We cannot hold tree_lock while calling dax_map_atomic() because it
+	 * eventually calls cond_resched().
+	 */
+	ret = dax_map_atomic(bdev, &dax);
+	if (ret < 0)
+		return ret;
+
+	if (WARN_ON_ONCE(ret < dax.size)) {
+		ret = -EIO;
+		goto unmap;
+	}
+
+	wb_cache_pmem(dax.addr, dax.size);
+ unmap:
+	dax_unmap_atomic(bdev, &dax);
+	return ret;
+
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+/*
+ * Flush the mapping to the persistent domain within the byte range of [start,
+ * end]. This is required by data integrity operations to ensure file data is
+ * on persistent storage prior to completion of the operation.
+ */
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end)
+{
+	struct inode *inode = mapping->host;
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	pgoff_t indices[PAGEVEC_SIZE];
+	pgoff_t start_page, end_page;
+	struct pagevec pvec;
+	void *entry;
+	int i, ret = 0;
+
+	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
+		return -EIO;
+
+	rcu_read_lock();
+	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
+	rcu_read_unlock();
+
+	/* see if the start of our range is covered by a PMD entry */
+	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		start &= PMD_MASK;
+
+	start_page = start >> PAGE_CACHE_SHIFT;
+	end_page = end >> PAGE_CACHE_SHIFT;
+
+	tag_pages_for_writeback(mapping, start_page, end_page);
+
+	pagevec_init(&pvec, 0);
+	while (1) {
+		pvec.nr = find_get_entries_tag(mapping, start_page,
+				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		for (i = 0; i < pvec.nr; i++) {
+			ret = dax_writeback_one(bdev, mapping, indices[i],
+					pvec.pages[i]);
+			if (ret < 0)
+				return ret;
+		}
+	}
+	wmb_pmem();
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -363,6 +532,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
+	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
+			vmf->flags & FAULT_FLAG_WRITE);
+	if (error)
+		goto out;
+
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
@@ -487,6 +661,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		delete_from_page_cache(page);
 		unlock_page(page);
 		page_cache_release(page);
+		page = NULL;
 	}
 
 	/*
@@ -591,7 +766,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	pgoff_t size, pgoff;
 	loff_t lstart, lend;
 	sector_t block;
-	int result = 0;
+	int error, result = 0;
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -733,6 +908,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
+		if (write) {
+			error = dax_radix_entry(mapping, pgoff, dax.sector,
+					true, true);
+			if (error) {
+				dax_pmd_dbg(&bh, address,
+						"PMD radix insertion failed");
+				goto fallback;
+			}
+		}
+
 		dev_dbg(part_to_dev(bdev->bd_part),
 				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
 				__func__, current->comm, address,
@@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
- *
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct file *file = vma->vm_file;
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	sb_end_pagefault(sb);
+	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..8204c3d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end);
 #endif
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e215fc..2e7c8d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -482,6 +482,12 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 {
 	int err = 0;
 
+	if (dax_mapping(mapping) && mapping->nrexceptional) {
+		err = dax_writeback_mapping_range(mapping, lstart, lend);
+		if (err)
+			return err;
+	}
+
 	if (mapping->nrpages) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 7/9] ext2: call dax_pfn_mkwrite() for DAX fsync/msync
  2016-01-08  5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
                   ` (5 preceding siblings ...)
  2016-01-08  5:27 ` [PATCH v8 6/9] dax: add support for fsync/msync Ross Zwisler
@ 2016-01-08  5:27 ` Ross Zwisler
  2016-01-08  5:27 ` [PATCH v8 8/9] ext4: " Ross Zwisler
  2016-01-08  5:27 ` [PATCH v8 9/9] xfs: " Ross Zwisler
  8 siblings, 0 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-01-08  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 11a42c5..2c88d68 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
@@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 
 	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 8/9] ext4: call dax_pfn_mkwrite() for DAX fsync/msync
  2016-01-08  5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
                   ` (6 preceding siblings ...)
  2016-01-08  5:27 ` [PATCH v8 7/9] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Ross Zwisler
@ 2016-01-08  5:27 ` Ross Zwisler
  2016-01-08  5:27 ` [PATCH v8 9/9] xfs: " Ross Zwisler
  8 siblings, 0 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-01-08  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 60683ab..fa899c9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
@@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(sb);
 
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v8 9/9] xfs: call dax_pfn_mkwrite() for DAX fsync/msync
  2016-01-08  5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
                   ` (7 preceding siblings ...)
  2016-01-08  5:27 ` [PATCH v8 8/9] ext4: " Ross Zwisler
@ 2016-01-08  5:27 ` Ross Zwisler
  8 siblings, 0 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-01-08  5:27 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebe9b82..55e16e2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1610,9 +1610,8 @@ xfs_filemap_pmd_fault(
 /*
  * pfn_mkwrite was originally inteneded to ensure we capture time stamp
  * updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
- * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
- * barrier in place.
+ * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
+ * to ensure we serialise the fault barrier in place.
  */
 static int
 xfs_filemap_pfn_mkwrite(
@@ -1635,6 +1634,8 @@ xfs_filemap_pfn_mkwrite(
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else if (IS_DAX(inode))
+		ret = dax_pfn_mkwrite(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-08  5:27 ` [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
@ 2016-01-12  9:34   ` Jan Kara
  2016-01-13  7:08     ` Ross Zwisler
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-01-12  9:34 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Thu 07-01-16 22:27:51, Ross Zwisler wrote:
> In __dax_pmd_fault() we currently assume that get_block() will always set
> bh.b_bdev and we unconditionally dereference it in __dax_dbg().  This
> assumption isn't always true - when called for reads of holes
> ext4_dax_mmap_get_block() returns a buffer head where bh->b_bdev is never
> set.  I hit this BUG while testing the DAX PMD fault path.
> 
> Instead, initialize bh.b_bdev before passing bh into get_block().  It is
> possible that the filesystem's get_block() will update bh.b_bdev, and this
> is fine - we just want to initialize bh.b_bdev to something reasonable so
> that the calls to __dax_dbg() work and print something useful.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>

Looks good. But don't you need to do the same for __dax_fault(),
dax_zero_page_range() and similar places passing bh to dax functions?

								Honza
> ---
>  fs/dax.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 7af8797..513bba5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -624,6 +624,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	}
>  
>  	memset(&bh, 0, sizeof(bh));
> +	bh.b_bdev = inode->i_sb->s_bdev;
>  	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
>  
>  	bh.b_size = PMD_SIZE;
> -- 
> 2.5.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 2/9] dax: fix conversion of holes to PMDs
  2016-01-08  5:27 ` [PATCH v8 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
@ 2016-01-12  9:44   ` Jan Kara
  2016-01-13  7:37     ` Ross Zwisler
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-01-12  9:44 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Thu 07-01-16 22:27:52, Ross Zwisler wrote:
> When we get a DAX PMD fault for a write it is possible that there could be
> some number of 4k zero pages already present for the same range that were
> inserted to service reads from a hole.  These 4k zero pages need to be
> unmapped from the VMAs and removed from the struct address_space radix tree
> before the real DAX PMD entry can be inserted.
> 
> For PTE faults this same use case also exists and is handled by a
> combination of unmap_mapping_range() to unmap the VMAs and
> delete_from_page_cache() to remove the page from the address_space radix
> tree.
> 
> For PMD faults we do have a call to unmap_mapping_range() (protected by a
> buffer_new() check), but nothing clears out the radix tree entry.  The
> buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> code will never return a buffer_head with BH_New set, even when allocating
> new blocks over a hole.  Instead the filesystem will zero the blocks
> manually and return a buffer_head with only BH_Mapped set.
> 
> Fix this situation by removing the buffer_new() check and adding a call to
> truncate_inode_pages_range() to clear out the radix tree entries before we
> insert the DAX PMD.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reported-by: Dan Williams <dan.j.williams@intel.com>
> Tested-by: Dan Williams <dan.j.williams@intel.com>

Just two nits below. Nothing serious so you can add:

Reviewed-by: Jan Kara <jack@suse.cz>

> ---
>  fs/dax.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 513bba5..5b84a46 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -589,6 +589,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	bool write = flags & FAULT_FLAG_WRITE;
>  	struct block_device *bdev;
>  	pgoff_t size, pgoff;
> +	loff_t lstart, lend;
>  	sector_t block;
>  	int result = 0;
>  
> @@ -643,15 +644,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		goto fallback;
>  	}
>  
> -	/*
> -	 * If we allocated new storage, make sure no process has any
> -	 * zero pages covering this hole
> -	 */
> -	if (buffer_new(&bh)) {
> -		i_mmap_unlock_read(mapping);
> -		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> -		i_mmap_lock_read(mapping);
> -	}
> +	/* make sure no process has any zero pages covering this hole */
> +	lstart = pgoff << PAGE_SHIFT;
> +	lend = lstart + PMD_SIZE - 1; /* inclusive */
> +	i_mmap_unlock_read(mapping);

Just a nit but is there reason why we grab i_mmap_lock_read(mapping) only
to release it a few lines below? The bh checks inside the locked region
don't seem to rely on i_mmap_lock...

> +	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> +	truncate_inode_pages_range(mapping, lstart, lend);

These two calls can be shortened as:

truncate_pagecache_range(inode, lstart, lend);


								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/9] dax: add support for fsync/msync
  2016-01-08  5:27 ` [PATCH v8 6/9] dax: add support for fsync/msync Ross Zwisler
@ 2016-01-12 10:57   ` Jan Kara
  2016-01-13  7:30     ` Ross Zwisler
  2016-02-06 14:33   ` Dmitry Monakhov
  1 sibling, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-01-12 10:57 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Thu 07-01-16 22:27:56, Ross Zwisler wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
> 
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Some comments below.

> ---
>  fs/dax.c            | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  include/linux/dax.h |   2 +
>  mm/filemap.c        |   6 ++
>  3 files changed, 196 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5b84a46..0db21ea 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -24,6 +24,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
> +#include <linux/pagevec.h>
>  #include <linux/pmem.h>
>  #include <linux/sched.h>
>  #include <linux/uio.h>
> @@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
>  	return 0;
>  }
>  
> +#define NO_SECTOR -1
> +
> +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> +		sector_t sector, bool pmd_entry, bool dirty)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type, error = 0;
> +	void *entry;
> +
> +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = radix_tree_lookup(page_tree, index);
> +
> +	if (entry) {
> +		type = RADIX_DAX_TYPE(entry);
> +		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> +					type != RADIX_DAX_PMD)) {
> +			error = -EIO;
> +			goto unlock;
> +		}
> +
> +		if (!pmd_entry || type == RADIX_DAX_PMD)
> +			goto dirty;
> +		radix_tree_delete(&mapping->page_tree, index);
> +		mapping->nrexceptional--;

In theory, you can delete here DIRTY / TOWRITE PTE entry and insert a clean
PMD entry instead of it. That will cause fsync() to miss some flushes. So
you should make sure you transfer all the tags to the new entry.

> +static int dax_writeback_one(struct block_device *bdev,
> +		struct address_space *mapping, pgoff_t index, void *entry)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type = RADIX_DAX_TYPE(entry);
> +	struct radix_tree_node *node;
> +	struct blk_dax_ctl dax;
> +	void **slot;
> +	int ret = 0;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * Regular page slots are stabilized by the page lock even
> +	 * without the tree itself locked.  These unlocked entries
> +	 * need verification under the tree lock.
> +	 */
> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unlock;
> +	if (*slot != entry)
> +		goto unlock;
> +
> +	/* another fsync thread may have already written back this entry */
> +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto unlock;
> +
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +
> +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	dax.sector = RADIX_DAX_SECTOR(entry);
> +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> +	spin_unlock_irq(&mapping->tree_lock);

This seems to be somewhat racy as well - if there are two fsyncs running
against the same inode, one wins the race and clears TOWRITE tag, the
second then bails out and may finish before the skipped page gets flushed.

So we should clear the TOWRITE tag only after the range is flushed.  This
can result in some amount of duplicit flushing but I don't think the race
will happen that frequently in practice to be performance relevant.

And secondly: You must write-protect all mappings of the flushed range so
that you get fault when the sector gets written-to again. We spoke about
this in the past already but somehow it got lost and I forgot about it as
well. You need something like rmap_walk_file()...

> +	/*
> +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> +	 * eventually calls cond_resched().
> +	 */
> +	ret = dax_map_atomic(bdev, &dax);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (WARN_ON_ONCE(ret < dax.size)) {
> +		ret = -EIO;
> +		goto unmap;
> +	}
> +
> +	wb_cache_pmem(dax.addr, dax.size);
> + unmap:
> +	dax_unmap_atomic(bdev, &dax);
> +	return ret;
> +
> + unlock:
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}

...

> @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
>   * dax_pfn_mkwrite - handle first write to DAX page
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
> - *
>   */
>  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +	struct file *file = vma->vm_file;
>  
> -	sb_start_pagefault(sb);
> -	file_update_time(vma->vm_file);
> -	sb_end_pagefault(sb);
> +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);

Why is NO_SECTOR argument correct here?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-12  9:34   ` Jan Kara
@ 2016-01-13  7:08     ` Ross Zwisler
  2016-01-13  9:07       ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Ross Zwisler @ 2016-01-13  7:08 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Tue, Jan 12, 2016 at 10:34:58AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:51, Ross Zwisler wrote:
> > In __dax_pmd_fault() we currently assume that get_block() will always set
> > bh.b_bdev and we unconditionally dereference it in __dax_dbg().  This
> > assumption isn't always true - when called for reads of holes
> > ext4_dax_mmap_get_block() returns a buffer head where bh->b_bdev is never
> > set.  I hit this BUG while testing the DAX PMD fault path.
> > 
> > Instead, initialize bh.b_bdev before passing bh into get_block().  It is
> > possible that the filesystem's get_block() will update bh.b_bdev, and this
> > is fine - we just want to initialize bh.b_bdev to something reasonable so
> > that the calls to __dax_dbg() work and print something useful.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> 
> Looks good. But don't you need to do the same for __dax_fault(),
> dax_zero_page_range() and similar places passing bh to dax functions?
> 
> 								Honza

I don't think we need it anywhere else.  The only reason that we need to
initialize the bh.b_bdev manually in the __dax_pmd_fault() path is that if the
get_block() call ends up finding a hole (so doesn't fill out b_bdev) we still
go through the dax_pmd_dbg() path to print an error message, which uses
b_bdev.  I believe that in the other paths where we hit a hole, such as
__dax_fault(), we don't use b_bdev because we don't have the same error path
prints, and the regular code for handling holes doesn't use b_bdev.

That being said, if you feel like it's cleaner to initialize it everywhere so
everything is consistent and we don't have to worry about it, I'm fine to make
the change.

> > ---
> >  fs/dax.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 7af8797..513bba5 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -624,6 +624,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	}
> >  
> >  	memset(&bh, 0, sizeof(bh));
> > +	bh.b_bdev = inode->i_sb->s_bdev;
> >  	block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
> >  
> >  	bh.b_size = PMD_SIZE;
> > -- 
> > 2.5.0
> > 
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/9] dax: add support for fsync/msync
  2016-01-12 10:57   ` Jan Kara
@ 2016-01-13  7:30     ` Ross Zwisler
  2016-01-13  9:35       ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Ross Zwisler @ 2016-01-13  7:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Tue, Jan 12, 2016 at 11:57:16AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:56, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> Some comments below.
> 
> > ---
> >  fs/dax.c            | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> >  include/linux/dax.h |   2 +
> >  mm/filemap.c        |   6 ++
> >  3 files changed, 196 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 5b84a46..0db21ea 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/memcontrol.h>
> >  #include <linux/mm.h>
> >  #include <linux/mutex.h>
> > +#include <linux/pagevec.h>
> >  #include <linux/pmem.h>
> >  #include <linux/sched.h>
> >  #include <linux/uio.h>
> > @@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
> >  	return 0;
> >  }
> >  
> > +#define NO_SECTOR -1
> > +
> > +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> > +		sector_t sector, bool pmd_entry, bool dirty)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type, error = 0;
> > +	void *entry;
> > +
> > +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = radix_tree_lookup(page_tree, index);
> > +
> > +	if (entry) {
> > +		type = RADIX_DAX_TYPE(entry);
> > +		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > +					type != RADIX_DAX_PMD)) {
> > +			error = -EIO;
> > +			goto unlock;
> > +		}
> > +
> > +		if (!pmd_entry || type == RADIX_DAX_PMD)
> > +			goto dirty;
> > +		radix_tree_delete(&mapping->page_tree, index);
> > +		mapping->nrexceptional--;
> 
> In theory, you can delete here DIRTY / TOWRITE PTE entry and insert a clean
> PMD entry instead of it. That will cause fsync() to miss some flushes. So
> you should make sure you transfer all the tags to the new entry.

Ah, great catch, I'll address it in v9 which I'll send out tomorrow.

> > +static int dax_writeback_one(struct block_device *bdev,
> > +		struct address_space *mapping, pgoff_t index, void *entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	struct radix_tree_node *node;
> > +	struct blk_dax_ctl dax;
> > +	void **slot;
> > +	int ret = 0;
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	/*
> > +	 * Regular page slots are stabilized by the page lock even
> > +	 * without the tree itself locked.  These unlocked entries
> > +	 * need verification under the tree lock.
> > +	 */
> > +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> > +		goto unlock;
> > +	if (*slot != entry)
> > +		goto unlock;
> > +
> > +	/* another fsync thread may have already written back this entry */
> > +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> > +		goto unlock;
> > +
> > +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> > +
> > +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> > +		ret = -EIO;
> > +		goto unlock;
> > +	}
> > +
> > +	dax.sector = RADIX_DAX_SECTOR(entry);
> > +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> > +	spin_unlock_irq(&mapping->tree_lock);
> 
> This seems to be somewhat racy as well - if there are two fsyncs running
> against the same inode, one wins the race and clears TOWRITE tag, the
> second then bails out and may finish before the skipped page gets flushed.
> 
> So we should clear the TOWRITE tag only after the range is flushed.  This
> can result in some amount of duplicit flushing but I don't think the race
> will happen that frequently in practice to be performance relevant.

Yep, this make sense.  I'll also fix that in v9.

> And secondly: You must write-protect all mappings of the flushed range so
> that you get fault when the sector gets written-to again. We spoke about
> this in the past already but somehow it got lost and I forgot about it as
> well. You need something like rmap_walk_file()...

The code that write protected mappings and then cleaned the radix tree entries
did get written, and was part of v2:

https://lkml.org/lkml/2015/11/13/759

I removed all the code that cleaned PTE entries and radix tree entries for v3.
The reason behind this was that there was a race that I couldn't figure out
how to solve between the cleaning of the PTEs and the cleaning of the radix
tree entries.

The race goes like this:

Thread 1 (write)			Thread 2 (fsync)
================			================
wp_pfn_shared()
pfn_mkwrite()
dax_radix_entry()
radix_tree_tag_set(DIRTY)
					dax_writeback_mapping_range()
					dax_writeback_one()
					radix_tag_clear(DIRTY)
					pgoff_mkclean()
... return up to wp_pfn_shared()
wp_page_reuse()
pte_mkdirty()

After this sequence we end up with a dirty PTE that is writeable, but with a
clean radix tree entry.  This means that users can write to the page, but that
a follow-up fsync or msync won't flush this dirty data to media.

The overall issue is that in the write path that goes through wp_pfn_shared(),
the DAX code has control over when the radix tree entry is dirtied but not
when the PTE is made dirty and writeable.  This happens up in wp_page_reuse().
This means that we can't easily add locking, etc. to protect ourselves.

I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
easy solutions presented themselves in the absence of a page lock.  I do have
one idea, but I think it's pretty invasive and will need to wait for another
kernel cycle.

The current code that leaves the radix tree entry will give us correct
behavior - it'll just be less efficient because we will have an ever-growing
dirty set to flush.

> > +	/*
> > +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> > +	 * eventually calls cond_resched().
> > +	 */
> > +	ret = dax_map_atomic(bdev, &dax);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	if (WARN_ON_ONCE(ret < dax.size)) {
> > +		ret = -EIO;
> > +		goto unmap;
> > +	}
> > +
> > +	wb_cache_pmem(dax.addr, dax.size);
> > + unmap:
> > +	dax_unmap_atomic(bdev, &dax);
> > +	return ret;
> > +
> > + unlock:
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +	return ret;
> > +}
> 
> ...
> 
> > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> >   * dax_pfn_mkwrite - handle first write to DAX page
> >   * @vma: The virtual memory area where the fault occurred
> >   * @vmf: The description of the fault
> > - *
> >   */
> >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +	struct file *file = vma->vm_file;
> >  
> > -	sb_start_pagefault(sb);
> > -	file_update_time(vma->vm_file);
> > -	sb_end_pagefault(sb);
> > +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> 
> Why is NO_SECTOR argument correct here?

Right - so NO_SECTOR means "I expect there to already be an entry in the radix
tree - just make that entry dirty".  This works because pfn_mkwrite() always
follows a normal __dax_fault() or __dax_pmd_fault() call.  These fault calls
will insert the radix tree entry, regardless of whether the fault was for a
read or a write.  If the fault was for a write, the radix tree entry will also
be made dirty.

For reads the radix tree entry will be inserted but left clean.  When the
first write happens we will get a pfn_mkwrite() call, which will call
dax_radix_entry() with the NO_SECTOR argument.  This will look up the radix
tree entry & set the dirty tag.

This also has the added benefit that the pfn_mkwrite() path can remain minimal
- if we needed to actually insert a radix tree entry with sector information
we'd have to duplicate a bunch of the fault path code so that we could call
get_block().

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 2/9] dax: fix conversion of holes to PMDs
  2016-01-12  9:44   ` Jan Kara
@ 2016-01-13  7:37     ` Ross Zwisler
  0 siblings, 0 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-01-13  7:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Tue, Jan 12, 2016 at 10:44:51AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:52, Ross Zwisler wrote:
> > When we get a DAX PMD fault for a write it is possible that there could be
> > some number of 4k zero pages already present for the same range that were
> > inserted to service reads from a hole.  These 4k zero pages need to be
> > unmapped from the VMAs and removed from the struct address_space radix tree
> > before the real DAX PMD entry can be inserted.
> > 
> > For PTE faults this same use case also exists and is handled by a
> > combination of unmap_mapping_range() to unmap the VMAs and
> > delete_from_page_cache() to remove the page from the address_space radix
> > tree.
> > 
> > For PMD faults we do have a call to unmap_mapping_range() (protected by a
> > buffer_new() check), but nothing clears out the radix tree entry.  The
> > buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> > code will never return a buffer_head with BH_New set, even when allocating
> > new blocks over a hole.  Instead the filesystem will zero the blocks
> > manually and return a buffer_head with only BH_Mapped set.
> > 
> > Fix this situation by removing the buffer_new() check and adding a call to
> > truncate_inode_pages_range() to clear out the radix tree entries before we
> > insert the DAX PMD.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reported-by: Dan Williams <dan.j.williams@intel.com>
> > Tested-by: Dan Williams <dan.j.williams@intel.com>
> 
> Just two nits below. Nothing serious so you can add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Cool, thank you for the review!

> > ---
> >  fs/dax.c | 20 ++++++++++----------
> >  1 file changed, 10 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 513bba5..5b84a46 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -589,6 +589,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  	bool write = flags & FAULT_FLAG_WRITE;
> >  	struct block_device *bdev;
> >  	pgoff_t size, pgoff;
> > +	loff_t lstart, lend;
> >  	sector_t block;
> >  	int result = 0;
> >  
> > @@ -643,15 +644,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >  		goto fallback;
> >  	}
> >  
> > -	/*
> > -	 * If we allocated new storage, make sure no process has any
> > -	 * zero pages covering this hole
> > -	 */
> > -	if (buffer_new(&bh)) {
> > -		i_mmap_unlock_read(mapping);
> > -		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> > -		i_mmap_lock_read(mapping);
> > -	}
> > +	/* make sure no process has any zero pages covering this hole */
> > +	lstart = pgoff << PAGE_SHIFT;
> > +	lend = lstart + PMD_SIZE - 1; /* inclusive */
> > +	i_mmap_unlock_read(mapping);
> 
> Just a nit but is there reason why we grab i_mmap_lock_read(mapping) only
> to release it a few lines below? The bh checks inside the locked region
> don't seem to rely on i_mmap_lock...

I think we can probably just take it when we're done with the truncate() -
I'll fix for v9.

> > +	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> > +	truncate_inode_pages_range(mapping, lstart, lend);
> 
> These two calls can be shortened as:
> 
> truncate_pagecache_range(inode, lstart, lend);

Nice.  I'll change it for v9.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-13  7:08     ` Ross Zwisler
@ 2016-01-13  9:07       ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2016-01-13  9:07 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Wed 13-01-16 00:08:29, Ross Zwisler wrote:
> On Tue, Jan 12, 2016 at 10:34:58AM +0100, Jan Kara wrote:
> > On Thu 07-01-16 22:27:51, Ross Zwisler wrote:
> > > In __dax_pmd_fault() we currently assume that get_block() will always set
> > > bh.b_bdev and we unconditionally dereference it in __dax_dbg().  This
> > > assumption isn't always true - when called for reads of holes
> > > ext4_dax_mmap_get_block() returns a buffer head where bh->b_bdev is never
> > > set.  I hit this BUG while testing the DAX PMD fault path.
> > > 
> > > Instead, initialize bh.b_bdev before passing bh into get_block().  It is
> > > possible that the filesystem's get_block() will update bh.b_bdev, and this
> > > is fine - we just want to initialize bh.b_bdev to something reasonable so
> > > that the calls to __dax_dbg() work and print something useful.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > 
> > Looks good. But don't you need to do the same for __dax_fault(),
> > dax_zero_page_range() and similar places passing bh to dax functions?
> > 
> > 								Honza
> 
> I don't think we need it anywhere else.  The only reason that we need to
> initialize the bh.b_bdev manually in the __dax_pmd_fault() path is that if the
> get_block() call ends up finding a hole (so doesn't fill out b_bdev) we still
> go through the dax_pmd_dbg() path to print an error message, which uses
> b_bdev.  I believe that in the other paths where we hit a hole, such as
> __dax_fault(), we don't use b_bdev because we don't have the same error path
> prints, and the regular code for handling holes doesn't use b_bdev.
> 
> That being said, if you feel like it's cleaner to initialize it
> everywhere so everything is consistent and we don't have to worry about
> it, I'm fine to make the change.

Well, it seems more futureproof to me. In case someone decides to add some
debug message later on...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/9] dax: add support for fsync/msync
  2016-01-13  7:30     ` Ross Zwisler
@ 2016-01-13  9:35       ` Jan Kara
  2016-01-13 18:58         ` Ross Zwisler
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-01-13  9:35 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Wed 13-01-16 00:30:19, Ross Zwisler wrote:
> > And secondly: You must write-protect all mappings of the flushed range so
> > that you get fault when the sector gets written-to again. We spoke about
> > this in the past already but somehow it got lost and I forgot about it as
> > well. You need something like rmap_walk_file()...
> 
> The code that write protected mappings and then cleaned the radix tree entries
> did get written, and was part of v2:
> 
> https://lkml.org/lkml/2015/11/13/759
> 
> I removed all the code that cleaned PTE entries and radix tree entries for v3.
> The reason behind this was that there was a race that I couldn't figure out
> how to solve between the cleaning of the PTEs and the cleaning of the radix
> tree entries.
> 
> The race goes like this:
> 
> Thread 1 (write)			Thread 2 (fsync)
> ================			================
> wp_pfn_shared()
> pfn_mkwrite()
> dax_radix_entry()
> radix_tree_tag_set(DIRTY)
> 					dax_writeback_mapping_range()
> 					dax_writeback_one()
> 					radix_tag_clear(DIRTY)
> 					pgoff_mkclean()
> ... return up to wp_pfn_shared()
> wp_page_reuse()
> pte_mkdirty()
> 
> After this sequence we end up with a dirty PTE that is writeable, but with a
> clean radix tree entry.  This means that users can write to the page, but that
> a follow-up fsync or msync won't flush this dirty data to media.
> 
> The overall issue is that in the write path that goes through wp_pfn_shared(),
> the DAX code has control over when the radix tree entry is dirtied but not
> when the PTE is made dirty and writeable.  This happens up in wp_page_reuse().
> This means that we can't easily add locking, etc. to protect ourselves.
> 
> I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
> easy solutions presented themselves in the absence of a page lock.  I do have
> one idea, but I think it's pretty invasive and will need to wait for another
> kernel cycle.
> 
> The current code that leaves the radix tree entry will give us correct
> behavior - it'll just be less efficient because we will have an ever-growing
> dirty set to flush.

Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree
tags but it does not (I should have known, I have written that functions
few years ago ;). Makes sense. Thanks for clarification.

> > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > >   * dax_pfn_mkwrite - handle first write to DAX page
> > >   * @vma: The virtual memory area where the fault occurred
> > >   * @vmf: The description of the fault
> > > - *
> > >   */
> > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > >  {
> > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > +	struct file *file = vma->vm_file;
> > >  
> > > -	sb_start_pagefault(sb);
> > > -	file_update_time(vma->vm_file);
> > > -	sb_end_pagefault(sb);
> > > +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> > 
> > Why is NO_SECTOR argument correct here?
> 
> Right - so NO_SECTOR means "I expect there to already be an entry in the radix
> tree - just make that entry dirty".  This works because pfn_mkwrite() always
> follows a normal __dax_fault() or __dax_pmd_fault() call.  These fault calls
> will insert the radix tree entry, regardless of whether the fault was for a
> read or a write.  If the fault was for a write, the radix tree entry will also
> be made dirty.
>
> For reads the radix tree entry will be inserted but left clean.  When the
> first write happens we will get a pfn_mkwrite() call, which will call
> dax_radix_entry() with the NO_SECTOR argument.  This will look up the radix
> tree entry & set the dirty tag.

So the explanation of this should be somewhere so that everyone knows that
we must have radix tree entries even for clean mapped blocks. Because upto
know that was not clear to me.  Also __dax_pmd_fault() seems to insert
entries only for write fault so the assumption doesn't seem to hold there?

I'm somewhat uneasy that a bug in this logic can be hidden as a simple race
with hole punching. But I guess I can live with that.
 
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 4/9] dax: support dirty DAX entries in radix tree
  2016-01-08  5:27 ` [PATCH v8 4/9] dax: support dirty DAX entries in radix tree Ross Zwisler
@ 2016-01-13  9:44   ` Jan Kara
  2016-01-13 18:48     ` Ross Zwisler
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2016-01-13  9:44 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> Add support for tracking dirty DAX entries in the struct address_space
> radix tree.  This tree is already used for dirty page writeback, and it
> already supports the use of exceptional (non struct page*) entries.
> 
> In order to properly track dirty DAX pages we will insert new exceptional
> entries into the radix tree that represent dirty DAX PTE or PMD pages.
> These exceptional entries will also contain the writeback sectors for the
> PTE or PMD faults that we can use at fsync/msync time.
> 
> There are currently two types of exceptional entries (shmem and shadow)
> that can be placed into the radix tree, and this adds a third.  We rely on
> the fact that only one type of exceptional entry can be found in a given
> radix tree based on its usage.  This happens for free with DAX vs shmem but
> we explicitly prevent shadow entries from being added to radix trees for
> DAX mappings.
> 
> The only shadow entries that would be generated for DAX radix trees would
> be to track zero page mappings that were created for holes.  These pages
> would receive minimal benefit from having shadow entries, and the choice
> to have only one type of exceptional entry in a given radix tree makes the
> logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

I have realized there's one issue with this code. See below:

> @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
>  		return;
>  
>  	spin_lock_irq(&mapping->tree_lock);
> -	/*
> -	 * Regular page slots are stabilized by the page lock even
> -	 * without the tree itself locked.  These unlocked entries
> -	 * need verification under the tree lock.
> -	 */
> -	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> -		goto unlock;
> -	if (*slot != entry)
> -		goto unlock;
> -	radix_tree_replace_slot(slot, NULL);
> -	mapping->nrshadows--;
> -	if (!node)
> -		goto unlock;
> -	workingset_node_shadows_dec(node);
> -	/*
> -	 * Don't track node without shadow entries.
> -	 *
> -	 * Avoid acquiring the list_lru lock if already untracked.
> -	 * The list_empty() test is safe as node->private_list is
> -	 * protected by mapping->tree_lock.
> -	 */
> -	if (!workingset_node_shadows(node) &&
> -	    !list_empty(&node->private_list))
> -		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> -	__radix_tree_delete_node(&mapping->page_tree, node);
> +
> +	if (dax_mapping(mapping)) {
> +		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> +			mapping->nrexceptional--;

So when you punch hole in a file, you can delete a PMD entry from a radix
tree which covers part of the file which still stays. So in this case you
have to split the PMD entry into PTE entries (probably that needs to happen
up in truncate_inode_pages_range()) or something similar...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 4/9] dax: support dirty DAX entries in radix tree
  2016-01-13  9:44   ` Jan Kara
@ 2016-01-13 18:48     ` Ross Zwisler
  2016-01-15 13:22       ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Ross Zwisler @ 2016-01-13 18:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Wed, Jan 13, 2016 at 10:44:11AM +0100, Jan Kara wrote:
> On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> > Add support for tracking dirty DAX entries in the struct address_space
> > radix tree.  This tree is already used for dirty page writeback, and it
> > already supports the use of exceptional (non struct page*) entries.
> > 
> > In order to properly track dirty DAX pages we will insert new exceptional
> > entries into the radix tree that represent dirty DAX PTE or PMD pages.
> > These exceptional entries will also contain the writeback sectors for the
> > PTE or PMD faults that we can use at fsync/msync time.
> > 
> > There are currently two types of exceptional entries (shmem and shadow)
> > that can be placed into the radix tree, and this adds a third.  We rely on
> > the fact that only one type of exceptional entry can be found in a given
> > radix tree based on its usage.  This happens for free with DAX vs shmem but
> > we explicitly prevent shadow entries from being added to radix trees for
> > DAX mappings.
> > 
> > The only shadow entries that would be generated for DAX radix trees would
> > be to track zero page mappings that were created for holes.  These pages
> > would receive minimal benefit from having shadow entries, and the choice
> > to have only one type of exceptional entry in a given radix tree makes the
> > logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> I have realized there's one issue with this code. See below:
> 
> > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
> >  		return;
> >  
> >  	spin_lock_irq(&mapping->tree_lock);
> > -	/*
> > -	 * Regular page slots are stabilized by the page lock even
> > -	 * without the tree itself locked.  These unlocked entries
> > -	 * need verification under the tree lock.
> > -	 */
> > -	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> > -		goto unlock;
> > -	if (*slot != entry)
> > -		goto unlock;
> > -	radix_tree_replace_slot(slot, NULL);
> > -	mapping->nrshadows--;
> > -	if (!node)
> > -		goto unlock;
> > -	workingset_node_shadows_dec(node);
> > -	/*
> > -	 * Don't track node without shadow entries.
> > -	 *
> > -	 * Avoid acquiring the list_lru lock if already untracked.
> > -	 * The list_empty() test is safe as node->private_list is
> > -	 * protected by mapping->tree_lock.
> > -	 */
> > -	if (!workingset_node_shadows(node) &&
> > -	    !list_empty(&node->private_list))
> > -		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > -	__radix_tree_delete_node(&mapping->page_tree, node);
> > +
> > +	if (dax_mapping(mapping)) {
> > +		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> > +			mapping->nrexceptional--;
> 
> So when you punch hole in a file, you can delete a PMD entry from a radix
> tree which covers part of the file which still stays. So in this case you
> have to split the PMD entry into PTE entries (probably that needs to happen
> up in truncate_inode_pages_range()) or something similar...

I think (and will verify) that the DAX code just unmaps the entire PMD range
when we receive a hole punch request inside of the PMD.  If this is true then
I think the radix tree code should behave the same way and just remove the PMD
entry in the radix tree.

This will cause new accesses that used to land in the PMD range to get new
page faults.  These faults will call get_blocks(), where presumably the
filesystem will tell us that we don't have a contiguous 2MiB range anymore, so
we will fall back to PTE faults.  These PTEs will fill in both the radix tree
and the page tables.

So, I think the work here is to verify the behavior of DAX wrt hole punches
for PMD ranges, and make the radix tree code match that behavior.  Sound good?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/9] dax: add support for fsync/msync
  2016-01-13  9:35       ` Jan Kara
@ 2016-01-13 18:58         ` Ross Zwisler
  2016-01-15 13:10           ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Ross Zwisler @ 2016-01-13 18:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Wed, Jan 13, 2016 at 10:35:25AM +0100, Jan Kara wrote:
> On Wed 13-01-16 00:30:19, Ross Zwisler wrote:
> > > And secondly: You must write-protect all mappings of the flushed range so
> > > that you get fault when the sector gets written-to again. We spoke about
> > > this in the past already but somehow it got lost and I forgot about it as
> > > well. You need something like rmap_walk_file()...
> > 
> > The code that write protected mappings and then cleaned the radix tree entries
> > did get written, and was part of v2:
> > 
> > https://lkml.org/lkml/2015/11/13/759
> > 
> > I removed all the code that cleaned PTE entries and radix tree entries for v3.
> > The reason behind this was that there was a race that I couldn't figure out
> > how to solve between the cleaning of the PTEs and the cleaning of the radix
> > tree entries.
> > 
> > The race goes like this:
> > 
> > Thread 1 (write)			Thread 2 (fsync)
> > ================			================
> > wp_pfn_shared()
> > pfn_mkwrite()
> > dax_radix_entry()
> > radix_tree_tag_set(DIRTY)
> > 					dax_writeback_mapping_range()
> > 					dax_writeback_one()
> > 					radix_tag_clear(DIRTY)
> > 					pgoff_mkclean()
> > ... return up to wp_pfn_shared()
> > wp_page_reuse()
> > pte_mkdirty()
> > 
> > After this sequence we end up with a dirty PTE that is writeable, but with a
> > clean radix tree entry.  This means that users can write to the page, but that
> > a follow-up fsync or msync won't flush this dirty data to media.
> > 
> > The overall issue is that in the write path that goes through wp_pfn_shared(),
> > the DAX code has control over when the radix tree entry is dirtied but not
> > when the PTE is made dirty and writeable.  This happens up in wp_page_reuse().
> > This means that we can't easily add locking, etc. to protect ourselves.
> > 
> > I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
> > easy solutions presented themselves in the absence of a page lock.  I do have
> > one idea, but I think it's pretty invasive and will need to wait for another
> > kernel cycle.
> > 
> > The current code that leaves the radix tree entry will give us correct
> > behavior - it'll just be less efficient because we will have an ever-growing
> > dirty set to flush.
> 
> Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree
> tags but it does not (I should have known, I have written that functions
> few years ago ;). Makes sense. Thanks for clarification.
> 
> > > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > >   * dax_pfn_mkwrite - handle first write to DAX page
> > > >   * @vma: The virtual memory area where the fault occurred
> > > >   * @vmf: The description of the fault
> > > > - *
> > > >   */
> > > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > >  {
> > > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > +	struct file *file = vma->vm_file;
> > > >  
> > > > -	sb_start_pagefault(sb);
> > > > -	file_update_time(vma->vm_file);
> > > > -	sb_end_pagefault(sb);
> > > > +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> > > 
> > > Why is NO_SECTOR argument correct here?
> > 
> > Right - so NO_SECTOR means "I expect there to already be an entry in the radix
> > tree - just make that entry dirty".  This works because pfn_mkwrite() always
> > follows a normal __dax_fault() or __dax_pmd_fault() call.  These fault calls
> > will insert the radix tree entry, regardless of whether the fault was for a
> > read or a write.  If the fault was for a write, the radix tree entry will also
> > be made dirty.
> >
> > For reads the radix tree entry will be inserted but left clean.  When the
> > first write happens we will get a pfn_mkwrite() call, which will call
> > dax_radix_entry() with the NO_SECTOR argument.  This will look up the radix
> > tree entry & set the dirty tag.
> 
> So the explanation of this should be somewhere so that everyone knows that
> we must have radix tree entries even for clean mapped blocks. Because upto
> know that was not clear to me.  Also __dax_pmd_fault() seems to insert
> entries only for write fault so the assumption doesn't seem to hold there?

Ah, right, sorry, the read fault() -> pfn_mkwrite() sequence only happens for
4k pages.  You are right about our handling of 2MiB pages - for a read
followed by a write we will just call into the normal __dax_pmd_fault() code
again, which will do the get_block() call and insert a dirty radix tree entry.
Because we have to go all the way through the fault handler again at write
time there isn't a benefit to inserting a clean radix tree entry on read, so
we just skip it.

> I'm somewhat uneasy that a bug in this logic can be hidden as a simple race
> with hole punching. But I guess I can live with that.
>  
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/9] dax: add support for fsync/msync
  2016-01-13 18:58         ` Ross Zwisler
@ 2016-01-15 13:10           ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2016-01-15 13:10 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Wed 13-01-16 11:58:02, Ross Zwisler wrote:
> On Wed, Jan 13, 2016 at 10:35:25AM +0100, Jan Kara wrote:
> > On Wed 13-01-16 00:30:19, Ross Zwisler wrote:
> > > > And secondly: You must write-protect all mappings of the flushed range so
> > > > that you get fault when the sector gets written-to again. We spoke about
> > > > this in the past already but somehow it got lost and I forgot about it as
> > > > well. You need something like rmap_walk_file()...
> > > 
> > > The code that write protected mappings and then cleaned the radix tree entries
> > > did get written, and was part of v2:
> > > 
> > > https://lkml.org/lkml/2015/11/13/759
> > > 
> > > I removed all the code that cleaned PTE entries and radix tree entries for v3.
> > > The reason behind this was that there was a race that I couldn't figure out
> > > how to solve between the cleaning of the PTEs and the cleaning of the radix
> > > tree entries.
> > > 
> > > The race goes like this:
> > > 
> > > Thread 1 (write)			Thread 2 (fsync)
> > > ================			================
> > > wp_pfn_shared()
> > > pfn_mkwrite()
> > > dax_radix_entry()
> > > radix_tree_tag_set(DIRTY)
> > > 					dax_writeback_mapping_range()
> > > 					dax_writeback_one()
> > > 					radix_tag_clear(DIRTY)
> > > 					pgoff_mkclean()
> > > ... return up to wp_pfn_shared()
> > > wp_page_reuse()
> > > pte_mkdirty()
> > > 
> > > After this sequence we end up with a dirty PTE that is writeable, but with a
> > > clean radix tree entry.  This means that users can write to the page, but that
> > > a follow-up fsync or msync won't flush this dirty data to media.
> > > 
> > > The overall issue is that in the write path that goes through wp_pfn_shared(),
> > > the DAX code has control over when the radix tree entry is dirtied but not
> > > when the PTE is made dirty and writeable.  This happens up in wp_page_reuse().
> > > This means that we can't easily add locking, etc. to protect ourselves.
> > > 
> > > I spoke a bit about this with Dave Chinner and with Dave Hansen, but no really
> > > easy solutions presented themselves in the absence of a page lock.  I do have
> > > one idea, but I think it's pretty invasive and will need to wait for another
> > > kernel cycle.
> > > 
> > > The current code that leaves the radix tree entry will give us correct
> > > behavior - it'll just be less efficient because we will have an ever-growing
> > > dirty set to flush.
> > 
> > Ahaa! Somehow I imagined tag_pages_for_writeback() clears DIRTY radix tree
> > tags but it does not (I should have known, I have written that functions
> > few years ago ;). Makes sense. Thanks for clarification.
> > 
> > > > > @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> > > > >   * dax_pfn_mkwrite - handle first write to DAX page
> > > > >   * @vma: The virtual memory area where the fault occurred
> > > > >   * @vmf: The description of the fault
> > > > > - *
> > > > >   */
> > > > >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > >  {
> > > > > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > > > > +	struct file *file = vma->vm_file;
> > > > >  
> > > > > -	sb_start_pagefault(sb);
> > > > > -	file_update_time(vma->vm_file);
> > > > > -	sb_end_pagefault(sb);
> > > > > +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
> > > > 
> > > > Why is NO_SECTOR argument correct here?
> > > 
> > > Right - so NO_SECTOR means "I expect there to already be an entry in the radix
> > > tree - just make that entry dirty".  This works because pfn_mkwrite() always
> > > follows a normal __dax_fault() or __dax_pmd_fault() call.  These fault calls
> > > will insert the radix tree entry, regardless of whether the fault was for a
> > > read or a write.  If the fault was for a write, the radix tree entry will also
> > > be made dirty.
> > >
> > > For reads the radix tree entry will be inserted but left clean.  When the
> > > first write happens we will get a pfn_mkwrite() call, which will call
> > > dax_radix_entry() with the NO_SECTOR argument.  This will look up the radix
> > > tree entry & set the dirty tag.
> > 
> > So the explanation of this should be somewhere so that everyone knows that
> > we must have radix tree entries even for clean mapped blocks. Because upto
> > know that was not clear to me.  Also __dax_pmd_fault() seems to insert
> > entries only for write fault so the assumption doesn't seem to hold there?
> 
> Ah, right, sorry, the read fault() -> pfn_mkwrite() sequence only happens for
> 4k pages.  You are right about our handling of 2MiB pages - for a read
> followed by a write we will just call into the normal __dax_pmd_fault() code
> again, which will do the get_block() call and insert a dirty radix tree entry.
> Because we have to go all the way through the fault handler again at write
> time there isn't a benefit to inserting a clean radix tree entry on read, so
> we just skip it.

Ouch, I wasn't aware of this asymetry between PMD and PTE faults. OK, so
please just document this all somewhere because I'm pretty sure casual
reader won't be able to figure this all out just from the code.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 4/9] dax: support dirty DAX entries in radix tree
  2016-01-13 18:48     ` Ross Zwisler
@ 2016-01-15 13:22       ` Jan Kara
  2016-01-15 19:03         ` Ross Zwisler
  2016-02-03 16:42         ` Ross Zwisler
  0 siblings, 2 replies; 28+ messages in thread
From: Jan Kara @ 2016-01-15 13:22 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Wed 13-01-16 11:48:32, Ross Zwisler wrote:
> On Wed, Jan 13, 2016 at 10:44:11AM +0100, Jan Kara wrote:
> > On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> > > Add support for tracking dirty DAX entries in the struct address_space
> > > radix tree.  This tree is already used for dirty page writeback, and it
> > > already supports the use of exceptional (non struct page*) entries.
> > > 
> > > In order to properly track dirty DAX pages we will insert new exceptional
> > > entries into the radix tree that represent dirty DAX PTE or PMD pages.
> > > These exceptional entries will also contain the writeback sectors for the
> > > PTE or PMD faults that we can use at fsync/msync time.
> > > 
> > > There are currently two types of exceptional entries (shmem and shadow)
> > > that can be placed into the radix tree, and this adds a third.  We rely on
> > > the fact that only one type of exceptional entry can be found in a given
> > > radix tree based on its usage.  This happens for free with DAX vs shmem but
> > > we explicitly prevent shadow entries from being added to radix trees for
> > > DAX mappings.
> > > 
> > > The only shadow entries that would be generated for DAX radix trees would
> > > be to track zero page mappings that were created for holes.  These pages
> > > would receive minimal benefit from having shadow entries, and the choice
> > > to have only one type of exceptional entry in a given radix tree makes the
> > > logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > I have realized there's one issue with this code. See below:
> > 
> > > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
> > >  		return;
> > >  
> > >  	spin_lock_irq(&mapping->tree_lock);
> > > -	/*
> > > -	 * Regular page slots are stabilized by the page lock even
> > > -	 * without the tree itself locked.  These unlocked entries
> > > -	 * need verification under the tree lock.
> > > -	 */
> > > -	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> > > -		goto unlock;
> > > -	if (*slot != entry)
> > > -		goto unlock;
> > > -	radix_tree_replace_slot(slot, NULL);
> > > -	mapping->nrshadows--;
> > > -	if (!node)
> > > -		goto unlock;
> > > -	workingset_node_shadows_dec(node);
> > > -	/*
> > > -	 * Don't track node without shadow entries.
> > > -	 *
> > > -	 * Avoid acquiring the list_lru lock if already untracked.
> > > -	 * The list_empty() test is safe as node->private_list is
> > > -	 * protected by mapping->tree_lock.
> > > -	 */
> > > -	if (!workingset_node_shadows(node) &&
> > > -	    !list_empty(&node->private_list))
> > > -		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > > -	__radix_tree_delete_node(&mapping->page_tree, node);
> > > +
> > > +	if (dax_mapping(mapping)) {
> > > +		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> > > +			mapping->nrexceptional--;
> > 
> > So when you punch hole in a file, you can delete a PMD entry from a radix
> > tree which covers part of the file which still stays. So in this case you
> > have to split the PMD entry into PTE entries (probably that needs to happen
> > up in truncate_inode_pages_range()) or something similar...
> 
> I think (and will verify) that the DAX code just unmaps the entire PMD range
> when we receive a hole punch request inside of the PMD.  If this is true then
> I think the radix tree code should behave the same way and just remove the PMD
> entry in the radix tree.

But you cannot just remove it if it is dirty... You have to keep somewhere
information that part of the PMD range is still dirty (or write that range
out before removing the radix tree entry).

> This will cause new accesses that used to land in the PMD range to get new
> page faults.  These faults will call get_blocks(), where presumably the
> filesystem will tell us that we don't have a contiguous 2MiB range anymore, so
> we will fall back to PTE faults.  These PTEs will fill in both the radix tree
> and the page tables.
> 
> So, I think the work here is to verify the behavior of DAX wrt hole punches
> for PMD ranges, and make the radix tree code match that behavior.  Sound good?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 4/9] dax: support dirty DAX entries in radix tree
  2016-01-15 13:22       ` Jan Kara
@ 2016-01-15 19:03         ` Ross Zwisler
  2016-02-03 16:42         ` Ross Zwisler
  1 sibling, 0 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-01-15 19:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Fri, Jan 15, 2016 at 02:22:49PM +0100, Jan Kara wrote:
> On Wed 13-01-16 11:48:32, Ross Zwisler wrote:
> > On Wed, Jan 13, 2016 at 10:44:11AM +0100, Jan Kara wrote:
> > > On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> > > > Add support for tracking dirty DAX entries in the struct address_space
> > > > radix tree.  This tree is already used for dirty page writeback, and it
> > > > already supports the use of exceptional (non struct page*) entries.
> > > > 
> > > > In order to properly track dirty DAX pages we will insert new exceptional
> > > > entries into the radix tree that represent dirty DAX PTE or PMD pages.
> > > > These exceptional entries will also contain the writeback sectors for the
> > > > PTE or PMD faults that we can use at fsync/msync time.
> > > > 
> > > > There are currently two types of exceptional entries (shmem and shadow)
> > > > that can be placed into the radix tree, and this adds a third.  We rely on
> > > > the fact that only one type of exceptional entry can be found in a given
> > > > radix tree based on its usage.  This happens for free with DAX vs shmem but
> > > > we explicitly prevent shadow entries from being added to radix trees for
> > > > DAX mappings.
> > > > 
> > > > The only shadow entries that would be generated for DAX radix trees would
> > > > be to track zero page mappings that were created for holes.  These pages
> > > > would receive minimal benefit from having shadow entries, and the choice
> > > > to have only one type of exceptional entry in a given radix tree makes the
> > > > logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> > > > 
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > 
> > > I have realized there's one issue with this code. See below:
> > > 
> > > > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
> > > >  		return;
> > > >  
> > > >  	spin_lock_irq(&mapping->tree_lock);
> > > > -	/*
> > > > -	 * Regular page slots are stabilized by the page lock even
> > > > -	 * without the tree itself locked.  These unlocked entries
> > > > -	 * need verification under the tree lock.
> > > > -	 */
> > > > -	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> > > > -		goto unlock;
> > > > -	if (*slot != entry)
> > > > -		goto unlock;
> > > > -	radix_tree_replace_slot(slot, NULL);
> > > > -	mapping->nrshadows--;
> > > > -	if (!node)
> > > > -		goto unlock;
> > > > -	workingset_node_shadows_dec(node);
> > > > -	/*
> > > > -	 * Don't track node without shadow entries.
> > > > -	 *
> > > > -	 * Avoid acquiring the list_lru lock if already untracked.
> > > > -	 * The list_empty() test is safe as node->private_list is
> > > > -	 * protected by mapping->tree_lock.
> > > > -	 */
> > > > -	if (!workingset_node_shadows(node) &&
> > > > -	    !list_empty(&node->private_list))
> > > > -		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > > > -	__radix_tree_delete_node(&mapping->page_tree, node);
> > > > +
> > > > +	if (dax_mapping(mapping)) {
> > > > +		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> > > > +			mapping->nrexceptional--;
> > > 
> > > So when you punch hole in a file, you can delete a PMD entry from a radix
> > > tree which covers part of the file which still stays. So in this case you
> > > have to split the PMD entry into PTE entries (probably that needs to happen
> > > up in truncate_inode_pages_range()) or something similar...
> > 
> > I think (and will verify) that the DAX code just unmaps the entire PMD range
> > when we receive a hole punch request inside of the PMD.  If this is true then
> > I think the radix tree code should behave the same way and just remove the PMD
> > entry in the radix tree.
> 
> But you cannot just remove it if it is dirty... You have to keep somewhere
> information that part of the PMD range is still dirty (or write that range
> out before removing the radix tree entry).

Yep, agreed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 4/9] dax: support dirty DAX entries in radix tree
  2016-01-15 13:22       ` Jan Kara
  2016-01-15 19:03         ` Ross Zwisler
@ 2016-02-03 16:42         ` Ross Zwisler
  1 sibling, 0 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-02-03 16:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Fri, Jan 15, 2016 at 02:22:49PM +0100, Jan Kara wrote:
> On Wed 13-01-16 11:48:32, Ross Zwisler wrote:
> > On Wed, Jan 13, 2016 at 10:44:11AM +0100, Jan Kara wrote:
> > > On Thu 07-01-16 22:27:54, Ross Zwisler wrote:
> > > > Add support for tracking dirty DAX entries in the struct address_space
> > > > radix tree.  This tree is already used for dirty page writeback, and it
> > > > already supports the use of exceptional (non struct page*) entries.
> > > > 
> > > > In order to properly track dirty DAX pages we will insert new exceptional
> > > > entries into the radix tree that represent dirty DAX PTE or PMD pages.
> > > > These exceptional entries will also contain the writeback sectors for the
> > > > PTE or PMD faults that we can use at fsync/msync time.
> > > > 
> > > > There are currently two types of exceptional entries (shmem and shadow)
> > > > that can be placed into the radix tree, and this adds a third.  We rely on
> > > > the fact that only one type of exceptional entry can be found in a given
> > > > radix tree based on its usage.  This happens for free with DAX vs shmem but
> > > > we explicitly prevent shadow entries from being added to radix trees for
> > > > DAX mappings.
> > > > 
> > > > The only shadow entries that would be generated for DAX radix trees would
> > > > be to track zero page mappings that were created for holes.  These pages
> > > > would receive minimal benefit from having shadow entries, and the choice
> > > > to have only one type of exceptional entry in a given radix tree makes the
> > > > logic simpler both in clear_exceptional_entry() and in the rest of DAX.
> > > > 
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > 
> > > I have realized there's one issue with this code. See below:
> > > 
> > > > @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
> > > >  		return;
> > > >  
> > > >  	spin_lock_irq(&mapping->tree_lock);
> > > > -	/*
> > > > -	 * Regular page slots are stabilized by the page lock even
> > > > -	 * without the tree itself locked.  These unlocked entries
> > > > -	 * need verification under the tree lock.
> > > > -	 */
> > > > -	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> > > > -		goto unlock;
> > > > -	if (*slot != entry)
> > > > -		goto unlock;
> > > > -	radix_tree_replace_slot(slot, NULL);
> > > > -	mapping->nrshadows--;
> > > > -	if (!node)
> > > > -		goto unlock;
> > > > -	workingset_node_shadows_dec(node);
> > > > -	/*
> > > > -	 * Don't track node without shadow entries.
> > > > -	 *
> > > > -	 * Avoid acquiring the list_lru lock if already untracked.
> > > > -	 * The list_empty() test is safe as node->private_list is
> > > > -	 * protected by mapping->tree_lock.
> > > > -	 */
> > > > -	if (!workingset_node_shadows(node) &&
> > > > -	    !list_empty(&node->private_list))
> > > > -		list_lru_del(&workingset_shadow_nodes, &node->private_list);
> > > > -	__radix_tree_delete_node(&mapping->page_tree, node);
> > > > +
> > > > +	if (dax_mapping(mapping)) {
> > > > +		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> > > > +			mapping->nrexceptional--;
> > > 
> > > So when you punch hole in a file, you can delete a PMD entry from a radix
> > > tree which covers part of the file which still stays. So in this case you
> > > have to split the PMD entry into PTE entries (probably that needs to happen
> > > up in truncate_inode_pages_range()) or something similar...
> > 
> > I think (and will verify) that the DAX code just unmaps the entire PMD range
> > when we receive a hole punch request inside of the PMD.  If this is true then
> > I think the radix tree code should behave the same way and just remove the PMD
> > entry in the radix tree.
> 
> But you cannot just remove it if it is dirty... You have to keep somewhere
> information that part of the PMD range is still dirty (or write that range
> out before removing the radix tree entry).

It turns out that hole punching a DAX PMD hits a BUG:

[  247.821632] ------------[ cut here ]------------
[  247.822744] kernel BUG at mm/memory.c:1195!
[  247.823742] invalid opcode: 0000 [#1] SMP 
[  247.824768] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[  247.826299] CPU: 1 PID: 1544 Comm: test Tainted: G        W       4.4.0-rc8+ #9
[  247.828017] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
[  247.830298] task: ffff880036756200 ti: ffff8800a077c000 task.ti: ffff8800a077c000
[  247.831935] RIP: 0010:[<ffffffff81203157>]  [<ffffffff81203157>] unmap_page_range+0x907/0x910
[  247.833847] RSP: 0018:ffff8800a077fb88  EFLAGS: 00010282
[  247.835030] RAX: 0000000000000073 RBX: ffffc00000000fff RCX: 0000000000000000
[  247.836595] RDX: 0000000000000000 RSI: ffff88051a3ce168 RDI: ffff88051a3ce168
[  247.838168] RBP: ffff8800a077fc68 R08: 0000000000000001 R09: 0000000000000001
[  247.839728] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000010405000
[  247.841244] R13: 0000000010403000 R14: ffff8800a077fcb0 R15: 0000000010403000
[  247.842715] FS:  00007f533a5bb700(0000) GS:ffff88051a200000(0000) knlGS:0000000000000000
[  247.844395] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  247.845589] CR2: 0000000010403000 CR3: 0000000514337000 CR4: 00000000000006e0
[  247.847076] Stack:
[  247.847502]  ffff8800a077fcb0 ffff8800a077fc38 ffff880036756200 ffff8800a077fbb0
[  247.849126]  0000000010404fff 0000000010404fff 0000000010404fff ffff880514337000
[  247.850714]  ffff880513499000 ffff8800a077fcc0 0000000010405000 0000000000000000
[  247.852246] Call Trace:
[  247.852740]  [<ffffffff810fbb4d>] ? trace_hardirqs_on+0xd/0x10
[  247.853883]  [<ffffffff812031dd>] unmap_single_vma+0x7d/0xe0
[  247.855004]  [<ffffffff812032ed>] zap_page_range_single+0xad/0xf0
[  247.856195]  [<ffffffff81203410>] ? unmap_mapping_range+0xa0/0x190
[  247.857403]  [<ffffffff812034d6>] unmap_mapping_range+0x166/0x190
[  247.858596]  [<ffffffff811e0948>] truncate_pagecache_range+0x48/0x60
[  247.859839]  [<ffffffff8130b7ba>] ext4_punch_hole+0x33a/0x4b0
[  247.860837]  [<ffffffff8133a274>] ext4_fallocate+0x144/0x890
[  247.861784]  [<ffffffff810f6cd7>] ? update_fast_ctr+0x17/0x30
[  247.862751]  [<ffffffff810f6d69>] ? percpu_down_read+0x49/0x90
[  247.863731]  [<ffffffff812640c4>] ? __sb_start_write+0xb4/0xf0
[  247.864709]  [<ffffffff8125d900>] vfs_fallocate+0x140/0x220
[  247.865645]  [<ffffffff8125e7d4>] SyS_fallocate+0x44/0x70
[  247.866553]  [<ffffffff81a68ef2>] entry_SYSCALL_64_fastpath+0x12/0x76
[  247.867632] Code: 12 fe ff ff 0f 0b 48 8b 45 98 48 8b 4d 90 4c 89 fa 48 c7 c6 78 30 c2 81 48 c7 c7 50 6b f0 81 4c 8b 48 08 4c 8b 00 e8 06 4a fc ff <0f> 0b e8 d2 2f ea ff 66 90 66 66 66 66 90 48 8b 06 4c 8b 4e 08 
[  247.871862] RIP  [<ffffffff81203157>] unmap_page_range+0x907/0x910
[  247.872843]  RSP <ffff8800a077fb88>
[  247.873435] ---[ end trace 75145e78670ba43d ]---

This happens with XFS as well.  I'm not sure that this path has ever been run,
so essentially the next step is "add PMD hole punch support to DAX, including
fsync/msync support".

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/9] dax: add support for fsync/msync
  2016-01-08  5:27 ` [PATCH v8 6/9] dax: add support for fsync/msync Ross Zwisler
  2016-01-12 10:57   ` Jan Kara
@ 2016-02-06 14:33   ` Dmitry Monakhov
  2016-02-08  9:44     ` Jan Kara
  2016-02-08 22:06     ` Ross Zwisler
  1 sibling, 2 replies; 28+ messages in thread
From: Dmitry Monakhov @ 2016-02-06 14:33 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

[-- Attachment #1: Type: text/plain, Size: 11561 bytes --]

Ross Zwisler <ross.zwisler@linux.intel.com> writes:

> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
Please see coments below
>
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c            | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  include/linux/dax.h |   2 +
>  mm/filemap.c        |   6 ++
>  3 files changed, 196 insertions(+), 6 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 5b84a46..0db21ea 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -24,6 +24,7 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
> +#include <linux/pagevec.h>
>  #include <linux/pmem.h>
>  #include <linux/sched.h>
>  #include <linux/uio.h>
> @@ -324,6 +325,174 @@ static int copy_user_bh(struct page *to, struct inode *inode,
>  	return 0;
>  }
>  
> +#define NO_SECTOR -1
> +
> +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
IMHO it would be sane to call that function as dax_radix_entry_insert() 
> +		sector_t sector, bool pmd_entry, bool dirty)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type, error = 0;
> +	void *entry;
> +
> +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = radix_tree_lookup(page_tree, index);
> +
> +	if (entry) {
> +		type = RADIX_DAX_TYPE(entry);
> +		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> +					type != RADIX_DAX_PMD)) {
> +			error = -EIO;
> +			goto unlock;
> +		}
> +
> +		if (!pmd_entry || type == RADIX_DAX_PMD)
> +			goto dirty;
> +		radix_tree_delete(&mapping->page_tree, index);
> +		mapping->nrexceptional--;
> +	}
> +
> +	if (sector == NO_SECTOR) {
> +		/*
> +		 * This can happen during correct operation if our pfn_mkwrite
> +		 * fault raced against a hole punch operation.  If this
> +		 * happens the pte that was hole punched will have been
> +		 * unmapped and the radix tree entry will have been removed by
> +		 * the time we are called, but the call will still happen.  We
> +		 * will return all the way up to wp_pfn_shared(), where the
> +		 * pte_same() check will fail, eventually causing page fault
> +		 * to be retried by the CPU.
> +		 */
> +		goto unlock;
> +	}
> +
> +	error = radix_tree_insert(page_tree, index,
> +			RADIX_DAX_ENTRY(sector, pmd_entry));
> +	if (error)
> +		goto unlock;
> +
> +	mapping->nrexceptional++;
> + dirty:
> +	if (dirty)
> +		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
> + unlock:
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return error;
> +}
> +
> +static int dax_writeback_one(struct block_device *bdev,
> +		struct address_space *mapping, pgoff_t index, void *entry)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type = RADIX_DAX_TYPE(entry);
> +	struct radix_tree_node *node;
> +	struct blk_dax_ctl dax;
> +	void **slot;
> +	int ret = 0;
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	/*
> +	 * Regular page slots are stabilized by the page lock even
> +	 * without the tree itself locked.  These unlocked entries
> +	 * need verification under the tree lock.
> +	 */
> +	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
> +		goto unlock;
> +	if (*slot != entry)
> +		goto unlock;
> +
> +	/* another fsync thread may have already written back this entry */
> +	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
> +		goto unlock;
> +
> +	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
> +
> +	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
> +		ret = -EIO;
> +		goto unlock;
> +	}
> +
> +	dax.sector = RADIX_DAX_SECTOR(entry);
> +	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	/*
> +	 * We cannot hold tree_lock while calling dax_map_atomic() because it
> +	 * eventually calls cond_resched().
> +	 */
> +	ret = dax_map_atomic(bdev, &dax);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (WARN_ON_ONCE(ret < dax.size)) {
> +		ret = -EIO;
> +		goto unmap;
> +	}
> +
> +	wb_cache_pmem(dax.addr, dax.size);
> + unmap:
> +	dax_unmap_atomic(bdev, &dax);
> +	return ret;
> +
> + unlock:
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}
> +
> +/*
> + * Flush the mapping to the persistent domain within the byte range of [start,
> + * end]. This is required by data integrity operations to ensure file data is
> + * on persistent storage prior to completion of the operation.
> + */
> +int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> +		loff_t end)
> +{
> +	struct inode *inode = mapping->host;
> +	struct block_device *bdev = inode->i_sb->s_bdev;
> +	pgoff_t indices[PAGEVEC_SIZE];
> +	pgoff_t start_page, end_page;
> +	struct pagevec pvec;
> +	void *entry;
> +	int i, ret = 0;
> +
> +	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
> +		return -EIO;
> +
> +	rcu_read_lock();
> +	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
> +	rcu_read_unlock();
> +
> +	/* see if the start of our range is covered by a PMD entry */
> +	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> +		start &= PMD_MASK;
> +
> +	start_page = start >> PAGE_CACHE_SHIFT;
> +	end_page = end >> PAGE_CACHE_SHIFT;
> +
> +	tag_pages_for_writeback(mapping, start_page, end_page);
> +
> +	pagevec_init(&pvec, 0);
> +	while (1) {
> +		pvec.nr = find_get_entries_tag(mapping, start_page,
> +				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
> +				pvec.pages, indices);
> +
> +		if (pvec.nr == 0)
> +			break;
> +
> +		for (i = 0; i < pvec.nr; i++) {
> +			ret = dax_writeback_one(bdev, mapping, indices[i],
> +					pvec.pages[i]);
> +			if (ret < 0)
> +				return ret;
> +		}
I think it would be more efficient to use batched locking like follows:
                spin_lock_irq(&mapping->tree_lock);
		for (i = 0; i < pvec.nr; i++) {
                    struct blk_dax_ctl dax[PAGEVEC_SIZE];                
                    radix_tree_tag_clear(page_tree, indices[i], PAGECACHE_TAG_TOWRITE);
                    /* It is also reasonable to merge adjacent dax
                     * regions in to one */
                    dax[i].sector = RADIX_DAX_SECTOR(entry);
                    dax[i].size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);                    

                }
                spin_unlock_irq(&mapping->tree_lock);
               	if (blk_queue_enter(q, true) != 0)
                    goto error;
                for (i = 0; i < pvec.nr; i++) {
                    rc = bdev_direct_access(bdev, dax[i]);
                    wb_cache_pmem(dax[i].addr, dax[i].size);
                }
                ret = blk_queue_exit(q, true)
> +	}
> +	wmb_pmem();
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
> +
>  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  			struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> @@ -363,6 +532,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	}
>  	dax_unmap_atomic(bdev, &dax);
>  
> +	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> +			vmf->flags & FAULT_FLAG_WRITE);
> +	if (error)
> +		goto out;
> +
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
>  
>   out:
> @@ -487,6 +661,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
>  		delete_from_page_cache(page);
>  		unlock_page(page);
>  		page_cache_release(page);
> +		page = NULL;
>  	}
I've realized that I do not understand why dax_fault code works at all.
During dax_fault we want to remove page from mapping and insert dax-entry
 Basically code looks like follows:
0 page = find_get_page()
1 lock_page(page)
2 delete_from_page_cache(page);
3 unlock_page(page);
4 dax_insert_mapping(inode, &bh, vma, vmf);

BUT what on earth protects us from other process to reinsert page again
after step(2) but before (4)?
Imagine we do write to file-hole which result in to dax_fault(write), but
another task also does read fault and reinsert deleted page via dax_hole_load
As result dax_tree_entry will fail with EIO
Testcase looks very trivial, but i can not reproduce this.
>  
>  	/*
> @@ -591,7 +766,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	pgoff_t size, pgoff;
>  	loff_t lstart, lend;
>  	sector_t block;
> -	int result = 0;
> +	int error, result = 0;
>  
>  	/* dax pmd mappings require pfn_t_devmap() */
>  	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
> @@ -733,6 +908,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		}
>  		dax_unmap_atomic(bdev, &dax);
>  
> +		if (write) {
> +			error = dax_radix_entry(mapping, pgoff, dax.sector,
> +					true, true);
> +			if (error) {
> +				dax_pmd_dbg(&bh, address,
> +						"PMD radix insertion failed");
> +				goto fallback;
> +			}
> +		}
> +
>  		dev_dbg(part_to_dev(bdev->bd_part),
>  				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
>  				__func__, current->comm, address,
> @@ -791,15 +976,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
>   * dax_pfn_mkwrite - handle first write to DAX page
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
> - *
>   */
>  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +	struct file *file = vma->vm_file;
>  
> -	sb_start_pagefault(sb);
> -	file_update_time(vma->vm_file);
> -	sb_end_pagefault(sb);
> +	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
>  	return VM_FAULT_NOPAGE;
>  }
>  EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index e9d57f68..8204c3d 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping)
>  {
>  	return mapping->host && IS_DAX(mapping->host);
>  }
> +int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> +		loff_t end);
>  #endif
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1e215fc..2e7c8d9 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -482,6 +482,12 @@ int filemap_write_and_wait_range(struct address_space *mapping,
>  {
>  	int err = 0;
>  
> +	if (dax_mapping(mapping) && mapping->nrexceptional) {
> +		err = dax_writeback_mapping_range(mapping, lstart, lend);
> +		if (err)
> +			return err;
> +	}
> +
>  	if (mapping->nrpages) {
>  		err = __filemap_fdatawrite_range(mapping, lstart, lend,
>  						 WB_SYNC_ALL);
> -- 
> 2.5.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 472 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/9] dax: add support for fsync/msync
  2016-02-06 14:33   ` Dmitry Monakhov
@ 2016-02-08  9:44     ` Jan Kara
  2016-02-08 22:06     ` Ross Zwisler
  1 sibling, 0 replies; 28+ messages in thread
From: Jan Kara @ 2016-02-08  9:44 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Sat 06-02-16 17:33:07, Dmitry Monakhov wrote:
> > +int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> > +		loff_t end)
> > +{
> > +	struct inode *inode = mapping->host;
> > +	struct block_device *bdev = inode->i_sb->s_bdev;
> > +	pgoff_t indices[PAGEVEC_SIZE];
> > +	pgoff_t start_page, end_page;
> > +	struct pagevec pvec;
> > +	void *entry;
> > +	int i, ret = 0;
> > +
> > +	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
> > +		return -EIO;
> > +
> > +	rcu_read_lock();
> > +	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
> > +	rcu_read_unlock();
> > +
> > +	/* see if the start of our range is covered by a PMD entry */
> > +	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +		start &= PMD_MASK;
> > +
> > +	start_page = start >> PAGE_CACHE_SHIFT;
> > +	end_page = end >> PAGE_CACHE_SHIFT;
> > +
> > +	tag_pages_for_writeback(mapping, start_page, end_page);
> > +
> > +	pagevec_init(&pvec, 0);
> > +	while (1) {
> > +		pvec.nr = find_get_entries_tag(mapping, start_page,
> > +				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
> > +				pvec.pages, indices);
> > +
> > +		if (pvec.nr == 0)
> > +			break;
> > +
> > +		for (i = 0; i < pvec.nr; i++) {
> > +			ret = dax_writeback_one(bdev, mapping, indices[i],
> > +					pvec.pages[i]);
> > +			if (ret < 0)
> > +				return ret;
> > +		}
> I think it would be more efficient to use batched locking like follows:
>                 spin_lock_irq(&mapping->tree_lock);
> 		for (i = 0; i < pvec.nr; i++) {
>                     struct blk_dax_ctl dax[PAGEVEC_SIZE];                
>                     radix_tree_tag_clear(page_tree, indices[i], PAGECACHE_TAG_TOWRITE);
>                     /* It is also reasonable to merge adjacent dax
>                      * regions in to one */
>                     dax[i].sector = RADIX_DAX_SECTOR(entry);
>                     dax[i].size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);                    
> 
>                 }
>                 spin_unlock_irq(&mapping->tree_lock);
>                	if (blk_queue_enter(q, true) != 0)
>                     goto error;
>                 for (i = 0; i < pvec.nr; i++) {
>                     rc = bdev_direct_access(bdev, dax[i]);
>                     wb_cache_pmem(dax[i].addr, dax[i].size);
>                 }
>                 ret = blk_queue_exit(q, true)

We need to clear the radix tree tag only after flushing caches. But in
principle I agree that some batching of radix tree tag manipulations should
be doable. But frankly so far we have issues with correctness so speed is
not our main concern.

> > +	}
> > +	wmb_pmem();
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
> > +
> >  static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  			struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > @@ -363,6 +532,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> >  	}
> >  	dax_unmap_atomic(bdev, &dax);
> >  
> > +	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> > +			vmf->flags & FAULT_FLAG_WRITE);
> > +	if (error)
> > +		goto out;
> > +
> >  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> >  
> >   out:
> > @@ -487,6 +661,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> >  		delete_from_page_cache(page);
> >  		unlock_page(page);
> >  		page_cache_release(page);
> > +		page = NULL;
> >  	}
> I've realized that I do not understand why dax_fault code works at all.
> During dax_fault we want to remove page from mapping and insert dax-entry
>  Basically code looks like follows:
> 0 page = find_get_page()
> 1 lock_page(page)
> 2 delete_from_page_cache(page);
> 3 unlock_page(page);
> 4 dax_insert_mapping(inode, &bh, vma, vmf);
> 
> BUT what on earth protects us from other process to reinsert page again
> after step(2) but before (4)?

Nothing, it's a bug and Ross / Matthew are working on fixing it...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v8 6/9] dax: add support for fsync/msync
  2016-02-06 14:33   ` Dmitry Monakhov
  2016-02-08  9:44     ` Jan Kara
@ 2016-02-08 22:06     ` Ross Zwisler
  1 sibling, 0 replies; 28+ messages in thread
From: Ross Zwisler @ 2016-02-08 22:06 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Sat, Feb 06, 2016 at 05:33:07PM +0300, Dmitry Monakhov wrote:
> Ross Zwisler <ross.zwisler@linux.intel.com> writes:
<>
> > +static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
> IMHO it would be sane to call that function as dax_radix_entry_insert() 

I think I may have actually had it named that at some point. :)  I changed it
because it doesn't always insert an entry - in the read case for example we
insert a clean entry, and then on the following dax_pfn_mkwrite() we call back
in and mark it as dirty.

<>
> > +/*
> > + * Flush the mapping to the persistent domain within the byte range of [start,
> > + * end]. This is required by data integrity operations to ensure file data is
> > + * on persistent storage prior to completion of the operation.
> > + */
> > +int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
> > +		loff_t end)
> > +{
> > +	struct inode *inode = mapping->host;
> > +	struct block_device *bdev = inode->i_sb->s_bdev;
> > +	pgoff_t indices[PAGEVEC_SIZE];
> > +	pgoff_t start_page, end_page;
> > +	struct pagevec pvec;
> > +	void *entry;
> > +	int i, ret = 0;
> > +
> > +	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
> > +		return -EIO;
> > +
> > +	rcu_read_lock();
> > +	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
> > +	rcu_read_unlock();
> > +
> > +	/* see if the start of our range is covered by a PMD entry */
> > +	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
> > +		start &= PMD_MASK;
> > +
> > +	start_page = start >> PAGE_CACHE_SHIFT;
> > +	end_page = end >> PAGE_CACHE_SHIFT;
> > +
> > +	tag_pages_for_writeback(mapping, start_page, end_page);
> > +
> > +	pagevec_init(&pvec, 0);
> > +	while (1) {
> > +		pvec.nr = find_get_entries_tag(mapping, start_page,
> > +				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
> > +				pvec.pages, indices);
> > +
> > +		if (pvec.nr == 0)
> > +			break;
> > +
> > +		for (i = 0; i < pvec.nr; i++) {
> > +			ret = dax_writeback_one(bdev, mapping, indices[i],
> > +					pvec.pages[i]);
> > +			if (ret < 0)
> > +				return ret;
> > +		}
> I think it would be more efficient to use batched locking like follows:
>                 spin_lock_irq(&mapping->tree_lock);
> 		for (i = 0; i < pvec.nr; i++) {
>                     struct blk_dax_ctl dax[PAGEVEC_SIZE];                
>                     radix_tree_tag_clear(page_tree, indices[i], PAGECACHE_TAG_TOWRITE);
>                     /* It is also reasonable to merge adjacent dax
>                      * regions in to one */
>                     dax[i].sector = RADIX_DAX_SECTOR(entry);
>                     dax[i].size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);                    
> 
>                 }
>                 spin_unlock_irq(&mapping->tree_lock);
>                	if (blk_queue_enter(q, true) != 0)
>                     goto error;
>                 for (i = 0; i < pvec.nr; i++) {
>                     rc = bdev_direct_access(bdev, dax[i]);
>                     wb_cache_pmem(dax[i].addr, dax[i].size);
>                 }
>                 ret = blk_queue_exit(q, true)

I guess this could be more efficient, but as Jan said in his response we're
currently focused on correctness.  I also wonder if it would be measurably
better?

In any case, Jan is right - you have to clear the TOWRITE tag only after
you've flushed, and you also need to include the entry verification code from
dax_writeback_one() after you grab the tree lock.  Basically, I believe all
the code in dax_writeback_one() is needed - this change would essentially just
be inlining that code in dax_writeback_mapping_range() so you could do
multiple operations without giving up a lock.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-02-08 22:07 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-08  5:27 [PATCH v8 0/9] DAX fsync/msync support Ross Zwisler
2016-01-08  5:27 ` [PATCH v8 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
2016-01-12  9:34   ` Jan Kara
2016-01-13  7:08     ` Ross Zwisler
2016-01-13  9:07       ` Jan Kara
2016-01-08  5:27 ` [PATCH v8 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
2016-01-12  9:44   ` Jan Kara
2016-01-13  7:37     ` Ross Zwisler
2016-01-08  5:27 ` [PATCH v8 3/9] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2016-01-08  5:27 ` [PATCH v8 4/9] dax: support dirty DAX entries in radix tree Ross Zwisler
2016-01-13  9:44   ` Jan Kara
2016-01-13 18:48     ` Ross Zwisler
2016-01-15 13:22       ` Jan Kara
2016-01-15 19:03         ` Ross Zwisler
2016-02-03 16:42         ` Ross Zwisler
2016-01-08  5:27 ` [PATCH v8 5/9] mm: add find_get_entries_tag() Ross Zwisler
2016-01-08  5:27 ` [PATCH v8 6/9] dax: add support for fsync/msync Ross Zwisler
2016-01-12 10:57   ` Jan Kara
2016-01-13  7:30     ` Ross Zwisler
2016-01-13  9:35       ` Jan Kara
2016-01-13 18:58         ` Ross Zwisler
2016-01-15 13:10           ` Jan Kara
2016-02-06 14:33   ` Dmitry Monakhov
2016-02-08  9:44     ` Jan Kara
2016-02-08 22:06     ` Ross Zwisler
2016-01-08  5:27 ` [PATCH v8 7/9] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Ross Zwisler
2016-01-08  5:27 ` [PATCH v8 8/9] ext4: " Ross Zwisler
2016-01-08  5:27 ` [PATCH v8 9/9] xfs: " Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).