linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/9] DAX fsync/msync support
@ 2016-01-06 18:00 Ross Zwisler
  2016-01-06 18:00 ` [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

Changes since v6 [1]:

1) Fixed an existing NULL pointer dereference bug in __dax_dbg() in patch 1.

2) Fixed an existing bug with the way holes are converted into DAX PMD
entries in patch 2.  This solves a BUG_ON reported by Dan Williams.

3) Removed second verification of our radix tree entry before cache flush
in dax_writeback_one(). (Jan Kara)

4) Updated to the new argument list types for dax_pmd_dbg(). (Dan Williams)

5) Fixed the text of a random debug message so that it accurately reflects
the error being found.

This series replaces v6 in the MM tree and in the "akpm" branch of the next
tree.  A working tree can be found here:

https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_v7

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-December/003663.html

Ross Zwisler (9):
  dax: fix NULL pointer dereference in __dax_dbg()
  dax: fix conversion of holes to PMDs
  pmem: add wb_cache_pmem() to the PMEM API
  dax: support dirty DAX entries in radix tree
  mm: add find_get_entries_tag()
  dax: add support for fsync/msync
  ext2: call dax_pfn_mkwrite() for DAX fsync/msync
  ext4: call dax_pfn_mkwrite() for DAX fsync/msync
  xfs: call dax_pfn_mkwrite() for DAX fsync/msync

 arch/x86/include/asm/pmem.h |  11 +--
 fs/block_dev.c              |   2 +-
 fs/dax.c                    | 214 ++++++++++++++++++++++++++++++++++++++++----
 fs/ext2/file.c              |   4 +-
 fs/ext4/file.c              |   4 +-
 fs/inode.c                  |   2 +-
 fs/xfs/xfs_file.c           |   7 +-
 include/linux/dax.h         |   7 ++
 include/linux/fs.h          |   3 +-
 include/linux/pagemap.h     |   3 +
 include/linux/pmem.h        |  22 ++++-
 include/linux/radix-tree.h  |   9 ++
 mm/filemap.c                |  91 +++++++++++++++++--
 mm/truncate.c               |  69 +++++++-------
 mm/vmscan.c                 |   9 +-
 mm/workingset.c             |   4 +-
 16 files changed, 391 insertions(+), 70 deletions(-)

-- 
2.5.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-06 18:00 [PATCH v7 0/9] DAX fsync/msync support Ross Zwisler
@ 2016-01-06 18:00 ` Ross Zwisler
  2016-01-06 19:14   ` Dan Williams
  2016-01-06 18:00 ` [PATCH v7 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

__dax_dbg() currently assumes that bh->b_bdev is non-NULL, passing it into
bdevname() where is is dereferenced.  This assumption isn't always true -
when called for reads of holes, ext4_dax_mmap_get_block() returns a buffer
head where bh->b_bdev is never set.  I hit this BUG while testing the DAX
PMD fault path.

Instead, verify that we have a valid bh->b_bdev, else just say "unknown"
for the block device.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7af8797..03cc4a3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -563,7 +563,12 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
 {
 	if (bh) {
 		char bname[BDEVNAME_SIZE];
-		bdevname(bh->b_bdev, bname);
+
+		if (bh->b_bdev)
+			bdevname(bh->b_bdev, bname);
+		else
+			snprintf(bname, BDEVNAME_SIZE, "unknown");
+
 		pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
 			"length %zd fallback: %s\n", fn, current->comm,
 			address, bname, bh->b_state, (u64)bh->b_blocknr,
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 2/9] dax: fix conversion of holes to PMDs
  2016-01-06 18:00 [PATCH v7 0/9] DAX fsync/msync support Ross Zwisler
  2016-01-06 18:00 ` [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
@ 2016-01-06 18:00 ` Ross Zwisler
  2016-01-06 19:04   ` Dan Williams
  2016-01-07 13:22   ` Jan Kara
  2016-01-06 18:00 ` [PATCH v7 3/9] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

When we get a DAX PMD fault for a write it is possible that there could be
some number of 4k zero pages already present for the same range that were
inserted to service reads from a hole.  These 4k zero pages need to be
unmapped from the VMAs and removed from the struct address_space radix tree
before the real DAX PMD entry can be inserted.

For PTE faults this same use case also exists and is handled by a
combination of unmap_mapping_range() to unmap the VMAs and
delete_from_page_cache() to remove the page from the address_space radix
tree.

For PMD faults we do have a call to unmap_mapping_range() (protected by a
buffer_new() check), but nothing clears out the radix tree entry.  The
buffer_new() check is also incorrect as the current ext4 and XFS filesystem
code will never return a buffer_head with BH_New set, even when allocating
new blocks over a hole.  Instead the filesystem will zero the blocks
manually and return a buffer_head with only BH_Mapped set.

Fix this situation by removing the buffer_new() check and adding a call to
truncate_inode_pages_range() to clear out the radix tree entries before we
insert the DAX PMD.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 03cc4a3..9dc0c97 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -594,6 +594,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	bool write = flags & FAULT_FLAG_WRITE;
 	struct block_device *bdev;
 	pgoff_t size, pgoff;
+	loff_t lstart, lend;
 	sector_t block;
 	int result = 0;
 
@@ -647,15 +648,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto fallback;
 	}
 
-	/*
-	 * If we allocated new storage, make sure no process has any
-	 * zero pages covering this hole
-	 */
-	if (buffer_new(&bh)) {
-		i_mmap_unlock_read(mapping);
-		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
-		i_mmap_lock_read(mapping);
-	}
+	/* make sure no process has any zero pages covering this hole */
+	lstart = pgoff << PAGE_SHIFT;
+	lend = lstart + PMD_SIZE - 1; /* inclusive */
+	i_mmap_unlock_read(mapping);
+	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
+	truncate_inode_pages_range(mapping, lstart, lend);
+	i_mmap_lock_read(mapping);
 
 	/*
 	 * If a truncate happened while we were allocating blocks, we may
@@ -669,7 +668,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		goto out;
 	}
 	if ((pgoff | PG_PMD_COLOUR) >= size) {
-		dax_pmd_dbg(&bh, address, "pgoff unaligned");
+		dax_pmd_dbg(&bh, address,
+				"offset + huge page size > file size");
 		goto fallback;
 	}
 
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 3/9] pmem: add wb_cache_pmem() to the PMEM API
  2016-01-06 18:00 [PATCH v7 0/9] DAX fsync/msync support Ross Zwisler
  2016-01-06 18:00 ` [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
  2016-01-06 18:00 ` [PATCH v7 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
@ 2016-01-06 18:00 ` Ross Zwisler
  2016-01-06 18:00 ` [PATCH v7 4/9] dax: support dirty DAX entries in radix tree Ross Zwisler
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

One thing worth noting is that we really do want this to be part of the
PMEM API as opposed to a stand-alone function like clflush_cache_range()
because of ordering restrictions.  By having wb_cache_pmem() as part of the
PMEM API we can leave it unordered, call it multiple times to write back
large amounts of memory, and then order the multiple calls with a single
wmb_pmem().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pmem.h | 11 ++++++-----
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index 1544fab..c57fd1e 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -133,7 +134,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	void *vaddr = (void __force *)addr;
 
 	memset(vaddr, 0, size);
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index acfea8c..7c3d11a 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -178,4 +184,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 4/9] dax: support dirty DAX entries in radix tree
  2016-01-06 18:00 [PATCH v7 0/9] DAX fsync/msync support Ross Zwisler
                   ` (2 preceding siblings ...)
  2016-01-06 18:00 ` [PATCH v7 3/9] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
@ 2016-01-06 18:00 ` Ross Zwisler
  2016-01-06 18:00 ` [PATCH v7 5/9] mm: add find_get_entries_tag() Ross Zwisler
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.
These exceptional entries will also contain the writeback sectors for the
PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  We rely on
the fact that only one type of exceptional entry can be found in a given
radix tree based on its usage.  This happens for free with DAX vs shmem but
we explicitly prevent shadow entries from being added to radix trees for
DAX mappings.

The only shadow entries that would be generated for DAX radix trees would
be to track zero page mappings that were created for holes.  These pages
would receive minimal benefit from having shadow entries, and the choice
to have only one type of exceptional entry in a given radix tree makes the
logic simpler both in clear_exceptional_entry() and in the rest of DAX.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c             |  2 +-
 fs/inode.c                 |  2 +-
 include/linux/dax.h        |  5 ++++
 include/linux/fs.h         |  3 +-
 include/linux/radix-tree.h |  9 ++++++
 mm/filemap.c               | 17 ++++++++----
 mm/truncate.c              | 69 ++++++++++++++++++++++++++--------------------
 mm/vmscan.c                |  9 +++++-
 mm/workingset.c            |  4 +--
 9 files changed, 78 insertions(+), 42 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e123641..303b7cd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -75,7 +75,7 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index 4c8f719..6e3e5d0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -495,7 +495,7 @@ void clear_inode(struct inode *inode)
 	 */
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
-	BUG_ON(inode->i_data.nrshadows);
+	BUG_ON(inode->i_data.nrexceptional);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+	return mapping->host && IS_DAX(mapping->host);
+}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ddb7bad..fdab768 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,7 +433,8 @@ struct address_space {
 	struct rw_semaphore	i_mmap_rwsem;	/* protect tree, count, list */
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
-	unsigned long		nrshadows;	/* number of shadow entries */
+	/* number of shadow or DAX exceptional entries */
+	unsigned long		nrexceptional;
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 57e7d87..7c88ad1 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,15 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_SHIFT	4
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+		RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 847ee43..7b8be78 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@
  */
 #include <linux/export.h>
 #include <linux/compiler.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/uaccess.h>
 #include <linux/capability.h>
@@ -123,9 +124,9 @@ static void page_cache_tree_delete(struct address_space *mapping,
 	__radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
 
 	if (shadow) {
-		mapping->nrshadows++;
+		mapping->nrexceptional++;
 		/*
-		 * Make sure the nrshadows update is committed before
+		 * Make sure the nrexceptional update is committed before
 		 * the nrpages update so that final truncate racing
 		 * with reclaim does not see both counters 0 at the
 		 * same time and miss a shadow entry.
@@ -579,9 +580,13 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
+
+		if (WARN_ON(dax_mapping(mapping)))
+			return -EINVAL;
+
 		if (shadowp)
 			*shadowp = p;
-		mapping->nrshadows--;
+		mapping->nrexceptional--;
 		if (node)
 			workingset_node_shadows_dec(node);
 	}
@@ -1245,9 +1250,9 @@ repeat:
 			if (radix_tree_deref_retry(page))
 				goto restart;
 			/*
-			 * A shadow entry of a recently evicted page,
-			 * or a swap entry from shmem/tmpfs.  Return
-			 * it without attempting to raise page count.
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
 			 */
 			goto export;
 		}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..e3ee0e2 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@
 
 #include <linux/kernel.h>
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
@@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
 		return;
 
 	spin_lock_irq(&mapping->tree_lock);
-	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
-	 */
-	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-	radix_tree_replace_slot(slot, NULL);
-	mapping->nrshadows--;
-	if (!node)
-		goto unlock;
-	workingset_node_shadows_dec(node);
-	/*
-	 * Don't track node without shadow entries.
-	 *
-	 * Avoid acquiring the list_lru lock if already untracked.
-	 * The list_empty() test is safe as node->private_list is
-	 * protected by mapping->tree_lock.
-	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
-		list_lru_del(&workingset_shadow_nodes, &node->private_list);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+
+	if (dax_mapping(mapping)) {
+		if (radix_tree_delete_item(&mapping->page_tree, index, entry))
+			mapping->nrexceptional--;
+	} else {
+		/*
+		 * Regular page slots are stabilized by the page lock even
+		 * without the tree itself locked.  These unlocked entries
+		 * need verification under the tree lock.
+		 */
+		if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+					&slot))
+			goto unlock;
+		if (*slot != entry)
+			goto unlock;
+		radix_tree_replace_slot(slot, NULL);
+		mapping->nrexceptional--;
+		if (!node)
+			goto unlock;
+		workingset_node_shadows_dec(node);
+		/*
+		 * Don't track node without shadow entries.
+		 *
+		 * Avoid acquiring the list_lru lock if already untracked.
+		 * The list_empty() test is safe as node->private_list is
+		 * protected by mapping->tree_lock.
+		 */
+		if (!workingset_node_shadows(node) &&
+		    !list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes,
+					&node->private_list);
+		__radix_tree_delete_node(&mapping->page_tree, node);
+	}
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
@@ -228,7 +237,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrexceptional == 0)
 		return;
 
 	/* Offsets within partial pages */
@@ -402,7 +411,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
  */
 void truncate_inode_pages_final(struct address_space *mapping)
 {
-	unsigned long nrshadows;
+	unsigned long nrexceptional;
 	unsigned long nrpages;
 
 	/*
@@ -416,14 +425,14 @@ void truncate_inode_pages_final(struct address_space *mapping)
 
 	/*
 	 * When reclaim installs eviction entries, it increases
-	 * nrshadows first, then decreases nrpages.  Make sure we see
+	 * nrexceptional first, then decreases nrpages.  Make sure we see
 	 * this in the right order or we might miss an entry.
 	 */
 	nrpages = mapping->nrpages;
 	smp_rmb();
-	nrshadows = mapping->nrshadows;
+	nrexceptional = mapping->nrexceptional;
 
-	if (nrpages || nrshadows) {
+	if (nrpages || nrexceptional) {
 		/*
 		 * As truncation uses a lockless tree lookup, cycle
 		 * the tree lock to make sure any ongoing tree
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44ec50f..30e0cd7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -46,6 +46,7 @@
 #include <linux/oom.h>
 #include <linux/prefetch.h>
 #include <linux/printk.h>
+#include <linux/dax.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -671,9 +672,15 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		 * inode reclaim needs to empty out the radix tree or
 		 * the nodes are lost.  Don't plant shadows behind its
 		 * back.
+		 *
+		 * We also don't store shadows for DAX mappings because the
+		 * only page cache pages found in these are zero pages
+		 * covering holes, and because we don't want to mix DAX
+		 * exceptional entries and shadow exceptional entries in the
+		 * same page_tree.
 		 */
 		if (reclaimed && page_is_file_cache(page) &&
-		    !mapping_exiting(mapping))
+		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow, memcg);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
diff --git a/mm/workingset.c b/mm/workingset.c
index aa01713..61ead9e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -351,8 +351,8 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
 			node->slots[i] = NULL;
 			BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
 			node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-			BUG_ON(!mapping->nrshadows);
-			mapping->nrshadows--;
+			BUG_ON(!mapping->nrexceptional);
+			mapping->nrexceptional--;
 		}
 	}
 	BUG_ON(node->count);
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 5/9] mm: add find_get_entries_tag()
  2016-01-06 18:00 [PATCH v7 0/9] DAX fsync/msync support Ross Zwisler
                   ` (3 preceding siblings ...)
  2016-01-06 18:00 ` [PATCH v7 4/9] dax: support dirty DAX entries in radix tree Ross Zwisler
@ 2016-01-06 18:00 ` Ross Zwisler
  2016-01-06 18:01 ` [PATCH v7 6/9] dax: add support for fsync/msync Ross Zwisler
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            | 68 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4d08b6c..92395a0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b8be78..1e215fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1499,6 +1499,74 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping:	the address_space to search
+ * @start:	the starting page cache index
+ * @tag:	the tag index
+ * @nr_entries:	the maximum number of entries
+ * @entries:	where the resulting entries are placed
+ * @indices:	the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_tagged(slot, &mapping->page_tree,
+				   &iter, start, tag) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page)) {
+				/*
+				 * Transient condition which can only trigger
+				 * when entry at index 0 moves out of or back
+				 * to root: none yet gotten, safe to restart.
+				 */
+				goto restart;
+			}
+
+			/*
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		entries[ret] = page;
+		if (++ret == nr_entries)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 6/9] dax: add support for fsync/msync
  2016-01-06 18:00 [PATCH v7 0/9] DAX fsync/msync support Ross Zwisler
                   ` (4 preceding siblings ...)
  2016-01-06 18:00 ` [PATCH v7 5/9] mm: add find_get_entries_tag() Ross Zwisler
@ 2016-01-06 18:01 ` Ross Zwisler
  2016-01-06 18:01 ` [PATCH v7 7/9] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Ross Zwisler
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 187 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dax.h |   2 +
 mm/filemap.c        |   6 ++
 3 files changed, 189 insertions(+), 6 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9dc0c97..c107da6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,6 +24,7 @@
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagevec.h>
 #include <linux/pmem.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
@@ -324,6 +325,167 @@ static int copy_user_bh(struct page *to, struct inode *inode,
 	return 0;
 }
 
+#define NO_SECTOR -1
+
+static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
+		sector_t sector, bool pmd_entry, bool dirty)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int error = 0;
+	void *entry;
+
+	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = radix_tree_lookup(page_tree, index);
+
+	if (entry) {
+		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+			goto dirty;
+		radix_tree_delete(&mapping->page_tree, index);
+		mapping->nrexceptional--;
+	}
+
+	if (sector == NO_SECTOR) {
+		/*
+		 * This can happen during correct operation if our pfn_mkwrite
+		 * fault raced against a hole punch operation.  If this
+		 * happens the pte that was hole punched will have been
+		 * unmapped and the radix tree entry will have been removed by
+		 * the time we are called, but the call will still happen.  We
+		 * will return all the way up to wp_pfn_shared(), where the
+		 * pte_same() check will fail, eventually causing page fault
+		 * to be retried by the CPU.
+		 */
+		goto unlock;
+	}
+
+	error = radix_tree_insert(page_tree, index,
+			RADIX_DAX_ENTRY(sector, pmd_entry));
+	if (error)
+		goto unlock;
+
+	mapping->nrexceptional++;
+ dirty:
+	if (dirty)
+		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return error;
+}
+
+static int dax_writeback_one(struct block_device *bdev,
+		struct address_space *mapping, pgoff_t index, void *entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int type = RADIX_DAX_TYPE(entry);
+	struct radix_tree_node *node;
+	struct blk_dax_ctl dax;
+	void **slot;
+	int ret = 0;
+
+	spin_lock_irq(&mapping->tree_lock);
+	/*
+	 * Regular page slots are stabilized by the page lock even
+	 * without the tree itself locked.  These unlocked entries
+	 * need verification under the tree lock.
+	 */
+	if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+		goto unlock;
+	if (*slot != entry)
+		goto unlock;
+
+	/* another fsync thread may have already written back this entry */
+	if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+		goto unlock;
+
+	radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+
+	if (WARN_ON_ONCE(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD)) {
+		ret = -EIO;
+		goto unlock;
+	}
+
+	dax.sector = RADIX_DAX_SECTOR(entry);
+	dax.size = (type == RADIX_DAX_PMD ? PMD_SIZE : PAGE_SIZE);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	/*
+	 * We cannot hold tree_lock while calling dax_map_atomic() because it
+	 * eventually calls cond_resched().
+	 */
+	ret = dax_map_atomic(bdev, &dax);
+	if (ret < 0)
+		return ret;
+
+	if (WARN_ON_ONCE(ret < dax.size)) {
+		ret = -EIO;
+		goto unmap;
+	}
+
+	wb_cache_pmem(dax.addr, dax.size);
+ unmap:
+	dax_unmap_atomic(bdev, &dax);
+	return ret;
+
+ unlock:
+	spin_unlock_irq(&mapping->tree_lock);
+	return ret;
+}
+
+/*
+ * Flush the mapping to the persistent domain within the byte range of [start,
+ * end]. This is required by data integrity operations to ensure file data is
+ * on persistent storage prior to completion of the operation.
+ */
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end)
+{
+	struct inode *inode = mapping->host;
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	pgoff_t indices[PAGEVEC_SIZE];
+	pgoff_t start_page, end_page;
+	struct pagevec pvec;
+	void *entry;
+	int i, ret = 0;
+
+	if (WARN_ON_ONCE(inode->i_blkbits != PAGE_SHIFT))
+		return -EIO;
+
+	rcu_read_lock();
+	entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
+	rcu_read_unlock();
+
+	/* see if the start of our range is covered by a PMD entry */
+	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+		start &= PMD_MASK;
+
+	start_page = start >> PAGE_CACHE_SHIFT;
+	end_page = end >> PAGE_CACHE_SHIFT;
+
+	tag_pages_for_writeback(mapping, start_page, end_page);
+
+	pagevec_init(&pvec, 0);
+	while (1) {
+		pvec.nr = find_get_entries_tag(mapping, start_page,
+				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		for (i = 0; i < pvec.nr; i++) {
+			ret = dax_writeback_one(bdev, mapping, indices[i],
+					pvec.pages[i]);
+			if (ret < 0)
+				return ret;
+		}
+	}
+	wmb_pmem();
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -363,6 +525,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 	dax_unmap_atomic(bdev, &dax);
 
+	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
+			vmf->flags & FAULT_FLAG_WRITE);
+	if (error)
+		goto out;
+
 	error = vm_insert_mixed(vma, vaddr, dax.pfn);
 
  out:
@@ -487,6 +654,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		delete_from_page_cache(page);
 		unlock_page(page);
 		page_cache_release(page);
+		page = NULL;
 	}
 
 	/*
@@ -596,7 +764,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	pgoff_t size, pgoff;
 	loff_t lstart, lend;
 	sector_t block;
-	int result = 0;
+	int error, result = 0;
 
 	/* dax pmd mappings require pfn_t_devmap() */
 	if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -737,6 +905,16 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 		dax_unmap_atomic(bdev, &dax);
 
+		if (write) {
+			error = dax_radix_entry(mapping, pgoff, dax.sector,
+					true, true);
+			if (error) {
+				dax_pmd_dbg(&bh, address,
+						"PMD radix insertion failed");
+				goto fallback;
+			}
+		}
+
 		dev_dbg(part_to_dev(bdev->bd_part),
 				"%s: %s addr: %lx pfn: %lx sect: %llx\n",
 				__func__, current->comm, address,
@@ -795,15 +973,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
- *
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct file *file = vma->vm_file;
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	sb_end_pagefault(sb);
+	dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false, true);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..8204c3d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+int dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+		loff_t end);
 #endif
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e215fc..2e7c8d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -482,6 +482,12 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 {
 	int err = 0;
 
+	if (dax_mapping(mapping) && mapping->nrexceptional) {
+		err = dax_writeback_mapping_range(mapping, lstart, lend);
+		if (err)
+			return err;
+	}
+
 	if (mapping->nrpages) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 7/9] ext2: call dax_pfn_mkwrite() for DAX fsync/msync
  2016-01-06 18:00 [PATCH v7 0/9] DAX fsync/msync support Ross Zwisler
                   ` (5 preceding siblings ...)
  2016-01-06 18:01 ` [PATCH v7 6/9] dax: add support for fsync/msync Ross Zwisler
@ 2016-01-06 18:01 ` Ross Zwisler
  2016-01-06 18:01 ` [PATCH v7 8/9] ext4: " Ross Zwisler
  2016-01-06 18:01 ` [PATCH v7 9/9] xfs: " Ross Zwisler
  8 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext2/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 11a42c5..2c88d68 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
@@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 
 	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 8/9] ext4: call dax_pfn_mkwrite() for DAX fsync/msync
  2016-01-06 18:00 [PATCH v7 0/9] DAX fsync/msync support Ross Zwisler
                   ` (6 preceding siblings ...)
  2016-01-06 18:01 ` [PATCH v7 7/9] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Ross Zwisler
@ 2016-01-06 18:01 ` Ross Zwisler
  2016-01-06 18:01 ` [PATCH v7 9/9] xfs: " Ross Zwisler
  8 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 60683ab..fa899c9 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
@@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(sb);
 
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 9/9] xfs: call dax_pfn_mkwrite() for DAX fsync/msync
  2016-01-06 18:00 [PATCH v7 0/9] DAX fsync/msync support Ross Zwisler
                   ` (7 preceding siblings ...)
  2016-01-06 18:01 ` [PATCH v7 8/9] ext4: " Ross Zwisler
@ 2016-01-06 18:01 ` Ross Zwisler
  8 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-06 18:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebe9b82..55e16e2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1610,9 +1610,8 @@ xfs_filemap_pmd_fault(
 /*
  * pfn_mkwrite was originally inteneded to ensure we capture time stamp
  * updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
- * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
- * barrier in place.
+ * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
+ * to ensure we serialise the fault barrier in place.
  */
 static int
 xfs_filemap_pfn_mkwrite(
@@ -1635,6 +1634,8 @@ xfs_filemap_pfn_mkwrite(
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else if (IS_DAX(inode))
+		ret = dax_pfn_mkwrite(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/9] dax: fix conversion of holes to PMDs
  2016-01-06 18:00 ` [PATCH v7 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
@ 2016-01-06 19:04   ` Dan Williams
  2016-01-07 22:34     ` Ross Zwisler
  2016-01-07 13:22   ` Jan Kara
  1 sibling, 1 reply; 22+ messages in thread
From: Dan Williams @ 2016-01-06 19:04 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dave Chinner,
	Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers

On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> When we get a DAX PMD fault for a write it is possible that there could be
> some number of 4k zero pages already present for the same range that were
> inserted to service reads from a hole.  These 4k zero pages need to be
> unmapped from the VMAs and removed from the struct address_space radix tree
> before the real DAX PMD entry can be inserted.
>
> For PTE faults this same use case also exists and is handled by a
> combination of unmap_mapping_range() to unmap the VMAs and
> delete_from_page_cache() to remove the page from the address_space radix
> tree.
>
> For PMD faults we do have a call to unmap_mapping_range() (protected by a
> buffer_new() check), but nothing clears out the radix tree entry.  The
> buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> code will never return a buffer_head with BH_New set, even when allocating
> new blocks over a hole.  Instead the filesystem will zero the blocks
> manually and return a buffer_head with only BH_Mapped set.
>
> Fix this situation by removing the buffer_new() check and adding a call to
> truncate_inode_pages_range() to clear out the radix tree entries before we
> insert the DAX PMD.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Replaced the current contents of v6 in -mm from next-20160106 with
this v7 set and it looks good.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>

One question below...

> ---
>  fs/dax.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 03cc4a3..9dc0c97 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -594,6 +594,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>         bool write = flags & FAULT_FLAG_WRITE;
>         struct block_device *bdev;
>         pgoff_t size, pgoff;
> +       loff_t lstart, lend;
>         sector_t block;
>         int result = 0;
>
> @@ -647,15 +648,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>                 goto fallback;
>         }
>
> -       /*
> -        * If we allocated new storage, make sure no process has any
> -        * zero pages covering this hole
> -        */
> -       if (buffer_new(&bh)) {
> -               i_mmap_unlock_read(mapping);
> -               unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> -               i_mmap_lock_read(mapping);
> -       }
> +       /* make sure no process has any zero pages covering this hole */
> +       lstart = pgoff << PAGE_SHIFT;
> +       lend = lstart + PMD_SIZE - 1; /* inclusive */
> +       i_mmap_unlock_read(mapping);
> +       unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> +       truncate_inode_pages_range(mapping, lstart, lend);

Do we need to do both unmap and truncate given that
truncate_inode_page() optionally does an unmap_mapping_range()
internally?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-06 18:00 ` [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
@ 2016-01-06 19:14   ` Dan Williams
  2016-01-07  9:34     ` Jan Kara
  0 siblings, 1 reply; 22+ messages in thread
From: Dan Williams @ 2016-01-06 19:14 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dave Chinner,
	Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers

On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> __dax_dbg() currently assumes that bh->b_bdev is non-NULL, passing it into
> bdevname() where is is dereferenced.  This assumption isn't always true -
> when called for reads of holes, ext4_dax_mmap_get_block() returns a buffer
> head where bh->b_bdev is never set.  I hit this BUG while testing the DAX
> PMD fault path.
>
> Instead, verify that we have a valid bh->b_bdev, else just say "unknown"
> for the block device.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 7af8797..03cc4a3 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -563,7 +563,12 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
>  {
>         if (bh) {
>                 char bname[BDEVNAME_SIZE];
> -               bdevname(bh->b_bdev, bname);
> +
> +               if (bh->b_bdev)
> +                       bdevname(bh->b_bdev, bname);
> +               else
> +                       snprintf(bname, BDEVNAME_SIZE, "unknown");
> +
>                 pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
>                         "length %zd fallback: %s\n", fn, current->comm,
>                         address, bname, bh->b_state, (u64)bh->b_blocknr,

I'm assuming there's no danger of a such a buffer_head ever being used
for the bdev parameter to dax_map_atomic()?  Shouldn't we also/instead
go fix ext4 to not send partially filled buffer_heads?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-06 19:14   ` Dan Williams
@ 2016-01-07  9:34     ` Jan Kara
  2016-01-07 15:17       ` Dan Williams
  2016-01-07 23:10       ` Dave Chinner
  0 siblings, 2 replies; 22+ messages in thread
From: Jan Kara @ 2016-01-07  9:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org, X86 ML,
	XFS Developers

On Wed 06-01-16 11:14:09, Dan Williams wrote:
> On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > __dax_dbg() currently assumes that bh->b_bdev is non-NULL, passing it into
> > bdevname() where is is dereferenced.  This assumption isn't always true -
> > when called for reads of holes, ext4_dax_mmap_get_block() returns a buffer
> > head where bh->b_bdev is never set.  I hit this BUG while testing the DAX
> > PMD fault path.
> >
> > Instead, verify that we have a valid bh->b_bdev, else just say "unknown"
> > for the block device.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > ---
> >  fs/dax.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 7af8797..03cc4a3 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -563,7 +563,12 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
> >  {
> >         if (bh) {
> >                 char bname[BDEVNAME_SIZE];
> > -               bdevname(bh->b_bdev, bname);
> > +
> > +               if (bh->b_bdev)
> > +                       bdevname(bh->b_bdev, bname);
> > +               else
> > +                       snprintf(bname, BDEVNAME_SIZE, "unknown");
> > +
> >                 pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
> >                         "length %zd fallback: %s\n", fn, current->comm,
> >                         address, bname, bh->b_state, (u64)bh->b_blocknr,
> 
> I'm assuming there's no danger of a such a buffer_head ever being used
> for the bdev parameter to dax_map_atomic()?  Shouldn't we also/instead
> go fix ext4 to not send partially filled buffer_heads?

No. The real problem is a long-standing abuse of struct buffer_head to be
used for passing block mapping information (it's on my todo list to remove
that at least from DAX code and use cleaner block mapping interface but
first I want basic DAX functionality to settle down to avoid unnecessary
conflicts). Filesystem is not supposed to touch bh->b_bdev. If you need
that filled in, set it yourself in before passing bh to the block mapping
function.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/9] dax: fix conversion of holes to PMDs
  2016-01-06 18:00 ` [PATCH v7 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
  2016-01-06 19:04   ` Dan Williams
@ 2016-01-07 13:22   ` Jan Kara
  2016-01-07 22:11     ` Ross Zwisler
  1 sibling, 1 reply; 22+ messages in thread
From: Jan Kara @ 2016-01-07 13:22 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dan Williams,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Wed 06-01-16 11:00:56, Ross Zwisler wrote:
> When we get a DAX PMD fault for a write it is possible that there could be
> some number of 4k zero pages already present for the same range that were
> inserted to service reads from a hole.  These 4k zero pages need to be
> unmapped from the VMAs and removed from the struct address_space radix tree
> before the real DAX PMD entry can be inserted.
> 
> For PTE faults this same use case also exists and is handled by a
> combination of unmap_mapping_range() to unmap the VMAs and
> delete_from_page_cache() to remove the page from the address_space radix
> tree.
> 
> For PMD faults we do have a call to unmap_mapping_range() (protected by a
> buffer_new() check), but nothing clears out the radix tree entry.  The
> buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> code will never return a buffer_head with BH_New set, even when allocating
> new blocks over a hole.  Instead the filesystem will zero the blocks
> manually and return a buffer_head with only BH_Mapped set.
> 
> Fix this situation by removing the buffer_new() check and adding a call to
> truncate_inode_pages_range() to clear out the radix tree entries before we
> insert the DAX PMD.

Ho, hum, let me understand this. So we have a file, different processes are
mapping it. One process maps is with normal page granularity and another
process with huge page granularity. Thus when the first process read-faults
a few normal pages and then the second process write-faults the huge page
in the same range, we have a problem. Do I understand this correctly?
Because otherwise I don't understand how a single page table can have both
huge page and normal page in the same range...

And if this is indeed the problem then what prevents the unmapping and
truncation in huge page fault to race with mapping the same range again
using small pages? Sure now blocks are allocated so the mapping itself will
be consistent but radix tree will have the same issues it had before this
patch, won't it?

... thinking some more about this ...

OK, there is some difference - we will only have DAX exceptional entries
for the range covered by huge page and those we replace properly in
dax_radix_entry() code. So things are indeed fine *except* that nothing
seems so serialize dax_load() hole with PMD fault. The race like following
seems possible:

CPU1 - process 1		CPU2 - process 2

__dax_fault() - file f, index 1
  get_block() -> returns hole
				__dax_pmd_fault() - file f, index 0
				  get_block() -> allocates blocks
				  ...
				  truncate_pagecache_range()
  dax_load_hole()

Boom, we have hole page instantiated for allocated range (data corruption)
and corruption of radix tree entries as well. Actually this problem is
there even for two different processes doing normal page faults (one read,
one write) against the same page in the file.

... thinking about possible fixes ...

So we need some exclusion that makes sure pgoff->block mapping information
is uptodate at the moment we insert it into page tables. The simplest
reasonably fast thing I can see is:

When handling a read fault, things stay as is and filesystem protects the
fault with an equivalent of EXT4_I(inode)->i_mmap_sem held for reading. When
handling a write fault we first grab EXT4_I(inode)->i_mmap_sem for reading
and try a read fault. If __dax_fault() sees a hole returned from
get_blocks() during a write fault, it bails out. Filesystem grabs
EXT4_I(inode)->i_mmap_sem for writing and retries with different
get_blocks() callback which will allocate blocks. That way we get proper
exclusion for faults needing to allocate blocks. Thoughts?

								Honza

> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 03cc4a3..9dc0c97 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -594,6 +594,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  	bool write = flags & FAULT_FLAG_WRITE;
>  	struct block_device *bdev;
>  	pgoff_t size, pgoff;
> +	loff_t lstart, lend;
>  	sector_t block;
>  	int result = 0;
>  
> @@ -647,15 +648,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		goto fallback;
>  	}
>  
> -	/*
> -	 * If we allocated new storage, make sure no process has any
> -	 * zero pages covering this hole
> -	 */
> -	if (buffer_new(&bh)) {
> -		i_mmap_unlock_read(mapping);
> -		unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> -		i_mmap_lock_read(mapping);
> -	}
> +	/* make sure no process has any zero pages covering this hole */
> +	lstart = pgoff << PAGE_SHIFT;
> +	lend = lstart + PMD_SIZE - 1; /* inclusive */
> +	i_mmap_unlock_read(mapping);
> +	unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> +	truncate_inode_pages_range(mapping, lstart, lend);
> +	i_mmap_lock_read(mapping);
>  
>  	/*
>  	 * If a truncate happened while we were allocating blocks, we may
> @@ -669,7 +668,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>  		goto out;
>  	}
>  	if ((pgoff | PG_PMD_COLOUR) >= size) {
> -		dax_pmd_dbg(&bh, address, "pgoff unaligned");
> +		dax_pmd_dbg(&bh, address,
> +				"offset + huge page size > file size");
>  		goto fallback;
>  	}
>  
> -- 
> 2.5.0
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-07  9:34     ` Jan Kara
@ 2016-01-07 15:17       ` Dan Williams
  2016-01-07 22:16         ` Ross Zwisler
  2016-01-07 23:10       ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Dan Williams @ 2016-01-07 15:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org, X86 ML,
	XFS Developers

On Thu, Jan 7, 2016 at 1:34 AM, Jan Kara <jack@suse.cz> wrote:
> On Wed 06-01-16 11:14:09, Dan Williams wrote:
>> On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > __dax_dbg() currently assumes that bh->b_bdev is non-NULL, passing it into
>> > bdevname() where is is dereferenced.  This assumption isn't always true -
>> > when called for reads of holes, ext4_dax_mmap_get_block() returns a buffer
>> > head where bh->b_bdev is never set.  I hit this BUG while testing the DAX
>> > PMD fault path.
>> >
>> > Instead, verify that we have a valid bh->b_bdev, else just say "unknown"
>> > for the block device.
>> >
>> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>> > Cc: Dan Williams <dan.j.williams@intel.com>
>> > ---
>> >  fs/dax.c | 7 ++++++-
>> >  1 file changed, 6 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/dax.c b/fs/dax.c
>> > index 7af8797..03cc4a3 100644
>> > --- a/fs/dax.c
>> > +++ b/fs/dax.c
>> > @@ -563,7 +563,12 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
>> >  {
>> >         if (bh) {
>> >                 char bname[BDEVNAME_SIZE];
>> > -               bdevname(bh->b_bdev, bname);
>> > +
>> > +               if (bh->b_bdev)
>> > +                       bdevname(bh->b_bdev, bname);
>> > +               else
>> > +                       snprintf(bname, BDEVNAME_SIZE, "unknown");
>> > +
>> >                 pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
>> >                         "length %zd fallback: %s\n", fn, current->comm,
>> >                         address, bname, bh->b_state, (u64)bh->b_blocknr,
>>
>> I'm assuming there's no danger of a such a buffer_head ever being used
>> for the bdev parameter to dax_map_atomic()?  Shouldn't we also/instead
>> go fix ext4 to not send partially filled buffer_heads?
>
> No. The real problem is a long-standing abuse of struct buffer_head to be
> used for passing block mapping information (it's on my todo list to remove
> that at least from DAX code and use cleaner block mapping interface but
> first I want basic DAX functionality to settle down to avoid unnecessary
> conflicts). Filesystem is not supposed to touch bh->b_bdev. If you need
> that filled in, set it yourself in before passing bh to the block mapping
> function.
>

Ok, makes sense.

Ross, can you fix this instead by unconditionally looking up the bdev
rather that saying "unknown".  The bdev should always be retrievable.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/9] dax: fix conversion of holes to PMDs
  2016-01-07 13:22   ` Jan Kara
@ 2016-01-07 22:11     ` Ross Zwisler
  2016-01-11 12:23       ` Jan Kara
  0 siblings, 1 reply; 22+ messages in thread
From: Ross Zwisler @ 2016-01-07 22:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Thu, Jan 07, 2016 at 02:22:06PM +0100, Jan Kara wrote:
> On Wed 06-01-16 11:00:56, Ross Zwisler wrote:
> > When we get a DAX PMD fault for a write it is possible that there could be
> > some number of 4k zero pages already present for the same range that were
> > inserted to service reads from a hole.  These 4k zero pages need to be
> > unmapped from the VMAs and removed from the struct address_space radix tree
> > before the real DAX PMD entry can be inserted.
> > 
> > For PTE faults this same use case also exists and is handled by a
> > combination of unmap_mapping_range() to unmap the VMAs and
> > delete_from_page_cache() to remove the page from the address_space radix
> > tree.
> > 
> > For PMD faults we do have a call to unmap_mapping_range() (protected by a
> > buffer_new() check), but nothing clears out the radix tree entry.  The
> > buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> > code will never return a buffer_head with BH_New set, even when allocating
> > new blocks over a hole.  Instead the filesystem will zero the blocks
> > manually and return a buffer_head with only BH_Mapped set.
> > 
> > Fix this situation by removing the buffer_new() check and adding a call to
> > truncate_inode_pages_range() to clear out the radix tree entries before we
> > insert the DAX PMD.
> 
> Ho, hum, let me understand this. So we have a file, different processes are
> mapping it. One process maps is with normal page granularity and another
> process with huge page granularity. Thus when the first process read-faults
> a few normal pages and then the second process write-faults the huge page
> in the same range, we have a problem. Do I understand this correctly?
> Because otherwise I don't understand how a single page table can have both
> huge page and normal page in the same range...

I don't think that it necessarily has to do with multiple threads.  The bit to
notice here is we *always* use 4k zero pages to cover holes.  So, a single
thread can hit this condition by doing some reads from a hole (insert 4k
pages), then doing a write.  This write is the first time that we will try and
use real DAX storage to insert into the page tables, and we may end up getting
a PMD.  This means that we need to clear out all the 4k pages that we inserted
while reading holes in this same range, now that we have a 2M segment
allocated by the filesystem and the entire range is no longer a hole.

> And if this is indeed the problem then what prevents the unmapping and
> truncation in huge page fault to race with mapping the same range again
> using small pages? Sure now blocks are allocated so the mapping itself will
> be consistent but radix tree will have the same issues it had before this
> patch, won't it?

Yep, this is a separate issue, but I think that we handle this case
successfully, though we may end up flushing the same address multiple times.
Once the filesystem has established a block mapping (assuming we avoid the
race described below where one thread is mapping in holes and the other sees a
block allocation), I think we are okay.  It's true that one thread can map in
PMDs, and another thread could potentially map in PTEs that cover the same
range if they hare working with mmaps that are smaller than a PMD, but the
sectors inserted into the radix tree by each of those threads will be
individually correct - the only issue is that they may overlap.

Say, for example you have the following:

CPU1 - process 1				CPU2 - process 2
mmap for sector 0, size 2M
insert PMD into radix tree for sector 0
  This radix tree covers sectors 0-4096
						mmap for sector 32, size 4k
						insert PTE entry into radix
						tree for sector 32

In this case a fsync of the fd by process 1 will end up flushing sector 32
twice, which is correct but inefficient.  I think we can make this more
efficient by adjusting the insertion code and dirtying code in
dax_radix_entry() to look for PMDs that cover this same range.

> ... thinking some more about this ...
> 
> OK, there is some difference - we will only have DAX exceptional entries
> for the range covered by huge page and those we replace properly in
> dax_radix_entry() code. So things are indeed fine *except* that nothing
> seems so serialize dax_load() hole with PMD fault. The race like following
> seems possible:
> 
> CPU1 - process 1		CPU2 - process 2
> 
> __dax_fault() - file f, index 1
>   get_block() -> returns hole
> 				__dax_pmd_fault() - file f, index 0
> 				  get_block() -> allocates blocks
> 				  ...
> 				  truncate_pagecache_range()
>   dax_load_hole()
> 
> Boom, we have hole page instantiated for allocated range (data corruption)
> and corruption of radix tree entries as well. Actually this problem is
> there even for two different processes doing normal page faults (one read,
> one write) against the same page in the file.

Yea, I agree, this seems like an existing issue that you could hit with just
the PTE path:

CPU1 - process 1		CPU2 - process 2

__dax_fault() - read file f, index 0
  get_block() -> returns hole
				__dax_fault() - write file f, index 0
				  get_block() -> allocates blocks
				  ...
				  skips unmap_mapping_range() and
				    delete_from_page_cache() because it didn't
				    find a page for this pgoff
				  dax_insert_mapping()
  dax_load_hole()
  *data corruption*

> ... thinking about possible fixes ...
> 
> So we need some exclusion that makes sure pgoff->block mapping information
> is uptodate at the moment we insert it into page tables. The simplest
> reasonably fast thing I can see is:
> 
> When handling a read fault, things stay as is and filesystem protects the
> fault with an equivalent of EXT4_I(inode)->i_mmap_sem held for reading. When
> handling a write fault we first grab EXT4_I(inode)->i_mmap_sem for reading
> and try a read fault. If __dax_fault() sees a hole returned from
> get_blocks() during a write fault, it bails out. Filesystem grabs
> EXT4_I(inode)->i_mmap_sem for writing and retries with different
> get_blocks() callback which will allocate blocks. That way we get proper
> exclusion for faults needing to allocate blocks. Thoughts?

I think this would work.  ext4, ext2 and xfs all handle their exclusion with
rw_semaphores, so this should work for each of them, I think.  Thanks for the
problem statement & solution!  :) 

I guess our best course is to make sure that we don't make this existing
problem worse via the fsync/msync patches by handling the error gracefully,
and fix this for v4.6.  I do feel the need to point out that this is a
pre-existing issue with DAX, and that my fsync patches just happened to help
us find it.  They don't make the situation any better or any worse, and I
really hope this issue doesn't end up blocking the fsync/msync patches from
getting merged for v4.5.

Thanks,
- Ross

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-07 15:17       ` Dan Williams
@ 2016-01-07 22:16         ` Ross Zwisler
  0 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-07 22:16 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dave Chinner, Dave Hansen,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers

On Thu, Jan 07, 2016 at 07:17:22AM -0800, Dan Williams wrote:
> On Thu, Jan 7, 2016 at 1:34 AM, Jan Kara <jack@suse.cz> wrote:
> > On Wed 06-01-16 11:14:09, Dan Williams wrote:
> >> On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >> > __dax_dbg() currently assumes that bh->b_bdev is non-NULL, passing it into
> >> > bdevname() where is is dereferenced.  This assumption isn't always true -
> >> > when called for reads of holes, ext4_dax_mmap_get_block() returns a buffer
> >> > head where bh->b_bdev is never set.  I hit this BUG while testing the DAX
> >> > PMD fault path.
> >> >
> >> > Instead, verify that we have a valid bh->b_bdev, else just say "unknown"
> >> > for the block device.
> >> >
> >> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> > Cc: Dan Williams <dan.j.williams@intel.com>
> >> > ---
> >> >  fs/dax.c | 7 ++++++-
> >> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/fs/dax.c b/fs/dax.c
> >> > index 7af8797..03cc4a3 100644
> >> > --- a/fs/dax.c
> >> > +++ b/fs/dax.c
> >> > @@ -563,7 +563,12 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
> >> >  {
> >> >         if (bh) {
> >> >                 char bname[BDEVNAME_SIZE];
> >> > -               bdevname(bh->b_bdev, bname);
> >> > +
> >> > +               if (bh->b_bdev)
> >> > +                       bdevname(bh->b_bdev, bname);
> >> > +               else
> >> > +                       snprintf(bname, BDEVNAME_SIZE, "unknown");
> >> > +
> >> >                 pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
> >> >                         "length %zd fallback: %s\n", fn, current->comm,
> >> >                         address, bname, bh->b_state, (u64)bh->b_blocknr,
> >>
> >> I'm assuming there's no danger of a such a buffer_head ever being used
> >> for the bdev parameter to dax_map_atomic()?  Shouldn't we also/instead
> >> go fix ext4 to not send partially filled buffer_heads?
> >
> > No. The real problem is a long-standing abuse of struct buffer_head to be
> > used for passing block mapping information (it's on my todo list to remove
> > that at least from DAX code and use cleaner block mapping interface but
> > first I want basic DAX functionality to settle down to avoid unnecessary
> > conflicts). Filesystem is not supposed to touch bh->b_bdev. If you need
> > that filled in, set it yourself in before passing bh to the block mapping
> > function.
> >
> 
> Ok, makes sense.
> 
> Ross, can you fix this instead by unconditionally looking up the bdev
> rather that saying "unknown".  The bdev should always be retrievable.

Sure, will do.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/9] dax: fix conversion of holes to PMDs
  2016-01-06 19:04   ` Dan Williams
@ 2016-01-07 22:34     ` Ross Zwisler
  2016-01-08  4:18       ` Ross Zwisler
  0 siblings, 1 reply; 22+ messages in thread
From: Ross Zwisler @ 2016-01-07 22:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Matthew Wilcox, Thomas Gleixner, linux-ext4,
	linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org, X86 ML,
	XFS Developers

On Wed, Jan 06, 2016 at 11:04:35AM -0800, Dan Williams wrote:
> On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > When we get a DAX PMD fault for a write it is possible that there could be
> > some number of 4k zero pages already present for the same range that were
> > inserted to service reads from a hole.  These 4k zero pages need to be
> > unmapped from the VMAs and removed from the struct address_space radix tree
> > before the real DAX PMD entry can be inserted.
> >
> > For PTE faults this same use case also exists and is handled by a
> > combination of unmap_mapping_range() to unmap the VMAs and
> > delete_from_page_cache() to remove the page from the address_space radix
> > tree.
> >
> > For PMD faults we do have a call to unmap_mapping_range() (protected by a
> > buffer_new() check), but nothing clears out the radix tree entry.  The
> > buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> > code will never return a buffer_head with BH_New set, even when allocating
> > new blocks over a hole.  Instead the filesystem will zero the blocks
> > manually and return a buffer_head with only BH_Mapped set.
> >
> > Fix this situation by removing the buffer_new() check and adding a call to
> > truncate_inode_pages_range() to clear out the radix tree entries before we
> > insert the DAX PMD.
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> Replaced the current contents of v6 in -mm from next-20160106 with
> this v7 set and it looks good.
> 
> Reported-by: Dan Williams <dan.j.williams@intel.com>
> Tested-by: Dan Williams <dan.j.williams@intel.com>
> 
> One question below...
> 
> > ---
> >  fs/dax.c | 20 ++++++++++----------
> >  1 file changed, 10 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 03cc4a3..9dc0c97 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -594,6 +594,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >         bool write = flags & FAULT_FLAG_WRITE;
> >         struct block_device *bdev;
> >         pgoff_t size, pgoff;
> > +       loff_t lstart, lend;
> >         sector_t block;
> >         int result = 0;
> >
> > @@ -647,15 +648,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> >                 goto fallback;
> >         }
> >
> > -       /*
> > -        * If we allocated new storage, make sure no process has any
> > -        * zero pages covering this hole
> > -        */
> > -       if (buffer_new(&bh)) {
> > -               i_mmap_unlock_read(mapping);
> > -               unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> > -               i_mmap_lock_read(mapping);
> > -       }
> > +       /* make sure no process has any zero pages covering this hole */
> > +       lstart = pgoff << PAGE_SHIFT;
> > +       lend = lstart + PMD_SIZE - 1; /* inclusive */
> > +       i_mmap_unlock_read(mapping);
> > +       unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> > +       truncate_inode_pages_range(mapping, lstart, lend);
> 
> Do we need to do both unmap and truncate given that
> truncate_inode_page() optionally does an unmap_mapping_range()
> internally?

Ah, indeed it does.  Sure, having just the call to truncate_inode_page() seems
cleaner.  I'll re-test and send this out in v8.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-07  9:34     ` Jan Kara
  2016-01-07 15:17       ` Dan Williams
@ 2016-01-07 23:10       ` Dave Chinner
  2016-01-07 23:39         ` Ross Zwisler
  1 sibling, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2016-01-07 23:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dave Hansen, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, X86 ML, XFS Developers

On Thu, Jan 07, 2016 at 10:34:02AM +0100, Jan Kara wrote:
> On Wed 06-01-16 11:14:09, Dan Williams wrote:
> > On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > __dax_dbg() currently assumes that bh->b_bdev is non-NULL, passing it into
> > > bdevname() where is is dereferenced.  This assumption isn't always true -
> > > when called for reads of holes, ext4_dax_mmap_get_block() returns a buffer
> > > head where bh->b_bdev is never set.  I hit this BUG while testing the DAX
> > > PMD fault path.
> > >
> > > Instead, verify that we have a valid bh->b_bdev, else just say "unknown"
> > > for the block device.
> > >
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > ---
> > >  fs/dax.c | 7 ++++++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 7af8797..03cc4a3 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -563,7 +563,12 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
> > >  {
> > >         if (bh) {
> > >                 char bname[BDEVNAME_SIZE];
> > > -               bdevname(bh->b_bdev, bname);
> > > +
> > > +               if (bh->b_bdev)
> > > +                       bdevname(bh->b_bdev, bname);
> > > +               else
> > > +                       snprintf(bname, BDEVNAME_SIZE, "unknown");
> > > +
> > >                 pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
> > >                         "length %zd fallback: %s\n", fn, current->comm,
> > >                         address, bname, bh->b_state, (u64)bh->b_blocknr,
> > 
> > I'm assuming there's no danger of a such a buffer_head ever being used
> > for the bdev parameter to dax_map_atomic()?  Shouldn't we also/instead
> > go fix ext4 to not send partially filled buffer_heads?
> 
> No. The real problem is a long-standing abuse of struct buffer_head to be
> used for passing block mapping information (it's on my todo list to remove
> that at least from DAX code and use cleaner block mapping interface but
> first I want basic DAX functionality to settle down to avoid unnecessary
> conflicts). Filesystem is not supposed to touch bh->b_bdev.

That has not been true for a long, long time. e.g. XFS always
rewrites bh->b_bdev in get_blocks because the file may not reside on
the primary block device of the filesystem. i.e.:

        /*
         * If this is a realtime file, data may be on a different device.
         * to that pointed to from the buffer_head b_bdev currently.
         */
        bh_result->b_bdev = xfs_find_bdev_for_inode(inode);

> If you need
> that filled in, set it yourself in before passing bh to the block mapping
> function.

That may be true, but we cannot assume that the bdev coming back
out of get_block is the same one that was passed in.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg()
  2016-01-07 23:10       ` Dave Chinner
@ 2016-01-07 23:39         ` Ross Zwisler
  0 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-07 23:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dan Williams, Ross Zwisler, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Andrew Morton, Dave Hansen,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers

On Fri, Jan 08, 2016 at 10:10:00AM +1100, Dave Chinner wrote:
> On Thu, Jan 07, 2016 at 10:34:02AM +0100, Jan Kara wrote:
> > On Wed 06-01-16 11:14:09, Dan Williams wrote:
> > > On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
> > > <ross.zwisler@linux.intel.com> wrote:
> > > > __dax_dbg() currently assumes that bh->b_bdev is non-NULL, passing it into
> > > > bdevname() where is is dereferenced.  This assumption isn't always true -
> > > > when called for reads of holes, ext4_dax_mmap_get_block() returns a buffer
> > > > head where bh->b_bdev is never set.  I hit this BUG while testing the DAX
> > > > PMD fault path.
> > > >
> > > > Instead, verify that we have a valid bh->b_bdev, else just say "unknown"
> > > > for the block device.
> > > >
> > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > > ---
> > > >  fs/dax.c | 7 ++++++-
> > > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 7af8797..03cc4a3 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -563,7 +563,12 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
> > > >  {
> > > >         if (bh) {
> > > >                 char bname[BDEVNAME_SIZE];
> > > > -               bdevname(bh->b_bdev, bname);
> > > > +
> > > > +               if (bh->b_bdev)
> > > > +                       bdevname(bh->b_bdev, bname);
> > > > +               else
> > > > +                       snprintf(bname, BDEVNAME_SIZE, "unknown");
> > > > +
> > > >                 pr_debug("%s: %s addr: %lx dev %s state %lx start %lld "
> > > >                         "length %zd fallback: %s\n", fn, current->comm,
> > > >                         address, bname, bh->b_state, (u64)bh->b_blocknr,
> > > 
> > > I'm assuming there's no danger of a such a buffer_head ever being used
> > > for the bdev parameter to dax_map_atomic()?  Shouldn't we also/instead
> > > go fix ext4 to not send partially filled buffer_heads?
> > 
> > No. The real problem is a long-standing abuse of struct buffer_head to be
> > used for passing block mapping information (it's on my todo list to remove
> > that at least from DAX code and use cleaner block mapping interface but
> > first I want basic DAX functionality to settle down to avoid unnecessary
> > conflicts). Filesystem is not supposed to touch bh->b_bdev.
> 
> That has not been true for a long, long time. e.g. XFS always
> rewrites bh->b_bdev in get_blocks because the file may not reside on
> the primary block device of the filesystem. i.e.:
> 
>         /*
>          * If this is a realtime file, data may be on a different device.
>          * to that pointed to from the buffer_head b_bdev currently.
>          */
>         bh_result->b_bdev = xfs_find_bdev_for_inode(inode);
> 
> > If you need
> > that filled in, set it yourself in before passing bh to the block mapping
> > function.
> 
> That may be true, but we cannot assume that the bdev coming back
> out of get_block is the same one that was passed in.

For our use case I think this is fine - we just need the bdev to be filled in
so that we can print reasonable error messages.  If the filesystem updates
bh->b_bdev during get_blocks(), we are fine with that.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/9] dax: fix conversion of holes to PMDs
  2016-01-07 22:34     ` Ross Zwisler
@ 2016-01-08  4:18       ` Ross Zwisler
  0 siblings, 0 replies; 22+ messages in thread
From: Ross Zwisler @ 2016-01-08  4:18 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Andrew Morton, Dave Chinner, Dave Hansen,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers

On Thu, Jan 07, 2016 at 03:34:55PM -0700, Ross Zwisler wrote:
> On Wed, Jan 06, 2016 at 11:04:35AM -0800, Dan Williams wrote:
> > On Wed, Jan 6, 2016 at 10:00 AM, Ross Zwisler
> > <ross.zwisler@linux.intel.com> wrote:
> > > When we get a DAX PMD fault for a write it is possible that there could be
> > > some number of 4k zero pages already present for the same range that were
> > > inserted to service reads from a hole.  These 4k zero pages need to be
> > > unmapped from the VMAs and removed from the struct address_space radix tree
> > > before the real DAX PMD entry can be inserted.
> > >
> > > For PTE faults this same use case also exists and is handled by a
> > > combination of unmap_mapping_range() to unmap the VMAs and
> > > delete_from_page_cache() to remove the page from the address_space radix
> > > tree.
> > >
> > > For PMD faults we do have a call to unmap_mapping_range() (protected by a
> > > buffer_new() check), but nothing clears out the radix tree entry.  The
> > > buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> > > code will never return a buffer_head with BH_New set, even when allocating
> > > new blocks over a hole.  Instead the filesystem will zero the blocks
> > > manually and return a buffer_head with only BH_Mapped set.
> > >
> > > Fix this situation by removing the buffer_new() check and adding a call to
> > > truncate_inode_pages_range() to clear out the radix tree entries before we
> > > insert the DAX PMD.
> > >
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > 
> > Replaced the current contents of v6 in -mm from next-20160106 with
> > this v7 set and it looks good.
> > 
> > Reported-by: Dan Williams <dan.j.williams@intel.com>
> > Tested-by: Dan Williams <dan.j.williams@intel.com>
> > 
> > One question below...
> > 
> > > ---
> > >  fs/dax.c | 20 ++++++++++----------
> > >  1 file changed, 10 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 03cc4a3..9dc0c97 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -594,6 +594,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > >         bool write = flags & FAULT_FLAG_WRITE;
> > >         struct block_device *bdev;
> > >         pgoff_t size, pgoff;
> > > +       loff_t lstart, lend;
> > >         sector_t block;
> > >         int result = 0;
> > >
> > > @@ -647,15 +648,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > >                 goto fallback;
> > >         }
> > >
> > > -       /*
> > > -        * If we allocated new storage, make sure no process has any
> > > -        * zero pages covering this hole
> > > -        */
> > > -       if (buffer_new(&bh)) {
> > > -               i_mmap_unlock_read(mapping);
> > > -               unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> > > -               i_mmap_lock_read(mapping);
> > > -       }
> > > +       /* make sure no process has any zero pages covering this hole */
> > > +       lstart = pgoff << PAGE_SHIFT;
> > > +       lend = lstart + PMD_SIZE - 1; /* inclusive */
> > > +       i_mmap_unlock_read(mapping);
> > > +       unmap_mapping_range(mapping, lstart, PMD_SIZE, 0);
> > > +       truncate_inode_pages_range(mapping, lstart, lend);
> > 
> > Do we need to do both unmap and truncate given that
> > truncate_inode_page() optionally does an unmap_mapping_range()
> > internally?
> 
> Ah, indeed it does.  Sure, having just the call to truncate_inode_page() seems
> cleaner.  I'll re-test and send this out in v8.

Actually, in testing it doesn't look like unmap_mapping_range() in
truncate_inode_page() gets called.  We fail the page_mapped(page) check for
our read-only zero pages.  I think we need to keep the unmap_mapping_range()
call in __dax_pmd_fault().

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 2/9] dax: fix conversion of holes to PMDs
  2016-01-07 22:11     ` Ross Zwisler
@ 2016-01-11 12:23       ` Jan Kara
  0 siblings, 0 replies; 22+ messages in thread
From: Jan Kara @ 2016-01-11 12:23 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Andrew Morton,
	Dan Williams, Dave Chinner, Dave Hansen, Ingo Molnar, Jan Kara,
	Jeff Layton, Matthew Wilcox, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs

On Thu 07-01-16 15:11:14, Ross Zwisler wrote:
> On Thu, Jan 07, 2016 at 02:22:06PM +0100, Jan Kara wrote:
> > On Wed 06-01-16 11:00:56, Ross Zwisler wrote:
> > > When we get a DAX PMD fault for a write it is possible that there could be
> > > some number of 4k zero pages already present for the same range that were
> > > inserted to service reads from a hole.  These 4k zero pages need to be
> > > unmapped from the VMAs and removed from the struct address_space radix tree
> > > before the real DAX PMD entry can be inserted.
> > > 
> > > For PTE faults this same use case also exists and is handled by a
> > > combination of unmap_mapping_range() to unmap the VMAs and
> > > delete_from_page_cache() to remove the page from the address_space radix
> > > tree.
> > > 
> > > For PMD faults we do have a call to unmap_mapping_range() (protected by a
> > > buffer_new() check), but nothing clears out the radix tree entry.  The
> > > buffer_new() check is also incorrect as the current ext4 and XFS filesystem
> > > code will never return a buffer_head with BH_New set, even when allocating
> > > new blocks over a hole.  Instead the filesystem will zero the blocks
> > > manually and return a buffer_head with only BH_Mapped set.
> > > 
> > > Fix this situation by removing the buffer_new() check and adding a call to
> > > truncate_inode_pages_range() to clear out the radix tree entries before we
> > > insert the DAX PMD.
> > 
> > Ho, hum, let me understand this. So we have a file, different processes are
> > mapping it. One process maps is with normal page granularity and another
> > process with huge page granularity. Thus when the first process read-faults
> > a few normal pages and then the second process write-faults the huge page
> > in the same range, we have a problem. Do I understand this correctly?
> > Because otherwise I don't understand how a single page table can have both
> > huge page and normal page in the same range...
> 
> I don't think that it necessarily has to do with multiple threads.  The bit to
> notice here is we *always* use 4k zero pages to cover holes.  So, a single
> thread can hit this condition by doing some reads from a hole (insert 4k
> pages), then doing a write.  This write is the first time that we will try and
> use real DAX storage to insert into the page tables, and we may end up getting
> a PMD.  This means that we need to clear out all the 4k pages that we inserted
> while reading holes in this same range, now that we have a 2M segment
> allocated by the filesystem and the entire range is no longer a hole.

OK, I see. Thanks for explanation.

> > And if this is indeed the problem then what prevents the unmapping and
> > truncation in huge page fault to race with mapping the same range again
> > using small pages? Sure now blocks are allocated so the mapping itself will
> > be consistent but radix tree will have the same issues it had before this
> > patch, won't it?
> 
> Yep, this is a separate issue, but I think that we handle this case
> successfully, though we may end up flushing the same address multiple times.
> Once the filesystem has established a block mapping (assuming we avoid the
> race described below where one thread is mapping in holes and the other sees a
> block allocation), I think we are okay.  It's true that one thread can map in
> PMDs, and another thread could potentially map in PTEs that cover the same
> range if they hare working with mmaps that are smaller than a PMD, but the
> sectors inserted into the radix tree by each of those threads will be
> individually correct - the only issue is that they may overlap.
> 
> Say, for example you have the following:
> 
> CPU1 - process 1				CPU2 - process 2
> mmap for sector 0, size 2M
> insert PMD into radix tree for sector 0
>   This radix tree covers sectors 0-4096
> 						mmap for sector 32, size 4k
> 						insert PTE entry into radix
> 						tree for sector 32
> 
> In this case a fsync of the fd by process 1 will end up flushing sector 32
> twice, which is correct but inefficient.  I think we can make this more
> efficient by adjusting the insertion code and dirtying code in
> dax_radix_entry() to look for PMDs that cover this same range.

Yes, this is what I ended up with as well. So we are in agreement here.

> > ... thinking some more about this ...
> > 
> > OK, there is some difference - we will only have DAX exceptional entries
> > for the range covered by huge page and those we replace properly in
> > dax_radix_entry() code. So things are indeed fine *except* that nothing
> > seems so serialize dax_load() hole with PMD fault. The race like following
> > seems possible:
> > 
> > CPU1 - process 1		CPU2 - process 2
> > 
> > __dax_fault() - file f, index 1
> >   get_block() -> returns hole
> > 				__dax_pmd_fault() - file f, index 0
> > 				  get_block() -> allocates blocks
> > 				  ...
> > 				  truncate_pagecache_range()
> >   dax_load_hole()
> > 
> > Boom, we have hole page instantiated for allocated range (data corruption)
> > and corruption of radix tree entries as well. Actually this problem is
> > there even for two different processes doing normal page faults (one read,
> > one write) against the same page in the file.
> 
> Yea, I agree, this seems like an existing issue that you could hit with just
> the PTE path:
> 
> CPU1 - process 1		CPU2 - process 2
> 
> __dax_fault() - read file f, index 0
>   get_block() -> returns hole
> 				__dax_fault() - write file f, index 0
> 				  get_block() -> allocates blocks
> 				  ...
> 				  skips unmap_mapping_range() and
> 				    delete_from_page_cache() because it didn't
> 				    find a page for this pgoff
> 				  dax_insert_mapping()
>   dax_load_hole()
>   *data corruption*
> 
> > ... thinking about possible fixes ...
> > 
> > So we need some exclusion that makes sure pgoff->block mapping information
> > is uptodate at the moment we insert it into page tables. The simplest
> > reasonably fast thing I can see is:
> > 
> > When handling a read fault, things stay as is and filesystem protects the
> > fault with an equivalent of EXT4_I(inode)->i_mmap_sem held for reading. When
> > handling a write fault we first grab EXT4_I(inode)->i_mmap_sem for reading
> > and try a read fault. If __dax_fault() sees a hole returned from
> > get_blocks() during a write fault, it bails out. Filesystem grabs
> > EXT4_I(inode)->i_mmap_sem for writing and retries with different
> > get_blocks() callback which will allocate blocks. That way we get proper
> > exclusion for faults needing to allocate blocks. Thoughts?
> 
> I think this would work.  ext4, ext2 and xfs all handle their exclusion with
> rw_semaphores, so this should work for each of them, I think.  Thanks for the
> problem statement & solution!  :) 
> 
> I guess our best course is to make sure that we don't make this existing
> problem worse via the fsync/msync patches by handling the error gracefully,
> and fix this for v4.6.  I do feel the need to point out that this is a
> pre-existing issue with DAX, and that my fsync patches just happened to help
> us find it.  They don't make the situation any better or any worse, and I
> really hope this issue doesn't end up blocking the fsync/msync patches from
> getting merged for v4.5.

Yeah, I agree this is a preexisting problem and mostly independent of your
fsync series so it can be dealt with once fsync series lands. The only
thing where this meets is the locking - you will have exclusive hold of the
inode and all its mapping when doing a write fault allocating a block. That
may save you some dances with i_mmap_lock (but I didn't think about this
too much).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-01-11 12:23 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-06 18:00 [PATCH v7 0/9] DAX fsync/msync support Ross Zwisler
2016-01-06 18:00 ` [PATCH v7 1/9] dax: fix NULL pointer dereference in __dax_dbg() Ross Zwisler
2016-01-06 19:14   ` Dan Williams
2016-01-07  9:34     ` Jan Kara
2016-01-07 15:17       ` Dan Williams
2016-01-07 22:16         ` Ross Zwisler
2016-01-07 23:10       ` Dave Chinner
2016-01-07 23:39         ` Ross Zwisler
2016-01-06 18:00 ` [PATCH v7 2/9] dax: fix conversion of holes to PMDs Ross Zwisler
2016-01-06 19:04   ` Dan Williams
2016-01-07 22:34     ` Ross Zwisler
2016-01-08  4:18       ` Ross Zwisler
2016-01-07 13:22   ` Jan Kara
2016-01-07 22:11     ` Ross Zwisler
2016-01-11 12:23       ` Jan Kara
2016-01-06 18:00 ` [PATCH v7 3/9] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2016-01-06 18:00 ` [PATCH v7 4/9] dax: support dirty DAX entries in radix tree Ross Zwisler
2016-01-06 18:00 ` [PATCH v7 5/9] mm: add find_get_entries_tag() Ross Zwisler
2016-01-06 18:01 ` [PATCH v7 6/9] dax: add support for fsync/msync Ross Zwisler
2016-01-06 18:01 ` [PATCH v7 7/9] ext2: call dax_pfn_mkwrite() for DAX fsync/msync Ross Zwisler
2016-01-06 18:01 ` [PATCH v7 8/9] ext4: " Ross Zwisler
2016-01-06 18:01 ` [PATCH v7 9/9] xfs: " Ross Zwisler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).