linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/11] DAX fsynx/msync support
@ 2015-10-29 20:12 Ross Zwisler
  2015-10-29 20:12 ` [RFC 01/11] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
                   ` (13 more replies)
  0 siblings, 14 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

This patch series adds support for fsync/msync to DAX.

Patches 1 through 8 add various utilities that the DAX code will eventually
need, and the DAX code itself is added by patch 9.  Patches 10 and 11 are
filesystem changes that are needed after the DAX code is added, but these
patches may change slightly as the filesystem fault handling for DAX is
being modified ([1] and [2]).

I've marked this series as RFC because I'm still testing, but I wanted to
get this out there so people would see the direction I was going and
hopefully comment on any big red flags sooner rather than later.

I realize that we are getting pretty dang close to the v4.4 merge window,
but I think that if we can get this reviewed and working it's a much better
solution than the "big hammer" approach that blindly flushes entire PMEM
namespaces [3].

[1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html
[2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2
[3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html

Ross Zwisler (11):
  pmem: add wb_cache_pmem() to the PMEM API
  mm: add pmd_mkclean()
  pmem: enable REQ_FLUSH handling
  dax: support dirty DAX entries in radix tree
  mm: add follow_pte_pmd()
  mm: add pgoff_mkclean()
  mm: add find_get_entries_tag()
  fs: add get_block() to struct inode_operations
  dax: add support for fsync/sync
  xfs, ext2: call dax_pfn_mkwrite() on write fault
  ext4: add ext4_dax_pfn_mkwrite()

 arch/x86/include/asm/pgtable.h |   5 ++
 arch/x86/include/asm/pmem.h    |  11 +--
 drivers/nvdimm/pmem.c          |   3 +-
 fs/dax.c                       | 161 +++++++++++++++++++++++++++++++++++++++--
 fs/ext2/file.c                 |   5 +-
 fs/ext4/file.c                 |  23 +++++-
 fs/inode.c                     |   1 +
 fs/xfs/xfs_file.c              |   9 ++-
 fs/xfs/xfs_iops.c              |   1 +
 include/linux/dax.h            |   6 ++
 include/linux/fs.h             |   5 +-
 include/linux/mm.h             |   2 +
 include/linux/pagemap.h        |   3 +
 include/linux/pmem.h           |  22 +++++-
 include/linux/radix-tree.h     |   3 +
 include/linux/rmap.h           |   5 ++
 mm/filemap.c                   |  73 ++++++++++++++++++-
 mm/huge_memory.c               |  14 ++--
 mm/memory.c                    |  41 +++++++++--
 mm/page-writeback.c            |   9 +++
 mm/rmap.c                      |  53 ++++++++++++++
 mm/truncate.c                  |   5 +-
 22 files changed, 418 insertions(+), 42 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC 01/11] pmem: add wb_cache_pmem() to the PMEM API
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 20:12 ` [RFC 02/11] mm: add pmd_mkclean() Ross Zwisler
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pmem.h | 11 ++++++-----
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec..6c7ade0 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	else
 		memset(vaddr, 0, size);
 
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 85f810b3..2cd5003 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 02/11] mm: add pmd_mkclean()
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
  2015-10-29 20:12 ` [RFC 01/11] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 20:12 ` [RFC 03/11] pmem: enable REQ_FLUSH handling Ross Zwisler
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

Currently PMD pages can be dirtied via pmd_mkdirty(), but cannot be
cleaned.  For DAX mmap dirty page tracking we need to be able to clean PMD
pages when we flush them to media so that we get a new write fault the next
time the are written to.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5b..c548e4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -277,6 +277,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+}
+
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 03/11] pmem: enable REQ_FLUSH handling
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
  2015-10-29 20:12 ` [RFC 01/11] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
  2015-10-29 20:12 ` [RFC 02/11] mm: add pmd_mkclean() Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 20:12 ` [RFC 04/11] dax: support dirty DAX entries in radix tree Ross Zwisler
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

Currently the PMEM driver doesn't accept REQ_FLUSH bios.  These are sent
down via blkdev_issue_flush() in response to a fsync() or msync().

When we get an msync() or fsync() it is the responsibility of the DAX code
to flush all dirty pages to media.  The PMEM driver then just has issue a
wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
the flushed data has been durably stored on the media.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/nvdimm/pmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0ba6a97..e1e222e 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -80,7 +80,7 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	if (do_acct)
 		nd_iostat_end(bio, start);
 
-	if (bio_data_dir(bio))
+	if (bio_data_dir(bio) || (bio->bi_rw & REQ_FLUSH))
 		wmb_pmem();
 
 	bio_endio(bio);
@@ -189,6 +189,7 @@ static int pmem_attach_disk(struct device *dev,
 	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+	blk_queue_flush(pmem->pmem_queue, REQ_FLUSH);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
 
 	disk = alloc_disk(0);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 04/11] dax: support dirty DAX entries in radix tree
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (2 preceding siblings ...)
  2015-10-29 20:12 ` [RFC 03/11] pmem: enable REQ_FLUSH handling Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 20:12 ` [RFC 05/11] mm: add follow_pte_pmd() Ross Zwisler
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  There
shouldn't be any collisions between these various exceptional entries
because only one type of exceptional entry should be able to be found in a
radix tree at a time depending on how it is being used.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/inode.c                 |  1 +
 include/linux/dax.h        |  5 +++++
 include/linux/fs.h         |  1 +
 include/linux/radix-tree.h |  3 +++
 mm/filemap.c               | 12 ++++++++----
 mm/truncate.c              |  5 +++--
 6 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 78a17b8..f7c87a6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -496,6 +496,7 @@ void clear_inode(struct inode *inode)
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
 	BUG_ON(inode->i_data.nrshadows);
+	BUG_ON(inode->i_data.nrdax);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+	return mapping->host && IS_DAX(mapping->host);
+}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..f791698 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,6 +433,7 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	unsigned long		nrdax;	        /* number of DAX entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 33170db..fabec66 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,9 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
+#define RADIX_TREE_DAX_PTE  ((void *)(0x10 | RADIX_TREE_EXCEPTIONAL_ENTRY))
+#define RADIX_TREE_DAX_PMD  ((void *)(0x20 | RADIX_TREE_EXCEPTIONAL_ENTRY))
+
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 327910c..c3a9e4f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@
  */
 #include <linux/export.h>
 #include <linux/compiler.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/uaccess.h>
 #include <linux/capability.h>
@@ -440,7 +441,7 @@ int filemap_write_and_wait_range(struct address_space *mapping,
 {
 	int err = 0;
 
-	if (mapping->nrpages) {
+	if (mapping->nrpages || mapping->nrdax) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
 		/* See comment of filemap_write_and_wait() */
@@ -538,6 +539,9 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
+
+		BUG_ON(dax_mapping(mapping));
+
 		if (shadowp)
 			*shadowp = p;
 		mapping->nrshadows--;
@@ -1201,9 +1205,9 @@ repeat:
 			if (radix_tree_deref_retry(page))
 				goto restart;
 			/*
-			 * A shadow entry of a recently evicted page,
-			 * or a swap entry from shmem/tmpfs.  Return
-			 * it without attempting to raise page count.
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
 			 */
 			goto export;
 		}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..cdf44a0 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@
 
 #include <linux/kernel.h>
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
@@ -29,8 +30,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
 	struct radix_tree_node *node;
 	void **slot;
 
-	/* Handled by shmem itself */
-	if (shmem_mapping(mapping))
+	/* Handled by shmem or DAX directly */
+	if (shmem_mapping(mapping) || dax_mapping(mapping))
 		return;
 
 	spin_lock_irq(&mapping->tree_lock);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 05/11] mm: add follow_pte_pmd()
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (3 preceding siblings ...)
  2015-10-29 20:12 ` [RFC 04/11] dax: support dirty DAX entries in radix tree Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 20:12 ` [RFC 06/11] mm: add pgoff_mkclean() Ross Zwisler
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a
huge page PMD leaf to be found and returned.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 41 +++++++++++++++++++++++++++++++++--------
 2 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80001de..393441c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1166,6 +1166,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
+int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index deb679c..c2b8c0a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3512,8 +3512,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static int __follow_pte(struct mm_struct *mm, unsigned long address,
-		pte_t **ptepp, spinlock_t **ptlp)
+static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+		pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -3529,12 +3529,23 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address,
 		goto out;
 
 	pmd = pmd_offset(pud, address);
-	VM_BUG_ON(pmd_trans_huge(*pmd));
-	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
-		goto out;
 
-	/* We cannot handle huge page PFN maps. Luckily they don't exist. */
-	if (pmd_huge(*pmd))
+	if (pmd_huge(*pmd)) {
+		if (!pmdpp)
+			goto out;
+
+		*ptlp = pmd_lock(mm, pmd);
+		if (pmd_huge(*pmd)) {
+			/* Success, we found a large PTE */
+			*pmdpp = pmd;
+			return 0;
+		}
+		/* Somebody removed the PMD entry, try it as a pte */
+		spin_unlock(*ptlp);
+	}
+
+	/* FIXME: pmd_bad() is sometimes set for DAX pmds? */
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
 	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
@@ -3557,9 +3568,23 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte(mm, address, ptepp, ptlp)));
+			   !(res = __follow_pte_pmd(mm, address, ptepp, NULL,
+					   ptlp)));
+	return res;
+}
+
+int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+{
+	int res;
+
+	/* (void) is needed to make gcc happy */
+	(void) __cond_lock(*ptlp,
+			   !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp,
+					   ptlp)));
 	return res;
 }
+EXPORT_SYMBOL(follow_pte_pmd);
 
 /**
  * follow_pfn - look up PFN at a user virtual address
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 06/11] mm: add pgoff_mkclean()
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (4 preceding siblings ...)
  2015-10-29 20:12 ` [RFC 05/11] mm: add follow_pte_pmd() Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 20:12 ` [RFC 07/11] mm: add find_get_entries_tag() Ross Zwisler
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

Introduce pgoff_mkclean() which conceptually is similar to page_mkclean()
except it works in the absence of struct page and it can also be used to
clean PMDs.  This is needed for DAX's dirty page handling.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/rmap.h |  5 +++++
 mm/rmap.c            | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 29446ae..627875f9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -223,6 +223,11 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 int page_mkclean(struct page *);
 
 /*
+ * Cleans and write protects the PTEs of shared mappings.
+ */
+int pgoff_mkclean(pgoff_t, struct address_space *);
+
+/*
  * called in munlock()/munmap() path to check for other vmas holding
  * the page mlocked.
  */
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f..0ce16ab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -586,6 +586,16 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+	unsigned long address;
+
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	return address;
+}
+
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 static void percpu_flush_tlb_batch_pages(void *data)
 {
@@ -1040,6 +1050,49 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+int pgoff_mkclean(pgoff_t pgoff, struct address_space *mapping)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		struct mm_struct *mm = vma->vm_mm;
+		pmd_t pmd, *pmdp = NULL;
+		pte_t pte, *ptep = NULL;
+		unsigned long address;
+		spinlock_t *ptl;
+
+		address = pgoff_address(pgoff, vma);
+
+		ret = follow_pte_pmd(mm, address, &ptep, &pmdp, &ptl);
+		if (ret)
+			goto out;
+
+		if (pmdp) {
+			flush_cache_page(vma, address, pmd_pfn(*pmdp));
+			pmd = pmdp_huge_clear_flush(vma, address, pmdp);
+			pmd = pmd_wrprotect(pmd);
+			pmd = pmd_mkclean(pmd);
+			set_pmd_at(mm, address, pmdp, pmd);
+			spin_unlock(ptl);
+		} else {
+			BUG_ON(!ptep);
+			flush_cache_page(vma, address, pte_pfn(*ptep));
+			pte = ptep_clear_flush(vma, address, ptep);
+			pte = pte_wrprotect(pte);
+			pte = pte_mkclean(pte);
+			set_pte_at(mm, address, ptep, pte);
+			pte_unmap_unlock(ptep, ptl);
+		}
+	}
+
+ out:
+	i_mmap_unlock_read(mapping);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pgoff_mkclean);
+
 /**
  * page_move_anon_rmap - move a page to our anon_vma
  * @page:	the page to move to our anon_vma
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 07/11] mm: add find_get_entries_tag()
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (5 preceding siblings ...)
  2015-10-29 20:12 ` [RFC 06/11] mm: add pgoff_mkclean() Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 20:12 ` [RFC 08/11] fs: add get_block() to struct inode_operations Ross Zwisler
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a6c78e0..6fea3be 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index c3a9e4f..992cf84 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1454,6 +1454,67 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping:	the address_space to search
+ * @start:	the starting page cache index
+ * @tag:	the tag index
+ * @nr_entries:	the maximum number of entries
+ * @entries:	where the resulting entries are placed
+ * @indices:	the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_tagged(slot, &mapping->page_tree,
+				   &iter, start, tag) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		entries[ret] = page;
+		if (++ret == nr_entries)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 08/11] fs: add get_block() to struct inode_operations
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (6 preceding siblings ...)
  2015-10-29 20:12 ` [RFC 07/11] mm: add find_get_entries_tag() Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 20:12 ` [RFC 09/11] dax: add support for fsync/sync Ross Zwisler
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

To be able to flush dirty pages to media as part of the fsync/msync path
DAX needs to be able to map file offsets to kernel addresses via a
combination of the filesystem's get_block() routine and
bdev_direct_access().  This currently happens in the DAX fault handlers
which receive a get_block() callback directly from the filesystem via a
function parameter.

For the fsync/msync path this doesn't work, though, because DAX is called
not by the filesystem but by the writeback infrastructure which doesn't
know about the filesystem specific get_block() routine.

To handle this we make get_block() an entry in the struct inode_operations
table so that we can access the correct get_block() routine in the
context of the writeback infrastructure.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/file.c     | 1 +
 fs/ext4/file.c     | 1 +
 fs/xfs/xfs_iops.c  | 1 +
 include/linux/fs.h | 4 ++--
 4 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 11a42c5..fc1418c 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -202,4 +202,5 @@ const struct inode_operations ext2_file_inode_operations = {
 	.get_acl	= ext2_get_acl,
 	.set_acl	= ext2_set_acl,
 	.fiemap		= ext2_fiemap,
+	.get_block	= ext2_get_block,
 };
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 113837e..54d7729 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -720,5 +720,6 @@ const struct inode_operations ext4_file_inode_operations = {
 	.get_acl	= ext4_get_acl,
 	.set_acl	= ext4_set_acl,
 	.fiemap		= ext4_fiemap,
+	.get_block	= ext4_get_block_dax,
 };
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8294132..c58c270 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1112,6 +1112,7 @@ static const struct inode_operations xfs_inode_operations = {
 	.listxattr		= xfs_vn_listxattr,
 	.fiemap			= xfs_vn_fiemap,
 	.update_time		= xfs_vn_update_time,
+	.get_block		= xfs_get_blocks_direct,
 };
 
 static const struct inode_operations xfs_dir_inode_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f791698..1dca85b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1679,8 +1679,8 @@ struct inode_operations {
 			   umode_t create_mode, int *opened);
 	int (*tmpfile) (struct inode *, struct dentry *, umode_t);
 	int (*set_acl)(struct inode *, struct posix_acl *, int);
-
-	/* WARNING: probably going away soon, do not use! */
+	int (*get_block)(struct inode *inode, sector_t iblock,
+			struct buffer_head *bh_result, int create);
 } ____cacheline_aligned;
 
 ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 09/11] dax: add support for fsync/sync
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (7 preceding siblings ...)
  2015-10-29 20:12 ` [RFC 08/11] fs: add get_block() to struct inode_operations Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 20:12 ` [RFC 10/11] xfs, ext2: call dax_pfn_mkwrite() on write fault Ross Zwisler
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

When called as part of the msync/fsync flush path DAX queries the radix
tree for dirty entries, flushing them and then marking the PTE or PMD page
table entries as clean.  The step of cleaning the PTE or PMD entries is
necessary so that on subsequent writes to the same page we get a new write
fault, allowing us to re-enter the DAX tag into the radix tree.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 161 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/dax.h |   1 +
 mm/huge_memory.c    |  14 ++---
 mm/page-writeback.c |   9 +++
 4 files changed, 172 insertions(+), 13 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 131fd35a..3b38aff 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,10 +24,13 @@
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagevec.h>
 #include <linux/pmem.h>
+#include <linux/rmap.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include <linux/dax.h>
 
 /*
  * dax_clear_blocks() is called from within transaction context from XFS,
@@ -287,6 +290,42 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 	return 0;
 }
 
+static int dax_dirty_pgoff(struct vm_area_struct *vma,
+		struct address_space *mapping, unsigned long pgoff,
+		bool pmd_entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	void *tag;
+	int error = 0;
+
+	__mark_inode_dirty(file_inode(vma->vm_file), I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+
+	tag = radix_tree_lookup(page_tree, pgoff);
+	if (tag) {
+		if (pmd_entry && tag == RADIX_TREE_DAX_PTE) {
+			radix_tree_delete(&mapping->page_tree, pgoff);
+			mapping->nrdax--;
+		} else
+			goto out;
+	}
+
+	if (pmd_entry)
+		error = radix_tree_insert(page_tree, pgoff, RADIX_TREE_DAX_PMD);
+	else
+		error = radix_tree_insert(page_tree, pgoff, RADIX_TREE_DAX_PTE);
+
+	if (error)
+		goto out;
+
+	mapping->nrdax++;
+	radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
+ out:
+	spin_unlock_irq(&mapping->tree_lock);
+	return error;
+}
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -450,6 +489,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		delete_from_page_cache(page);
 		unlock_page(page);
 		page_cache_release(page);
+		page = NULL;
 	}
 
 	/*
@@ -463,6 +503,13 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 	 * as for normal BH based IO completions.
 	 */
 	error = dax_insert_mapping(inode, &bh, vma, vmf);
+	if (error)
+		goto out;
+
+	error = dax_dirty_pgoff(vma, inode->i_mapping, vmf->pgoff, false);
+	if (error)
+		goto out;
+
 	if (buffer_unwritten(&bh)) {
 		if (complete_unwritten)
 			complete_unwritten(&bh, !error);
@@ -537,7 +584,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	pgoff_t size, pgoff;
 	sector_t block, sector;
 	unsigned long pfn;
-	int result = 0;
+	int error, result = 0;
 
 	/* Fall back to PTEs if we're going to COW */
 	if (write && !(vma->vm_flags & VM_SHARED))
@@ -638,6 +685,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+
+		error = dax_dirty_pgoff(vma, inode->i_mapping, pgoff, true);
+		if (error)
+			goto fallback;
 	}
 
  out:
@@ -693,11 +744,11 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct file *file = vma->vm_file;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	sb_end_pagefault(sb);
+	dax_dirty_pgoff(vma, inode->i_mapping, vmf->pgoff, false);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
@@ -772,3 +823,103 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 	return dax_zero_page_range(inode, from, length, get_block);
 }
 EXPORT_SYMBOL_GPL(dax_truncate_page);
+
+static int dax_flush_one_mapping(struct address_space *mapping,
+		struct inode *inode, sector_t block, void *tag)
+{
+	get_block_t *get_block = inode->i_op->get_block;
+	struct buffer_head bh;
+	void __pmem *addr;
+	int ret;
+
+	BUG_ON(tag != RADIX_TREE_DAX_PMD && tag != RADIX_TREE_DAX_PTE);
+
+	memset(&bh, 0, sizeof(bh));
+
+	if (tag == RADIX_TREE_DAX_PMD)
+		bh.b_size = PMD_SIZE;
+	else
+		bh.b_size = PAGE_SIZE;
+
+	ret = get_block(inode, block, &bh, false);
+	BUG_ON(!buffer_written(&bh));
+	if (ret < 0)
+		return ret;
+
+	ret = dax_get_addr(&bh, &addr, inode->i_blkbits);
+	if (ret < 0)
+		return ret;
+
+	if (tag == RADIX_TREE_DAX_PMD)
+		WARN_ON(ret != PMD_SIZE);
+	else
+		WARN_ON(ret != PAGE_SIZE);
+
+	wb_cache_pmem(addr, ret);
+
+	spin_lock_irq(&mapping->tree_lock);
+	radix_tree_delete(&mapping->page_tree, block);
+	spin_unlock_irq(&mapping->tree_lock);
+	mapping->nrdax--;
+
+	return pgoff_mkclean(block, mapping);
+}
+
+/*
+ * flush the mapping to the persistent domain within the byte range of (start,
+ * end). This is required by data integrity operations to ensure file data is on
+ * persistent storage prior to completion of the operation. It also requires us
+ * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
+ * the file is written to again so wehave an indication that we need to flush
+ * the mapping if a data integrity operation takes place.
+ *
+ * We don't need commits to storage here - the filesystems will issue flushes
+ * appropriately at the conclusion of the data integrity operation via REQ_FUA
+ * writes or blkdev_issue_flush() commands.  This requires the DAX block device
+ * to implement persistent storage domain fencing/commits on receiving a
+ * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
+ * layers.
+ */
+int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t indices[PAGEVEC_SIZE];
+	struct pagevec pvec;
+	int i, error;
+
+	pgoff_t start_page = start >> PAGE_CACHE_SHIFT;
+	pgoff_t end_page = end >> PAGE_CACHE_SHIFT;
+
+
+	if (mapping->nrdax == 0)
+		return 0;
+
+	if (!inode->i_op->get_block) {
+		WARN_ONCE(1, "Flushing DAX mapping without get_block()!");
+		mapping->nrdax = 0;
+		return 0;
+	}
+
+	BUG_ON(inode->i_blkbits != PAGE_SHIFT);
+
+	tag_pages_for_writeback(mapping, start_page, end_page);
+
+	pagevec_init(&pvec, 0);
+	while (1) {
+		pvec.nr = find_get_entries_tag(mapping, start_page,
+				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		for (i = 0; i < pvec.nr; i++) {
+			error = dax_flush_one_mapping(mapping, inode,
+					indices[i], pvec.pages[i]);
+			if (error)
+				return error;
+		}
+	}
+
+	return 0;
+}
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..5eff476 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,5 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+int dax_flush_mapping(struct address_space *mapping, loff_t start, loff_t end);
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbac913..1b3df56 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 
 	ptl = pmd_lock(mm, pmd);
-	if (pmd_none(*pmd)) {
-		entry = pmd_mkhuge(pfn_pmd(pfn, prot));
-		if (write) {
-			entry = pmd_mkyoung(pmd_mkdirty(entry));
-			entry = maybe_pmd_mkwrite(entry, vma);
-		}
-		set_pmd_at(mm, addr, pmd, entry);
-		update_mmu_cache_pmd(vma, addr, pmd);
+	entry = pmd_mkhuge(pfn_pmd(pfn, prot));
+	if (write) {
+		entry = pmd_mkyoung(pmd_mkdirty(entry));
+		entry = maybe_pmd_mkwrite(entry, vma);
 	}
+	set_pmd_at(mm, addr, pmd, entry);
+	update_mmu_cache_pmd(vma, addr, pmd);
 	spin_unlock(ptl);
 }
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2c90357..1801df8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -37,6 +37,7 @@
 #include <linux/timer.h>
 #include <linux/sched/rt.h>
 #include <linux/mm_inline.h>
+#include <linux/dax.h>
 #include <trace/events/writeback.h>
 
 #include "internal.h"
@@ -2338,6 +2339,14 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
 
 	if (wbc->nr_to_write <= 0)
 		return 0;
+
+	if (wbc->sync_mode == WB_SYNC_ALL && dax_mapping(mapping)) {
+		ret = dax_flush_mapping(mapping, wbc->range_start,
+				wbc->range_end);
+		if (ret)
+			return ret;
+	}
+
 	if (mapping->a_ops->writepages)
 		ret = mapping->a_ops->writepages(mapping, wbc);
 	else
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 10/11] xfs, ext2: call dax_pfn_mkwrite() on write fault
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (8 preceding siblings ...)
  2015-10-29 20:12 ` [RFC 09/11] dax: add support for fsync/sync Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 20:12 ` [RFC 11/11] ext4: add ext4_dax_pfn_mkwrite() Ross Zwisler
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

Previously the functionality offered by dax_pfn_mkwrite() was open-coded in
XFS and ext2.  With the addition of DAX msync/fsync support we need to call
dax_pfn_mkwrite() so that the newly writeable page can be added to the
radix tree and marked as dirty.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/file.c    | 4 +++-
 fs/xfs/xfs_file.c | 9 +++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index fc1418c..5741e29 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
@@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 
 	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f429662..584e4eb 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1575,9 +1575,8 @@ xfs_filemap_pmd_fault(
 /*
  * pfn_mkwrite was originally inteneded to ensure we capture time stamp
  * updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
- * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
- * barrier in place.
+ * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
+ * to ensure we serialise the fault barrier in place.
  */
 static int
 xfs_filemap_pfn_mkwrite(
@@ -1587,7 +1586,7 @@ xfs_filemap_pfn_mkwrite(
 
 	struct inode		*inode = file_inode(vma->vm_file);
 	struct xfs_inode	*ip = XFS_I(inode);
-	int			ret = VM_FAULT_NOPAGE;
+	int			ret;
 	loff_t			size;
 
 	trace_xfs_filemap_pfn_mkwrite(ip);
@@ -1600,6 +1599,8 @@ xfs_filemap_pfn_mkwrite(
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC 11/11] ext4: add ext4_dax_pfn_mkwrite()
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (9 preceding siblings ...)
  2015-10-29 20:12 ` [RFC 10/11] xfs, ext2: call dax_pfn_mkwrite() on write fault Ross Zwisler
@ 2015-10-29 20:12 ` Ross Zwisler
  2015-10-29 22:49 ` [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 20:12 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

Add ext4_dax_pfn_mkwrite() that properly protects against freeze, sets
the file update time and calls into dax_pfn_mkwrite() to add the newly
dirtied page to the radix tree which tracks dirty DAX pages.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext4/file.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 54d7729..7f62cad 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -272,11 +272,31 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 				ext4_end_io_unwritten);
 }
 
+static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	loff_t size;
+	int ret;
+
+	sb_start_pagefault(inode->i_sb);
+	file_update_time(vma->vm_file);
+
+	/* check that the faulting page hasn't raced with truncate */
+	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (vmf->pgoff >= size)
+		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
+
+	sb_end_pagefault(inode->i_sb);
+	return ret;
+}
+
 static const struct vm_operations_struct ext4_dax_vm_ops = {
 	.fault		= ext4_dax_fault,
 	.pmd_fault	= ext4_dax_pmd_fault,
 	.page_mkwrite	= ext4_dax_mkwrite,
-	.pfn_mkwrite	= dax_pfn_mkwrite,
+	.pfn_mkwrite	= ext4_dax_pfn_mkwrite,
 };
 #else
 #define ext4_dax_vm_ops	ext4_file_vm_ops
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (10 preceding siblings ...)
  2015-10-29 20:12 ` [RFC 11/11] ext4: add ext4_dax_pfn_mkwrite() Ross Zwisler
@ 2015-10-29 22:49 ` Ross Zwisler
  2015-10-30  3:55 ` Dave Chinner
  2015-10-30 18:34 ` Dan Williams
  13 siblings, 0 replies; 27+ messages in thread
From: Ross Zwisler @ 2015-10-29 22:49 UTC (permalink / raw)
  To: Ross Zwisler, Dave Chinner
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote:
> This patch series adds support for fsync/msync to DAX.
> 
> Patches 1 through 8 add various utilities that the DAX code will eventually
> need, and the DAX code itself is added by patch 9.  Patches 10 and 11 are
> filesystem changes that are needed after the DAX code is added, but these
> patches may change slightly as the filesystem fault handling for DAX is
> being modified ([1] and [2]).
> 
> I've marked this series as RFC because I'm still testing, but I wanted to
> get this out there so people would see the direction I was going and
> hopefully comment on any big red flags sooner rather than later.
> 
> I realize that we are getting pretty dang close to the v4.4 merge window,
> but I think that if we can get this reviewed and working it's a much better
> solution than the "big hammer" approach that blindly flushes entire PMEM
> namespaces [3].
> 
> [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html
> [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2
> [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html

Hmm...I think I may need to isolate the fsync/msync flushing against races
with truncate since we are calling into the filesystem directly with
get_block().  Dave (Chinner), does this sound right?

Also, one thing I forgot to mention is that these patches are built upon the
first version of Dave Chinner's XFS patches and my ext2 patches that deal with
the truncate races with DAX.  A snapshot of my development tree with these
patches applied can be found here:

https://github.com/01org/prd/tree/fsync_rfc

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (11 preceding siblings ...)
  2015-10-29 22:49 ` [RFC 00/11] DAX fsynx/msync support Ross Zwisler
@ 2015-10-30  3:55 ` Dave Chinner
  2015-10-30 18:39   ` Ross Zwisler
  2015-10-30 18:34 ` Dan Williams
  13 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2015-10-30  3:55 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Matthew Wilcox

On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote:
> This patch series adds support for fsync/msync to DAX.
> 
> Patches 1 through 8 add various utilities that the DAX code will eventually
> need, and the DAX code itself is added by patch 9.  Patches 10 and 11 are
> filesystem changes that are needed after the DAX code is added, but these
> patches may change slightly as the filesystem fault handling for DAX is
> being modified ([1] and [2]).
> 
> I've marked this series as RFC because I'm still testing, but I wanted to
> get this out there so people would see the direction I was going and
> hopefully comment on any big red flags sooner rather than later.
> 
> I realize that we are getting pretty dang close to the v4.4 merge window,
> but I think that if we can get this reviewed and working it's a much better
> solution than the "big hammer" approach that blindly flushes entire PMEM
> namespaces [3].

We need the "big hammer" regardless of fsync. If REQ_FLUSH and
REQ_FUA don't do the right thing when it comes to ordering journal
writes against other IO operations, then the filesystems are not
crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding
changes back to stable storage, just like they do for existing
storage....

> [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html
> [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2
> [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html
> 
> Ross Zwisler (11):
>   pmem: add wb_cache_pmem() to the PMEM API
>   mm: add pmd_mkclean()
>   pmem: enable REQ_FLUSH handling
>   dax: support dirty DAX entries in radix tree
>   mm: add follow_pte_pmd()
>   mm: add pgoff_mkclean()
>   mm: add find_get_entries_tag()
>   fs: add get_block() to struct inode_operations

I don't think this is the right thing to do - it propagates the use
of bufferheads as a mapping structure into places where we do not
want bufferheads. We've recently added a similar block mapping
interface to the export operations structure for PNFS and that uses
a "struct iomap" which is far more suited to being an inode
operation this.

We have plans to move this to the inode operations for various
reasons. e.g: multipage write, adding interfaces that support proper
mapping of holes, etc:

https://www.redhat.com/archives/cluster-devel/2014-October/msg00167.html

So after many years of saying no to moving getblocks to the inode
operations it seems like the wrong thing to do now considering I
want to convert all the DAX code to use iomaps while only 2/3
filesystems are supported...

>   dax: add support for fsync/sync

Why put the dax_flush_mapping() in do_writepages()? Why not call it
directly from the filesystem ->fsync() implementations where a
getblocks callback could also be provided?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
                   ` (12 preceding siblings ...)
  2015-10-30  3:55 ` Dave Chinner
@ 2015-10-30 18:34 ` Dan Williams
  2015-10-30 19:43   ` Ross Zwisler
  13 siblings, 1 reply; 27+ messages in thread
From: Dan Williams @ 2015-10-30 18:34 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org,
	X86 ML, xfs, Andrew Morton, Matthew Wilcox

On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> This patch series adds support for fsync/msync to DAX.
>
> Patches 1 through 8 add various utilities that the DAX code will eventually
> need, and the DAX code itself is added by patch 9.  Patches 10 and 11 are
> filesystem changes that are needed after the DAX code is added, but these
> patches may change slightly as the filesystem fault handling for DAX is
> being modified ([1] and [2]).
>
> I've marked this series as RFC because I'm still testing, but I wanted to
> get this out there so people would see the direction I was going and
> hopefully comment on any big red flags sooner rather than later.
>
> I realize that we are getting pretty dang close to the v4.4 merge window,
> but I think that if we can get this reviewed and working it's a much better
> solution than the "big hammer" approach that blindly flushes entire PMEM
> namespaces [3].
>
> [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html
> [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2
> [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html
>
> Ross Zwisler (11):
>   pmem: add wb_cache_pmem() to the PMEM API
>   mm: add pmd_mkclean()
>   pmem: enable REQ_FLUSH handling
>   dax: support dirty DAX entries in radix tree
>   mm: add follow_pte_pmd()
>   mm: add pgoff_mkclean()
>   mm: add find_get_entries_tag()
>   fs: add get_block() to struct inode_operations
>   dax: add support for fsync/sync
>   xfs, ext2: call dax_pfn_mkwrite() on write fault
>   ext4: add ext4_dax_pfn_mkwrite()

This is great to have when the flush-the-world solution ends up
killing performance.  However, there are a couple mitigating options
for workloads that dirty small amounts and flush often that we need to
collect data on:

1/ Using cache management and pcommit from userspace to skip calls to
msync / fsync.  Although, this does not eliminate all calls to
blkdev_issue_flush as the fs may invoke it for other reasons.  I
suspect turning on REQ_FUA support eliminates a number of those
invocations, and pmem already satisfies REQ_FUA semantics by default.

2/ Turn off DAX and use the page cache.  As Dave mentions [1] we
should enable this control on a per-inode basis.  I'm folding in this
capability as a blkdev_ioctl for the next version of the raw block DAX
support patch.

It's entirely possible these mitigations won't eliminate the need for
a mechanism like this, but I think we have a bit more work to do to
find out how bad this is in practice as well as the crossover point
where walking the radix becomes prohibitive.

We also have the option of tracking open DAX extents in the driver.
Even at coarse granularities I'd be surprised if we can't mitigate
most of the overhead.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-October/002598.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-10-30  3:55 ` Dave Chinner
@ 2015-10-30 18:39   ` Ross Zwisler
  2015-11-01 23:29     ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Ross Zwisler @ 2015-10-30 18:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

On Fri, Oct 30, 2015 at 02:55:33PM +1100, Dave Chinner wrote:
> On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote:
> > This patch series adds support for fsync/msync to DAX.
> > 
> > Patches 1 through 8 add various utilities that the DAX code will eventually
> > need, and the DAX code itself is added by patch 9.  Patches 10 and 11 are
> > filesystem changes that are needed after the DAX code is added, but these
> > patches may change slightly as the filesystem fault handling for DAX is
> > being modified ([1] and [2]).
> > 
> > I've marked this series as RFC because I'm still testing, but I wanted to
> > get this out there so people would see the direction I was going and
> > hopefully comment on any big red flags sooner rather than later.
> > 
> > I realize that we are getting pretty dang close to the v4.4 merge window,
> > but I think that if we can get this reviewed and working it's a much better
> > solution than the "big hammer" approach that blindly flushes entire PMEM
> > namespaces [3].
> 
> We need the "big hammer" regardless of fsync. If REQ_FLUSH and
> REQ_FUA don't do the right thing when it comes to ordering journal
> writes against other IO operations, then the filesystems are not
> crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding
> changes back to stable storage, just like they do for existing
> storage....

I think that what I've got here (when it's fully working) will protect all the
cases that we need.

AFAIK there are three ways that data can be written to a PMEM namespace:

1) Through the PMEM driver via either pmem_make_request(), pmem_rw_page() or
pmem_rw_bytes().  All of these paths sync the newly written data durably to
media before the I/O completes so they shouldn't have any reliance on
REQ_FUA/REQ_FLUSH.

2) Through the DAX I/O path, dax_io().  As with PMEM we flush the newly
written data durably to media before the I/O operation completes, so this path
shouldn't have any reliance on REQ_FUA/REQ_FLUSH.

3) Through mmaps set up by DAX.  This is the path we are trying to protect
with the dirty page tracking and flushing in this patch set, and I think that
this is the only path that has reliance on REQ_FLUSH.

The goal of this set is to have the cache writeback all happen as part of the
fsync/msync handling, and then have the REQ_FLUSH just provide the trailing
wmb_pmem().

My guess is that XFS metadata writes happen via path 1), down through the PMEM
driver.  Am I missing anything, or should we be good to go?

> > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html
> > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2
> > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html
> > 
> > Ross Zwisler (11):
> >   pmem: add wb_cache_pmem() to the PMEM API
> >   mm: add pmd_mkclean()
> >   pmem: enable REQ_FLUSH handling
> >   dax: support dirty DAX entries in radix tree
> >   mm: add follow_pte_pmd()
> >   mm: add pgoff_mkclean()
> >   mm: add find_get_entries_tag()
> >   fs: add get_block() to struct inode_operations
> 
> I don't think this is the right thing to do - it propagates the use
> of bufferheads as a mapping structure into places where we do not
> want bufferheads. We've recently added a similar block mapping
> interface to the export operations structure for PNFS and that uses
> a "struct iomap" which is far more suited to being an inode
> operation this.
> 
> We have plans to move this to the inode operations for various
> reasons. e.g: multipage write, adding interfaces that support proper
> mapping of holes, etc:
> 
> https://www.redhat.com/archives/cluster-devel/2014-October/msg00167.html
> 
> So after many years of saying no to moving getblocks to the inode
> operations it seems like the wrong thing to do now considering I
> want to convert all the DAX code to use iomaps while only 2/3
> filesystems are supported...

Okay, I'll take a look at this interface.  I also think that we may need to
flow through the filesystem before going into the DAX code so that we can
serialize our flushing with respect to extent manipulation, as we had to do
with our DAX fault paths.  

> >   dax: add support for fsync/sync
> 
> Why put the dax_flush_mapping() in do_writepages()? Why not call it
> directly from the filesystem ->fsync() implementations where a
> getblocks callback could also be provided?

Because that's where you put it in your example. :)

https://lists.01.org/pipermail/linux-nvdimm/2015-March/000341.html

Moving it into the filesystem where we know about get_block() is probably the
right thing to do - I'll check it out.  Thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-10-30 18:34 ` Dan Williams
@ 2015-10-30 19:43   ` Ross Zwisler
  2015-10-30 19:51     ` Dan Williams
  0 siblings, 1 reply; 27+ messages in thread
From: Ross Zwisler @ 2015-10-30 19:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, X86 ML, xfs, Andrew Morton,
	Matthew Wilcox

On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote:
> On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > This patch series adds support for fsync/msync to DAX.
> >
> > Patches 1 through 8 add various utilities that the DAX code will eventually
> > need, and the DAX code itself is added by patch 9.  Patches 10 and 11 are
> > filesystem changes that are needed after the DAX code is added, but these
> > patches may change slightly as the filesystem fault handling for DAX is
> > being modified ([1] and [2]).
> >
> > I've marked this series as RFC because I'm still testing, but I wanted to
> > get this out there so people would see the direction I was going and
> > hopefully comment on any big red flags sooner rather than later.
> >
> > I realize that we are getting pretty dang close to the v4.4 merge window,
> > but I think that if we can get this reviewed and working it's a much better
> > solution than the "big hammer" approach that blindly flushes entire PMEM
> > namespaces [3].
> >
> > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html
> > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2
> > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html
> >
> > Ross Zwisler (11):
> >   pmem: add wb_cache_pmem() to the PMEM API
> >   mm: add pmd_mkclean()
> >   pmem: enable REQ_FLUSH handling
> >   dax: support dirty DAX entries in radix tree
> >   mm: add follow_pte_pmd()
> >   mm: add pgoff_mkclean()
> >   mm: add find_get_entries_tag()
> >   fs: add get_block() to struct inode_operations
> >   dax: add support for fsync/sync
> >   xfs, ext2: call dax_pfn_mkwrite() on write fault
> >   ext4: add ext4_dax_pfn_mkwrite()
> 
> This is great to have when the flush-the-world solution ends up
> killing performance.  However, there are a couple mitigating options
> for workloads that dirty small amounts and flush often that we need to
> collect data on:
> 
> 1/ Using cache management and pcommit from userspace to skip calls to
> msync / fsync.  Although, this does not eliminate all calls to
> blkdev_issue_flush as the fs may invoke it for other reasons.  I
> suspect turning on REQ_FUA support eliminates a number of those
> invocations, and pmem already satisfies REQ_FUA semantics by default.

Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM
already handles the requirements of REQ_FUA, but I didn't realize that it
might reduce the number of REQ_FLUSH bios we receive.

> 2/ Turn off DAX and use the page cache.  As Dave mentions [1] we
> should enable this control on a per-inode basis.  I'm folding in this
> capability as a blkdev_ioctl for the next version of the raw block DAX
> support patch.

Umm...I think you just said "the way to avoid this delay is to just not use
DAX".  :)  I don't think this is where we want to go - we are trying to make
DAX better, not abandon it.

> It's entirely possible these mitigations won't eliminate the need for
> a mechanism like this, but I think we have a bit more work to do to
> find out how bad this is in practice as well as the crossover point
> where walking the radix becomes prohibitive.

I'm guessing a single run through xfstests will be enough to convince you that
the "big hammer" approach is untenable.  Tests that used to take a second now
take several minutes, at least in my VM testing environment...  And that's
only using a tiny 4GiB namespace.

Yes, we can distribute the cost over multiple CPUs, but that just distributes
the problem and doesn't reduce the overall work that needs to be done.
Ultimately I think that looping through multiple GiB or even TiB of cache
lines and blindly writing them back individually on every REQ_FLUSH is going
to be a deal breaker.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-10-30 19:43   ` Ross Zwisler
@ 2015-10-30 19:51     ` Dan Williams
  2015-11-01 23:36       ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Dan Williams @ 2015-10-30 19:51 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, xfs, Andrew Morton,
	Matthew Wilcox

On Fri, Oct 30, 2015 at 12:43 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote:
>> On Thu, Oct 29, 2015 at 1:12 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>> > This patch series adds support for fsync/msync to DAX.
>> >
>> > Patches 1 through 8 add various utilities that the DAX code will eventually
>> > need, and the DAX code itself is added by patch 9.  Patches 10 and 11 are
>> > filesystem changes that are needed after the DAX code is added, but these
>> > patches may change slightly as the filesystem fault handling for DAX is
>> > being modified ([1] and [2]).
>> >
>> > I've marked this series as RFC because I'm still testing, but I wanted to
>> > get this out there so people would see the direction I was going and
>> > hopefully comment on any big red flags sooner rather than later.
>> >
>> > I realize that we are getting pretty dang close to the v4.4 merge window,
>> > but I think that if we can get this reviewed and working it's a much better
>> > solution than the "big hammer" approach that blindly flushes entire PMEM
>> > namespaces [3].
>> >
>> > [1] http://oss.sgi.com/archives/xfs/2015-10/msg00523.html
>> > [2] http://marc.info/?l=linux-ext4&m=144550211312472&w=2
>> > [3] https://lists.01.org/pipermail/linux-nvdimm/2015-October/002614.html
>> >
>> > Ross Zwisler (11):
>> >   pmem: add wb_cache_pmem() to the PMEM API
>> >   mm: add pmd_mkclean()
>> >   pmem: enable REQ_FLUSH handling
>> >   dax: support dirty DAX entries in radix tree
>> >   mm: add follow_pte_pmd()
>> >   mm: add pgoff_mkclean()
>> >   mm: add find_get_entries_tag()
>> >   fs: add get_block() to struct inode_operations
>> >   dax: add support for fsync/sync
>> >   xfs, ext2: call dax_pfn_mkwrite() on write fault
>> >   ext4: add ext4_dax_pfn_mkwrite()
>>
>> This is great to have when the flush-the-world solution ends up
>> killing performance.  However, there are a couple mitigating options
>> for workloads that dirty small amounts and flush often that we need to
>> collect data on:
>>
>> 1/ Using cache management and pcommit from userspace to skip calls to
>> msync / fsync.  Although, this does not eliminate all calls to
>> blkdev_issue_flush as the fs may invoke it for other reasons.  I
>> suspect turning on REQ_FUA support eliminates a number of those
>> invocations, and pmem already satisfies REQ_FUA semantics by default.
>
> Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM
> already handles the requirements of REQ_FUA, but I didn't realize that it
> might reduce the number of REQ_FLUSH bios we receive.

I'll let Dave chime in, but a lot of the flush requirements come from
guaranteeing the state of the metadata, if metadata updates can be
done with REQ_FUA then there is no subsequent need to flush.

>> 2/ Turn off DAX and use the page cache.  As Dave mentions [1] we
>> should enable this control on a per-inode basis.  I'm folding in this
>> capability as a blkdev_ioctl for the next version of the raw block DAX
>> support patch.
>
> Umm...I think you just said "the way to avoid this delay is to just not use
> DAX".  :)  I don't think this is where we want to go - we are trying to make
> DAX better, not abandon it.

That's a bit of an exaggeration.  Avoiding DAX where it is not
necessary is not "abandoning DAX", it's using the right tool for the
job.  Page cache is fine for many cases.

>> It's entirely possible these mitigations won't eliminate the need for
>> a mechanism like this, but I think we have a bit more work to do to
>> find out how bad this is in practice as well as the crossover point
>> where walking the radix becomes prohibitive.
>
> I'm guessing a single run through xfstests will be enough to convince you that
> the "big hammer" approach is untenable.  Tests that used to take a second now
> take several minutes, at least in my VM testing environment...  And that's
> only using a tiny 4GiB namespace.
>
> Yes, we can distribute the cost over multiple CPUs, but that just distributes
> the problem and doesn't reduce the overall work that needs to be done.
> Ultimately I think that looping through multiple GiB or even TiB of cache
> lines and blindly writing them back individually on every REQ_FLUSH is going
> to be a deal breaker.

Right, part of the problem is that the driver doesn't know which
blocks are actively DAX mapped.  I think we can incrementally fix that
without requiring DAX specific fsync/msync handling code for each fs
that supports DAX.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-10-30 18:39   ` Ross Zwisler
@ 2015-11-01 23:29     ` Dave Chinner
  2015-11-02 14:22       ` Jeff Moyer
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2015-11-01 23:29 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

On Fri, Oct 30, 2015 at 12:39:38PM -0600, Ross Zwisler wrote:
> On Fri, Oct 30, 2015 at 02:55:33PM +1100, Dave Chinner wrote:
> > On Thu, Oct 29, 2015 at 02:12:04PM -0600, Ross Zwisler wrote:
> > > This patch series adds support for fsync/msync to DAX.
> > > 
> > > Patches 1 through 8 add various utilities that the DAX code will eventually
> > > need, and the DAX code itself is added by patch 9.  Patches 10 and 11 are
> > > filesystem changes that are needed after the DAX code is added, but these
> > > patches may change slightly as the filesystem fault handling for DAX is
> > > being modified ([1] and [2]).
> > > 
> > > I've marked this series as RFC because I'm still testing, but I wanted to
> > > get this out there so people would see the direction I was going and
> > > hopefully comment on any big red flags sooner rather than later.
> > > 
> > > I realize that we are getting pretty dang close to the v4.4 merge window,
> > > but I think that if we can get this reviewed and working it's a much better
> > > solution than the "big hammer" approach that blindly flushes entire PMEM
> > > namespaces [3].
> > 
> > We need the "big hammer" regardless of fsync. If REQ_FLUSH and
> > REQ_FUA don't do the right thing when it comes to ordering journal
> > writes against other IO operations, then the filesystems are not
> > crash safe. i.e. we need REQ_FLUSH/REQ_FUA to commit all outstanding
> > changes back to stable storage, just like they do for existing
> > storage....
> 
> I think that what I've got here (when it's fully working) will protect all the
> cases that we need.
> 
> AFAIK there are three ways that data can be written to a PMEM namespace:
> 
> 1) Through the PMEM driver via either pmem_make_request(), pmem_rw_page() or
> pmem_rw_bytes().  All of these paths sync the newly written data durably to
> media before the I/O completes so they shouldn't have any reliance on
> REQ_FUA/REQ_FLUSH.

I suspect that not all future pmem devices will use this
driver/interface/semantics.

Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable
storage" commands. They are also IO barriers that affect scheduling
of IOs in progress and in the request queues.  A REQ_FLUSH/REQ_FUA
IO cannot be dispatched before all prior IO has been dispatched and
drained from the request queue, and IO submitted after a queued
REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued
REQ_FLUSH/REQ_FUA operation.

IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable
storage, they also guarantee the order of IO dispatch and
completion when concurrent IO is in progress.


> 2) Through the DAX I/O path, dax_io().  As with PMEM we flush the newly
> written data durably to media before the I/O operation completes, so this path
> shouldn't have any reliance on REQ_FUA/REQ_FLUSH.

That's fine, but that's not the problem we need solved ;)

> 3) Through mmaps set up by DAX.  This is the path we are trying to protect
> with the dirty page tracking and flushing in this patch set, and I think that
> this is the only path that has reliance on REQ_FLUSH.

Quite possibly this is the case for the current intel pmem driver,
but I don't look at the functionality from that perspective.

Dirty page tracking is needed to enable "data writeback", whether it
be CPU cachelines via pcommit() or dirty pages via submit_bio(). How
the pages get dirty is irrelevant - the fact is they are dirty and
we need to do /something/ to ensure they are correctly written back
to the storage layer.

REQ_FLUSH is needed to guarantee all data that has been written back
to the storage layer is persistent in that layer.  How a /driver/
manages that is up to the driver - the actual implementation is
irrelevant to the higher layers. i.e. what we are concerned about at
the filesystem level is that:

	a) "data writeback" is started correctly;
	b) the "data writeback" is completed; and
	c) volatile caches are completely flushed before we write
	   the metadata changes that reference that data to the
	   journal via FUA

e.g. we could have pmem, but we are using buffered IO (i.e. non-DAX)
and a hardware driver that doesn't flush CPU cachelines in the
physical IO path. This requires that driver to flush CPU cachelines
and place memory barriers in REQ_FLUSH operations, as well as after
writing the data in REQ_FUA operations.  Yes, this is different to
the way the intel pmem drivers work (i.e.  as noted in 1) above),
but it is /not wrong/ as long as REQ_FLUSH/REQ_FUA also flush dirty
cpu cachelines.

IOWs, the high level code we write that implements fsync
for DAX needs to be generic enough so that when something slightly
different comes along we don't have to throw everything away and
start again. I think your code will end up being generic enough to
handle this, but let's make sure we don't implement something that
can only work with pmem hardware/drivers that do all IO as fully
synchronous to the stable domain...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-10-30 19:51     ` Dan Williams
@ 2015-11-01 23:36       ` Dave Chinner
  0 siblings, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2015-11-01 23:36 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org,
	X86 ML, xfs, Andrew Morton, Matthew Wilcox

On Fri, Oct 30, 2015 at 12:51:40PM -0700, Dan Williams wrote:
> On Fri, Oct 30, 2015 at 12:43 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Fri, Oct 30, 2015 at 11:34:07AM -0700, Dan Williams wrote:
> >> This is great to have when the flush-the-world solution ends up
> >> killing performance.  However, there are a couple mitigating options
> >> for workloads that dirty small amounts and flush often that we need to
> >> collect data on:
> >>
> >> 1/ Using cache management and pcommit from userspace to skip calls to
> >> msync / fsync.  Although, this does not eliminate all calls to
> >> blkdev_issue_flush as the fs may invoke it for other reasons.  I
> >> suspect turning on REQ_FUA support eliminates a number of those
> >> invocations, and pmem already satisfies REQ_FUA semantics by default.
> >
> > Sure, I'll turn on REQ_FUA in addition to REQ_FLUSH - I agree that PMEM
> > already handles the requirements of REQ_FUA, but I didn't realize that it
> > might reduce the number of REQ_FLUSH bios we receive.
> 
> I'll let Dave chime in, but a lot of the flush requirements come from
> guaranteeing the state of the metadata, if metadata updates can be
> done with REQ_FUA then there is no subsequent need to flush.

No need for cache flushes in this case, but we still need the IO
scheduler to order such operations correctly.

> >> 2/ Turn off DAX and use the page cache.  As Dave mentions [1] we
> >> should enable this control on a per-inode basis.  I'm folding in this
> >> capability as a blkdev_ioctl for the next version of the raw block DAX
> >> support patch.
> >
> > Umm...I think you just said "the way to avoid this delay is to just not use
> > DAX".  :)  I don't think this is where we want to go - we are trying to make
> > DAX better, not abandon it.
> 
> That's a bit of an exaggeration.  Avoiding DAX where it is not
> necessary is not "abandoning DAX", it's using the right tool for the
> job.  Page cache is fine for many cases.

Think btrfs - any file that uses COW can't use DAX for write.
Everything has to be buffered, unless the nodatacow flag is set, and
then DAX can be used. Indeed, on ext4 if you are using file
encryption you can't use DAX.

IOWs, we already know that we have to support mixed DAX/non-DAX
access within the same filesystem, so I'm with Dan here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-11-01 23:29     ` Dave Chinner
@ 2015-11-02 14:22       ` Jeff Moyer
  2015-11-02 20:10         ` Dave Chinner
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2015-11-02 14:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

Dave Chinner <david@fromorbit.com> writes:

> Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable
> storage" commands. They are also IO barriers that affect scheduling
> of IOs in progress and in the request queues.  A REQ_FLUSH/REQ_FUA
> IO cannot be dispatched before all prior IO has been dispatched and
> drained from the request queue, and IO submitted after a queued
> REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued
> REQ_FLUSH/REQ_FUA operation.
>
> IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable
> storage, they also guarantee the order of IO dispatch and
> completion when concurrent IO is in progress.

This hasn't been the case for several years, now.  It used to work that
way, and that was deemed a big performance problem.  Since file systems
already issued and waited for all I/O before sending down a barrier, we
decided to get rid of the I/O ordering pieces of barriers (and stop
calling them barriers).

See commit 28e7d184521 (block: drop barrier ordering by queue draining).

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-11-02 14:22       ` Jeff Moyer
@ 2015-11-02 20:10         ` Dave Chinner
  2015-11-02 21:02           ` Jeff Moyer
  0 siblings, 1 reply; 27+ messages in thread
From: Dave Chinner @ 2015-11-02 20:10 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable
> > storage" commands. They are also IO barriers that affect scheduling
> > of IOs in progress and in the request queues.  A REQ_FLUSH/REQ_FUA
> > IO cannot be dispatched before all prior IO has been dispatched and
> > drained from the request queue, and IO submitted after a queued
> > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued
> > REQ_FLUSH/REQ_FUA operation.
> >
> > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable
> > storage, they also guarantee the order of IO dispatch and
> > completion when concurrent IO is in progress.
> 
> This hasn't been the case for several years, now.  It used to work that
> way, and that was deemed a big performance problem.  Since file systems
> already issued and waited for all I/O before sending down a barrier, we
> decided to get rid of the I/O ordering pieces of barriers (and stop
> calling them barriers).
> 
> See commit 28e7d184521 (block: drop barrier ordering by queue draining).

Yes, I realise that, even if I wasn't very clear about how I wrote
it. ;)

Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO
barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH
-> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO
scheduler calls to elv_dispatch_sort() that don't pass
REQ_SOFTBARRIER in the queue.

IOWs, if we queue a bunch of REQ_WRITE IOs followed by a
REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be
dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured
by the cache flush.

Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO
to complete, we know that all the earlier REQ_WRITE IOs are on
stable storage, too. Hence there's no need for the elevator to drain
the queue to guarantee completion ordering - the dispatch ordering
and flush/fua write semantics guarantee that when the flush/fua
completes, all the IOs dispatch prior to that flush/fua write are
also on stable storage...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-11-02 20:10         ` Dave Chinner
@ 2015-11-02 21:02           ` Jeff Moyer
  2015-11-04 18:34             ` Jeff Moyer
  2015-11-05  8:33             ` Dave Chinner
  0 siblings, 2 replies; 27+ messages in thread
From: Jeff Moyer @ 2015-11-02 21:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox, axboe

Dave Chinner <david@fromorbit.com> writes:

> On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote:
>> Dave Chinner <david@fromorbit.com> writes:
>> 
>> > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable
>> > storage" commands. They are also IO barriers that affect scheduling
>> > of IOs in progress and in the request queues.  A REQ_FLUSH/REQ_FUA
>> > IO cannot be dispatched before all prior IO has been dispatched and
>> > drained from the request queue, and IO submitted after a queued
>> > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued
>> > REQ_FLUSH/REQ_FUA operation.
>> >
>> > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable
>> > storage, they also guarantee the order of IO dispatch and
>> > completion when concurrent IO is in progress.
>> 
>> This hasn't been the case for several years, now.  It used to work that
>> way, and that was deemed a big performance problem.  Since file systems
>> already issued and waited for all I/O before sending down a barrier, we
>> decided to get rid of the I/O ordering pieces of barriers (and stop
>> calling them barriers).
>> 
>> See commit 28e7d184521 (block: drop barrier ordering by queue draining).
>
> Yes, I realise that, even if I wasn't very clear about how I wrote
> it. ;)
>
> Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO
> barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH
> -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO
> scheduler calls to elv_dispatch_sort() that don't pass
> REQ_SOFTBARRIER in the queue.

This part is right.

> IOWs, if we queue a bunch of REQ_WRITE IOs followed by a
> REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be
> dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured
> by the cache flush.

But this part is not.  It is up to the I/O scheduler to decide when to
dispatch requests.  It can hold on to them for a variety of reasons.
Flush requests, however, do not go through the I/O scheduler.  At the
very moment that the flush request is inserted, it goes directly to the
dispatch queue (assuming no other flush is in progress).  The prior
requests may still be waiting in the I/O scheduler's internal lists.

So, any newly dispatched I/Os will certainly not get past the REQ_FLUSH.
However, the REQ_FLUSH is very likely to jump ahead of prior I/Os in the
queue.

> Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO
> to complete, we know that all the earlier REQ_WRITE IOs are on
> stable storage, too. Hence there's no need for the elevator to drain
> the queue to guarantee completion ordering - the dispatch ordering
> and flush/fua write semantics guarantee that when the flush/fua
> completes, all the IOs dispatch prior to that flush/fua write are
> also on stable storage...

Des xfs rely on this model for correctness?  If so, I'd say we've got a
problem.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-11-02 21:02           ` Jeff Moyer
@ 2015-11-04 18:34             ` Jeff Moyer
  2015-11-05  8:33             ` Dave Chinner
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff Moyer @ 2015-11-04 18:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-nvdimm, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, x86, Ingo Molnar, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, axboe, Theodore Ts'o,
	linux-kernel, Jan Kara, linux-fsdevel, Andrew Morton,
	Matthew Wilcox

Jeff Moyer <jmoyer@redhat.com> writes:

>> Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO
>> to complete, we know that all the earlier REQ_WRITE IOs are on
>> stable storage, too. Hence there's no need for the elevator to drain
>> the queue to guarantee completion ordering - the dispatch ordering
>> and flush/fua write semantics guarantee that when the flush/fua
>> completes, all the IOs dispatch prior to that flush/fua write are
>> also on stable storage...
>
> Des xfs rely on this model for correctness?  If so, I'd say we've got a
> problem.

Dave?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-11-02 21:02           ` Jeff Moyer
  2015-11-04 18:34             ` Jeff Moyer
@ 2015-11-05  8:33             ` Dave Chinner
  2015-11-05 19:49               ` Jeff Moyer
  2015-11-05 20:54               ` Jens Axboe
  1 sibling, 2 replies; 27+ messages in thread
From: Dave Chinner @ 2015-11-05  8:33 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox, axboe

[ sorry for slow response, been without an internet connection for
~36 hours ]

On Mon, Nov 02, 2015 at 04:02:48PM -0500, Jeff Moyer wrote:
> Dave Chinner <david@fromorbit.com> writes:
> 
> > On Mon, Nov 02, 2015 at 09:22:15AM -0500, Jeff Moyer wrote:
> >> Dave Chinner <david@fromorbit.com> writes:
> >> 
> >> > Further, REQ_FLUSH/REQ_FUA are more than just "put the data on stable
> >> > storage" commands. They are also IO barriers that affect scheduling
> >> > of IOs in progress and in the request queues.  A REQ_FLUSH/REQ_FUA
> >> > IO cannot be dispatched before all prior IO has been dispatched and
> >> > drained from the request queue, and IO submitted after a queued
> >> > REQ_FLUSH/REQ_FUA cannot be scheduled ahead of the queued
> >> > REQ_FLUSH/REQ_FUA operation.
> >> >
> >> > IOWs, REQ_FUA/REQ_FLUSH not only guarantee data is on stable
> >> > storage, they also guarantee the order of IO dispatch and
> >> > completion when concurrent IO is in progress.
> >> 
> >> This hasn't been the case for several years, now.  It used to work that
> >> way, and that was deemed a big performance problem.  Since file systems
> >> already issued and waited for all I/O before sending down a barrier, we
> >> decided to get rid of the I/O ordering pieces of barriers (and stop
> >> calling them barriers).
> >> 
> >> See commit 28e7d184521 (block: drop barrier ordering by queue draining).
> >
> > Yes, I realise that, even if I wasn't very clear about how I wrote
> > it. ;)
> >
> > Correct me if I'm wrong: AFAIA, dispatch ordering (i.e. the "IO
> > barrier") is still enforced by the scheduler via REQ_FUA|REQ_FLUSH
> > -> ELEVATOR_INSERT_FLUSH -> REQ_SOFTBARRIER and subsequent IO
> > scheduler calls to elv_dispatch_sort() that don't pass
> > REQ_SOFTBARRIER in the queue.
> 
> This part is right.
> 
> > IOWs, if we queue a bunch of REQ_WRITE IOs followed by a
> > REQ_WRITE|REQ_FLUSH IO, all of the prior REQ_WRITE IOs will be
> > dispatched before the REQ_WRITE|REQ_FLUSH IO and hence be captured
> > by the cache flush.
> 
> But this part is not.  It is up to the I/O scheduler to decide when to
> dispatch requests.  It can hold on to them for a variety of reasons.
> Flush requests, however, do not go through the I/O scheduler.  At the

That's pure REQ_FLUSH bios, right? Aren't data IOs with
REQ_FLUSH|REQ_FUA sorted like any other IO?

> very moment that the flush request is inserted, it goes directly to the
> dispatch queue (assuming no other flush is in progress).  The prior
> requests may still be waiting in the I/O scheduler's internal lists.
> 
> So, any newly dispatched I/Os will certainly not get past the REQ_FLUSH.
> However, the REQ_FLUSH is very likely to jump ahead of prior I/Os in the
> queue.

Uh, ok, that's different, and most definitely not the "IO barrier" I
was under the impression REQ_FLUSH|REQ_FUA gave us.

> > Hence once the filesystem has waited on the REQ_WRITE|REQ_FLUSH IO
> > to complete, we know that all the earlier REQ_WRITE IOs are on
> > stable storage, too. Hence there's no need for the elevator to drain
> > the queue to guarantee completion ordering - the dispatch ordering
> > and flush/fua write semantics guarantee that when the flush/fua
> > completes, all the IOs dispatch prior to that flush/fua write are
> > also on stable storage...
> 
> Des xfs rely on this model for correctness?  If so, I'd say we've got a
> problem

No, it doesn't. The XFS integrity model doesn't trust the IO layers
to tell the truth about IO ordering and completion or for it's
developers to fully understand how IO layer ordering works. :P

i.e. we wait for full completions of all dependent IO before issuing
flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure
the dependent IOs are fully caught by the cache flushes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-11-05  8:33             ` Dave Chinner
@ 2015-11-05 19:49               ` Jeff Moyer
  2015-11-05 20:54               ` Jens Axboe
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff Moyer @ 2015-11-05 19:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox, axboe

Dave Chinner <david@fromorbit.com> writes:

>> But this part is not.  It is up to the I/O scheduler to decide when to
>> dispatch requests.  It can hold on to them for a variety of reasons.
>> Flush requests, however, do not go through the I/O scheduler.  At the
>
> That's pure REQ_FLUSH bios, right? Aren't data IOs with
> REQ_FLUSH|REQ_FUA sorted like any other IO?

No, they also go through the flush machinery, and so short-circuit the
I/O scheduler.

>> Des xfs rely on this model for correctness?  If so, I'd say we've got a
>> problem
>
> No, it doesn't. The XFS integrity model doesn't trust the IO layers
> to tell the truth about IO ordering and completion or for it's
> developers to fully understand how IO layer ordering works. :P
>
> i.e. we wait for full completions of all dependent IO before issuing
> flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure
> the dependent IOs are fully caught by the cache flushes...

OK, phew!  ;-)

-Jeff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC 00/11] DAX fsynx/msync support
  2015-11-05  8:33             ` Dave Chinner
  2015-11-05 19:49               ` Jeff Moyer
@ 2015-11-05 20:54               ` Jens Axboe
  1 sibling, 0 replies; 27+ messages in thread
From: Jens Axboe @ 2015-11-05 20:54 UTC (permalink / raw)
  To: Dave Chinner, Jeff Moyer
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

On 11/05/2015 01:33 AM, Dave Chinner wrote:
>> Des xfs rely on this model for correctness?  If so, I'd say we've got a
>> problem
>
> No, it doesn't. The XFS integrity model doesn't trust the IO layers
> to tell the truth about IO ordering and completion or for it's
> developers to fully understand how IO layer ordering works. :P

That's good, because the storage developers simplified the model so that 
fs developers would be able to get and use it.

> i.e. we wait for full completions of all dependent IO before issuing
> flushes or log writes that use REQ_FLUSH|REQ_FUA semantics to ensure
> the dependent IOs are fully caught by the cache flushes...

... which is what you are supposed to do, that's how it works.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-11-05 20:54 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-29 20:12 [RFC 00/11] DAX fsynx/msync support Ross Zwisler
2015-10-29 20:12 ` [RFC 01/11] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2015-10-29 20:12 ` [RFC 02/11] mm: add pmd_mkclean() Ross Zwisler
2015-10-29 20:12 ` [RFC 03/11] pmem: enable REQ_FLUSH handling Ross Zwisler
2015-10-29 20:12 ` [RFC 04/11] dax: support dirty DAX entries in radix tree Ross Zwisler
2015-10-29 20:12 ` [RFC 05/11] mm: add follow_pte_pmd() Ross Zwisler
2015-10-29 20:12 ` [RFC 06/11] mm: add pgoff_mkclean() Ross Zwisler
2015-10-29 20:12 ` [RFC 07/11] mm: add find_get_entries_tag() Ross Zwisler
2015-10-29 20:12 ` [RFC 08/11] fs: add get_block() to struct inode_operations Ross Zwisler
2015-10-29 20:12 ` [RFC 09/11] dax: add support for fsync/sync Ross Zwisler
2015-10-29 20:12 ` [RFC 10/11] xfs, ext2: call dax_pfn_mkwrite() on write fault Ross Zwisler
2015-10-29 20:12 ` [RFC 11/11] ext4: add ext4_dax_pfn_mkwrite() Ross Zwisler
2015-10-29 22:49 ` [RFC 00/11] DAX fsynx/msync support Ross Zwisler
2015-10-30  3:55 ` Dave Chinner
2015-10-30 18:39   ` Ross Zwisler
2015-11-01 23:29     ` Dave Chinner
2015-11-02 14:22       ` Jeff Moyer
2015-11-02 20:10         ` Dave Chinner
2015-11-02 21:02           ` Jeff Moyer
2015-11-04 18:34             ` Jeff Moyer
2015-11-05  8:33             ` Dave Chinner
2015-11-05 19:49               ` Jeff Moyer
2015-11-05 20:54               ` Jens Axboe
2015-10-30 18:34 ` Dan Williams
2015-10-30 19:43   ` Ross Zwisler
2015-10-30 19:51     ` Dan Williams
2015-11-01 23:36       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).