All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-14  0:06 ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

This patch series adds support for fsync/msync to DAX.

Patches 1 through 7 add various utilities that the DAX code will eventually
need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
three filesystems that currently support DAX, ext2, ext4 and XFS, to use
the new DAX fsync/msync code.

These patches build on the recent DAX locking changes from Dave Chinner,
Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
them here:

http://www.spinics.net/lists/linux-ext4/msg49951.html

Ross Zwisler (11):
  pmem: add wb_cache_pmem() to the PMEM API
  mm: add pmd_mkclean()
  pmem: enable REQ_FUA/REQ_FLUSH handling
  dax: support dirty DAX entries in radix tree
  mm: add follow_pte_pmd()
  mm: add pgoff_mkclean()
  mm: add find_get_entries_tag()
  dax: add support for fsync/sync
  ext2: add support for DAX fsync/msync
  ext4: add support for DAX fsync/msync
  xfs: add support for DAX fsync/msync

 arch/x86/include/asm/pgtable.h |   5 ++
 arch/x86/include/asm/pmem.h    |  11 ++--
 drivers/nvdimm/pmem.c          |   3 +-
 fs/block_dev.c                 |   3 +-
 fs/dax.c                       | 140 +++++++++++++++++++++++++++++++++++++++--
 fs/ext2/file.c                 |  14 ++++-
 fs/ext4/file.c                 |   4 +-
 fs/ext4/fsync.c                |  12 +++-
 fs/inode.c                     |   1 +
 fs/xfs/xfs_file.c              |  18 ++++--
 include/linux/dax.h            |   6 ++
 include/linux/fs.h             |   1 +
 include/linux/mm.h             |   2 +
 include/linux/pagemap.h        |   3 +
 include/linux/pmem.h           |  22 ++++++-
 include/linux/radix-tree.h     |   8 +++
 include/linux/rmap.h           |   5 ++
 mm/filemap.c                   |  71 ++++++++++++++++++++-
 mm/huge_memory.c               |  14 ++---
 mm/memory.c                    |  38 ++++++++---
 mm/rmap.c                      |  51 +++++++++++++++
 mm/truncate.c                  |  62 ++++++++++--------
 22 files changed, 425 insertions(+), 69 deletions(-)

-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-14  0:06 ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

This patch series adds support for fsync/msync to DAX.

Patches 1 through 7 add various utilities that the DAX code will eventually
need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
three filesystems that currently support DAX, ext2, ext4 and XFS, to use
the new DAX fsync/msync code.

These patches build on the recent DAX locking changes from Dave Chinner,
Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
them here:

http://www.spinics.net/lists/linux-ext4/msg49951.html

Ross Zwisler (11):
  pmem: add wb_cache_pmem() to the PMEM API
  mm: add pmd_mkclean()
  pmem: enable REQ_FUA/REQ_FLUSH handling
  dax: support dirty DAX entries in radix tree
  mm: add follow_pte_pmd()
  mm: add pgoff_mkclean()
  mm: add find_get_entries_tag()
  dax: add support for fsync/sync
  ext2: add support for DAX fsync/msync
  ext4: add support for DAX fsync/msync
  xfs: add support for DAX fsync/msync

 arch/x86/include/asm/pgtable.h |   5 ++
 arch/x86/include/asm/pmem.h    |  11 ++--
 drivers/nvdimm/pmem.c          |   3 +-
 fs/block_dev.c                 |   3 +-
 fs/dax.c                       | 140 +++++++++++++++++++++++++++++++++++++++--
 fs/ext2/file.c                 |  14 ++++-
 fs/ext4/file.c                 |   4 +-
 fs/ext4/fsync.c                |  12 +++-
 fs/inode.c                     |   1 +
 fs/xfs/xfs_file.c              |  18 ++++--
 include/linux/dax.h            |   6 ++
 include/linux/fs.h             |   1 +
 include/linux/mm.h             |   2 +
 include/linux/pagemap.h        |   3 +
 include/linux/pmem.h           |  22 ++++++-
 include/linux/radix-tree.h     |   8 +++
 include/linux/rmap.h           |   5 ++
 mm/filemap.c                   |  71 ++++++++++++++++++++-
 mm/huge_memory.c               |  14 ++---
 mm/memory.c                    |  38 ++++++++---
 mm/rmap.c                      |  51 +++++++++++++++
 mm/truncate.c                  |  62 ++++++++++--------
 22 files changed, 425 insertions(+), 69 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-14  0:06 ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

This patch series adds support for fsync/msync to DAX.

Patches 1 through 7 add various utilities that the DAX code will eventually
need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
three filesystems that currently support DAX, ext2, ext4 and XFS, to use
the new DAX fsync/msync code.

These patches build on the recent DAX locking changes from Dave Chinner,
Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
them here:

http://www.spinics.net/lists/linux-ext4/msg49951.html

Ross Zwisler (11):
  pmem: add wb_cache_pmem() to the PMEM API
  mm: add pmd_mkclean()
  pmem: enable REQ_FUA/REQ_FLUSH handling
  dax: support dirty DAX entries in radix tree
  mm: add follow_pte_pmd()
  mm: add pgoff_mkclean()
  mm: add find_get_entries_tag()
  dax: add support for fsync/sync
  ext2: add support for DAX fsync/msync
  ext4: add support for DAX fsync/msync
  xfs: add support for DAX fsync/msync

 arch/x86/include/asm/pgtable.h |   5 ++
 arch/x86/include/asm/pmem.h    |  11 ++--
 drivers/nvdimm/pmem.c          |   3 +-
 fs/block_dev.c                 |   3 +-
 fs/dax.c                       | 140 +++++++++++++++++++++++++++++++++++++++--
 fs/ext2/file.c                 |  14 ++++-
 fs/ext4/file.c                 |   4 +-
 fs/ext4/fsync.c                |  12 +++-
 fs/inode.c                     |   1 +
 fs/xfs/xfs_file.c              |  18 ++++--
 include/linux/dax.h            |   6 ++
 include/linux/fs.h             |   1 +
 include/linux/mm.h             |   2 +
 include/linux/pagemap.h        |   3 +
 include/linux/pmem.h           |  22 ++++++-
 include/linux/radix-tree.h     |   8 +++
 include/linux/rmap.h           |   5 ++
 mm/filemap.c                   |  71 ++++++++++++++++++++-
 mm/huge_memory.c               |  14 ++---
 mm/memory.c                    |  38 ++++++++---
 mm/rmap.c                      |  51 +++++++++++++++
 mm/truncate.c                  |  62 ++++++++++--------
 22 files changed, 425 insertions(+), 69 deletions(-)

-- 
2.1.0


^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH v2 01/11] pmem: add wb_cache_pmem() to the PMEM API
  2015-11-14  0:06 ` Ross Zwisler
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

One thing worth noting is that we really do want this to be part of the
PMEM API as opposed to a stand-alone function like clflush_cache_range()
because of ordering restrictions.  By having wb_cache_pmem() as part of the
PMEM API we can leave it unordered, call it multiple times to write back
large amounts of memory, and then order the multiple calls with a single
wmb_pmem().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pmem.h | 11 ++++++-----
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec..6c7ade0 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	else
 		memset(vaddr, 0, size);
 
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 85f810b3..2cd5003 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 01/11] pmem: add wb_cache_pmem() to the PMEM API
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

One thing worth noting is that we really do want this to be part of the
PMEM API as opposed to a stand-alone function like clflush_cache_range()
because of ordering restrictions.  By having wb_cache_pmem() as part of the
PMEM API we can leave it unordered, call it multiple times to write back
large amounts of memory, and then order the multiple calls with a single
wmb_pmem().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pmem.h | 11 ++++++-----
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec..6c7ade0 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	else
 		memset(vaddr, 0, size);
 
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 85f810b3..2cd5003 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 01/11] pmem: add wb_cache_pmem() to the PMEM API
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.

One thing worth noting is that we really do want this to be part of the
PMEM API as opposed to a stand-alone function like clflush_cache_range()
because of ordering restrictions.  By having wb_cache_pmem() as part of the
PMEM API we can leave it unordered, call it multiple times to write back
large amounts of memory, and then order the multiple calls with a single
wmb_pmem().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pmem.h | 11 ++++++-----
 include/linux/pmem.h        | 22 +++++++++++++++++++++-
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec..6c7ade0 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
 }
 
 /**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
  * @vaddr:	virtual start address
  * @size:	number of bytes to write back
  *
  * Write back a cache range using the CLWB (cache line write back)
  * instruction.  This function requires explicit ordering with an
- * arch_wmb_pmem() call.  This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
  */
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
 {
 	u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
 	unsigned long clflush_mask = x86_clflush_size - 1;
+	void *vaddr = (void __force *)addr;
 	void *vend = vaddr + size;
 	void *p;
 
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
 	len = copy_from_iter_nocache(vaddr, bytes, i);
 
 	if (__iter_needs_pmem_wb(i))
-		__arch_wb_cache_pmem(vaddr, bytes);
+		arch_wb_cache_pmem(addr, bytes);
 
 	return len;
 }
@@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 	else
 		memset(vaddr, 0, size);
 
-	__arch_wb_cache_pmem(vaddr, size);
+	arch_wb_cache_pmem(addr, size);
 }
 
 static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index 85f810b3..2cd5003 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
 {
 	BUG();
 }
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	BUG();
+}
 #endif
 
 /*
  * Architectures that define ARCH_HAS_PMEM_API must provide
  * implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
  */
 static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
 {
@@ -202,4 +208,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
 	else
 		default_clear_pmem(addr, size);
 }
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr:	virtual start address
+ * @size:	number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+	if (arch_has_pmem_api())
+		arch_wb_cache_pmem(addr, size);
+}
 #endif /* __PMEM_H__ */
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 02/11] mm: add pmd_mkclean()
  2015-11-14  0:06 ` Ross Zwisler
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Currently PMD pages can be dirtied via pmd_mkdirty(), but cannot be
cleaned.  For DAX mmap dirty page tracking we need to be able to clean PMD
pages when we flush them to media so that we get a new write fault the next
time the are written to.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5b..c548e4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -277,6 +277,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+}
+
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 02/11] mm: add pmd_mkclean()
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Currently PMD pages can be dirtied via pmd_mkdirty(), but cannot be
cleaned.  For DAX mmap dirty page tracking we need to be able to clean PMD
pages when we flush them to media so that we get a new write fault the next
time the are written to.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5b..c548e4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -277,6 +277,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+}
+
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 02/11] mm: add pmd_mkclean()
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

Currently PMD pages can be dirtied via pmd_mkdirty(), but cannot be
cleaned.  For DAX mmap dirty page tracking we need to be able to clean PMD
pages when we flush them to media so that we get a new write fault the next
time the are written to.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 arch/x86/include/asm/pgtable.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5b..c548e4c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -277,6 +277,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+	return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
+}
+
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-14  0:06 ` Ross Zwisler
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
are sent down via blkdev_issue_flush() in response to a fsync() or msync()
and are used by filesystems to order their metadata, among other things.

When we get an msync() or fsync() it is the responsibility of the DAX code
to flush all dirty pages to media.  The PMEM driver then just has issue a
wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
the flushed data has been durably stored on the media.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/nvdimm/pmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0ba6a97..b914d66 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -80,7 +80,7 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	if (do_acct)
 		nd_iostat_end(bio, start);
 
-	if (bio_data_dir(bio))
+	if (bio_data_dir(bio) || (bio->bi_rw & (REQ_FLUSH|REQ_FUA)))
 		wmb_pmem();
 
 	bio_endio(bio);
@@ -189,6 +189,7 @@ static int pmem_attach_disk(struct device *dev,
 	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+	blk_queue_flush(pmem->pmem_queue, REQ_FLUSH|REQ_FUA);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
 
 	disk = alloc_disk(0);
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
are sent down via blkdev_issue_flush() in response to a fsync() or msync()
and are used by filesystems to order their metadata, among other things.

When we get an msync() or fsync() it is the responsibility of the DAX code
to flush all dirty pages to media.  The PMEM driver then just has issue a
wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
the flushed data has been durably stored on the media.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/nvdimm/pmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0ba6a97..b914d66 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -80,7 +80,7 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	if (do_acct)
 		nd_iostat_end(bio, start);
 
-	if (bio_data_dir(bio))
+	if (bio_data_dir(bio) || (bio->bi_rw & (REQ_FLUSH|REQ_FUA)))
 		wmb_pmem();
 
 	bio_endio(bio);
@@ -189,6 +189,7 @@ static int pmem_attach_disk(struct device *dev,
 	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+	blk_queue_flush(pmem->pmem_queue, REQ_FLUSH|REQ_FUA);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
 
 	disk = alloc_disk(0);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
are sent down via blkdev_issue_flush() in response to a fsync() or msync()
and are used by filesystems to order their metadata, among other things.

When we get an msync() or fsync() it is the responsibility of the DAX code
to flush all dirty pages to media.  The PMEM driver then just has issue a
wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
the flushed data has been durably stored on the media.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 drivers/nvdimm/pmem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 0ba6a97..b914d66 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -80,7 +80,7 @@ static void pmem_make_request(struct request_queue *q, struct bio *bio)
 	if (do_acct)
 		nd_iostat_end(bio, start);
 
-	if (bio_data_dir(bio))
+	if (bio_data_dir(bio) || (bio->bi_rw & (REQ_FLUSH|REQ_FUA)))
 		wmb_pmem();
 
 	bio_endio(bio);
@@ -189,6 +189,7 @@ static int pmem_attach_disk(struct device *dev,
 	blk_queue_physical_block_size(pmem->pmem_queue, PAGE_SIZE);
 	blk_queue_max_hw_sectors(pmem->pmem_queue, UINT_MAX);
 	blk_queue_bounce_limit(pmem->pmem_queue, BLK_BOUNCE_ANY);
+	blk_queue_flush(pmem->pmem_queue, REQ_FLUSH|REQ_FUA);
 	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, pmem->pmem_queue);
 
 	disk = alloc_disk(0);
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 04/11] dax: support dirty DAX entries in radix tree
  2015-11-14  0:06 ` Ross Zwisler
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.
These exceptional entries will also contain the writeback addresses for the
PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  There
shouldn't be any collisions between these various exceptional entries
because only one type of exceptional entry should be able to be found in a
radix tree at a time depending on how it is being used.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/block_dev.c             |  3 ++-
 fs/inode.c                 |  1 +
 include/linux/dax.h        |  5 ++++
 include/linux/fs.h         |  1 +
 include/linux/radix-tree.h |  8 ++++++
 mm/filemap.c               | 10 +++++---
 mm/truncate.c              | 62 ++++++++++++++++++++++++++--------------------
 7 files changed, 59 insertions(+), 31 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 073bb57..afaaf44 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -66,7 +66,8 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0 &&
+			mapping->nrdax == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index 78a17b8..f7c87a6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -496,6 +496,7 @@ void clear_inode(struct inode *inode)
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
 	BUG_ON(inode->i_data.nrshadows);
+	BUG_ON(inode->i_data.nrdax);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+	return mapping->host && IS_DAX(mapping->host);
+}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..f791698 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,6 +433,7 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	unsigned long		nrdax;	        /* number of DAX entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 33170db..19a533a 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,14 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((__force u64)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_ADDR(entry) ((void __pmem *)((u64)entry & ~RADIX_DAX_MASK))
+#define RADIX_DAX_PTE_ENTRY(addr) ((void *)((__force u64)addr | RADIX_DAX_PTE))
+#define RADIX_DAX_PMD_ENTRY(addr) ((void *)((__force u64)addr | RADIX_DAX_PMD))
+
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 327910c..d5e94fd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@
  */
 #include <linux/export.h>
 #include <linux/compiler.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/uaccess.h>
 #include <linux/capability.h>
@@ -538,6 +539,9 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
+
+		BUG_ON(dax_mapping(mapping));
+
 		if (shadowp)
 			*shadowp = p;
 		mapping->nrshadows--;
@@ -1201,9 +1205,9 @@ repeat:
 			if (radix_tree_deref_retry(page))
 				goto restart;
 			/*
-			 * A shadow entry of a recently evicted page,
-			 * or a swap entry from shmem/tmpfs.  Return
-			 * it without attempting to raise page count.
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
 			 */
 			goto export;
 		}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..32d2964 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@
 
 #include <linux/kernel.h>
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
@@ -34,31 +35,37 @@ static void clear_exceptional_entry(struct address_space *mapping,
 		return;
 
 	spin_lock_irq(&mapping->tree_lock);
-	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
-	 */
-	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-	radix_tree_replace_slot(slot, NULL);
-	mapping->nrshadows--;
-	if (!node)
-		goto unlock;
-	workingset_node_shadows_dec(node);
-	/*
-	 * Don't track node without shadow entries.
-	 *
-	 * Avoid acquiring the list_lru lock if already untracked.
-	 * The list_empty() test is safe as node->private_list is
-	 * protected by mapping->tree_lock.
-	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
-		list_lru_del(&workingset_shadow_nodes, &node->private_list);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+
+	if (dax_mapping(mapping)) {
+		radix_tree_delete(&mapping->page_tree, index);
+		mapping->nrdax--;
+	} else {
+		/*
+		 * Regular page slots are stabilized by the page lock even
+		 * without the tree itself locked.  These unlocked entries
+		 * need verification under the tree lock.
+		 */
+		if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
+			goto unlock;
+		if (*slot != entry)
+			goto unlock;
+		radix_tree_replace_slot(slot, NULL);
+		mapping->nrshadows--;
+		if (!node)
+			goto unlock;
+		workingset_node_shadows_dec(node);
+		/*
+		 * Don't track node without shadow entries.
+		 *
+		 * Avoid acquiring the list_lru lock if already untracked.
+		 * The list_empty() test is safe as node->private_list is
+		 * protected by mapping->tree_lock.
+		 */
+		if (!workingset_node_shadows(node) &&
+		    !list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes, &node->private_list);
+		__radix_tree_delete_node(&mapping->page_tree, node);
+	}
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
@@ -228,7 +235,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0 &&
+			mapping->nrdax == 0)
 		return;
 
 	/* Offsets within partial pages */
@@ -423,7 +431,7 @@ void truncate_inode_pages_final(struct address_space *mapping)
 	smp_rmb();
 	nrshadows = mapping->nrshadows;
 
-	if (nrpages || nrshadows) {
+	if (nrpages || nrshadows || mapping->nrdax) {
 		/*
 		 * As truncation uses a lockless tree lookup, cycle
 		 * the tree lock to make sure any ongoing tree
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 04/11] dax: support dirty DAX entries in radix tree
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.
These exceptional entries will also contain the writeback addresses for the
PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  There
shouldn't be any collisions between these various exceptional entries
because only one type of exceptional entry should be able to be found in a
radix tree at a time depending on how it is being used.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/block_dev.c             |  3 ++-
 fs/inode.c                 |  1 +
 include/linux/dax.h        |  5 ++++
 include/linux/fs.h         |  1 +
 include/linux/radix-tree.h |  8 ++++++
 mm/filemap.c               | 10 +++++---
 mm/truncate.c              | 62 ++++++++++++++++++++++++++--------------------
 7 files changed, 59 insertions(+), 31 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 073bb57..afaaf44 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -66,7 +66,8 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0 &&
+			mapping->nrdax == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index 78a17b8..f7c87a6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -496,6 +496,7 @@ void clear_inode(struct inode *inode)
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
 	BUG_ON(inode->i_data.nrshadows);
+	BUG_ON(inode->i_data.nrdax);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+	return mapping->host && IS_DAX(mapping->host);
+}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..f791698 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,6 +433,7 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	unsigned long		nrdax;	        /* number of DAX entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 33170db..19a533a 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,14 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((__force u64)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_ADDR(entry) ((void __pmem *)((u64)entry & ~RADIX_DAX_MASK))
+#define RADIX_DAX_PTE_ENTRY(addr) ((void *)((__force u64)addr | RADIX_DAX_PTE))
+#define RADIX_DAX_PMD_ENTRY(addr) ((void *)((__force u64)addr | RADIX_DAX_PMD))
+
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 327910c..d5e94fd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@
  */
 #include <linux/export.h>
 #include <linux/compiler.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/uaccess.h>
 #include <linux/capability.h>
@@ -538,6 +539,9 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
+
+		BUG_ON(dax_mapping(mapping));
+
 		if (shadowp)
 			*shadowp = p;
 		mapping->nrshadows--;
@@ -1201,9 +1205,9 @@ repeat:
 			if (radix_tree_deref_retry(page))
 				goto restart;
 			/*
-			 * A shadow entry of a recently evicted page,
-			 * or a swap entry from shmem/tmpfs.  Return
-			 * it without attempting to raise page count.
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
 			 */
 			goto export;
 		}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..32d2964 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@
 
 #include <linux/kernel.h>
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
@@ -34,31 +35,37 @@ static void clear_exceptional_entry(struct address_space *mapping,
 		return;
 
 	spin_lock_irq(&mapping->tree_lock);
-	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
-	 */
-	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-	radix_tree_replace_slot(slot, NULL);
-	mapping->nrshadows--;
-	if (!node)
-		goto unlock;
-	workingset_node_shadows_dec(node);
-	/*
-	 * Don't track node without shadow entries.
-	 *
-	 * Avoid acquiring the list_lru lock if already untracked.
-	 * The list_empty() test is safe as node->private_list is
-	 * protected by mapping->tree_lock.
-	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
-		list_lru_del(&workingset_shadow_nodes, &node->private_list);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+
+	if (dax_mapping(mapping)) {
+		radix_tree_delete(&mapping->page_tree, index);
+		mapping->nrdax--;
+	} else {
+		/*
+		 * Regular page slots are stabilized by the page lock even
+		 * without the tree itself locked.  These unlocked entries
+		 * need verification under the tree lock.
+		 */
+		if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
+			goto unlock;
+		if (*slot != entry)
+			goto unlock;
+		radix_tree_replace_slot(slot, NULL);
+		mapping->nrshadows--;
+		if (!node)
+			goto unlock;
+		workingset_node_shadows_dec(node);
+		/*
+		 * Don't track node without shadow entries.
+		 *
+		 * Avoid acquiring the list_lru lock if already untracked.
+		 * The list_empty() test is safe as node->private_list is
+		 * protected by mapping->tree_lock.
+		 */
+		if (!workingset_node_shadows(node) &&
+		    !list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes, &node->private_list);
+		__radix_tree_delete_node(&mapping->page_tree, node);
+	}
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
@@ -228,7 +235,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0 &&
+			mapping->nrdax == 0)
 		return;
 
 	/* Offsets within partial pages */
@@ -423,7 +431,7 @@ void truncate_inode_pages_final(struct address_space *mapping)
 	smp_rmb();
 	nrshadows = mapping->nrshadows;
 
-	if (nrpages || nrshadows) {
+	if (nrpages || nrshadows || mapping->nrdax) {
 		/*
 		 * As truncation uses a lockless tree lookup, cycle
 		 * the tree lock to make sure any ongoing tree
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 04/11] dax: support dirty DAX entries in radix tree
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

Add support for tracking dirty DAX entries in the struct address_space
radix tree.  This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.

In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.
These exceptional entries will also contain the writeback addresses for the
PTE or PMD faults that we can use at fsync/msync time.

There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third.  There
shouldn't be any collisions between these various exceptional entries
because only one type of exceptional entry should be able to be found in a
radix tree at a time depending on how it is being used.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/block_dev.c             |  3 ++-
 fs/inode.c                 |  1 +
 include/linux/dax.h        |  5 ++++
 include/linux/fs.h         |  1 +
 include/linux/radix-tree.h |  8 ++++++
 mm/filemap.c               | 10 +++++---
 mm/truncate.c              | 62 ++++++++++++++++++++++++++--------------------
 7 files changed, 59 insertions(+), 31 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 073bb57..afaaf44 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -66,7 +66,8 @@ void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0 &&
+			mapping->nrdax == 0)
 		return;
 
 	invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index 78a17b8..f7c87a6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -496,6 +496,7 @@ void clear_inode(struct inode *inode)
 	spin_lock_irq(&inode->i_data.tree_lock);
 	BUG_ON(inode->i_data.nrpages);
 	BUG_ON(inode->i_data.nrshadows);
+	BUG_ON(inode->i_data.nrdax);
 	spin_unlock_irq(&inode->i_data.tree_lock);
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
 {
 	return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
 }
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+	return mapping->host && IS_DAX(mapping->host);
+}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..f791698 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,6 +433,7 @@ struct address_space {
 	/* Protected by tree_lock together with the radix tree */
 	unsigned long		nrpages;	/* number of total pages */
 	unsigned long		nrshadows;	/* number of shadow entries */
+	unsigned long		nrdax;	        /* number of DAX entries */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 33170db..19a533a 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,14 @@
 #define RADIX_TREE_EXCEPTIONAL_ENTRY	2
 #define RADIX_TREE_EXCEPTIONAL_SHIFT	2
 
+#define RADIX_DAX_MASK	0xf
+#define RADIX_DAX_PTE  (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD  (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((__force u64)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_ADDR(entry) ((void __pmem *)((u64)entry & ~RADIX_DAX_MASK))
+#define RADIX_DAX_PTE_ENTRY(addr) ((void *)((__force u64)addr | RADIX_DAX_PTE))
+#define RADIX_DAX_PMD_ENTRY(addr) ((void *)((__force u64)addr | RADIX_DAX_PMD))
+
 static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
 	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 327910c..d5e94fd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@
  */
 #include <linux/export.h>
 #include <linux/compiler.h>
+#include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/uaccess.h>
 #include <linux/capability.h>
@@ -538,6 +539,9 @@ static int page_cache_tree_insert(struct address_space *mapping,
 		p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
 		if (!radix_tree_exceptional_entry(p))
 			return -EEXIST;
+
+		BUG_ON(dax_mapping(mapping));
+
 		if (shadowp)
 			*shadowp = p;
 		mapping->nrshadows--;
@@ -1201,9 +1205,9 @@ repeat:
 			if (radix_tree_deref_retry(page))
 				goto restart;
 			/*
-			 * A shadow entry of a recently evicted page,
-			 * or a swap entry from shmem/tmpfs.  Return
-			 * it without attempting to raise page count.
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
 			 */
 			goto export;
 		}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..32d2964 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@
 
 #include <linux/kernel.h>
 #include <linux/backing-dev.h>
+#include <linux/dax.h>
 #include <linux/gfp.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
@@ -34,31 +35,37 @@ static void clear_exceptional_entry(struct address_space *mapping,
 		return;
 
 	spin_lock_irq(&mapping->tree_lock);
-	/*
-	 * Regular page slots are stabilized by the page lock even
-	 * without the tree itself locked.  These unlocked entries
-	 * need verification under the tree lock.
-	 */
-	if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
-		goto unlock;
-	if (*slot != entry)
-		goto unlock;
-	radix_tree_replace_slot(slot, NULL);
-	mapping->nrshadows--;
-	if (!node)
-		goto unlock;
-	workingset_node_shadows_dec(node);
-	/*
-	 * Don't track node without shadow entries.
-	 *
-	 * Avoid acquiring the list_lru lock if already untracked.
-	 * The list_empty() test is safe as node->private_list is
-	 * protected by mapping->tree_lock.
-	 */
-	if (!workingset_node_shadows(node) &&
-	    !list_empty(&node->private_list))
-		list_lru_del(&workingset_shadow_nodes, &node->private_list);
-	__radix_tree_delete_node(&mapping->page_tree, node);
+
+	if (dax_mapping(mapping)) {
+		radix_tree_delete(&mapping->page_tree, index);
+		mapping->nrdax--;
+	} else {
+		/*
+		 * Regular page slots are stabilized by the page lock even
+		 * without the tree itself locked.  These unlocked entries
+		 * need verification under the tree lock.
+		 */
+		if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
+			goto unlock;
+		if (*slot != entry)
+			goto unlock;
+		radix_tree_replace_slot(slot, NULL);
+		mapping->nrshadows--;
+		if (!node)
+			goto unlock;
+		workingset_node_shadows_dec(node);
+		/*
+		 * Don't track node without shadow entries.
+		 *
+		 * Avoid acquiring the list_lru lock if already untracked.
+		 * The list_empty() test is safe as node->private_list is
+		 * protected by mapping->tree_lock.
+		 */
+		if (!workingset_node_shadows(node) &&
+		    !list_empty(&node->private_list))
+			list_lru_del(&workingset_shadow_nodes, &node->private_list);
+		__radix_tree_delete_node(&mapping->page_tree, node);
+	}
 unlock:
 	spin_unlock_irq(&mapping->tree_lock);
 }
@@ -228,7 +235,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	int		i;
 
 	cleancache_invalidate_inode(mapping);
-	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+	if (mapping->nrpages == 0 && mapping->nrshadows == 0 &&
+			mapping->nrdax == 0)
 		return;
 
 	/* Offsets within partial pages */
@@ -423,7 +431,7 @@ void truncate_inode_pages_final(struct address_space *mapping)
 	smp_rmb();
 	nrshadows = mapping->nrshadows;
 
-	if (nrpages || nrshadows) {
+	if (nrpages || nrshadows || mapping->nrdax) {
 		/*
 		 * As truncation uses a lockless tree lookup, cycle
 		 * the tree lock to make sure any ongoing tree
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 05/11] mm: add follow_pte_pmd()
  2015-11-14  0:06 ` Ross Zwisler
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a
huge page PMD leaf to be found and returned.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 38 ++++++++++++++++++++++++++++++--------
 2 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80001de..393441c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1166,6 +1166,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
+int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index deb679c..7f4090e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3512,8 +3512,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static int __follow_pte(struct mm_struct *mm, unsigned long address,
-		pte_t **ptepp, spinlock_t **ptlp)
+static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+		pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -3529,12 +3529,20 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address,
 		goto out;
 
 	pmd = pmd_offset(pud, address);
-	VM_BUG_ON(pmd_trans_huge(*pmd));
-	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
-		goto out;
 
-	/* We cannot handle huge page PFN maps. Luckily they don't exist. */
-	if (pmd_huge(*pmd))
+	if (pmd_huge(*pmd)) {
+		if (!pmdpp)
+			goto out;
+
+		*ptlp = pmd_lock(mm, pmd);
+		if (pmd_huge(*pmd)) {
+			*pmdpp = pmd;
+			return 0;
+		}
+		spin_unlock(*ptlp);
+	}
+
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
 	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
@@ -3557,9 +3565,23 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte(mm, address, ptepp, ptlp)));
+			   !(res = __follow_pte_pmd(mm, address, ptepp, NULL,
+					   ptlp)));
+	return res;
+}
+
+int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+{
+	int res;
+
+	/* (void) is needed to make gcc happy */
+	(void) __cond_lock(*ptlp,
+			   !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp,
+					   ptlp)));
 	return res;
 }
+EXPORT_SYMBOL(follow_pte_pmd);
 
 /**
  * follow_pfn - look up PFN at a user virtual address
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 05/11] mm: add follow_pte_pmd()
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a
huge page PMD leaf to be found and returned.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 38 ++++++++++++++++++++++++++++++--------
 2 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80001de..393441c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1166,6 +1166,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
+int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index deb679c..7f4090e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3512,8 +3512,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static int __follow_pte(struct mm_struct *mm, unsigned long address,
-		pte_t **ptepp, spinlock_t **ptlp)
+static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+		pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -3529,12 +3529,20 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address,
 		goto out;
 
 	pmd = pmd_offset(pud, address);
-	VM_BUG_ON(pmd_trans_huge(*pmd));
-	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
-		goto out;
 
-	/* We cannot handle huge page PFN maps. Luckily they don't exist. */
-	if (pmd_huge(*pmd))
+	if (pmd_huge(*pmd)) {
+		if (!pmdpp)
+			goto out;
+
+		*ptlp = pmd_lock(mm, pmd);
+		if (pmd_huge(*pmd)) {
+			*pmdpp = pmd;
+			return 0;
+		}
+		spin_unlock(*ptlp);
+	}
+
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
 	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
@@ -3557,9 +3565,23 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte(mm, address, ptepp, ptlp)));
+			   !(res = __follow_pte_pmd(mm, address, ptepp, NULL,
+					   ptlp)));
+	return res;
+}
+
+int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+{
+	int res;
+
+	/* (void) is needed to make gcc happy */
+	(void) __cond_lock(*ptlp,
+			   !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp,
+					   ptlp)));
 	return res;
 }
+EXPORT_SYMBOL(follow_pte_pmd);
 
 /**
  * follow_pfn - look up PFN at a user virtual address
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 05/11] mm: add follow_pte_pmd()
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a
huge page PMD leaf to be found and returned.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 38 ++++++++++++++++++++++++++++++--------
 2 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 80001de..393441c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1166,6 +1166,8 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
+int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index deb679c..7f4090e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3512,8 +3512,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static int __follow_pte(struct mm_struct *mm, unsigned long address,
-		pte_t **ptepp, spinlock_t **ptlp)
+static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+		pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	pud_t *pud;
@@ -3529,12 +3529,20 @@ static int __follow_pte(struct mm_struct *mm, unsigned long address,
 		goto out;
 
 	pmd = pmd_offset(pud, address);
-	VM_BUG_ON(pmd_trans_huge(*pmd));
-	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
-		goto out;
 
-	/* We cannot handle huge page PFN maps. Luckily they don't exist. */
-	if (pmd_huge(*pmd))
+	if (pmd_huge(*pmd)) {
+		if (!pmdpp)
+			goto out;
+
+		*ptlp = pmd_lock(mm, pmd);
+		if (pmd_huge(*pmd)) {
+			*pmdpp = pmd;
+			return 0;
+		}
+		spin_unlock(*ptlp);
+	}
+
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
 	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
@@ -3557,9 +3565,23 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte(mm, address, ptepp, ptlp)));
+			   !(res = __follow_pte_pmd(mm, address, ptepp, NULL,
+					   ptlp)));
+	return res;
+}
+
+int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+{
+	int res;
+
+	/* (void) is needed to make gcc happy */
+	(void) __cond_lock(*ptlp,
+			   !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp,
+					   ptlp)));
 	return res;
 }
+EXPORT_SYMBOL(follow_pte_pmd);
 
 /**
  * follow_pfn - look up PFN at a user virtual address
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 06/11] mm: add pgoff_mkclean()
  2015-11-14  0:06 ` Ross Zwisler
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Introduce pgoff_mkclean() which conceptually is similar to page_mkclean()
except it works in the absence of struct page and it can also be used to
clean PMDs.  This is needed for DAX's dirty page handling.

pgoff_mkclean() doesn't return an error for a missing PTE/PMD when looping
through the VMAs because it's not a requirement that each of the
potentially many VMAs associated with a given struct address_space have a
mapping set up for our pgoff.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/rmap.h |  5 +++++
 mm/rmap.c            | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 29446ae..171a4ac 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -223,6 +223,11 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 int page_mkclean(struct page *);
 
 /*
+ * Cleans and write protects the PTEs of shared mappings.
+ */
+void pgoff_mkclean(pgoff_t, struct address_space *);
+
+/*
  * called in munlock()/munmap() path to check for other vmas holding
  * the page mlocked.
  */
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f..8114862 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -586,6 +586,16 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+	unsigned long address;
+
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	return address;
+}
+
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 static void percpu_flush_tlb_batch_pages(void *data)
 {
@@ -1040,6 +1050,47 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+void pgoff_mkclean(pgoff_t pgoff, struct address_space *mapping)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		struct mm_struct *mm = vma->vm_mm;
+		pmd_t pmd, *pmdp = NULL;
+		pte_t pte, *ptep = NULL;
+		unsigned long address;
+		spinlock_t *ptl;
+
+		address = pgoff_address(pgoff, vma);
+
+		/* when this returns successfully ptl is locked */
+		ret = follow_pte_pmd(mm, address, &ptep, &pmdp, &ptl);
+		if (ret)
+			continue;
+
+		if (pmdp) {
+			flush_cache_page(vma, address, pmd_pfn(*pmdp));
+			pmd = pmdp_huge_clear_flush(vma, address, pmdp);
+			pmd = pmd_wrprotect(pmd);
+			pmd = pmd_mkclean(pmd);
+			set_pmd_at(mm, address, pmdp, pmd);
+			spin_unlock(ptl);
+		} else {
+			BUG_ON(!ptep);
+			flush_cache_page(vma, address, pte_pfn(*ptep));
+			pte = ptep_clear_flush(vma, address, ptep);
+			pte = pte_wrprotect(pte);
+			pte = pte_mkclean(pte);
+			set_pte_at(mm, address, ptep, pte);
+			pte_unmap_unlock(ptep, ptl);
+		}
+	}
+	i_mmap_unlock_read(mapping);
+}
+EXPORT_SYMBOL_GPL(pgoff_mkclean);
+
 /**
  * page_move_anon_rmap - move a page to our anon_vma
  * @page:	the page to move to our anon_vma
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 06/11] mm: add pgoff_mkclean()
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Introduce pgoff_mkclean() which conceptually is similar to page_mkclean()
except it works in the absence of struct page and it can also be used to
clean PMDs.  This is needed for DAX's dirty page handling.

pgoff_mkclean() doesn't return an error for a missing PTE/PMD when looping
through the VMAs because it's not a requirement that each of the
potentially many VMAs associated with a given struct address_space have a
mapping set up for our pgoff.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/rmap.h |  5 +++++
 mm/rmap.c            | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 29446ae..171a4ac 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -223,6 +223,11 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 int page_mkclean(struct page *);
 
 /*
+ * Cleans and write protects the PTEs of shared mappings.
+ */
+void pgoff_mkclean(pgoff_t, struct address_space *);
+
+/*
  * called in munlock()/munmap() path to check for other vmas holding
  * the page mlocked.
  */
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f..8114862 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -586,6 +586,16 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+	unsigned long address;
+
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	return address;
+}
+
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 static void percpu_flush_tlb_batch_pages(void *data)
 {
@@ -1040,6 +1050,47 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+void pgoff_mkclean(pgoff_t pgoff, struct address_space *mapping)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		struct mm_struct *mm = vma->vm_mm;
+		pmd_t pmd, *pmdp = NULL;
+		pte_t pte, *ptep = NULL;
+		unsigned long address;
+		spinlock_t *ptl;
+
+		address = pgoff_address(pgoff, vma);
+
+		/* when this returns successfully ptl is locked */
+		ret = follow_pte_pmd(mm, address, &ptep, &pmdp, &ptl);
+		if (ret)
+			continue;
+
+		if (pmdp) {
+			flush_cache_page(vma, address, pmd_pfn(*pmdp));
+			pmd = pmdp_huge_clear_flush(vma, address, pmdp);
+			pmd = pmd_wrprotect(pmd);
+			pmd = pmd_mkclean(pmd);
+			set_pmd_at(mm, address, pmdp, pmd);
+			spin_unlock(ptl);
+		} else {
+			BUG_ON(!ptep);
+			flush_cache_page(vma, address, pte_pfn(*ptep));
+			pte = ptep_clear_flush(vma, address, ptep);
+			pte = pte_wrprotect(pte);
+			pte = pte_mkclean(pte);
+			set_pte_at(mm, address, ptep, pte);
+			pte_unmap_unlock(ptep, ptl);
+		}
+	}
+	i_mmap_unlock_read(mapping);
+}
+EXPORT_SYMBOL_GPL(pgoff_mkclean);
+
 /**
  * page_move_anon_rmap - move a page to our anon_vma
  * @page:	the page to move to our anon_vma
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 06/11] mm: add pgoff_mkclean()
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

Introduce pgoff_mkclean() which conceptually is similar to page_mkclean()
except it works in the absence of struct page and it can also be used to
clean PMDs.  This is needed for DAX's dirty page handling.

pgoff_mkclean() doesn't return an error for a missing PTE/PMD when looping
through the VMAs because it's not a requirement that each of the
potentially many VMAs associated with a given struct address_space have a
mapping set up for our pgoff.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/rmap.h |  5 +++++
 mm/rmap.c            | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 29446ae..171a4ac 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -223,6 +223,11 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 int page_mkclean(struct page *);
 
 /*
+ * Cleans and write protects the PTEs of shared mappings.
+ */
+void pgoff_mkclean(pgoff_t, struct address_space *);
+
+/*
  * called in munlock()/munmap() path to check for other vmas holding
  * the page mlocked.
  */
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f..8114862 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -586,6 +586,16 @@ vma_address(struct page *page, struct vm_area_struct *vma)
 	return address;
 }
 
+static inline unsigned long
+pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
+{
+	unsigned long address;
+
+	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
+	return address;
+}
+
 #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
 static void percpu_flush_tlb_batch_pages(void *data)
 {
@@ -1040,6 +1050,47 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+void pgoff_mkclean(pgoff_t pgoff, struct address_space *mapping)
+{
+	struct vm_area_struct *vma;
+	int ret = 0;
+
+	i_mmap_lock_read(mapping);
+	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+		struct mm_struct *mm = vma->vm_mm;
+		pmd_t pmd, *pmdp = NULL;
+		pte_t pte, *ptep = NULL;
+		unsigned long address;
+		spinlock_t *ptl;
+
+		address = pgoff_address(pgoff, vma);
+
+		/* when this returns successfully ptl is locked */
+		ret = follow_pte_pmd(mm, address, &ptep, &pmdp, &ptl);
+		if (ret)
+			continue;
+
+		if (pmdp) {
+			flush_cache_page(vma, address, pmd_pfn(*pmdp));
+			pmd = pmdp_huge_clear_flush(vma, address, pmdp);
+			pmd = pmd_wrprotect(pmd);
+			pmd = pmd_mkclean(pmd);
+			set_pmd_at(mm, address, pmdp, pmd);
+			spin_unlock(ptl);
+		} else {
+			BUG_ON(!ptep);
+			flush_cache_page(vma, address, pte_pfn(*ptep));
+			pte = ptep_clear_flush(vma, address, ptep);
+			pte = pte_wrprotect(pte);
+			pte = pte_mkclean(pte);
+			set_pte_at(mm, address, ptep, pte);
+			pte_unmap_unlock(ptep, ptl);
+		}
+	}
+	i_mmap_unlock_read(mapping);
+}
+EXPORT_SYMBOL_GPL(pgoff_mkclean);
+
 /**
  * page_move_anon_rmap - move a page to our anon_vma
  * @page:	the page to move to our anon_vma
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 07/11] mm: add find_get_entries_tag()
  2015-11-14  0:06 ` Ross Zwisler
  (?)
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a6c78e0..6fea3be 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index d5e94fd..89ab448 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1454,6 +1454,67 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping:	the address_space to search
+ * @start:	the starting page cache index
+ * @tag:	the tag index
+ * @nr_entries:	the maximum number of entries
+ * @entries:	where the resulting entries are placed
+ * @indices:	the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_tagged(slot, &mapping->page_tree,
+				   &iter, start, tag) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		entries[ret] = page;
+		if (++ret == nr_entries)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 07/11] mm: add find_get_entries_tag()
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a6c78e0..6fea3be 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index d5e94fd..89ab448 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1454,6 +1454,67 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping:	the address_space to search
+ * @start:	the starting page cache index
+ * @tag:	the tag index
+ * @nr_entries:	the maximum number of entries
+ * @entries:	where the resulting entries are placed
+ * @indices:	the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_tagged(slot, &mapping->page_tree,
+				   &iter, start, tag) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		entries[ret] = page;
+		if (++ret == nr_entries)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 07/11] mm: add find_get_entries_tag()
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a6c78e0..6fea3be 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index d5e94fd..89ab448 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1454,6 +1454,67 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping:	the address_space to search
+ * @start:	the starting page cache index
+ * @tag:	the tag index
+ * @nr_entries:	the maximum number of entries
+ * @entries:	where the resulting entries are placed
+ * @indices:	the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_tagged(slot, &mapping->page_tree,
+				   &iter, start, tag) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		entries[ret] = page;
+		if (++ret == nr_entries)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 07/11] mm: add find_get_entries_tag()
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 include/linux/pagemap.h |  3 +++
 mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a6c78e0..6fea3be 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
 			       unsigned int nr_pages, struct page **pages);
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices);
 
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 			pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index d5e94fd..89ab448 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1454,6 +1454,67 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping:	the address_space to search
+ * @start:	the starting page cache index
+ * @tag:	the tag index
+ * @nr_entries:	the maximum number of entries
+ * @entries:	where the resulting entries are placed
+ * @indices:	the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+			int tag, unsigned int nr_entries,
+			struct page **entries, pgoff_t *indices)
+{
+	void **slot;
+	unsigned int ret = 0;
+	struct radix_tree_iter iter;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+restart:
+	radix_tree_for_each_tagged(slot, &mapping->page_tree,
+				   &iter, start, tag) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot(slot);
+		if (unlikely(!page))
+			continue;
+		if (radix_tree_exception(page)) {
+			if (radix_tree_deref_retry(page))
+				goto restart;
+			/*
+			 * A shadow entry of a recently evicted page, a swap
+			 * entry from shmem/tmpfs or a DAX entry.  Return it
+			 * without attempting to raise page count.
+			 */
+			goto export;
+		}
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *slot)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+export:
+		indices[ret] = iter.index;
+		entries[ret] = page;
+		if (++ret == nr_entries)
+			break;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 08/11] dax: add support for fsync/sync
  2015-11-14  0:06 ` Ross Zwisler
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

When called as part of the msync/fsync flush path DAX queries the radix
tree for dirty entries, flushing them and then marking the PTE or PMD page
table entries as clean.  The step of cleaning the PTE or PMD entries is
necessary so that on subsequent writes to the same page we get a new write
fault allowing us to once again dirty the DAX tag in the radix tree.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 140 +++++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/dax.h |   1 +
 mm/huge_memory.c    |  14 +++---
 3 files changed, 141 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 131fd35a..9ce6d1b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,7 +24,9 @@
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagevec.h>
 #include <linux/pmem.h>
+#include <linux/rmap.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
@@ -287,6 +289,53 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 	return 0;
 }
 
+static int dax_dirty_pgoff(struct address_space *mapping, unsigned long pgoff,
+		void __pmem *addr, bool pmd_entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int error = 0;
+	void *entry;
+
+	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = radix_tree_lookup(page_tree, pgoff);
+	if (addr == NULL) {
+		if (entry)
+			goto dirty;
+		else {
+			WARN(1, "DAX pfn_mkwrite failed to find an entry");
+			goto out;
+		}
+	}
+
+	if (entry) {
+		if (pmd_entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
+			radix_tree_delete(&mapping->page_tree, pgoff);
+			mapping->nrdax--;
+		} else
+			goto dirty;
+	}
+
+	BUG_ON(RADIX_DAX_TYPE(addr));
+	if (pmd_entry)
+		error = radix_tree_insert(page_tree, pgoff,
+				RADIX_DAX_PMD_ENTRY(addr));
+	else
+		error = radix_tree_insert(page_tree, pgoff,
+				RADIX_DAX_PTE_ENTRY(addr));
+
+	if (error)
+		goto out;
+
+	mapping->nrdax++;
+ dirty:
+	radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
+ out:
+	spin_unlock_irq(&mapping->tree_lock);
+	return error;
+}
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -327,7 +376,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 
 	error = vm_insert_mixed(vma, vaddr, pfn);
+	if (error)
+		goto out;
 
+	error = dax_dirty_pgoff(mapping, vmf->pgoff, addr, false);
  out:
 	i_mmap_unlock_read(mapping);
 
@@ -450,6 +502,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		delete_from_page_cache(page);
 		unlock_page(page);
 		page_cache_release(page);
+		page = NULL;
 	}
 
 	/*
@@ -537,7 +590,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	pgoff_t size, pgoff;
 	sector_t block, sector;
 	unsigned long pfn;
-	int result = 0;
+	int error, result = 0;
 
 	/* Fall back to PTEs if we're going to COW */
 	if (write && !(vma->vm_flags & VM_SHARED))
@@ -638,6 +691,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+
+		error = dax_dirty_pgoff(mapping, pgoff, kaddr, true);
+		if (error)
+			goto fallback;
 	}
 
  out:
@@ -689,15 +746,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
- *
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct file *file = vma->vm_file;
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	sb_end_pagefault(sb);
+	dax_dirty_pgoff(file->f_mapping, vmf->pgoff, NULL, false);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
@@ -772,3 +826,77 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 	return dax_zero_page_range(inode, from, length, get_block);
 }
 EXPORT_SYMBOL_GPL(dax_truncate_page);
+
+static void dax_sync_entry(struct address_space *mapping, pgoff_t pgoff,
+		void *entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int type = RADIX_DAX_TYPE(entry);
+	size_t size;
+
+	BUG_ON(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD);
+
+	spin_lock_irq(&mapping->tree_lock);
+	if (!radix_tree_tag_get(page_tree, pgoff, PAGECACHE_TAG_TOWRITE)) {
+		/* another fsync thread already wrote back this entry */
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_TOWRITE);
+	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	if (type == RADIX_DAX_PMD)
+		size = PMD_SIZE;
+	else
+		size = PAGE_SIZE;
+
+	wb_cache_pmem(RADIX_DAX_ADDR(entry), size);
+	pgoff_mkclean(pgoff, mapping);
+}
+
+/*
+ * Flush the mapping to the persistent domain within the byte range of (start,
+ * end). This is required by data integrity operations to ensure file data is on
+ * persistent storage prior to completion of the operation. It also requires us
+ * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
+ * the file is written to again so we have an indication that we need to flush
+ * the mapping if a data integrity operation takes place.
+ *
+ * We don't need commits to storage here - the filesystems will issue flushes
+ * appropriately at the conclusion of the data integrity operation via REQ_FUA
+ * writes or blkdev_issue_flush() commands.  This requires the DAX block device
+ * to implement persistent storage domain fencing/commits on receiving a
+ * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
+ * layers.
+ */
+void dax_fsync(struct address_space *mapping, loff_t start, loff_t end)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t indices[PAGEVEC_SIZE];
+	struct pagevec pvec;
+	int i;
+
+	pgoff_t start_page = start >> PAGE_CACHE_SHIFT;
+	pgoff_t end_page = end >> PAGE_CACHE_SHIFT;
+
+	if (mapping->nrdax == 0)
+		return;
+
+	BUG_ON(inode->i_blkbits != PAGE_SHIFT);
+
+	tag_pages_for_writeback(mapping, start_page, end_page);
+
+	pagevec_init(&pvec, 0);
+	while (1) {
+		pvec.nr = find_get_entries_tag(mapping, start_page,
+				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		for (i = 0; i < pvec.nr; i++)
+			dax_sync_entry(mapping, indices[i], pvec.pages[i]);
+	}
+}
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..2b3ce6f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,5 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+void dax_fsync(struct address_space *mapping, loff_t start, loff_t end);
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbac913..1b3df56 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 
 	ptl = pmd_lock(mm, pmd);
-	if (pmd_none(*pmd)) {
-		entry = pmd_mkhuge(pfn_pmd(pfn, prot));
-		if (write) {
-			entry = pmd_mkyoung(pmd_mkdirty(entry));
-			entry = maybe_pmd_mkwrite(entry, vma);
-		}
-		set_pmd_at(mm, addr, pmd, entry);
-		update_mmu_cache_pmd(vma, addr, pmd);
+	entry = pmd_mkhuge(pfn_pmd(pfn, prot));
+	if (write) {
+		entry = pmd_mkyoung(pmd_mkdirty(entry));
+		entry = maybe_pmd_mkwrite(entry, vma);
 	}
+	set_pmd_at(mm, addr, pmd, entry);
+	update_mmu_cache_pmd(vma, addr, pmd);
 	spin_unlock(ptl);
 }
 
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 08/11] dax: add support for fsync/sync
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

When called as part of the msync/fsync flush path DAX queries the radix
tree for dirty entries, flushing them and then marking the PTE or PMD page
table entries as clean.  The step of cleaning the PTE or PMD entries is
necessary so that on subsequent writes to the same page we get a new write
fault allowing us to once again dirty the DAX tag in the radix tree.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 140 +++++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/dax.h |   1 +
 mm/huge_memory.c    |  14 +++---
 3 files changed, 141 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 131fd35a..9ce6d1b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,7 +24,9 @@
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagevec.h>
 #include <linux/pmem.h>
+#include <linux/rmap.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
@@ -287,6 +289,53 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 	return 0;
 }
 
+static int dax_dirty_pgoff(struct address_space *mapping, unsigned long pgoff,
+		void __pmem *addr, bool pmd_entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int error = 0;
+	void *entry;
+
+	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = radix_tree_lookup(page_tree, pgoff);
+	if (addr == NULL) {
+		if (entry)
+			goto dirty;
+		else {
+			WARN(1, "DAX pfn_mkwrite failed to find an entry");
+			goto out;
+		}
+	}
+
+	if (entry) {
+		if (pmd_entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
+			radix_tree_delete(&mapping->page_tree, pgoff);
+			mapping->nrdax--;
+		} else
+			goto dirty;
+	}
+
+	BUG_ON(RADIX_DAX_TYPE(addr));
+	if (pmd_entry)
+		error = radix_tree_insert(page_tree, pgoff,
+				RADIX_DAX_PMD_ENTRY(addr));
+	else
+		error = radix_tree_insert(page_tree, pgoff,
+				RADIX_DAX_PTE_ENTRY(addr));
+
+	if (error)
+		goto out;
+
+	mapping->nrdax++;
+ dirty:
+	radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
+ out:
+	spin_unlock_irq(&mapping->tree_lock);
+	return error;
+}
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -327,7 +376,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 
 	error = vm_insert_mixed(vma, vaddr, pfn);
+	if (error)
+		goto out;
 
+	error = dax_dirty_pgoff(mapping, vmf->pgoff, addr, false);
  out:
 	i_mmap_unlock_read(mapping);
 
@@ -450,6 +502,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		delete_from_page_cache(page);
 		unlock_page(page);
 		page_cache_release(page);
+		page = NULL;
 	}
 
 	/*
@@ -537,7 +590,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	pgoff_t size, pgoff;
 	sector_t block, sector;
 	unsigned long pfn;
-	int result = 0;
+	int error, result = 0;
 
 	/* Fall back to PTEs if we're going to COW */
 	if (write && !(vma->vm_flags & VM_SHARED))
@@ -638,6 +691,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+
+		error = dax_dirty_pgoff(mapping, pgoff, kaddr, true);
+		if (error)
+			goto fallback;
 	}
 
  out:
@@ -689,15 +746,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
- *
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct file *file = vma->vm_file;
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	sb_end_pagefault(sb);
+	dax_dirty_pgoff(file->f_mapping, vmf->pgoff, NULL, false);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
@@ -772,3 +826,77 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 	return dax_zero_page_range(inode, from, length, get_block);
 }
 EXPORT_SYMBOL_GPL(dax_truncate_page);
+
+static void dax_sync_entry(struct address_space *mapping, pgoff_t pgoff,
+		void *entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int type = RADIX_DAX_TYPE(entry);
+	size_t size;
+
+	BUG_ON(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD);
+
+	spin_lock_irq(&mapping->tree_lock);
+	if (!radix_tree_tag_get(page_tree, pgoff, PAGECACHE_TAG_TOWRITE)) {
+		/* another fsync thread already wrote back this entry */
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_TOWRITE);
+	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	if (type == RADIX_DAX_PMD)
+		size = PMD_SIZE;
+	else
+		size = PAGE_SIZE;
+
+	wb_cache_pmem(RADIX_DAX_ADDR(entry), size);
+	pgoff_mkclean(pgoff, mapping);
+}
+
+/*
+ * Flush the mapping to the persistent domain within the byte range of (start,
+ * end). This is required by data integrity operations to ensure file data is on
+ * persistent storage prior to completion of the operation. It also requires us
+ * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
+ * the file is written to again so we have an indication that we need to flush
+ * the mapping if a data integrity operation takes place.
+ *
+ * We don't need commits to storage here - the filesystems will issue flushes
+ * appropriately at the conclusion of the data integrity operation via REQ_FUA
+ * writes or blkdev_issue_flush() commands.  This requires the DAX block device
+ * to implement persistent storage domain fencing/commits on receiving a
+ * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
+ * layers.
+ */
+void dax_fsync(struct address_space *mapping, loff_t start, loff_t end)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t indices[PAGEVEC_SIZE];
+	struct pagevec pvec;
+	int i;
+
+	pgoff_t start_page = start >> PAGE_CACHE_SHIFT;
+	pgoff_t end_page = end >> PAGE_CACHE_SHIFT;
+
+	if (mapping->nrdax == 0)
+		return;
+
+	BUG_ON(inode->i_blkbits != PAGE_SHIFT);
+
+	tag_pages_for_writeback(mapping, start_page, end_page);
+
+	pagevec_init(&pvec, 0);
+	while (1) {
+		pvec.nr = find_get_entries_tag(mapping, start_page,
+				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		for (i = 0; i < pvec.nr; i++)
+			dax_sync_entry(mapping, indices[i], pvec.pages[i]);
+	}
+}
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..2b3ce6f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,5 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+void dax_fsync(struct address_space *mapping, loff_t start, loff_t end);
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbac913..1b3df56 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 
 	ptl = pmd_lock(mm, pmd);
-	if (pmd_none(*pmd)) {
-		entry = pmd_mkhuge(pfn_pmd(pfn, prot));
-		if (write) {
-			entry = pmd_mkyoung(pmd_mkdirty(entry));
-			entry = maybe_pmd_mkwrite(entry, vma);
-		}
-		set_pmd_at(mm, addr, pmd, entry);
-		update_mmu_cache_pmd(vma, addr, pmd);
+	entry = pmd_mkhuge(pfn_pmd(pfn, prot));
+	if (write) {
+		entry = pmd_mkyoung(pmd_mkdirty(entry));
+		entry = maybe_pmd_mkwrite(entry, vma);
 	}
+	set_pmd_at(mm, addr, pmd, entry);
+	update_mmu_cache_pmd(vma, addr, pmd);
 	spin_unlock(ptl);
 }
 
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 08/11] dax: add support for fsync/sync
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.

The tracking of dirty pages is done via the radix tree in struct
address_space.  This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries.  We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.

When called as part of the msync/fsync flush path DAX queries the radix
tree for dirty entries, flushing them and then marking the PTE or PMD page
table entries as clean.  The step of cleaning the PTE or PMD entries is
necessary so that on subsequent writes to the same page we get a new write
fault allowing us to once again dirty the DAX tag in the radix tree.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c            | 140 +++++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/dax.h |   1 +
 mm/huge_memory.c    |  14 +++---
 3 files changed, 141 insertions(+), 14 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 131fd35a..9ce6d1b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,7 +24,9 @@
 #include <linux/memcontrol.h>
 #include <linux/mm.h>
 #include <linux/mutex.h>
+#include <linux/pagevec.h>
 #include <linux/pmem.h>
+#include <linux/rmap.h>
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
@@ -287,6 +289,53 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 	return 0;
 }
 
+static int dax_dirty_pgoff(struct address_space *mapping, unsigned long pgoff,
+		void __pmem *addr, bool pmd_entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int error = 0;
+	void *entry;
+
+	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	spin_lock_irq(&mapping->tree_lock);
+	entry = radix_tree_lookup(page_tree, pgoff);
+	if (addr == NULL) {
+		if (entry)
+			goto dirty;
+		else {
+			WARN(1, "DAX pfn_mkwrite failed to find an entry");
+			goto out;
+		}
+	}
+
+	if (entry) {
+		if (pmd_entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
+			radix_tree_delete(&mapping->page_tree, pgoff);
+			mapping->nrdax--;
+		} else
+			goto dirty;
+	}
+
+	BUG_ON(RADIX_DAX_TYPE(addr));
+	if (pmd_entry)
+		error = radix_tree_insert(page_tree, pgoff,
+				RADIX_DAX_PMD_ENTRY(addr));
+	else
+		error = radix_tree_insert(page_tree, pgoff,
+				RADIX_DAX_PTE_ENTRY(addr));
+
+	if (error)
+		goto out;
+
+	mapping->nrdax++;
+ dirty:
+	radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
+ out:
+	spin_unlock_irq(&mapping->tree_lock);
+	return error;
+}
+
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
@@ -327,7 +376,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 	}
 
 	error = vm_insert_mixed(vma, vaddr, pfn);
+	if (error)
+		goto out;
 
+	error = dax_dirty_pgoff(mapping, vmf->pgoff, addr, false);
  out:
 	i_mmap_unlock_read(mapping);
 
@@ -450,6 +502,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
 		delete_from_page_cache(page);
 		unlock_page(page);
 		page_cache_release(page);
+		page = NULL;
 	}
 
 	/*
@@ -537,7 +590,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	pgoff_t size, pgoff;
 	sector_t block, sector;
 	unsigned long pfn;
-	int result = 0;
+	int error, result = 0;
 
 	/* Fall back to PTEs if we're going to COW */
 	if (write && !(vma->vm_flags & VM_SHARED))
@@ -638,6 +691,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		}
 
 		result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+
+		error = dax_dirty_pgoff(mapping, pgoff, kaddr, true);
+		if (error)
+			goto fallback;
 	}
 
  out:
@@ -689,15 +746,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
  * dax_pfn_mkwrite - handle first write to DAX page
  * @vma: The virtual memory area where the fault occurred
  * @vmf: The description of the fault
- *
  */
 int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+	struct file *file = vma->vm_file;
 
-	sb_start_pagefault(sb);
-	file_update_time(vma->vm_file);
-	sb_end_pagefault(sb);
+	dax_dirty_pgoff(file->f_mapping, vmf->pgoff, NULL, false);
 	return VM_FAULT_NOPAGE;
 }
 EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
@@ -772,3 +826,77 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
 	return dax_zero_page_range(inode, from, length, get_block);
 }
 EXPORT_SYMBOL_GPL(dax_truncate_page);
+
+static void dax_sync_entry(struct address_space *mapping, pgoff_t pgoff,
+		void *entry)
+{
+	struct radix_tree_root *page_tree = &mapping->page_tree;
+	int type = RADIX_DAX_TYPE(entry);
+	size_t size;
+
+	BUG_ON(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD);
+
+	spin_lock_irq(&mapping->tree_lock);
+	if (!radix_tree_tag_get(page_tree, pgoff, PAGECACHE_TAG_TOWRITE)) {
+		/* another fsync thread already wrote back this entry */
+		spin_unlock_irq(&mapping->tree_lock);
+		return;
+	}
+	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_TOWRITE);
+	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
+	spin_unlock_irq(&mapping->tree_lock);
+
+	if (type == RADIX_DAX_PMD)
+		size = PMD_SIZE;
+	else
+		size = PAGE_SIZE;
+
+	wb_cache_pmem(RADIX_DAX_ADDR(entry), size);
+	pgoff_mkclean(pgoff, mapping);
+}
+
+/*
+ * Flush the mapping to the persistent domain within the byte range of (start,
+ * end). This is required by data integrity operations to ensure file data is on
+ * persistent storage prior to completion of the operation. It also requires us
+ * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
+ * the file is written to again so we have an indication that we need to flush
+ * the mapping if a data integrity operation takes place.
+ *
+ * We don't need commits to storage here - the filesystems will issue flushes
+ * appropriately at the conclusion of the data integrity operation via REQ_FUA
+ * writes or blkdev_issue_flush() commands.  This requires the DAX block device
+ * to implement persistent storage domain fencing/commits on receiving a
+ * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
+ * layers.
+ */
+void dax_fsync(struct address_space *mapping, loff_t start, loff_t end)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t indices[PAGEVEC_SIZE];
+	struct pagevec pvec;
+	int i;
+
+	pgoff_t start_page = start >> PAGE_CACHE_SHIFT;
+	pgoff_t end_page = end >> PAGE_CACHE_SHIFT;
+
+	if (mapping->nrdax == 0)
+		return;
+
+	BUG_ON(inode->i_blkbits != PAGE_SHIFT);
+
+	tag_pages_for_writeback(mapping, start_page, end_page);
+
+	pagevec_init(&pvec, 0);
+	while (1) {
+		pvec.nr = find_get_entries_tag(mapping, start_page,
+				PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+				pvec.pages, indices);
+
+		if (pvec.nr == 0)
+			break;
+
+		for (i = 0; i < pvec.nr; i++)
+			dax_sync_entry(mapping, indices[i], pvec.pages[i]);
+	}
+}
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..2b3ce6f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,5 @@ static inline bool dax_mapping(struct address_space *mapping)
 {
 	return mapping->host && IS_DAX(mapping->host);
 }
+void dax_fsync(struct address_space *mapping, loff_t start, loff_t end);
 #endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbac913..1b3df56 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 
 	ptl = pmd_lock(mm, pmd);
-	if (pmd_none(*pmd)) {
-		entry = pmd_mkhuge(pfn_pmd(pfn, prot));
-		if (write) {
-			entry = pmd_mkyoung(pmd_mkdirty(entry));
-			entry = maybe_pmd_mkwrite(entry, vma);
-		}
-		set_pmd_at(mm, addr, pmd, entry);
-		update_mmu_cache_pmd(vma, addr, pmd);
+	entry = pmd_mkhuge(pfn_pmd(pfn, prot));
+	if (write) {
+		entry = pmd_mkyoung(pmd_mkdirty(entry));
+		entry = maybe_pmd_mkwrite(entry, vma);
 	}
+	set_pmd_at(mm, addr, pmd, entry);
+	update_mmu_cache_pmd(vma, addr, pmd);
 	spin_unlock(ptl);
 }
 
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 09/11] ext2: add support for DAX fsync/msync
  2015-11-14  0:06 ` Ross Zwisler
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can properly track when a user
write faults on a previously cleaned address.  They also need to call
dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
addresses retrieved from get_block() so it needs to be ordered with
respect to truncate.  This is accomplished by using the same locking that
was set up for DAX page faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/file.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 11a42c5..6c30ea2 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
@@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 
 	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
@@ -161,6 +163,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	struct super_block *sb = file->f_mapping->host->i_sb;
 	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 
+#ifdef CONFIG_FS_DAX
+	if (dax_mapping(mapping)) {
+		struct ext2_inode_info *ei = EXT2_I(file_inode(file));
+
+		down_read(&ei->dax_sem);
+		dax_fsync(mapping, start, end);
+		up_read(&ei->dax_sem);
+	}
+#endif
+
 	ret = generic_file_fsync(file, start, end, datasync);
 	if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
 		/* We don't really know where the IO error happened... */
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 09/11] ext2: add support for DAX fsync/msync
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can properly track when a user
write faults on a previously cleaned address.  They also need to call
dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
addresses retrieved from get_block() so it needs to be ordered with
respect to truncate.  This is accomplished by using the same locking that
was set up for DAX page faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/file.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 11a42c5..6c30ea2 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
@@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 
 	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
@@ -161,6 +163,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	struct super_block *sb = file->f_mapping->host->i_sb;
 	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 
+#ifdef CONFIG_FS_DAX
+	if (dax_mapping(mapping)) {
+		struct ext2_inode_info *ei = EXT2_I(file_inode(file));
+
+		down_read(&ei->dax_sem);
+		dax_fsync(mapping, start, end);
+		up_read(&ei->dax_sem);
+	}
+#endif
+
 	ret = generic_file_fsync(file, start, end, datasync);
 	if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
 		/* We don't really know where the IO error happened... */
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 09/11] ext2: add support for DAX fsync/msync
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can properly track when a user
write faults on a previously cleaned address.  They also need to call
dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
addresses retrieved from get_block() so it needs to be ordered with
respect to truncate.  This is accomplished by using the same locking that
was set up for DAX page faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext2/file.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 11a42c5..6c30ea2 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct ext2_inode_info *ei = EXT2_I(inode);
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
@@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 
 	up_read(&ei->dax_sem);
 	sb_end_pagefault(inode->i_sb);
@@ -161,6 +163,16 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	struct super_block *sb = file->f_mapping->host->i_sb;
 	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 
+#ifdef CONFIG_FS_DAX
+	if (dax_mapping(mapping)) {
+		struct ext2_inode_info *ei = EXT2_I(file_inode(file));
+
+		down_read(&ei->dax_sem);
+		dax_fsync(mapping, start, end);
+		up_read(&ei->dax_sem);
+	}
+#endif
+
 	ret = generic_file_fsync(file, start, end, datasync);
 	if (ret == -EIO || test_and_clear_bit(AS_EIO, &mapping->flags)) {
 		/* We don't really know where the IO error happened... */
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 10/11] ext4: add support for DAX fsync/msync
  2015-11-14  0:06 ` Ross Zwisler
  (?)
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can properly track when a user
write faults on a previously cleaned address.  They also need to call
dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
addresses retrieved from get_block() so it needs to be ordered with
respect to truncate.  This is accomplished by using the same locking that
was set up for DAX page faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext4/file.c  |  4 +++-
 fs/ext4/fsync.c | 12 ++++++++++--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 749b222..8c8965c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
@@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(sb);
 
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 8850254..e87c29b 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -27,6 +27,7 @@
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/dax.h>
 
 #include "ext4.h"
 #include "ext4_jbd2.h"
@@ -86,7 +87,8 @@ static int ext4_sync_parent(struct inode *inode)
 
 int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
-	struct inode *inode = file->f_mapping->host;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 	int ret = 0, err;
@@ -112,7 +114,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		goto out;
 	}
 
-	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (dax_mapping(mapping)) {
+		down_read(&ei->i_mmap_sem);
+		dax_fsync(mapping, start, end);
+		up_read(&ei->i_mmap_sem);
+	}
+
+	ret = filemap_write_and_wait_range(mapping, start, end);
 	if (ret)
 		return ret;
 	/*
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 10/11] ext4: add support for DAX fsync/msync
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can properly track when a user
write faults on a previously cleaned address.  They also need to call
dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
addresses retrieved from get_block() so it needs to be ordered with
respect to truncate.  This is accomplished by using the same locking that
was set up for DAX page faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext4/file.c  |  4 +++-
 fs/ext4/fsync.c | 12 ++++++++++--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 749b222..8c8965c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
@@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(sb);
 
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 8850254..e87c29b 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -27,6 +27,7 @@
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/dax.h>
 
 #include "ext4.h"
 #include "ext4_jbd2.h"
@@ -86,7 +87,8 @@ static int ext4_sync_parent(struct inode *inode)
 
 int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
-	struct inode *inode = file->f_mapping->host;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 	int ret = 0, err;
@@ -112,7 +114,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		goto out;
 	}
 
-	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (dax_mapping(mapping)) {
+		down_read(&ei->i_mmap_sem);
+		dax_fsync(mapping, start, end);
+		up_read(&ei->i_mmap_sem);
+	}
+
+	ret = filemap_write_and_wait_range(mapping, start, end);
 	if (ret)
 		return ret;
 	/*
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 10/11] ext4: add support for DAX fsync/msync
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can properly track when a user
write faults on a previously cleaned address.  They also need to call
dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
addresses retrieved from get_block() so it needs to be ordered with
respect to truncate.  This is accomplished by using the same locking that
was set up for DAX page faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext4/file.c  |  4 +++-
 fs/ext4/fsync.c | 12 ++++++++++--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 749b222..8c8965c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
@@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(sb);
 
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 8850254..e87c29b 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -27,6 +27,7 @@
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/dax.h>
 
 #include "ext4.h"
 #include "ext4_jbd2.h"
@@ -86,7 +87,8 @@ static int ext4_sync_parent(struct inode *inode)
 
 int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
-	struct inode *inode = file->f_mapping->host;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 	int ret = 0, err;
@@ -112,7 +114,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		goto out;
 	}
 
-	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (dax_mapping(mapping)) {
+		down_read(&ei->i_mmap_sem);
+		dax_fsync(mapping, start, end);
+		up_read(&ei->i_mmap_sem);
+	}
+
+	ret = filemap_write_and_wait_range(mapping, start, end);
 	if (ret)
 		return ret;
 	/*
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 10/11] ext4: add support for DAX fsync/msync
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can properly track when a user
write faults on a previously cleaned address.  They also need to call
dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
addresses retrieved from get_block() so it needs to be ordered with
respect to truncate.  This is accomplished by using the same locking that
was set up for DAX page faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/ext4/file.c  |  4 +++-
 fs/ext4/fsync.c | 12 ++++++++++--
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 749b222..8c8965c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 {
 	struct inode *inode = file_inode(vma->vm_file);
 	struct super_block *sb = inode->i_sb;
-	int ret = VM_FAULT_NOPAGE;
 	loff_t size;
+	int ret;
 
 	sb_start_pagefault(sb);
 	file_update_time(vma->vm_file);
@@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else
+		ret = dax_pfn_mkwrite(vma, vmf);
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(sb);
 
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 8850254..e87c29b 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -27,6 +27,7 @@
 #include <linux/sched.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
+#include <linux/dax.h>
 
 #include "ext4.h"
 #include "ext4_jbd2.h"
@@ -86,7 +87,8 @@ static int ext4_sync_parent(struct inode *inode)
 
 int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 {
-	struct inode *inode = file->f_mapping->host;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 	int ret = 0, err;
@@ -112,7 +114,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		goto out;
 	}
 
-	ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (dax_mapping(mapping)) {
+		down_read(&ei->i_mmap_sem);
+		dax_fsync(mapping, start, end);
+		up_read(&ei->i_mmap_sem);
+	}
+
+	ret = filemap_write_and_wait_range(mapping, start, end);
 	if (ret)
 		return ret;
 	/*
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 11/11] xfs: add support for DAX fsync/msync
  2015-11-14  0:06 ` Ross Zwisler
  (?)
@ 2015-11-14  0:06   ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can properly track when a user
write faults on a previously cleaned address.  They also need to call
dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
addresses retrieved from get_block() so it needs to be ordered with
respect to truncate.  This is accomplished by using the same locking that
was set up for DAX page faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 39743ef..2b490a1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -209,7 +209,8 @@ xfs_file_fsync(
 	loff_t			end,
 	int			datasync)
 {
-	struct inode		*inode = file->f_mapping->host;
+	struct address_space	*mapping = file->f_mapping;
+	struct inode		*inode = mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error = 0;
@@ -218,7 +219,13 @@ xfs_file_fsync(
 
 	trace_xfs_file_fsync(ip);
 
-	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (dax_mapping(mapping)) {
+		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+		dax_fsync(mapping, start, end);
+		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	}
+
+	error = filemap_write_and_wait_range(mapping, start, end);
 	if (error)
 		return error;
 
@@ -1603,9 +1610,8 @@ xfs_filemap_pmd_fault(
 /*
  * pfn_mkwrite was originally inteneded to ensure we capture time stamp
  * updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
- * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
- * barrier in place.
+ * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
+ * to ensure we serialise the fault barrier in place.
  */
 static int
 xfs_filemap_pfn_mkwrite(
@@ -1628,6 +1634,8 @@ xfs_filemap_pfn_mkwrite(
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else if (IS_DAX(inode))
+		ret = dax_pfn_mkwrite(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 11/11] xfs: add support for DAX fsync/msync
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can properly track when a user
write faults on a previously cleaned address.  They also need to call
dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
addresses retrieved from get_block() so it needs to be ordered with
respect to truncate.  This is accomplished by using the same locking that
was set up for DAX page faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 39743ef..2b490a1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -209,7 +209,8 @@ xfs_file_fsync(
 	loff_t			end,
 	int			datasync)
 {
-	struct inode		*inode = file->f_mapping->host;
+	struct address_space	*mapping = file->f_mapping;
+	struct inode		*inode = mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error = 0;
@@ -218,7 +219,13 @@ xfs_file_fsync(
 
 	trace_xfs_file_fsync(ip);
 
-	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (dax_mapping(mapping)) {
+		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+		dax_fsync(mapping, start, end);
+		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	}
+
+	error = filemap_write_and_wait_range(mapping, start, end);
 	if (error)
 		return error;
 
@@ -1603,9 +1610,8 @@ xfs_filemap_pmd_fault(
 /*
  * pfn_mkwrite was originally inteneded to ensure we capture time stamp
  * updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
- * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
- * barrier in place.
+ * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
+ * to ensure we serialise the fault barrier in place.
  */
 static int
 xfs_filemap_pfn_mkwrite(
@@ -1628,6 +1634,8 @@ xfs_filemap_pfn_mkwrite(
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else if (IS_DAX(inode))
+		ret = dax_pfn_mkwrite(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v2 11/11] xfs: add support for DAX fsync/msync
@ 2015-11-14  0:06   ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-14  0:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can properly track when a user
write faults on a previously cleaned address.  They also need to call
dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
addresses retrieved from get_block() so it needs to be ordered with
respect to truncate.  This is accomplished by using the same locking that
was set up for DAX page faults.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/xfs/xfs_file.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 39743ef..2b490a1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -209,7 +209,8 @@ xfs_file_fsync(
 	loff_t			end,
 	int			datasync)
 {
-	struct inode		*inode = file->f_mapping->host;
+	struct address_space	*mapping = file->f_mapping;
+	struct inode		*inode = mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	struct xfs_mount	*mp = ip->i_mount;
 	int			error = 0;
@@ -218,7 +219,13 @@ xfs_file_fsync(
 
 	trace_xfs_file_fsync(ip);
 
-	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
+	if (dax_mapping(mapping)) {
+		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+		dax_fsync(mapping, start, end);
+		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+	}
+
+	error = filemap_write_and_wait_range(mapping, start, end);
 	if (error)
 		return error;
 
@@ -1603,9 +1610,8 @@ xfs_filemap_pmd_fault(
 /*
  * pfn_mkwrite was originally inteneded to ensure we capture time stamp
  * updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
- * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
- * barrier in place.
+ * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
+ * to ensure we serialise the fault barrier in place.
  */
 static int
 xfs_filemap_pfn_mkwrite(
@@ -1628,6 +1634,8 @@ xfs_filemap_pfn_mkwrite(
 	size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	if (vmf->pgoff >= size)
 		ret = VM_FAULT_SIGBUS;
+	else if (IS_DAX(inode))
+		ret = dax_pfn_mkwrite(vma, vmf);
 	xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
-- 
2.1.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-14  0:06   ` Ross Zwisler
  (?)
@ 2015-11-14  0:20     ` Dan Williams
  -1 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-14  0:20 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm, X86 ML, xfs,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> and are used by filesystems to order their metadata, among other things.
>
> When we get an msync() or fsync() it is the responsibility of the DAX code
> to flush all dirty pages to media.  The PMEM driver then just has issue a
> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> the flushed data has been durably stored on the media.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Hmm, I'm not seeing why we need this patch.  If the actual flushing of
the cache is done by the core why does the driver need support
REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
only makes sense if individual writes can bypass the "drive" cache,
but no I/O submitted to the driver proper is ever cached we always
flush it through to media.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-14  0:20     ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-14  0:20 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org,
	X86 ML, xfs, Andrew Morton, Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> and are used by filesystems to order their metadata, among other things.
>
> When we get an msync() or fsync() it is the responsibility of the DAX code
> to flush all dirty pages to media.  The PMEM driver then just has issue a
> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> the flushed data has been durably stored on the media.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Hmm, I'm not seeing why we need this patch.  If the actual flushing of
the cache is done by the core why does the driver need support
REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
only makes sense if individual writes can bypass the "drive" cache,
but no I/O submitted to the driver proper is ever cached we always
flush it through to media.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-14  0:20     ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-14  0:20 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: X86 ML, Theodore Ts'o, Andrew Morton, linux-nvdimm, Jan Kara,
	linux-kernel, Dave Hansen, xfs, J. Bruce Fields, Linux MM,
	Ingo Molnar, Andreas Dilger, Alexander Viro, H. Peter Anvin,
	linux-fsdevel, Matthew Wilcox, Jeff Layton, linux-ext4,
	Thomas Gleixner, Matthew Wilcox

On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> and are used by filesystems to order their metadata, among other things.
>
> When we get an msync() or fsync() it is the responsibility of the DAX code
> to flush all dirty pages to media.  The PMEM driver then just has issue a
> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> the flushed data has been durably stored on the media.
>
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

Hmm, I'm not seeing why we need this patch.  If the actual flushing of
the cache is done by the core why does the driver need support
REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
only makes sense if individual writes can bypass the "drive" cache,
but no I/O submitted to the driver proper is ever cached we always
flush it through to media.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-14  0:20     ` Dan Williams
  (?)
@ 2015-11-14  0:43       ` Andreas Dilger
  -1 siblings, 0 replies; 132+ messages in thread
From: Andreas Dilger @ 2015-11-14  0:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm, X86 ML,
	XFS Developers, Andrew Morton, Matthew Wilcox, Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 1542 bytes --]

On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
>> and are used by filesystems to order their metadata, among other things.
>> 
>> When we get an msync() or fsync() it is the responsibility of the DAX code
>> to flush all dirty pages to media.  The PMEM driver then just has issue a
>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
>> the flushed data has been durably stored on the media.
>> 
>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> the cache is done by the core why does the driver need support
> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> only makes sense if individual writes can bypass the "drive" cache,
> but no I/O submitted to the driver proper is ever cached we always
> flush it through to media.

If the upper level filesystem gets an error when submitting a flush
request, then it assumes the underlying hardware is broken and cannot
be as aggressive in IO submission, but instead has to wait for in-flight
IO to complete.  Since FUA/FLUSH is basically a no-op for pmem devices,
it doesn't make sense _not_ to support this functionality.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-14  0:43       ` Andreas Dilger
  0 siblings, 0 replies; 132+ messages in thread
From: Andreas Dilger @ 2015-11-14  0:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org,
	X86 ML, XFS Developers, Andrew Morton, Matthew Wilcox,
	Dave Hansen

[-- Attachment #1: Type: text/plain, Size: 1542 bytes --]

On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
>> and are used by filesystems to order their metadata, among other things.
>> 
>> When we get an msync() or fsync() it is the responsibility of the DAX code
>> to flush all dirty pages to media.  The PMEM driver then just has issue a
>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
>> the flushed data has been durably stored on the media.
>> 
>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> the cache is done by the core why does the driver need support
> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> only makes sense if individual writes can bypass the "drive" cache,
> but no I/O submitted to the driver proper is ever cached we always
> flush it through to media.

If the upper level filesystem gets an error when submitting a flush
request, then it assumes the underlying hardware is broken and cannot
be as aggressive in IO submission, but instead has to wait for in-flight
IO to complete.  Since FUA/FLUSH is basically a no-op for pmem devices,
it doesn't make sense _not_ to support this functionality.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-14  0:43       ` Andreas Dilger
  0 siblings, 0 replies; 132+ messages in thread
From: Andreas Dilger @ 2015-11-14  0:43 UTC (permalink / raw)
  To: Dan Williams
  Cc: X86 ML, Theodore Ts'o, Andrew Morton, linux-nvdimm, Jan Kara,
	linux-kernel, Dave Hansen, XFS Developers, J. Bruce Fields,
	Linux MM, Ingo Molnar, Thomas Gleixner, Alexander Viro,
	H. Peter Anvin, linux-fsdevel, Matthew Wilcox, Ross Zwisler,
	linux-ext4, Jeff Layton, Matthew Wilcox


[-- Attachment #1.1: Type: text/plain, Size: 1542 bytes --]

On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> 
> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
>> and are used by filesystems to order their metadata, among other things.
>> 
>> When we get an msync() or fsync() it is the responsibility of the DAX code
>> to flush all dirty pages to media.  The PMEM driver then just has issue a
>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
>> the flushed data has been durably stored on the media.
>> 
>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> the cache is done by the core why does the driver need support
> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> only makes sense if individual writes can bypass the "drive" cache,
> but no I/O submitted to the driver proper is ever cached we always
> flush it through to media.

If the upper level filesystem gets an error when submitting a flush
request, then it assumes the underlying hardware is broken and cannot
be as aggressive in IO submission, but instead has to wait for in-flight
IO to complete.  Since FUA/FLUSH is basically a no-op for pmem devices,
it doesn't make sense _not_ to support this functionality.

Cheers, Andreas






[-- Attachment #1.2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 02/11] mm: add pmd_mkclean()
  2015-11-14  0:06   ` Ross Zwisler
  (?)
@ 2015-11-14  1:02     ` Dave Hansen
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Hansen @ 2015-11-14  1:02 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel
  Cc: H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

On 11/13/2015 04:06 PM, Ross Zwisler wrote:
> +static inline pmd_t pmd_mkclean(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
> +}

pte_mkclean() doesn't clear _PAGE_SOFT_DIRTY.  What the thought behind
doing it here?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 02/11] mm: add pmd_mkclean()
@ 2015-11-14  1:02     ` Dave Hansen
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Hansen @ 2015-11-14  1:02 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel
  Cc: H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

On 11/13/2015 04:06 PM, Ross Zwisler wrote:
> +static inline pmd_t pmd_mkclean(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
> +}

pte_mkclean() doesn't clear _PAGE_SOFT_DIRTY.  What the thought behind
doing it here?

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 02/11] mm: add pmd_mkclean()
@ 2015-11-14  1:02     ` Dave Hansen
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Hansen @ 2015-11-14  1:02 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel
  Cc: x86, Theodore Ts'o, Andrew Morton, Thomas Gleixner,
	linux-nvdimm, Jan Kara, xfs, J. Bruce Fields, linux-mm,
	Ingo Molnar, Andreas Dilger, Alexander Viro, H. Peter Anvin,
	linux-fsdevel, Matthew Wilcox, Dan Williams, linux-ext4,
	Jeff Layton, Matthew Wilcox

On 11/13/2015 04:06 PM, Ross Zwisler wrote:
> +static inline pmd_t pmd_mkclean(pmd_t pmd)
> +{
> +	return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
> +}

pte_mkclean() doesn't clear _PAGE_SOFT_DIRTY.  What the thought behind
doing it here?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-14  0:43       ` Andreas Dilger
  (?)
@ 2015-11-14  2:32         ` Dan Williams
  -1 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-14  2:32 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm, X86 ML,
	XFS Developers, Andrew Morton, Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>
>> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
>>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
>>> and are used by filesystems to order their metadata, among other things.
>>>
>>> When we get an msync() or fsync() it is the responsibility of the DAX code
>>> to flush all dirty pages to media.  The PMEM driver then just has issue a
>>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
>>> the flushed data has been durably stored on the media.
>>>
>>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>>
>> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
>> the cache is done by the core why does the driver need support
>> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
>> only makes sense if individual writes can bypass the "drive" cache,
>> but no I/O submitted to the driver proper is ever cached we always
>> flush it through to media.
>
> If the upper level filesystem gets an error when submitting a flush
> request, then it assumes the underlying hardware is broken and cannot
> be as aggressive in IO submission, but instead has to wait for in-flight
> IO to complete.

Upper level filesystems won't get errors when the driver does not
support flush.  Those requests are ended cleanly in
generic_make_request_checks().  Yes, the fs still needs to wait for
outstanding I/O to complete but in the case of pmem all I/O is
synchronous.  There's never anything to await when flushing at the
pmem driver level.

> Since FUA/FLUSH is basically a no-op for pmem devices,
> it doesn't make sense _not_ to support this functionality.

Seems to be a nop either way.  Given that DAX may lead to dirty data
pending to the device in the cpu cache that a REQ_FLUSH request will
not touch, its better to leave it all to the mm core to handle.  I.e.
it doesn't make sense to call the driver just for two instructions
(sfence + pcommit) when the mm core is taking on the cache flushing.
Either handle it all in the mm or the driver, not a mixture.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-14  2:32         ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-14  2:32 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Dave Chinner, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, Linux MM, linux-nvdimm@lists.01.org,
	X86 ML, XFS Developers, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>
>> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
>>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
>>> and are used by filesystems to order their metadata, among other things.
>>>
>>> When we get an msync() or fsync() it is the responsibility of the DAX code
>>> to flush all dirty pages to media.  The PMEM driver then just has issue a
>>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
>>> the flushed data has been durably stored on the media.
>>>
>>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>>
>> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
>> the cache is done by the core why does the driver need support
>> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
>> only makes sense if individual writes can bypass the "drive" cache,
>> but no I/O submitted to the driver proper is ever cached we always
>> flush it through to media.
>
> If the upper level filesystem gets an error when submitting a flush
> request, then it assumes the underlying hardware is broken and cannot
> be as aggressive in IO submission, but instead has to wait for in-flight
> IO to complete.

Upper level filesystems won't get errors when the driver does not
support flush.  Those requests are ended cleanly in
generic_make_request_checks().  Yes, the fs still needs to wait for
outstanding I/O to complete but in the case of pmem all I/O is
synchronous.  There's never anything to await when flushing at the
pmem driver level.

> Since FUA/FLUSH is basically a no-op for pmem devices,
> it doesn't make sense _not_ to support this functionality.

Seems to be a nop either way.  Given that DAX may lead to dirty data
pending to the device in the cpu cache that a REQ_FLUSH request will
not touch, its better to leave it all to the mm core to handle.  I.e.
it doesn't make sense to call the driver just for two instructions
(sfence + pcommit) when the mm core is taking on the cache flushing.
Either handle it all in the mm or the driver, not a mixture.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-14  2:32         ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-14  2:32 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: X86 ML, Theodore Ts'o, Andrew Morton, linux-nvdimm, Jan Kara,
	linux-kernel, Dave Hansen, XFS Developers, J. Bruce Fields,
	Linux MM, Ingo Molnar, Thomas Gleixner, Alexander Viro,
	H. Peter Anvin, linux-fsdevel, Matthew Wilcox, Ross Zwisler,
	linux-ext4, Jeff Layton, Matthew Wilcox

On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>
>> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
>> <ross.zwisler@linux.intel.com> wrote:
>>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
>>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
>>> and are used by filesystems to order their metadata, among other things.
>>>
>>> When we get an msync() or fsync() it is the responsibility of the DAX code
>>> to flush all dirty pages to media.  The PMEM driver then just has issue a
>>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
>>> the flushed data has been durably stored on the media.
>>>
>>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
>>
>> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
>> the cache is done by the core why does the driver need support
>> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
>> only makes sense if individual writes can bypass the "drive" cache,
>> but no I/O submitted to the driver proper is ever cached we always
>> flush it through to media.
>
> If the upper level filesystem gets an error when submitting a flush
> request, then it assumes the underlying hardware is broken and cannot
> be as aggressive in IO submission, but instead has to wait for in-flight
> IO to complete.

Upper level filesystems won't get errors when the driver does not
support flush.  Those requests are ended cleanly in
generic_make_request_checks().  Yes, the fs still needs to wait for
outstanding I/O to complete but in the case of pmem all I/O is
synchronous.  There's never anything to await when flushing at the
pmem driver level.

> Since FUA/FLUSH is basically a no-op for pmem devices,
> it doesn't make sense _not_ to support this functionality.

Seems to be a nop either way.  Given that DAX may lead to dirty data
pending to the device in the cpu cache that a REQ_FLUSH request will
not touch, its better to leave it all to the mm core to handle.  I.e.
it doesn't make sense to call the driver just for two instructions
(sfence + pcommit) when the mm core is taking on the cache flushing.
Either handle it all in the mm or the driver, not a mixture.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-14  2:32         ` Dan Williams
  (?)
  (?)
@ 2015-11-16 13:37           ` Jan Kara
  -1 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-16 13:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andreas Dilger, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Fri 13-11-15 18:32:40, Dan Williams wrote:
> On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >>
> >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> >>> and are used by filesystems to order their metadata, among other things.
> >>>
> >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> >>> the flushed data has been durably stored on the media.
> >>>
> >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >>
> >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> >> the cache is done by the core why does the driver need support
> >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> >> only makes sense if individual writes can bypass the "drive" cache,
> >> but no I/O submitted to the driver proper is ever cached we always
> >> flush it through to media.
> >
> > If the upper level filesystem gets an error when submitting a flush
> > request, then it assumes the underlying hardware is broken and cannot
> > be as aggressive in IO submission, but instead has to wait for in-flight
> > IO to complete.
> 
> Upper level filesystems won't get errors when the driver does not
> support flush.  Those requests are ended cleanly in
> generic_make_request_checks().  Yes, the fs still needs to wait for
> outstanding I/O to complete but in the case of pmem all I/O is
> synchronous.  There's never anything to await when flushing at the
> pmem driver level.
> 
> > Since FUA/FLUSH is basically a no-op for pmem devices,
> > it doesn't make sense _not_ to support this functionality.
> 
> Seems to be a nop either way.  Given that DAX may lead to dirty data
> pending to the device in the cpu cache that a REQ_FLUSH request will
> not touch, its better to leave it all to the mm core to handle.  I.e.
> it doesn't make sense to call the driver just for two instructions
> (sfence + pcommit) when the mm core is taking on the cache flushing.
> Either handle it all in the mm or the driver, not a mixture.

So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
e.g. journal writes going through block layer or writes done through
dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
and the way driver does IO doesn't guarantee this, does it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 13:37           ` Jan Kara
  0 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-16 13:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andreas Dilger, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Fri 13-11-15 18:32:40, Dan Williams wrote:
> On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >>
> >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> >>> and are used by filesystems to order their metadata, among other things.
> >>>
> >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> >>> the flushed data has been durably stored on the media.
> >>>
> >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >>
> >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> >> the cache is done by the core why does the driver need support
> >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> >> only makes sense if individual writes can bypass the "drive" cache,
> >> but no I/O submitted to the driver proper is ever cached we always
> >> flush it through to media.
> >
> > If the upper level filesystem gets an error when submitting a flush
> > request, then it assumes the underlying hardware is broken and cannot
> > be as aggressive in IO submission, but instead has to wait for in-flight
> > IO to complete.
> 
> Upper level filesystems won't get errors when the driver does not
> support flush.  Those requests are ended cleanly in
> generic_make_request_checks().  Yes, the fs still needs to wait for
> outstanding I/O to complete but in the case of pmem all I/O is
> synchronous.  There's never anything to await when flushing at the
> pmem driver level.
> 
> > Since FUA/FLUSH is basically a no-op for pmem devices,
> > it doesn't make sense _not_ to support this functionality.
> 
> Seems to be a nop either way.  Given that DAX may lead to dirty data
> pending to the device in the cpu cache that a REQ_FLUSH request will
> not touch, its better to leave it all to the mm core to handle.  I.e.
> it doesn't make sense to call the driver just for two instructions
> (sfence + pcommit) when the mm core is taking on the cache flushing.
> Either handle it all in the mm or the driver, not a mixture.

So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
e.g. journal writes going through block layer or writes done through
dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
and the way driver does IO doesn't guarantee this, does it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 13:37           ` Jan Kara
  0 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-16 13:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andreas Dilger, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Fri 13-11-15 18:32:40, Dan Williams wrote:
> On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >>
> >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> >>> and are used by filesystems to order their metadata, among other things.
> >>>
> >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> >>> the flushed data has been durably stored on the media.
> >>>
> >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >>
> >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> >> the cache is done by the core why does the driver need support
> >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> >> only makes sense if individual writes can bypass the "drive" cache,
> >> but no I/O submitted to the driver proper is ever cached we always
> >> flush it through to media.
> >
> > If the upper level filesystem gets an error when submitting a flush
> > request, then it assumes the underlying hardware is broken and cannot
> > be as aggressive in IO submission, but instead has to wait for in-flight
> > IO to complete.
> 
> Upper level filesystems won't get errors when the driver does not
> support flush.  Those requests are ended cleanly in
> generic_make_request_checks().  Yes, the fs still needs to wait for
> outstanding I/O to complete but in the case of pmem all I/O is
> synchronous.  There's never anything to await when flushing at the
> pmem driver level.
> 
> > Since FUA/FLUSH is basically a no-op for pmem devices,
> > it doesn't make sense _not_ to support this functionality.
> 
> Seems to be a nop either way.  Given that DAX may lead to dirty data
> pending to the device in the cpu cache that a REQ_FLUSH request will
> not touch, its better to leave it all to the mm core to handle.  I.e.
> it doesn't make sense to call the driver just for two instructions
> (sfence + pcommit) when the mm core is taking on the cache flushing.
> Either handle it all in the mm or the driver, not a mixture.

So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
e.g. journal writes going through block layer or writes done through
dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
and the way driver does IO doesn't guarantee this, does it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 13:37           ` Jan Kara
  0 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-16 13:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, H. Peter Anvin,
	Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar, Matthew Wilcox,
	Ross Zwisler, linux-ext4, XFS Developers, Alexander Viro,
	Thomas Gleixner, Andreas Dilger, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Fri 13-11-15 18:32:40, Dan Williams wrote:
> On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >>
> >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> >>> and are used by filesystems to order their metadata, among other things.
> >>>
> >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> >>> the flushed data has been durably stored on the media.
> >>>
> >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >>
> >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> >> the cache is done by the core why does the driver need support
> >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> >> only makes sense if individual writes can bypass the "drive" cache,
> >> but no I/O submitted to the driver proper is ever cached we always
> >> flush it through to media.
> >
> > If the upper level filesystem gets an error when submitting a flush
> > request, then it assumes the underlying hardware is broken and cannot
> > be as aggressive in IO submission, but instead has to wait for in-flight
> > IO to complete.
> 
> Upper level filesystems won't get errors when the driver does not
> support flush.  Those requests are ended cleanly in
> generic_make_request_checks().  Yes, the fs still needs to wait for
> outstanding I/O to complete but in the case of pmem all I/O is
> synchronous.  There's never anything to await when flushing at the
> pmem driver level.
> 
> > Since FUA/FLUSH is basically a no-op for pmem devices,
> > it doesn't make sense _not_ to support this functionality.
> 
> Seems to be a nop either way.  Given that DAX may lead to dirty data
> pending to the device in the cpu cache that a REQ_FLUSH request will
> not touch, its better to leave it all to the mm core to handle.  I.e.
> it doesn't make sense to call the driver just for two instructions
> (sfence + pcommit) when the mm core is taking on the cache flushing.
> Either handle it all in the mm or the driver, not a mixture.

So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
e.g. journal writes going through block layer or writes done through
dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
and the way driver does IO doesn't guarantee this, does it?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-16 13:37           ` Jan Kara
  (?)
@ 2015-11-16 14:05             ` Jan Kara
  -1 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-16 14:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andreas Dilger, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon 16-11-15 14:37:14, Jan Kara wrote:
> On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > >>
> > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > >>> and are used by filesystems to order their metadata, among other things.
> > >>>
> > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > >>> the flushed data has been durably stored on the media.
> > >>>
> > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > >>
> > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > >> the cache is done by the core why does the driver need support
> > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > >> only makes sense if individual writes can bypass the "drive" cache,
> > >> but no I/O submitted to the driver proper is ever cached we always
> > >> flush it through to media.
> > >
> > > If the upper level filesystem gets an error when submitting a flush
> > > request, then it assumes the underlying hardware is broken and cannot
> > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > IO to complete.
> > 
> > Upper level filesystems won't get errors when the driver does not
> > support flush.  Those requests are ended cleanly in
> > generic_make_request_checks().  Yes, the fs still needs to wait for
> > outstanding I/O to complete but in the case of pmem all I/O is
> > synchronous.  There's never anything to await when flushing at the
> > pmem driver level.
> > 
> > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > it doesn't make sense _not_ to support this functionality.
> > 
> > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > pending to the device in the cpu cache that a REQ_FLUSH request will
> > not touch, its better to leave it all to the mm core to handle.  I.e.
> > it doesn't make sense to call the driver just for two instructions
> > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > Either handle it all in the mm or the driver, not a mixture.
> 
> So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> e.g. journal writes going through block layer or writes done through
> dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> and the way driver does IO doesn't guarantee this, does it?

Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
does, I'm indeed wrong because they both do wmb_pmem() after each write
which seems to include sfence + pcommit. Sorry for confusion.

But a question: Won't it be better to do sfence + pcommit only in response
to REQ_FLUSH request and don't do it after each write? I'm not sure how
expensive these instructions are but in theory it could be a performance
win, couldn't it? For filesystems this is enough wrt persistency
guarantees...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 14:05             ` Jan Kara
  0 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-16 14:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andreas Dilger, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon 16-11-15 14:37:14, Jan Kara wrote:
> On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > >>
> > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > >>> and are used by filesystems to order their metadata, among other things.
> > >>>
> > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > >>> the flushed data has been durably stored on the media.
> > >>>
> > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > >>
> > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > >> the cache is done by the core why does the driver need support
> > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > >> only makes sense if individual writes can bypass the "drive" cache,
> > >> but no I/O submitted to the driver proper is ever cached we always
> > >> flush it through to media.
> > >
> > > If the upper level filesystem gets an error when submitting a flush
> > > request, then it assumes the underlying hardware is broken and cannot
> > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > IO to complete.
> > 
> > Upper level filesystems won't get errors when the driver does not
> > support flush.  Those requests are ended cleanly in
> > generic_make_request_checks().  Yes, the fs still needs to wait for
> > outstanding I/O to complete but in the case of pmem all I/O is
> > synchronous.  There's never anything to await when flushing at the
> > pmem driver level.
> > 
> > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > it doesn't make sense _not_ to support this functionality.
> > 
> > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > pending to the device in the cpu cache that a REQ_FLUSH request will
> > not touch, its better to leave it all to the mm core to handle.  I.e.
> > it doesn't make sense to call the driver just for two instructions
> > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > Either handle it all in the mm or the driver, not a mixture.
> 
> So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> e.g. journal writes going through block layer or writes done through
> dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> and the way driver does IO doesn't guarantee this, does it?

Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
does, I'm indeed wrong because they both do wmb_pmem() after each write
which seems to include sfence + pcommit. Sorry for confusion.

But a question: Won't it be better to do sfence + pcommit only in response
to REQ_FLUSH request and don't do it after each write? I'm not sure how
expensive these instructions are but in theory it could be a performance
win, couldn't it? For filesystems this is enough wrt persistency
guarantees...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 14:05             ` Jan Kara
  0 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-16 14:05 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, H. Peter Anvin,
	Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar, Matthew Wilcox,
	Ross Zwisler, linux-ext4, XFS Developers, Alexander Viro,
	Thomas Gleixner, Andreas Dilger, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Mon 16-11-15 14:37:14, Jan Kara wrote:
> On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > >>
> > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > >>> and are used by filesystems to order their metadata, among other things.
> > >>>
> > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > >>> the flushed data has been durably stored on the media.
> > >>>
> > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > >>
> > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > >> the cache is done by the core why does the driver need support
> > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > >> only makes sense if individual writes can bypass the "drive" cache,
> > >> but no I/O submitted to the driver proper is ever cached we always
> > >> flush it through to media.
> > >
> > > If the upper level filesystem gets an error when submitting a flush
> > > request, then it assumes the underlying hardware is broken and cannot
> > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > IO to complete.
> > 
> > Upper level filesystems won't get errors when the driver does not
> > support flush.  Those requests are ended cleanly in
> > generic_make_request_checks().  Yes, the fs still needs to wait for
> > outstanding I/O to complete but in the case of pmem all I/O is
> > synchronous.  There's never anything to await when flushing at the
> > pmem driver level.
> > 
> > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > it doesn't make sense _not_ to support this functionality.
> > 
> > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > pending to the device in the cpu cache that a REQ_FLUSH request will
> > not touch, its better to leave it all to the mm core to handle.  I.e.
> > it doesn't make sense to call the driver just for two instructions
> > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > Either handle it all in the mm or the driver, not a mixture.
> 
> So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> e.g. journal writes going through block layer or writes done through
> dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> and the way driver does IO doesn't guarantee this, does it?

Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
does, I'm indeed wrong because they both do wmb_pmem() after each write
which seems to include sfence + pcommit. Sorry for confusion.

But a question: Won't it be better to do sfence + pcommit only in response
to REQ_FLUSH request and don't do it after each write? I'm not sure how
expensive these instructions are but in theory it could be a performance
win, couldn't it? For filesystems this is enough wrt persistency
guarantees...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
  2015-11-14  0:06 ` Ross Zwisler
  (?)
@ 2015-11-16 14:41   ` Jan Kara
  -1 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-16 14:41 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
> This patch series adds support for fsync/msync to DAX.
> 
> Patches 1 through 7 add various utilities that the DAX code will eventually
> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
> the new DAX fsync/msync code.
> 
> These patches build on the recent DAX locking changes from Dave Chinner,
> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
> them here:
> 
> http://www.spinics.net/lists/linux-ext4/msg49951.html

I had a quick look and the patches look sane to me. I'll try to give them
more detailed look later this week. When thinking about the general design
I was wondering: When we have this infrastructure to track data potentially
lingering in CPU caches, would not it be a performance win to use standard
cached stores in dax_io() and mark corresponding pages as dirty in page
cache the same way as this patch set does it for mmaped writes? I have no
idea how costly are non-temporal stores compared to cached ones and how
would this compare to the cost of dirty tracking so this may be just
completely bogus...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-16 14:41   ` Jan Kara
  0 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-16 14:41 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
> This patch series adds support for fsync/msync to DAX.
> 
> Patches 1 through 7 add various utilities that the DAX code will eventually
> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
> the new DAX fsync/msync code.
> 
> These patches build on the recent DAX locking changes from Dave Chinner,
> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
> them here:
> 
> http://www.spinics.net/lists/linux-ext4/msg49951.html

I had a quick look and the patches look sane to me. I'll try to give them
more detailed look later this week. When thinking about the general design
I was wondering: When we have this infrastructure to track data potentially
lingering in CPU caches, would not it be a performance win to use standard
cached stores in dax_io() and mark corresponding pages as dirty in page
cache the same way as this patch set does it for mmaped writes? I have no
idea how costly are non-temporal stores compared to cached ones and how
would this compare to the cost of dirty tracking so this may be just
completely bogus...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-16 14:41   ` Jan Kara
  0 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-16 14:41 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, linux-ext4, xfs, Alexander Viro,
	Thomas Gleixner, Theodore Ts'o, linux-kernel, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
> This patch series adds support for fsync/msync to DAX.
> 
> Patches 1 through 7 add various utilities that the DAX code will eventually
> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
> the new DAX fsync/msync code.
> 
> These patches build on the recent DAX locking changes from Dave Chinner,
> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
> them here:
> 
> http://www.spinics.net/lists/linux-ext4/msg49951.html

I had a quick look and the patches look sane to me. I'll try to give them
more detailed look later this week. When thinking about the general design
I was wondering: When we have this infrastructure to track data potentially
lingering in CPU caches, would not it be a performance win to use standard
cached stores in dax_io() and mark corresponding pages as dirty in page
cache the same way as this patch set does it for mmaped writes? I have no
idea how costly are non-temporal stores compared to cached ones and how
would this compare to the cost of dirty tracking so this may be just
completely bogus...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
  2015-11-16 14:41   ` Jan Kara
  (?)
  (?)
@ 2015-11-16 16:58     ` Dan Williams
  -1 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 16:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 6:41 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
>> This patch series adds support for fsync/msync to DAX.
>>
>> Patches 1 through 7 add various utilities that the DAX code will eventually
>> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
>> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
>> the new DAX fsync/msync code.
>>
>> These patches build on the recent DAX locking changes from Dave Chinner,
>> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
>> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
>> them here:
>>
>> http://www.spinics.net/lists/linux-ext4/msg49951.html
>
> I had a quick look and the patches look sane to me. I'll try to give them
> more detailed look later this week. When thinking about the general design
> I was wondering: When we have this infrastructure to track data potentially
> lingering in CPU caches, would not it be a performance win to use standard
> cached stores in dax_io() and mark corresponding pages as dirty in page
> cache the same way as this patch set does it for mmaped writes? I have no
> idea how costly are non-temporal stores compared to cached ones and how
> would this compare to the cost of dirty tracking so this may be just
> completely bogus...

Keep in mind that this approach will flush every virtual address that
may be dirty.  For example, if you touch 1byte in a 2MB page we'll end
up looping through the entire 2MB range.  At some point the dirty size
becomes large enough that is cheaper to flush the entire cache, we
have not measured where that crossover point is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-16 16:58     ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 16:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 6:41 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
>> This patch series adds support for fsync/msync to DAX.
>>
>> Patches 1 through 7 add various utilities that the DAX code will eventually
>> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
>> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
>> the new DAX fsync/msync code.
>>
>> These patches build on the recent DAX locking changes from Dave Chinner,
>> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
>> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
>> them here:
>>
>> http://www.spinics.net/lists/linux-ext4/msg49951.html
>
> I had a quick look and the patches look sane to me. I'll try to give them
> more detailed look later this week. When thinking about the general design
> I was wondering: When we have this infrastructure to track data potentially
> lingering in CPU caches, would not it be a performance win to use standard
> cached stores in dax_io() and mark corresponding pages as dirty in page
> cache the same way as this patch set does it for mmaped writes? I have no
> idea how costly are non-temporal stores compared to cached ones and how
> would this compare to the cost of dirty tracking so this may be just
> completely bogus...

Keep in mind that this approach will flush every virtual address that
may be dirty.  For example, if you touch 1byte in a 2MB page we'll end
up looping through the entire 2MB range.  At some point the dirty size
becomes large enough that is cheaper to flush the entire cache, we
have not measured where that crossover point is.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-16 16:58     ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 16:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 6:41 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
>> This patch series adds support for fsync/msync to DAX.
>>
>> Patches 1 through 7 add various utilities that the DAX code will eventually
>> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
>> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
>> the new DAX fsync/msync code.
>>
>> These patches build on the recent DAX locking changes from Dave Chinner,
>> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
>> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
>> them here:
>>
>> http://www.spinics.net/lists/linux-ext4/msg49951.html
>
> I had a quick look and the patches look sane to me. I'll try to give them
> more detailed look later this week. When thinking about the general design
> I was wondering: When we have this infrastructure to track data potentially
> lingering in CPU caches, would not it be a performance win to use standard
> cached stores in dax_io() and mark corresponding pages as dirty in page
> cache the same way as this patch set does it for mmaped writes? I have no
> idea how costly are non-temporal stores compared to cached ones and how
> would this compare to the cost of dirty tracking so this may be just
> completely bogus...

Keep in mind that this approach will flush every virtual address that
may be dirty.  For example, if you touch 1byte in a 2MB page we'll end
up looping through the entire 2MB range.  At some point the dirty size
becomes large enough that is cheaper to flush the entire cache, we
have not measured where that crossover point is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-16 16:58     ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 16:58 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Mon, Nov 16, 2015 at 6:41 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
>> This patch series adds support for fsync/msync to DAX.
>>
>> Patches 1 through 7 add various utilities that the DAX code will eventually
>> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
>> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
>> the new DAX fsync/msync code.
>>
>> These patches build on the recent DAX locking changes from Dave Chinner,
>> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
>> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
>> them here:
>>
>> http://www.spinics.net/lists/linux-ext4/msg49951.html
>
> I had a quick look and the patches look sane to me. I'll try to give them
> more detailed look later this week. When thinking about the general design
> I was wondering: When we have this infrastructure to track data potentially
> lingering in CPU caches, would not it be a performance win to use standard
> cached stores in dax_io() and mark corresponding pages as dirty in page
> cache the same way as this patch set does it for mmaped writes? I have no
> idea how costly are non-temporal stores compared to cached ones and how
> would this compare to the cost of dirty tracking so this may be just
> completely bogus...

Keep in mind that this approach will flush every virtual address that
may be dirty.  For example, if you touch 1byte in a 2MB page we'll end
up looping through the entire 2MB range.  At some point the dirty size
becomes large enough that is cheaper to flush the entire cache, we
have not measured where that crossover point is.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-16 14:05             ` Jan Kara
  (?)
@ 2015-11-16 17:28               ` Dan Williams
  -1 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 17:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
> But a question: Won't it be better to do sfence + pcommit only in response
> to REQ_FLUSH request and don't do it after each write? I'm not sure how
> expensive these instructions are but in theory it could be a performance
> win, couldn't it? For filesystems this is enough wrt persistency
> guarantees...

We would need to gather the performance data...  The expectation is
that the cache flushing is more expensive than the sfence + pcommit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 17:28               ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 17:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
> But a question: Won't it be better to do sfence + pcommit only in response
> to REQ_FLUSH request and don't do it after each write? I'm not sure how
> expensive these instructions are but in theory it could be a performance
> win, couldn't it? For filesystems this is enough wrt persistency
> guarantees...

We would need to gather the performance data...  The expectation is
that the cache flushing is more expensive than the sfence + pcommit.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 17:28               ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 17:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, H. Peter Anvin,
	Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar, Matthew Wilcox,
	Ross Zwisler, linux-ext4, XFS Developers, Alexander Viro,
	Thomas Gleixner, Andreas Dilger, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
> But a question: Won't it be better to do sfence + pcommit only in response
> to REQ_FLUSH request and don't do it after each write? I'm not sure how
> expensive these instructions are but in theory it could be a performance
> win, couldn't it? For filesystems this is enough wrt persistency
> guarantees...

We would need to gather the performance data...  The expectation is
that the cache flushing is more expensive than the sfence + pcommit.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-16 17:28               ` Dan Williams
  (?)
  (?)
@ 2015-11-16 19:48                 ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 19:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Andreas Dilger, Ross Zwisler, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> [..]
> > But a question: Won't it be better to do sfence + pcommit only in response
> > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > expensive these instructions are but in theory it could be a performance
> > win, couldn't it? For filesystems this is enough wrt persistency
> > guarantees...
> 
> We would need to gather the performance data...  The expectation is
> that the cache flushing is more expensive than the sfence + pcommit.

I think we should revisit the idea of removing wmb_pmem() from the I/O path in
both the PMEM driver and in DAX, and just relying on the REQ_FUA/REQ_FLUSH
path to do wmb_pmem() for all cases.  This was brought up in the thread
dealing with the "big hammer" fsync/msync patches as well.

https://lkml.org/lkml/2015/11/3/730

I think we can all agree from the start that wmb_pmem() will have a nonzero
cost, both because of the PCOMMIT and because of the ordering caused by the
sfence.  If it's possible to avoid doing it on each I/O, I think that would be
a win.

So, here would be our new flows:

PMEM I/O:
	write I/O(s) to the driver
		PMEM I/O writes the data using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX I/O:
	write I/O(s) to the DAX layer
		write the data using regular stores (eventually to be replaced
		with non-temporal stores)

		flush the data with wb_cache_pmem() (removed when we use
		non-temporal stores)

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX msync/fsync:
	writes happen to DAX mmaps from userspace

	DAX fsync/msync
		all dirty pages are written back using wb_cache_pmem()

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs
	
DAX/PMEM zeroing (suggested by Dave: https://lkml.org/lkml/2015/11/2/772):
	PMEM driver receives zeroing request
		writes a bunch of zeroes using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

Having all these flows wait to do wmb_pmem() in the PMEM driver in response to
REQ_FUA/REQ_FLUSH has several advantages:

1) The work done and guarantees provided after each step closely match the
normal block I/O to disk case.  This means that the existing algorithms used
by filesystems to make sure that their metadata is ordered properly and synced
at a known time should all work the same.

2) By delaying wmb_pmem() until REQ_FUA/REQ_FLUSH time we can potentially do
many I/Os at different levels, and order them all with a single wmb_pmem().
This should result in a performance win.

Is there any reason why this wouldn't work or wouldn't be a good idea?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 19:48                 ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 19:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Andreas Dilger, Ross Zwisler, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> [..]
> > But a question: Won't it be better to do sfence + pcommit only in response
> > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > expensive these instructions are but in theory it could be a performance
> > win, couldn't it? For filesystems this is enough wrt persistency
> > guarantees...
> 
> We would need to gather the performance data...  The expectation is
> that the cache flushing is more expensive than the sfence + pcommit.

I think we should revisit the idea of removing wmb_pmem() from the I/O path in
both the PMEM driver and in DAX, and just relying on the REQ_FUA/REQ_FLUSH
path to do wmb_pmem() for all cases.  This was brought up in the thread
dealing with the "big hammer" fsync/msync patches as well.

https://lkml.org/lkml/2015/11/3/730

I think we can all agree from the start that wmb_pmem() will have a nonzero
cost, both because of the PCOMMIT and because of the ordering caused by the
sfence.  If it's possible to avoid doing it on each I/O, I think that would be
a win.

So, here would be our new flows:

PMEM I/O:
	write I/O(s) to the driver
		PMEM I/O writes the data using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX I/O:
	write I/O(s) to the DAX layer
		write the data using regular stores (eventually to be replaced
		with non-temporal stores)

		flush the data with wb_cache_pmem() (removed when we use
		non-temporal stores)

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX msync/fsync:
	writes happen to DAX mmaps from userspace

	DAX fsync/msync
		all dirty pages are written back using wb_cache_pmem()

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs
	
DAX/PMEM zeroing (suggested by Dave: https://lkml.org/lkml/2015/11/2/772):
	PMEM driver receives zeroing request
		writes a bunch of zeroes using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

Having all these flows wait to do wmb_pmem() in the PMEM driver in response to
REQ_FUA/REQ_FLUSH has several advantages:

1) The work done and guarantees provided after each step closely match the
normal block I/O to disk case.  This means that the existing algorithms used
by filesystems to make sure that their metadata is ordered properly and synced
at a known time should all work the same.

2) By delaying wmb_pmem() until REQ_FUA/REQ_FLUSH time we can potentially do
many I/Os at different levels, and order them all with a single wmb_pmem().
This should result in a performance win.

Is there any reason why this wouldn't work or wouldn't be a good idea?

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 19:48                 ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 19:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Andreas Dilger, Ross Zwisler, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> [..]
> > But a question: Won't it be better to do sfence + pcommit only in response
> > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > expensive these instructions are but in theory it could be a performance
> > win, couldn't it? For filesystems this is enough wrt persistency
> > guarantees...
> 
> We would need to gather the performance data...  The expectation is
> that the cache flushing is more expensive than the sfence + pcommit.

I think we should revisit the idea of removing wmb_pmem() from the I/O path in
both the PMEM driver and in DAX, and just relying on the REQ_FUA/REQ_FLUSH
path to do wmb_pmem() for all cases.  This was brought up in the thread
dealing with the "big hammer" fsync/msync patches as well.

https://lkml.org/lkml/2015/11/3/730

I think we can all agree from the start that wmb_pmem() will have a nonzero
cost, both because of the PCOMMIT and because of the ordering caused by the
sfence.  If it's possible to avoid doing it on each I/O, I think that would be
a win.

So, here would be our new flows:

PMEM I/O:
	write I/O(s) to the driver
		PMEM I/O writes the data using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX I/O:
	write I/O(s) to the DAX layer
		write the data using regular stores (eventually to be replaced
		with non-temporal stores)

		flush the data with wb_cache_pmem() (removed when we use
		non-temporal stores)

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX msync/fsync:
	writes happen to DAX mmaps from userspace

	DAX fsync/msync
		all dirty pages are written back using wb_cache_pmem()

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs
	
DAX/PMEM zeroing (suggested by Dave: https://lkml.org/lkml/2015/11/2/772):
	PMEM driver receives zeroing request
		writes a bunch of zeroes using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

Having all these flows wait to do wmb_pmem() in the PMEM driver in response to
REQ_FUA/REQ_FLUSH has several advantages:

1) The work done and guarantees provided after each step closely match the
normal block I/O to disk case.  This means that the existing algorithms used
by filesystems to make sure that their metadata is ordered properly and synced
at a known time should all work the same.

2) By delaying wmb_pmem() until REQ_FUA/REQ_FLUSH time we can potentially do
many I/Os at different levels, and order them all with a single wmb_pmem().
This should result in a performance win.

Is there any reason why this wouldn't work or wouldn't be a good idea?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 19:48                 ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 19:48 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Hansen, J. Bruce Fields, Linux MM, H. Peter Anvin,
	Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar, Matthew Wilcox,
	Ross Zwisler, linux-ext4, XFS Developers, Alexander Viro,
	Thomas Gleixner, Andreas Dilger, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> [..]
> > But a question: Won't it be better to do sfence + pcommit only in response
> > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > expensive these instructions are but in theory it could be a performance
> > win, couldn't it? For filesystems this is enough wrt persistency
> > guarantees...
> 
> We would need to gather the performance data...  The expectation is
> that the cache flushing is more expensive than the sfence + pcommit.

I think we should revisit the idea of removing wmb_pmem() from the I/O path in
both the PMEM driver and in DAX, and just relying on the REQ_FUA/REQ_FLUSH
path to do wmb_pmem() for all cases.  This was brought up in the thread
dealing with the "big hammer" fsync/msync patches as well.

https://lkml.org/lkml/2015/11/3/730

I think we can all agree from the start that wmb_pmem() will have a nonzero
cost, both because of the PCOMMIT and because of the ordering caused by the
sfence.  If it's possible to avoid doing it on each I/O, I think that would be
a win.

So, here would be our new flows:

PMEM I/O:
	write I/O(s) to the driver
		PMEM I/O writes the data using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX I/O:
	write I/O(s) to the DAX layer
		write the data using regular stores (eventually to be replaced
		with non-temporal stores)

		flush the data with wb_cache_pmem() (removed when we use
		non-temporal stores)

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

DAX msync/fsync:
	writes happen to DAX mmaps from userspace

	DAX fsync/msync
		all dirty pages are written back using wb_cache_pmem()

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs
	
DAX/PMEM zeroing (suggested by Dave: https://lkml.org/lkml/2015/11/2/772):
	PMEM driver receives zeroing request
		writes a bunch of zeroes using non-temporal stores

	REQ_FUA/REQ_FLUSH to the PMEM driver
		wmb_pmem() to order all previous writes and flushes, and to
		PCOMMIT the dirty data durably to the DIMMs

Having all these flows wait to do wmb_pmem() in the PMEM driver in response to
REQ_FUA/REQ_FLUSH has several advantages:

1) The work done and guarantees provided after each step closely match the
normal block I/O to disk case.  This means that the existing algorithms used
by filesystems to make sure that their metadata is ordered properly and synced
at a known time should all work the same.

2) By delaying wmb_pmem() until REQ_FUA/REQ_FLUSH time we can potentially do
many I/Os at different levels, and order them all with a single wmb_pmem().
This should result in a performance win.

Is there any reason why this wouldn't work or wouldn't be a good idea?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
  2015-11-16 16:58     ` Dan Williams
  (?)
  (?)
@ 2015-11-16 20:01       ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 20:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 08:58:11AM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 6:41 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
> >> This patch series adds support for fsync/msync to DAX.
> >>
> >> Patches 1 through 7 add various utilities that the DAX code will eventually
> >> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
> >> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
> >> the new DAX fsync/msync code.
> >>
> >> These patches build on the recent DAX locking changes from Dave Chinner,
> >> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
> >> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
> >> them here:
> >>
> >> http://www.spinics.net/lists/linux-ext4/msg49951.html
> >
> > I had a quick look and the patches look sane to me. I'll try to give them
> > more detailed look later this week. When thinking about the general design
> > I was wondering: When we have this infrastructure to track data potentially
> > lingering in CPU caches, would not it be a performance win to use standard
> > cached stores in dax_io() and mark corresponding pages as dirty in page
> > cache the same way as this patch set does it for mmaped writes? I have no
> > idea how costly are non-temporal stores compared to cached ones and how
> > would this compare to the cost of dirty tracking so this may be just
> > completely bogus...
> 
> Keep in mind that this approach will flush every virtual address that
> may be dirty.  For example, if you touch 1byte in a 2MB page we'll end
> up looping through the entire 2MB range.  At some point the dirty size
> becomes large enough that is cheaper to flush the entire cache, we
> have not measured where that crossover point is.

Yep, I expect there will be a crossover point where flushing the entire
processor cache will be beneficial.  I agree with Dan that we'll need to
figure this out via measurement, and that we'd similarly need measurements to
justify the decision to write dirty data at the DAX level without flushing and
mark entries as dirty for fsync/msync to clean up later.  It could turn out to
be great, but we'll have to see. :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-16 20:01       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 20:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 08:58:11AM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 6:41 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
> >> This patch series adds support for fsync/msync to DAX.
> >>
> >> Patches 1 through 7 add various utilities that the DAX code will eventually
> >> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
> >> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
> >> the new DAX fsync/msync code.
> >>
> >> These patches build on the recent DAX locking changes from Dave Chinner,
> >> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
> >> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
> >> them here:
> >>
> >> http://www.spinics.net/lists/linux-ext4/msg49951.html
> >
> > I had a quick look and the patches look sane to me. I'll try to give them
> > more detailed look later this week. When thinking about the general design
> > I was wondering: When we have this infrastructure to track data potentially
> > lingering in CPU caches, would not it be a performance win to use standard
> > cached stores in dax_io() and mark corresponding pages as dirty in page
> > cache the same way as this patch set does it for mmaped writes? I have no
> > idea how costly are non-temporal stores compared to cached ones and how
> > would this compare to the cost of dirty tracking so this may be just
> > completely bogus...
> 
> Keep in mind that this approach will flush every virtual address that
> may be dirty.  For example, if you touch 1byte in a 2MB page we'll end
> up looping through the entire 2MB range.  At some point the dirty size
> becomes large enough that is cheaper to flush the entire cache, we
> have not measured where that crossover point is.

Yep, I expect there will be a crossover point where flushing the entire
processor cache will be beneficial.  I agree with Dan that we'll need to
figure this out via measurement, and that we'd similarly need measurements to
justify the decision to write dirty data at the DAX level without flushing and
mark entries as dirty for fsync/msync to clean up later.  It could turn out to
be great, but we'll have to see. :)

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-16 20:01       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 20:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 08:58:11AM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 6:41 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
> >> This patch series adds support for fsync/msync to DAX.
> >>
> >> Patches 1 through 7 add various utilities that the DAX code will eventually
> >> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
> >> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
> >> the new DAX fsync/msync code.
> >>
> >> These patches build on the recent DAX locking changes from Dave Chinner,
> >> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
> >> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
> >> them here:
> >>
> >> http://www.spinics.net/lists/linux-ext4/msg49951.html
> >
> > I had a quick look and the patches look sane to me. I'll try to give them
> > more detailed look later this week. When thinking about the general design
> > I was wondering: When we have this infrastructure to track data potentially
> > lingering in CPU caches, would not it be a performance win to use standard
> > cached stores in dax_io() and mark corresponding pages as dirty in page
> > cache the same way as this patch set does it for mmaped writes? I have no
> > idea how costly are non-temporal stores compared to cached ones and how
> > would this compare to the cost of dirty tracking so this may be just
> > completely bogus...
> 
> Keep in mind that this approach will flush every virtual address that
> may be dirty.  For example, if you touch 1byte in a 2MB page we'll end
> up looping through the entire 2MB range.  At some point the dirty size
> becomes large enough that is cheaper to flush the entire cache, we
> have not measured where that crossover point is.

Yep, I expect there will be a crossover point where flushing the entire
processor cache will be beneficial.  I agree with Dan that we'll need to
figure this out via measurement, and that we'd similarly need measurements to
justify the decision to write dirty data at the DAX level without flushing and
mark entries as dirty for fsync/msync to clean up later.  It could turn out to
be great, but we'll have to see. :)

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 00/11] DAX fsynx/msync support
@ 2015-11-16 20:01       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 20:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Hansen, J. Bruce Fields, Linux MM, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Mon, Nov 16, 2015 at 08:58:11AM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 6:41 AM, Jan Kara <jack@suse.cz> wrote:
> > On Fri 13-11-15 17:06:39, Ross Zwisler wrote:
> >> This patch series adds support for fsync/msync to DAX.
> >>
> >> Patches 1 through 7 add various utilities that the DAX code will eventually
> >> need, and the DAX code itself is added by patch 8.  Patches 9-11 update the
> >> three filesystems that currently support DAX, ext2, ext4 and XFS, to use
> >> the new DAX fsync/msync code.
> >>
> >> These patches build on the recent DAX locking changes from Dave Chinner,
> >> Jan Kara and myself.  Dave's changes for XFS and my changes for ext2 have
> >> been merged in the v4.4 window, but Jan's are still unmerged.  You can grab
> >> them here:
> >>
> >> http://www.spinics.net/lists/linux-ext4/msg49951.html
> >
> > I had a quick look and the patches look sane to me. I'll try to give them
> > more detailed look later this week. When thinking about the general design
> > I was wondering: When we have this infrastructure to track data potentially
> > lingering in CPU caches, would not it be a performance win to use standard
> > cached stores in dax_io() and mark corresponding pages as dirty in page
> > cache the same way as this patch set does it for mmaped writes? I have no
> > idea how costly are non-temporal stores compared to cached ones and how
> > would this compare to the cost of dirty tracking so this may be just
> > completely bogus...
> 
> Keep in mind that this approach will flush every virtual address that
> may be dirty.  For example, if you touch 1byte in a 2MB page we'll end
> up looping through the entire 2MB range.  At some point the dirty size
> becomes large enough that is cheaper to flush the entire cache, we
> have not measured where that crossover point is.

Yep, I expect there will be a crossover point where flushing the entire
processor cache will be beneficial.  I agree with Dan that we'll need to
figure this out via measurement, and that we'd similarly need measurements to
justify the decision to write dirty data at the DAX level without flushing and
mark entries as dirty for fsync/msync to clean up later.  It could turn out to
be great, but we'll have to see. :)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-14  2:32         ` Dan Williams
  (?)
@ 2015-11-16 20:09           ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 20:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andreas Dilger, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >>
> >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> >>> and are used by filesystems to order their metadata, among other things.
> >>>
> >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> >>> the flushed data has been durably stored on the media.
> >>>
> >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >>
> >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> >> the cache is done by the core why does the driver need support
> >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> >> only makes sense if individual writes can bypass the "drive" cache,
> >> but no I/O submitted to the driver proper is ever cached we always
> >> flush it through to media.
> >
> > If the upper level filesystem gets an error when submitting a flush
> > request, then it assumes the underlying hardware is broken and cannot
> > be as aggressive in IO submission, but instead has to wait for in-flight
> > IO to complete.
> 
> Upper level filesystems won't get errors when the driver does not
> support flush.  Those requests are ended cleanly in
> generic_make_request_checks().  Yes, the fs still needs to wait for
> outstanding I/O to complete but in the case of pmem all I/O is
> synchronous.  There's never anything to await when flushing at the
> pmem driver level.
> 
> > Since FUA/FLUSH is basically a no-op for pmem devices,
> > it doesn't make sense _not_ to support this functionality.
> 
> Seems to be a nop either way.  Given that DAX may lead to dirty data
> pending to the device in the cpu cache that a REQ_FLUSH request will
> not touch, its better to leave it all to the mm core to handle.  I.e.
> it doesn't make sense to call the driver just for two instructions
> (sfence + pcommit) when the mm core is taking on the cache flushing.
> Either handle it all in the mm or the driver, not a mixture.

Does anyone know if ext4 and/or XFS alter their algorithms based on whether
the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
efficiently with respect to their internal I/O ordering, etc., if PMEM
advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
at the DAX layer?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 20:09           ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 20:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Andreas Dilger, Ross Zwisler, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >>
> >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> >>> and are used by filesystems to order their metadata, among other things.
> >>>
> >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> >>> the flushed data has been durably stored on the media.
> >>>
> >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >>
> >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> >> the cache is done by the core why does the driver need support
> >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> >> only makes sense if individual writes can bypass the "drive" cache,
> >> but no I/O submitted to the driver proper is ever cached we always
> >> flush it through to media.
> >
> > If the upper level filesystem gets an error when submitting a flush
> > request, then it assumes the underlying hardware is broken and cannot
> > be as aggressive in IO submission, but instead has to wait for in-flight
> > IO to complete.
> 
> Upper level filesystems won't get errors when the driver does not
> support flush.  Those requests are ended cleanly in
> generic_make_request_checks().  Yes, the fs still needs to wait for
> outstanding I/O to complete but in the case of pmem all I/O is
> synchronous.  There's never anything to await when flushing at the
> pmem driver level.
> 
> > Since FUA/FLUSH is basically a no-op for pmem devices,
> > it doesn't make sense _not_ to support this functionality.
> 
> Seems to be a nop either way.  Given that DAX may lead to dirty data
> pending to the device in the cpu cache that a REQ_FLUSH request will
> not touch, its better to leave it all to the mm core to handle.  I.e.
> it doesn't make sense to call the driver just for two instructions
> (sfence + pcommit) when the mm core is taking on the cache flushing.
> Either handle it all in the mm or the driver, not a mixture.

Does anyone know if ext4 and/or XFS alter their algorithms based on whether
the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
efficiently with respect to their internal I/O ordering, etc., if PMEM
advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
at the DAX layer?

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 20:09           ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 20:09 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, H. Peter Anvin,
	Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar, Matthew Wilcox,
	Ross Zwisler, linux-ext4, XFS Developers, Alexander Viro,
	Thomas Gleixner, Andreas Dilger, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> >>
> >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> >> <ross.zwisler@linux.intel.com> wrote:
> >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> >>> and are used by filesystems to order their metadata, among other things.
> >>>
> >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> >>> the flushed data has been durably stored on the media.
> >>>
> >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> >>
> >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> >> the cache is done by the core why does the driver need support
> >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> >> only makes sense if individual writes can bypass the "drive" cache,
> >> but no I/O submitted to the driver proper is ever cached we always
> >> flush it through to media.
> >
> > If the upper level filesystem gets an error when submitting a flush
> > request, then it assumes the underlying hardware is broken and cannot
> > be as aggressive in IO submission, but instead has to wait for in-flight
> > IO to complete.
> 
> Upper level filesystems won't get errors when the driver does not
> support flush.  Those requests are ended cleanly in
> generic_make_request_checks().  Yes, the fs still needs to wait for
> outstanding I/O to complete but in the case of pmem all I/O is
> synchronous.  There's never anything to await when flushing at the
> pmem driver level.
> 
> > Since FUA/FLUSH is basically a no-op for pmem devices,
> > it doesn't make sense _not_ to support this functionality.
> 
> Seems to be a nop either way.  Given that DAX may lead to dirty data
> pending to the device in the cpu cache that a REQ_FLUSH request will
> not touch, its better to leave it all to the mm core to handle.  I.e.
> it doesn't make sense to call the driver just for two instructions
> (sfence + pcommit) when the mm core is taking on the cache flushing.
> Either handle it all in the mm or the driver, not a mixture.

Does anyone know if ext4 and/or XFS alter their algorithms based on whether
the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
efficiently with respect to their internal I/O ordering, etc., if PMEM
advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
at the DAX layer?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-16 19:48                 ` Ross Zwisler
  (?)
  (?)
@ 2015-11-16 20:34                   ` Dan Williams
  -1 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 20:34 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Jan Kara, Andreas Dilger,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 11:48 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
>> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
> Is there any reason why this wouldn't work or wouldn't be a good idea?

We don't have numbers to support the claim that pcommit is so
expensive as to need be deferred, especially if the upper layers are
already taking the hit on doing the flushes.

REQ_FLUSH, means flush your volatile write cache.  Currently all I/O
through the driver never hits a volatile cache so there's no need to
tell the block layer that we have a volatile write cache, especially
when you have the core mm taking responsibility for doing cache
maintenance for dax-mmap ranges.

We also don't have numbers on if/when wbinvd is a more performant solution.

tl;dr Now that we have a baseline implementation can we please use
data to make future arch decisions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 20:34                   ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 20:34 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Jan Kara, Andreas Dilger,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 11:48 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
>> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
> Is there any reason why this wouldn't work or wouldn't be a good idea?

We don't have numbers to support the claim that pcommit is so
expensive as to need be deferred, especially if the upper layers are
already taking the hit on doing the flushes.

REQ_FLUSH, means flush your volatile write cache.  Currently all I/O
through the driver never hits a volatile cache so there's no need to
tell the block layer that we have a volatile write cache, especially
when you have the core mm taking responsibility for doing cache
maintenance for dax-mmap ranges.

We also don't have numbers on if/when wbinvd is a more performant solution.

tl;dr Now that we have a baseline implementation can we please use
data to make future arch decisions?

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 20:34                   ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 20:34 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Jan Kara, Andreas Dilger,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox

On Mon, Nov 16, 2015 at 11:48 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
>> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
> Is there any reason why this wouldn't work or wouldn't be a good idea?

We don't have numbers to support the claim that pcommit is so
expensive as to need be deferred, especially if the upper layers are
already taking the hit on doing the flushes.

REQ_FLUSH, means flush your volatile write cache.  Currently all I/O
through the driver never hits a volatile cache so there's no need to
tell the block layer that we have a volatile write cache, especially
when you have the core mm taking responsibility for doing cache
maintenance for dax-mmap ranges.

We also don't have numbers on if/when wbinvd is a more performant solution.

tl;dr Now that we have a baseline implementation can we please use
data to make future arch decisions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 20:34                   ` Dan Williams
  0 siblings, 0 replies; 132+ messages in thread
From: Dan Williams @ 2015-11-16 20:34 UTC (permalink / raw)
  To: Ross Zwisler, Dan Williams, Jan Kara, Andreas Dilger,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 11:48 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
>> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
[..]
> Is there any reason why this wouldn't work or wouldn't be a good idea?

We don't have numbers to support the claim that pcommit is so
expensive as to need be deferred, especially if the upper layers are
already taking the hit on doing the flushes.

REQ_FLUSH, means flush your volatile write cache.  Currently all I/O
through the driver never hits a volatile cache so there's no need to
tell the block layer that we have a volatile write cache, especially
when you have the core mm taking responsibility for doing cache
maintenance for dax-mmap ranges.

We also don't have numbers on if/when wbinvd is a more performant solution.

tl;dr Now that we have a baseline implementation can we please use
data to make future arch decisions?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-16 14:05             ` Jan Kara
  (?)
@ 2015-11-16 22:14               ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 22:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Andreas Dilger, Ross Zwisler, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler@linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > > 
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush.  Those requests are ended cleanly in
> > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous.  There's never anything to await when flushing at the
> > > pmem driver level.
> > > 
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > > 
> > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> > 
> > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > e.g. journal writes going through block layer or writes done through
> > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > and the way driver does IO doesn't guarantee this, does it?
> 
> Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> does, I'm indeed wrong because they both do wmb_pmem() after each write
> which seems to include sfence + pcommit. Sorry for confusion.

Which I want to remove, because it makes DAX IO 3x slower than
buffered IO on ramdisk based testing.

> But a question: Won't it be better to do sfence + pcommit only in response
> to REQ_FLUSH request and don't do it after each write? I'm not sure how
> expensive these instructions are but in theory it could be a performance
> win, couldn't it? For filesystems this is enough wrt persistency
> guarantees...

I'm pretty sure it would be, because all of the overhead (and
therefore latency) I measured is in the cache flushing instructions.
But before we can remove the wmb_pmem() from  dax_do_io(), we need
the underlying device to support REQ_FLUSH correctly...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 22:14               ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 22:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, Andreas Dilger, Ross Zwisler, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler@linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > > 
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush.  Those requests are ended cleanly in
> > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous.  There's never anything to await when flushing at the
> > > pmem driver level.
> > > 
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > > 
> > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> > 
> > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > e.g. journal writes going through block layer or writes done through
> > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > and the way driver does IO doesn't guarantee this, does it?
> 
> Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> does, I'm indeed wrong because they both do wmb_pmem() after each write
> which seems to include sfence + pcommit. Sorry for confusion.

Which I want to remove, because it makes DAX IO 3x slower than
buffered IO on ramdisk based testing.

> But a question: Won't it be better to do sfence + pcommit only in response
> to REQ_FLUSH request and don't do it after each write? I'm not sure how
> expensive these instructions are but in theory it could be a performance
> win, couldn't it? For filesystems this is enough wrt persistency
> guarantees...

I'm pretty sure it would be, because all of the overhead (and
therefore latency) I measured is in the cache flushing instructions.
But before we can remove the wmb_pmem() from  dax_do_io(), we need
the underlying device to support REQ_FLUSH correctly...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 22:14               ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 22:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, H. Peter Anvin,
	Jeff Layton, Dan Williams, linux-nvdimm, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Andreas Dilger,
	Theodore Ts'o, linux-kernel, Jan Kara, linux-fsdevel,
	Andrew Morton, Matthew Wilcox

On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler@linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > > 
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush.  Those requests are ended cleanly in
> > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous.  There's never anything to await when flushing at the
> > > pmem driver level.
> > > 
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > > 
> > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> > 
> > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > e.g. journal writes going through block layer or writes done through
> > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > and the way driver does IO doesn't guarantee this, does it?
> 
> Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> does, I'm indeed wrong because they both do wmb_pmem() after each write
> which seems to include sfence + pcommit. Sorry for confusion.

Which I want to remove, because it makes DAX IO 3x slower than
buffered IO on ramdisk based testing.

> But a question: Won't it be better to do sfence + pcommit only in response
> to REQ_FLUSH request and don't do it after each write? I'm not sure how
> expensive these instructions are but in theory it could be a performance
> win, couldn't it? For filesystems this is enough wrt persistency
> guarantees...

I'm pretty sure it would be, because all of the overhead (and
therefore latency) I measured is in the cache flushing instructions.
But before we can remove the wmb_pmem() from  dax_do_io(), we need
the underlying device to support REQ_FLUSH correctly...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 07/11] mm: add find_get_entries_tag()
  2015-11-14  0:06   ` Ross Zwisler
  (?)
@ 2015-11-16 22:42     ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 22:42 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 05:06:46PM -0700, Ross Zwisler wrote:
> Add find_get_entries_tag() to the family of functions that include
> find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
> needed for DAX dirty page handling because we need a list of both page
> offsets and radix tree entries ('indices' and 'entries' in this function)
> that are marked with the PAGECACHE_TAG_TOWRITE tag.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  include/linux/pagemap.h |  3 +++
>  mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index a6c78e0..6fea3be 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
>  			       unsigned int nr_pages, struct page **pages);
>  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
>  			int tag, unsigned int nr_pages, struct page **pages);
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> +			int tag, unsigned int nr_entries,
> +			struct page **entries, pgoff_t *indices);
>  
>  struct page *grab_cache_page_write_begin(struct address_space *mapping,
>  			pgoff_t index, unsigned flags);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d5e94fd..89ab448 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1454,6 +1454,67 @@ repeat:
>  }
>  EXPORT_SYMBOL(find_get_pages_tag);
>  
> +/**
> + * find_get_entries_tag - find and return entries that match @tag
> + * @mapping:	the address_space to search
> + * @start:	the starting page cache index
> + * @tag:	the tag index
> + * @nr_entries:	the maximum number of entries
> + * @entries:	where the resulting entries are placed
> + * @indices:	the cache indices corresponding to the entries in @entries
> + *
> + * Like find_get_entries, except we only return entries which are tagged with
> + * @tag.
> + */
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> +			int tag, unsigned int nr_entries,
> +			struct page **entries, pgoff_t *indices)
> +{
> +	void **slot;
> +	unsigned int ret = 0;
> +	struct radix_tree_iter iter;
> +
> +	if (!nr_entries)
> +		return 0;
> +
> +	rcu_read_lock();
> +restart:
> +	radix_tree_for_each_tagged(slot, &mapping->page_tree,
> +				   &iter, start, tag) {
> +		struct page *page;
> +repeat:
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (radix_tree_exception(page)) {
> +			if (radix_tree_deref_retry(page))
> +				goto restart;

That restart condition looks wrong. ret can be non-zero, but we
start looking from the original start index again, resulting in
duplicates being added to the return arrays...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 07/11] mm: add find_get_entries_tag()
@ 2015-11-16 22:42     ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 22:42 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 05:06:46PM -0700, Ross Zwisler wrote:
> Add find_get_entries_tag() to the family of functions that include
> find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
> needed for DAX dirty page handling because we need a list of both page
> offsets and radix tree entries ('indices' and 'entries' in this function)
> that are marked with the PAGECACHE_TAG_TOWRITE tag.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  include/linux/pagemap.h |  3 +++
>  mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index a6c78e0..6fea3be 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
>  			       unsigned int nr_pages, struct page **pages);
>  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
>  			int tag, unsigned int nr_pages, struct page **pages);
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> +			int tag, unsigned int nr_entries,
> +			struct page **entries, pgoff_t *indices);
>  
>  struct page *grab_cache_page_write_begin(struct address_space *mapping,
>  			pgoff_t index, unsigned flags);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d5e94fd..89ab448 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1454,6 +1454,67 @@ repeat:
>  }
>  EXPORT_SYMBOL(find_get_pages_tag);
>  
> +/**
> + * find_get_entries_tag - find and return entries that match @tag
> + * @mapping:	the address_space to search
> + * @start:	the starting page cache index
> + * @tag:	the tag index
> + * @nr_entries:	the maximum number of entries
> + * @entries:	where the resulting entries are placed
> + * @indices:	the cache indices corresponding to the entries in @entries
> + *
> + * Like find_get_entries, except we only return entries which are tagged with
> + * @tag.
> + */
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> +			int tag, unsigned int nr_entries,
> +			struct page **entries, pgoff_t *indices)
> +{
> +	void **slot;
> +	unsigned int ret = 0;
> +	struct radix_tree_iter iter;
> +
> +	if (!nr_entries)
> +		return 0;
> +
> +	rcu_read_lock();
> +restart:
> +	radix_tree_for_each_tagged(slot, &mapping->page_tree,
> +				   &iter, start, tag) {
> +		struct page *page;
> +repeat:
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (radix_tree_exception(page)) {
> +			if (radix_tree_deref_retry(page))
> +				goto restart;

That restart condition looks wrong. ret can be non-zero, but we
start looking from the original start index again, resulting in
duplicates being added to the return arrays...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 07/11] mm: add find_get_entries_tag()
@ 2015-11-16 22:42     ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 22:42 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: x86, Theodore Ts'o, Andrew Morton, Thomas Gleixner,
	linux-nvdimm, Jan Kara, linux-kernel, Dave Hansen, xfs,
	J. Bruce Fields, linux-mm, Ingo Molnar, Andreas Dilger,
	Alexander Viro, H. Peter Anvin, linux-fsdevel, Matthew Wilcox,
	Dan Williams, linux-ext4, Jeff Layton, Matthew Wilcox

On Fri, Nov 13, 2015 at 05:06:46PM -0700, Ross Zwisler wrote:
> Add find_get_entries_tag() to the family of functions that include
> find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
> needed for DAX dirty page handling because we need a list of both page
> offsets and radix tree entries ('indices' and 'entries' in this function)
> that are marked with the PAGECACHE_TAG_TOWRITE tag.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  include/linux/pagemap.h |  3 +++
>  mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 64 insertions(+)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index a6c78e0..6fea3be 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
>  			       unsigned int nr_pages, struct page **pages);
>  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
>  			int tag, unsigned int nr_pages, struct page **pages);
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> +			int tag, unsigned int nr_entries,
> +			struct page **entries, pgoff_t *indices);
>  
>  struct page *grab_cache_page_write_begin(struct address_space *mapping,
>  			pgoff_t index, unsigned flags);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d5e94fd..89ab448 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1454,6 +1454,67 @@ repeat:
>  }
>  EXPORT_SYMBOL(find_get_pages_tag);
>  
> +/**
> + * find_get_entries_tag - find and return entries that match @tag
> + * @mapping:	the address_space to search
> + * @start:	the starting page cache index
> + * @tag:	the tag index
> + * @nr_entries:	the maximum number of entries
> + * @entries:	where the resulting entries are placed
> + * @indices:	the cache indices corresponding to the entries in @entries
> + *
> + * Like find_get_entries, except we only return entries which are tagged with
> + * @tag.
> + */
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> +			int tag, unsigned int nr_entries,
> +			struct page **entries, pgoff_t *indices)
> +{
> +	void **slot;
> +	unsigned int ret = 0;
> +	struct radix_tree_iter iter;
> +
> +	if (!nr_entries)
> +		return 0;
> +
> +	rcu_read_lock();
> +restart:
> +	radix_tree_for_each_tagged(slot, &mapping->page_tree,
> +				   &iter, start, tag) {
> +		struct page *page;
> +repeat:
> +		page = radix_tree_deref_slot(slot);
> +		if (unlikely(!page))
> +			continue;
> +		if (radix_tree_exception(page)) {
> +			if (radix_tree_deref_retry(page))
> +				goto restart;

That restart condition looks wrong. ret can be non-zero, but we
start looking from the original start index again, resulting in
duplicates being added to the return arrays...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 08/11] dax: add support for fsync/sync
  2015-11-14  0:06   ` Ross Zwisler
  (?)
@ 2015-11-16 22:58     ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 22:58 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 05:06:47PM -0700, Ross Zwisler wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
> 
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
> 
> When called as part of the msync/fsync flush path DAX queries the radix
> tree for dirty entries, flushing them and then marking the PTE or PMD page
> table entries as clean.  The step of cleaning the PTE or PMD entries is
> necessary so that on subsequent writes to the same page we get a new write
> fault allowing us to once again dirty the DAX tag in the radix tree.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c            | 140 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  include/linux/dax.h |   1 +
>  mm/huge_memory.c    |  14 +++---
>  3 files changed, 141 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 131fd35a..9ce6d1b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -24,7 +24,9 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
> +#include <linux/pagevec.h>
>  #include <linux/pmem.h>
> +#include <linux/rmap.h>
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> @@ -287,6 +289,53 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
>  	return 0;
>  }
>  
> +static int dax_dirty_pgoff(struct address_space *mapping, unsigned long pgoff,
> +		void __pmem *addr, bool pmd_entry)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int error = 0;
> +	void *entry;
> +
> +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = radix_tree_lookup(page_tree, pgoff);
> +	if (addr == NULL) {
> +		if (entry)
> +			goto dirty;
> +		else {
> +			WARN(1, "DAX pfn_mkwrite failed to find an entry");
> +			goto out;
> +		}
> +	}
> +
> +	if (entry) {
> +		if (pmd_entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
> +			radix_tree_delete(&mapping->page_tree, pgoff);
> +			mapping->nrdax--;
> +		} else
> +			goto dirty;
> +	}

Logic is pretty spagettied here. Perhaps:

	entry = radix_tree_lookup(page_tree, pgoff);
	if (entry) {
		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD))
			goto dirty;
		radix_tree_delete(&mapping->page_tree, pgoff);
		mapping->nrdax--;
	} else {
		WARN_ON(!addr);
		goto out_unlock;
	}
....

> +
> +	BUG_ON(RADIX_DAX_TYPE(addr));
> +	if (pmd_entry)
> +		error = radix_tree_insert(page_tree, pgoff,
> +				RADIX_DAX_PMD_ENTRY(addr));
> +	else
> +		error = radix_tree_insert(page_tree, pgoff,
> +				RADIX_DAX_PTE_ENTRY(addr));
> +
> +	if (error)
> +		goto out;
> +
> +	mapping->nrdax++;
> + dirty:
> +	radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> + out:
> +	spin_unlock_irq(&mapping->tree_lock);

label should be "out_unlock" rather "out" to indicate in the code
that we are jumping to the correct spot in the error stack...

> +			goto fallback;
>  	}
>  
>   out:
> @@ -689,15 +746,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
>   * dax_pfn_mkwrite - handle first write to DAX page
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
> - *
>   */
>  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +	struct file *file = vma->vm_file;
>  
> -	sb_start_pagefault(sb);
> -	file_update_time(vma->vm_file);
> -	sb_end_pagefault(sb);
> +	dax_dirty_pgoff(file->f_mapping, vmf->pgoff, NULL, false);
>  	return VM_FAULT_NOPAGE;

This seems wrong - it's dropping the freeze protection on fault, and
now the inode timestamp won't get updated, either.

>  }
>  EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> @@ -772,3 +826,77 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  	return dax_zero_page_range(inode, from, length, get_block);
>  }
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
> +
> +static void dax_sync_entry(struct address_space *mapping, pgoff_t pgoff,
> +		void *entry)
> +{

dax_writeback_pgoff() seems like a more consistent name (consider
dax_dirty_pgoff), and that we are actually doing a writeback
operation, not a "sync" operation.

> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type = RADIX_DAX_TYPE(entry);
> +	size_t size;
> +
> +	BUG_ON(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	if (!radix_tree_tag_get(page_tree, pgoff, PAGECACHE_TAG_TOWRITE)) {
> +		/* another fsync thread already wrote back this entry */
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}
> +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_TOWRITE);
> +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	if (type == RADIX_DAX_PMD)
> +		size = PMD_SIZE;
> +	else
> +		size = PAGE_SIZE;
> +
> +	wb_cache_pmem(RADIX_DAX_ADDR(entry), size);
> +	pgoff_mkclean(pgoff, mapping);

This looks racy w.r.t. another operation setting the radix tree
dirty tags. i.e. there is no locking to serialise marking the
vma/pte clean and another operation marking the radix tree dirty.

> +}
> +
> +/*
> + * Flush the mapping to the persistent domain within the byte range of (start,
> + * end). This is required by data integrity operations to ensure file data is on
> + * persistent storage prior to completion of the operation. It also requires us
> + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
> + * the file is written to again so we have an indication that we need to flush
> + * the mapping if a data integrity operation takes place.
> + *
> + * We don't need commits to storage here - the filesystems will issue flushes
> + * appropriately at the conclusion of the data integrity operation via REQ_FUA
> + * writes or blkdev_issue_flush() commands.  This requires the DAX block device
> + * to implement persistent storage domain fencing/commits on receiving a
> + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
> + * layers.
> + */
> +void dax_fsync(struct address_space *mapping, loff_t start, loff_t end)
> +{

dax_writeback_mapping_range()

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 08/11] dax: add support for fsync/sync
@ 2015-11-16 22:58     ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 22:58 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 05:06:47PM -0700, Ross Zwisler wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
> 
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
> 
> When called as part of the msync/fsync flush path DAX queries the radix
> tree for dirty entries, flushing them and then marking the PTE or PMD page
> table entries as clean.  The step of cleaning the PTE or PMD entries is
> necessary so that on subsequent writes to the same page we get a new write
> fault allowing us to once again dirty the DAX tag in the radix tree.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c            | 140 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  include/linux/dax.h |   1 +
>  mm/huge_memory.c    |  14 +++---
>  3 files changed, 141 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 131fd35a..9ce6d1b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -24,7 +24,9 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
> +#include <linux/pagevec.h>
>  #include <linux/pmem.h>
> +#include <linux/rmap.h>
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> @@ -287,6 +289,53 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
>  	return 0;
>  }
>  
> +static int dax_dirty_pgoff(struct address_space *mapping, unsigned long pgoff,
> +		void __pmem *addr, bool pmd_entry)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int error = 0;
> +	void *entry;
> +
> +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = radix_tree_lookup(page_tree, pgoff);
> +	if (addr == NULL) {
> +		if (entry)
> +			goto dirty;
> +		else {
> +			WARN(1, "DAX pfn_mkwrite failed to find an entry");
> +			goto out;
> +		}
> +	}
> +
> +	if (entry) {
> +		if (pmd_entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
> +			radix_tree_delete(&mapping->page_tree, pgoff);
> +			mapping->nrdax--;
> +		} else
> +			goto dirty;
> +	}

Logic is pretty spagettied here. Perhaps:

	entry = radix_tree_lookup(page_tree, pgoff);
	if (entry) {
		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD))
			goto dirty;
		radix_tree_delete(&mapping->page_tree, pgoff);
		mapping->nrdax--;
	} else {
		WARN_ON(!addr);
		goto out_unlock;
	}
....

> +
> +	BUG_ON(RADIX_DAX_TYPE(addr));
> +	if (pmd_entry)
> +		error = radix_tree_insert(page_tree, pgoff,
> +				RADIX_DAX_PMD_ENTRY(addr));
> +	else
> +		error = radix_tree_insert(page_tree, pgoff,
> +				RADIX_DAX_PTE_ENTRY(addr));
> +
> +	if (error)
> +		goto out;
> +
> +	mapping->nrdax++;
> + dirty:
> +	radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> + out:
> +	spin_unlock_irq(&mapping->tree_lock);

label should be "out_unlock" rather "out" to indicate in the code
that we are jumping to the correct spot in the error stack...

> +			goto fallback;
>  	}
>  
>   out:
> @@ -689,15 +746,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
>   * dax_pfn_mkwrite - handle first write to DAX page
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
> - *
>   */
>  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +	struct file *file = vma->vm_file;
>  
> -	sb_start_pagefault(sb);
> -	file_update_time(vma->vm_file);
> -	sb_end_pagefault(sb);
> +	dax_dirty_pgoff(file->f_mapping, vmf->pgoff, NULL, false);
>  	return VM_FAULT_NOPAGE;

This seems wrong - it's dropping the freeze protection on fault, and
now the inode timestamp won't get updated, either.

>  }
>  EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> @@ -772,3 +826,77 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  	return dax_zero_page_range(inode, from, length, get_block);
>  }
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
> +
> +static void dax_sync_entry(struct address_space *mapping, pgoff_t pgoff,
> +		void *entry)
> +{

dax_writeback_pgoff() seems like a more consistent name (consider
dax_dirty_pgoff), and that we are actually doing a writeback
operation, not a "sync" operation.

> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type = RADIX_DAX_TYPE(entry);
> +	size_t size;
> +
> +	BUG_ON(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	if (!radix_tree_tag_get(page_tree, pgoff, PAGECACHE_TAG_TOWRITE)) {
> +		/* another fsync thread already wrote back this entry */
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}
> +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_TOWRITE);
> +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	if (type == RADIX_DAX_PMD)
> +		size = PMD_SIZE;
> +	else
> +		size = PAGE_SIZE;
> +
> +	wb_cache_pmem(RADIX_DAX_ADDR(entry), size);
> +	pgoff_mkclean(pgoff, mapping);

This looks racy w.r.t. another operation setting the radix tree
dirty tags. i.e. there is no locking to serialise marking the
vma/pte clean and another operation marking the radix tree dirty.

> +}
> +
> +/*
> + * Flush the mapping to the persistent domain within the byte range of (start,
> + * end). This is required by data integrity operations to ensure file data is on
> + * persistent storage prior to completion of the operation. It also requires us
> + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
> + * the file is written to again so we have an indication that we need to flush
> + * the mapping if a data integrity operation takes place.
> + *
> + * We don't need commits to storage here - the filesystems will issue flushes
> + * appropriately at the conclusion of the data integrity operation via REQ_FUA
> + * writes or blkdev_issue_flush() commands.  This requires the DAX block device
> + * to implement persistent storage domain fencing/commits on receiving a
> + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
> + * layers.
> + */
> +void dax_fsync(struct address_space *mapping, loff_t start, loff_t end)
> +{

dax_writeback_mapping_range()

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 08/11] dax: add support for fsync/sync
@ 2015-11-16 22:58     ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 22:58 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: x86, Theodore Ts'o, Andrew Morton, Thomas Gleixner,
	linux-nvdimm, Jan Kara, linux-kernel, Dave Hansen, xfs,
	J. Bruce Fields, linux-mm, Ingo Molnar, Andreas Dilger,
	Alexander Viro, H. Peter Anvin, linux-fsdevel, Matthew Wilcox,
	Dan Williams, linux-ext4, Jeff Layton, Matthew Wilcox

On Fri, Nov 13, 2015 at 05:06:47PM -0700, Ross Zwisler wrote:
> To properly handle fsync/msync in an efficient way DAX needs to track dirty
> pages so it is able to flush them durably to media on demand.
> 
> The tracking of dirty pages is done via the radix tree in struct
> address_space.  This radix tree is already used by the page writeback
> infrastructure for tracking dirty pages associated with an open file, and
> it already has support for exceptional (non struct page*) entries.  We
> build upon these features to add exceptional entries to the radix tree for
> DAX dirty PMD or PTE pages at fault time.
> 
> When called as part of the msync/fsync flush path DAX queries the radix
> tree for dirty entries, flushing them and then marking the PTE or PMD page
> table entries as clean.  The step of cleaning the PTE or PMD entries is
> necessary so that on subsequent writes to the same page we get a new write
> fault allowing us to once again dirty the DAX tag in the radix tree.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c            | 140 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  include/linux/dax.h |   1 +
>  mm/huge_memory.c    |  14 +++---
>  3 files changed, 141 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 131fd35a..9ce6d1b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -24,7 +24,9 @@
>  #include <linux/memcontrol.h>
>  #include <linux/mm.h>
>  #include <linux/mutex.h>
> +#include <linux/pagevec.h>
>  #include <linux/pmem.h>
> +#include <linux/rmap.h>
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> @@ -287,6 +289,53 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
>  	return 0;
>  }
>  
> +static int dax_dirty_pgoff(struct address_space *mapping, unsigned long pgoff,
> +		void __pmem *addr, bool pmd_entry)
> +{
> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int error = 0;
> +	void *entry;
> +
> +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	entry = radix_tree_lookup(page_tree, pgoff);
> +	if (addr == NULL) {
> +		if (entry)
> +			goto dirty;
> +		else {
> +			WARN(1, "DAX pfn_mkwrite failed to find an entry");
> +			goto out;
> +		}
> +	}
> +
> +	if (entry) {
> +		if (pmd_entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
> +			radix_tree_delete(&mapping->page_tree, pgoff);
> +			mapping->nrdax--;
> +		} else
> +			goto dirty;
> +	}

Logic is pretty spagettied here. Perhaps:

	entry = radix_tree_lookup(page_tree, pgoff);
	if (entry) {
		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD))
			goto dirty;
		radix_tree_delete(&mapping->page_tree, pgoff);
		mapping->nrdax--;
	} else {
		WARN_ON(!addr);
		goto out_unlock;
	}
....

> +
> +	BUG_ON(RADIX_DAX_TYPE(addr));
> +	if (pmd_entry)
> +		error = radix_tree_insert(page_tree, pgoff,
> +				RADIX_DAX_PMD_ENTRY(addr));
> +	else
> +		error = radix_tree_insert(page_tree, pgoff,
> +				RADIX_DAX_PTE_ENTRY(addr));
> +
> +	if (error)
> +		goto out;
> +
> +	mapping->nrdax++;
> + dirty:
> +	radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> + out:
> +	spin_unlock_irq(&mapping->tree_lock);

label should be "out_unlock" rather "out" to indicate in the code
that we are jumping to the correct spot in the error stack...

> +			goto fallback;
>  	}
>  
>   out:
> @@ -689,15 +746,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
>   * dax_pfn_mkwrite - handle first write to DAX page
>   * @vma: The virtual memory area where the fault occurred
>   * @vmf: The description of the fault
> - *
>   */
>  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
> -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +	struct file *file = vma->vm_file;
>  
> -	sb_start_pagefault(sb);
> -	file_update_time(vma->vm_file);
> -	sb_end_pagefault(sb);
> +	dax_dirty_pgoff(file->f_mapping, vmf->pgoff, NULL, false);
>  	return VM_FAULT_NOPAGE;

This seems wrong - it's dropping the freeze protection on fault, and
now the inode timestamp won't get updated, either.

>  }
>  EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> @@ -772,3 +826,77 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
>  	return dax_zero_page_range(inode, from, length, get_block);
>  }
>  EXPORT_SYMBOL_GPL(dax_truncate_page);
> +
> +static void dax_sync_entry(struct address_space *mapping, pgoff_t pgoff,
> +		void *entry)
> +{

dax_writeback_pgoff() seems like a more consistent name (consider
dax_dirty_pgoff), and that we are actually doing a writeback
operation, not a "sync" operation.

> +	struct radix_tree_root *page_tree = &mapping->page_tree;
> +	int type = RADIX_DAX_TYPE(entry);
> +	size_t size;
> +
> +	BUG_ON(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	if (!radix_tree_tag_get(page_tree, pgoff, PAGECACHE_TAG_TOWRITE)) {
> +		/* another fsync thread already wrote back this entry */
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}
> +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_TOWRITE);
> +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> +	spin_unlock_irq(&mapping->tree_lock);
> +
> +	if (type == RADIX_DAX_PMD)
> +		size = PMD_SIZE;
> +	else
> +		size = PAGE_SIZE;
> +
> +	wb_cache_pmem(RADIX_DAX_ADDR(entry), size);
> +	pgoff_mkclean(pgoff, mapping);

This looks racy w.r.t. another operation setting the radix tree
dirty tags. i.e. there is no locking to serialise marking the
vma/pte clean and another operation marking the radix tree dirty.

> +}
> +
> +/*
> + * Flush the mapping to the persistent domain within the byte range of (start,
> + * end). This is required by data integrity operations to ensure file data is on
> + * persistent storage prior to completion of the operation. It also requires us
> + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
> + * the file is written to again so we have an indication that we need to flush
> + * the mapping if a data integrity operation takes place.
> + *
> + * We don't need commits to storage here - the filesystems will issue flushes
> + * appropriately at the conclusion of the data integrity operation via REQ_FUA
> + * writes or blkdev_issue_flush() commands.  This requires the DAX block device
> + * to implement persistent storage domain fencing/commits on receiving a
> + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
> + * layers.
> + */
> +void dax_fsync(struct address_space *mapping, loff_t start, loff_t end)
> +{

dax_writeback_mapping_range()

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
  2015-11-14  0:06   ` Ross Zwisler
  (?)
  (?)
@ 2015-11-16 23:12     ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 23:12 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> To properly support the new DAX fsync/msync infrastructure filesystems
> need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> write faults on a previously cleaned address.  They also need to call
> dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> addresses retrieved from get_block() so it needs to be ordered with
> respect to truncate.  This is accomplished by using the same locking that
> was set up for DAX page faults.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/xfs/xfs_file.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 39743ef..2b490a1 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -209,7 +209,8 @@ xfs_file_fsync(
>  	loff_t			end,
>  	int			datasync)
>  {
> -	struct inode		*inode = file->f_mapping->host;
> +	struct address_space	*mapping = file->f_mapping;
> +	struct inode		*inode = mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error = 0;
> @@ -218,7 +219,13 @@ xfs_file_fsync(
>  
>  	trace_xfs_file_fsync(ip);
>  
> -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +	if (dax_mapping(mapping)) {
> +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +		dax_fsync(mapping, start, end);
> +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	}
> +
> +	error = filemap_write_and_wait_range(mapping, start, end);

Ok, I don't understand a couple of things here.

Firstly, if it's a DAX mapping, why are we still calling
filemap_write_and_wait_range() after the dax_fsync() call that has
already written back all the dirty cachelines?

Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
be doing here? I don't see where dax_fsync() has any callouts to
get_block(), so the comment "needs to be ordered with respect to
truncate" doesn't make any obvious sense. If we have a racing
truncate removing entries from the radix tree, then thanks to the
mapping tree lock we'll either find an entry we need to write back,
or we won't find any entry at all, right?

Lastly, this flushing really needs to be inside
filemap_write_and_wait_range(), because we call the writeback code
from many more places than just fsync to ensure ordering of various
operations such that files are in known state before proceeding
(e.g. hole punch).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
@ 2015-11-16 23:12     ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 23:12 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> To properly support the new DAX fsync/msync infrastructure filesystems
> need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> write faults on a previously cleaned address.  They also need to call
> dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> addresses retrieved from get_block() so it needs to be ordered with
> respect to truncate.  This is accomplished by using the same locking that
> was set up for DAX page faults.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/xfs/xfs_file.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 39743ef..2b490a1 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -209,7 +209,8 @@ xfs_file_fsync(
>  	loff_t			end,
>  	int			datasync)
>  {
> -	struct inode		*inode = file->f_mapping->host;
> +	struct address_space	*mapping = file->f_mapping;
> +	struct inode		*inode = mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error = 0;
> @@ -218,7 +219,13 @@ xfs_file_fsync(
>  
>  	trace_xfs_file_fsync(ip);
>  
> -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +	if (dax_mapping(mapping)) {
> +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +		dax_fsync(mapping, start, end);
> +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	}
> +
> +	error = filemap_write_and_wait_range(mapping, start, end);

Ok, I don't understand a couple of things here.

Firstly, if it's a DAX mapping, why are we still calling
filemap_write_and_wait_range() after the dax_fsync() call that has
already written back all the dirty cachelines?

Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
be doing here? I don't see where dax_fsync() has any callouts to
get_block(), so the comment "needs to be ordered with respect to
truncate" doesn't make any obvious sense. If we have a racing
truncate removing entries from the radix tree, then thanks to the
mapping tree lock we'll either find an entry we need to write back,
or we won't find any entry at all, right?

Lastly, this flushing really needs to be inside
filemap_write_and_wait_range(), because we call the writeback code
from many more places than just fsync to ensure ordering of various
operations such that files are in known state before proceeding
(e.g. hole punch).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
@ 2015-11-16 23:12     ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 23:12 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Andreas Dilger, Dan Williams, Ingo Molnar,
	Jan Kara, Jeff Layton, Matthew Wilcox, Thomas Gleixner,
	linux-ext4, linux-fsdevel, linux-mm, linux-nvdimm, x86, xfs,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> To properly support the new DAX fsync/msync infrastructure filesystems
> need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> write faults on a previously cleaned address.  They also need to call
> dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> addresses retrieved from get_block() so it needs to be ordered with
> respect to truncate.  This is accomplished by using the same locking that
> was set up for DAX page faults.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/xfs/xfs_file.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 39743ef..2b490a1 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -209,7 +209,8 @@ xfs_file_fsync(
>  	loff_t			end,
>  	int			datasync)
>  {
> -	struct inode		*inode = file->f_mapping->host;
> +	struct address_space	*mapping = file->f_mapping;
> +	struct inode		*inode = mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error = 0;
> @@ -218,7 +219,13 @@ xfs_file_fsync(
>  
>  	trace_xfs_file_fsync(ip);
>  
> -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +	if (dax_mapping(mapping)) {
> +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +		dax_fsync(mapping, start, end);
> +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	}
> +
> +	error = filemap_write_and_wait_range(mapping, start, end);

Ok, I don't understand a couple of things here.

Firstly, if it's a DAX mapping, why are we still calling
filemap_write_and_wait_range() after the dax_fsync() call that has
already written back all the dirty cachelines?

Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
be doing here? I don't see where dax_fsync() has any callouts to
get_block(), so the comment "needs to be ordered with respect to
truncate" doesn't make any obvious sense. If we have a racing
truncate removing entries from the radix tree, then thanks to the
mapping tree lock we'll either find an entry we need to write back,
or we won't find any entry at all, right?

Lastly, this flushing really needs to be inside
filemap_write_and_wait_range(), because we call the writeback code
from many more places than just fsync to ensure ordering of various
operations such that files are in known state before proceeding
(e.g. hole punch).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
@ 2015-11-16 23:12     ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 23:12 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: x86, Theodore Ts'o, Andrew Morton, Thomas Gleixner,
	linux-nvdimm, Jan Kara, linux-kernel, Dave Hansen, xfs,
	J. Bruce Fields, linux-mm, Ingo Molnar, Andreas Dilger,
	Alexander Viro, H. Peter Anvin, linux-fsdevel, Matthew Wilcox,
	Dan Williams, linux-ext4, Jeff Layton, Matthew Wilcox

On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> To properly support the new DAX fsync/msync infrastructure filesystems
> need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> write faults on a previously cleaned address.  They also need to call
> dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> addresses retrieved from get_block() so it needs to be ordered with
> respect to truncate.  This is accomplished by using the same locking that
> was set up for DAX page faults.
> 
> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/xfs/xfs_file.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 39743ef..2b490a1 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -209,7 +209,8 @@ xfs_file_fsync(
>  	loff_t			end,
>  	int			datasync)
>  {
> -	struct inode		*inode = file->f_mapping->host;
> +	struct address_space	*mapping = file->f_mapping;
> +	struct inode		*inode = mapping->host;
>  	struct xfs_inode	*ip = XFS_I(inode);
>  	struct xfs_mount	*mp = ip->i_mount;
>  	int			error = 0;
> @@ -218,7 +219,13 @@ xfs_file_fsync(
>  
>  	trace_xfs_file_fsync(ip);
>  
> -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> +	if (dax_mapping(mapping)) {
> +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +		dax_fsync(mapping, start, end);
> +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> +	}
> +
> +	error = filemap_write_and_wait_range(mapping, start, end);

Ok, I don't understand a couple of things here.

Firstly, if it's a DAX mapping, why are we still calling
filemap_write_and_wait_range() after the dax_fsync() call that has
already written back all the dirty cachelines?

Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
be doing here? I don't see where dax_fsync() has any callouts to
get_block(), so the comment "needs to be ordered with respect to
truncate" doesn't make any obvious sense. If we have a racing
truncate removing entries from the radix tree, then thanks to the
mapping tree lock we'll either find an entry we need to write back,
or we won't find any entry at all, right?

Lastly, this flushing really needs to be inside
filemap_write_and_wait_range(), because we call the writeback code
from many more places than just fsync to ensure ordering of various
operations such that files are in known state before proceeding
(e.g. hole punch).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-16 22:14               ` Dave Chinner
  (?)
  (?)
@ 2015-11-16 23:29                 ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 23:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dan Williams, Andreas Dilger, Ross Zwisler,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Tue, Nov 17, 2015 at 09:14:12AM +1100, Dave Chinner wrote:
> On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > > >>
> > > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > > >> <ross.zwisler@linux.intel.com> wrote:
> > > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > > >>> and are used by filesystems to order their metadata, among other things.
> > > > >>>
> > > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > > >>> the flushed data has been durably stored on the media.
> > > > >>>
> > > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > >>
> > > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > > >> the cache is done by the core why does the driver need support
> > > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > > >> but no I/O submitted to the driver proper is ever cached we always
> > > > >> flush it through to media.
> > > > >
> > > > > If the upper level filesystem gets an error when submitting a flush
> > > > > request, then it assumes the underlying hardware is broken and cannot
> > > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > > IO to complete.
> > > > 
> > > > Upper level filesystems won't get errors when the driver does not
> > > > support flush.  Those requests are ended cleanly in
> > > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > > outstanding I/O to complete but in the case of pmem all I/O is
> > > > synchronous.  There's never anything to await when flushing at the
> > > > pmem driver level.
> > > > 
> > > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > > it doesn't make sense _not_ to support this functionality.
> > > > 
> > > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > > it doesn't make sense to call the driver just for two instructions
> > > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > > Either handle it all in the mm or the driver, not a mixture.
> > > 
> > > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > > e.g. journal writes going through block layer or writes done through
> > > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > > and the way driver does IO doesn't guarantee this, does it?
> > 
> > Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> > does, I'm indeed wrong because they both do wmb_pmem() after each write
> > which seems to include sfence + pcommit. Sorry for confusion.
> 
> Which I want to remove, because it makes DAX IO 3x slower than
> buffered IO on ramdisk based testing.
> 
> > But a question: Won't it be better to do sfence + pcommit only in response
> > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > expensive these instructions are but in theory it could be a performance
> > win, couldn't it? For filesystems this is enough wrt persistency
> > guarantees...
> 
> I'm pretty sure it would be, because all of the overhead (and
> therefore latency) I measured is in the cache flushing instructions.
> But before we can remove the wmb_pmem() from  dax_do_io(), we need
> the underlying device to support REQ_FLUSH correctly...

By "support REQ_FLUSH correctly" do you mean call wmb_pmem() as I do in my
set?  Or do you mean something that also involves cache flushing such as the
"big hammer" that flushes everything or something like WBINVD?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 23:29                 ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 23:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dan Williams, Andreas Dilger, Ross Zwisler,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Tue, Nov 17, 2015 at 09:14:12AM +1100, Dave Chinner wrote:
> On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > > >>
> > > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > > >> <ross.zwisler@linux.intel.com> wrote:
> > > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > > >>> and are used by filesystems to order their metadata, among other things.
> > > > >>>
> > > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > > >>> the flushed data has been durably stored on the media.
> > > > >>>
> > > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > >>
> > > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > > >> the cache is done by the core why does the driver need support
> > > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > > >> but no I/O submitted to the driver proper is ever cached we always
> > > > >> flush it through to media.
> > > > >
> > > > > If the upper level filesystem gets an error when submitting a flush
> > > > > request, then it assumes the underlying hardware is broken and cannot
> > > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > > IO to complete.
> > > > 
> > > > Upper level filesystems won't get errors when the driver does not
> > > > support flush.  Those requests are ended cleanly in
> > > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > > outstanding I/O to complete but in the case of pmem all I/O is
> > > > synchronous.  There's never anything to await when flushing at the
> > > > pmem driver level.
> > > > 
> > > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > > it doesn't make sense _not_ to support this functionality.
> > > > 
> > > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > > it doesn't make sense to call the driver just for two instructions
> > > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > > Either handle it all in the mm or the driver, not a mixture.
> > > 
> > > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > > e.g. journal writes going through block layer or writes done through
> > > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > > and the way driver does IO doesn't guarantee this, does it?
> > 
> > Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> > does, I'm indeed wrong because they both do wmb_pmem() after each write
> > which seems to include sfence + pcommit. Sorry for confusion.
> 
> Which I want to remove, because it makes DAX IO 3x slower than
> buffered IO on ramdisk based testing.
> 
> > But a question: Won't it be better to do sfence + pcommit only in response
> > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > expensive these instructions are but in theory it could be a performance
> > win, couldn't it? For filesystems this is enough wrt persistency
> > guarantees...
> 
> I'm pretty sure it would be, because all of the overhead (and
> therefore latency) I measured is in the cache flushing instructions.
> But before we can remove the wmb_pmem() from  dax_do_io(), we need
> the underlying device to support REQ_FLUSH correctly...

By "support REQ_FLUSH correctly" do you mean call wmb_pmem() as I do in my
set?  Or do you mean something that also involves cache flushing such as the
"big hammer" that flushes everything or something like WBINVD?

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 23:29                 ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 23:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dan Williams, Andreas Dilger, Ross Zwisler,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Tue, Nov 17, 2015 at 09:14:12AM +1100, Dave Chinner wrote:
> On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > > >>
> > > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > > >> <ross.zwisler@linux.intel.com> wrote:
> > > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > > >>> and are used by filesystems to order their metadata, among other things.
> > > > >>>
> > > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > > >>> the flushed data has been durably stored on the media.
> > > > >>>
> > > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > >>
> > > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > > >> the cache is done by the core why does the driver need support
> > > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > > >> but no I/O submitted to the driver proper is ever cached we always
> > > > >> flush it through to media.
> > > > >
> > > > > If the upper level filesystem gets an error when submitting a flush
> > > > > request, then it assumes the underlying hardware is broken and cannot
> > > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > > IO to complete.
> > > > 
> > > > Upper level filesystems won't get errors when the driver does not
> > > > support flush.  Those requests are ended cleanly in
> > > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > > outstanding I/O to complete but in the case of pmem all I/O is
> > > > synchronous.  There's never anything to await when flushing at the
> > > > pmem driver level.
> > > > 
> > > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > > it doesn't make sense _not_ to support this functionality.
> > > > 
> > > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > > it doesn't make sense to call the driver just for two instructions
> > > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > > Either handle it all in the mm or the driver, not a mixture.
> > > 
> > > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > > e.g. journal writes going through block layer or writes done through
> > > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > > and the way driver does IO doesn't guarantee this, does it?
> > 
> > Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> > does, I'm indeed wrong because they both do wmb_pmem() after each write
> > which seems to include sfence + pcommit. Sorry for confusion.
> 
> Which I want to remove, because it makes DAX IO 3x slower than
> buffered IO on ramdisk based testing.
> 
> > But a question: Won't it be better to do sfence + pcommit only in response
> > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > expensive these instructions are but in theory it could be a performance
> > win, couldn't it? For filesystems this is enough wrt persistency
> > guarantees...
> 
> I'm pretty sure it would be, because all of the overhead (and
> therefore latency) I measured is in the cache flushing instructions.
> But before we can remove the wmb_pmem() from  dax_do_io(), we need
> the underlying device to support REQ_FLUSH correctly...

By "support REQ_FLUSH correctly" do you mean call wmb_pmem() as I do in my
set?  Or do you mean something that also involves cache flushing such as the
"big hammer" that flushes everything or something like WBINVD?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 23:29                 ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 23:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Dave Hansen, J. Bruce Fields, Linux MM, H. Peter Anvin,
	Jeff Layton, Dan Williams, linux-nvdimm, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Andreas Dilger,
	Theodore Ts'o, linux-kernel, Jan Kara, linux-fsdevel,
	Andrew Morton, Matthew Wilcox

On Tue, Nov 17, 2015 at 09:14:12AM +1100, Dave Chinner wrote:
> On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > > >>
> > > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > > >> <ross.zwisler@linux.intel.com> wrote:
> > > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > > >>> and are used by filesystems to order their metadata, among other things.
> > > > >>>
> > > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > > >>> the flushed data has been durably stored on the media.
> > > > >>>
> > > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > >>
> > > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > > >> the cache is done by the core why does the driver need support
> > > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > > >> but no I/O submitted to the driver proper is ever cached we always
> > > > >> flush it through to media.
> > > > >
> > > > > If the upper level filesystem gets an error when submitting a flush
> > > > > request, then it assumes the underlying hardware is broken and cannot
> > > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > > IO to complete.
> > > > 
> > > > Upper level filesystems won't get errors when the driver does not
> > > > support flush.  Those requests are ended cleanly in
> > > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > > outstanding I/O to complete but in the case of pmem all I/O is
> > > > synchronous.  There's never anything to await when flushing at the
> > > > pmem driver level.
> > > > 
> > > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > > it doesn't make sense _not_ to support this functionality.
> > > > 
> > > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > > it doesn't make sense to call the driver just for two instructions
> > > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > > Either handle it all in the mm or the driver, not a mixture.
> > > 
> > > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > > e.g. journal writes going through block layer or writes done through
> > > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > > and the way driver does IO doesn't guarantee this, does it?
> > 
> > Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> > does, I'm indeed wrong because they both do wmb_pmem() after each write
> > which seems to include sfence + pcommit. Sorry for confusion.
> 
> Which I want to remove, because it makes DAX IO 3x slower than
> buffered IO on ramdisk based testing.
> 
> > But a question: Won't it be better to do sfence + pcommit only in response
> > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > expensive these instructions are but in theory it could be a performance
> > win, couldn't it? For filesystems this is enough wrt persistency
> > guarantees...
> 
> I'm pretty sure it would be, because all of the overhead (and
> therefore latency) I measured is in the cache flushing instructions.
> But before we can remove the wmb_pmem() from  dax_do_io(), we need
> the underlying device to support REQ_FLUSH correctly...

By "support REQ_FLUSH correctly" do you mean call wmb_pmem() as I do in my
set?  Or do you mean something that also involves cache flushing such as the
"big hammer" that flushes everything or something like WBINVD?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-16 23:29                 ` Ross Zwisler
  (?)
  (?)
@ 2015-11-16 23:42                   ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 23:42 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, Dan Williams, Andreas Dilger,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 04:29:27PM -0700, Ross Zwisler wrote:
> On Tue, Nov 17, 2015 at 09:14:12AM +1100, Dave Chinner wrote:
> > On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> > > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > > > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > >>
> > > > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > > > >> <ross.zwisler@linux.intel.com> wrote:
> > > > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > > > >>> and are used by filesystems to order their metadata, among other things.
> > > > > >>>
> > > > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > > > >>> the flushed data has been durably stored on the media.
> > > > > >>>
> > > > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > > >>
> > > > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > > > >> the cache is done by the core why does the driver need support
> > > > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > > > >> but no I/O submitted to the driver proper is ever cached we always
> > > > > >> flush it through to media.
> > > > > >
> > > > > > If the upper level filesystem gets an error when submitting a flush
> > > > > > request, then it assumes the underlying hardware is broken and cannot
> > > > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > > > IO to complete.
> > > > > 
> > > > > Upper level filesystems won't get errors when the driver does not
> > > > > support flush.  Those requests are ended cleanly in
> > > > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > > > outstanding I/O to complete but in the case of pmem all I/O is
> > > > > synchronous.  There's never anything to await when flushing at the
> > > > > pmem driver level.
> > > > > 
> > > > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > > > it doesn't make sense _not_ to support this functionality.
> > > > > 
> > > > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > > > it doesn't make sense to call the driver just for two instructions
> > > > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > > > Either handle it all in the mm or the driver, not a mixture.
> > > > 
> > > > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > > > e.g. journal writes going through block layer or writes done through
> > > > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > > > and the way driver does IO doesn't guarantee this, does it?
> > > 
> > > Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> > > does, I'm indeed wrong because they both do wmb_pmem() after each write
> > > which seems to include sfence + pcommit. Sorry for confusion.
> > 
> > Which I want to remove, because it makes DAX IO 3x slower than
> > buffered IO on ramdisk based testing.
> > 
> > > But a question: Won't it be better to do sfence + pcommit only in response
> > > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > > expensive these instructions are but in theory it could be a performance
> > > win, couldn't it? For filesystems this is enough wrt persistency
> > > guarantees...
> > 
> > I'm pretty sure it would be, because all of the overhead (and
> > therefore latency) I measured is in the cache flushing instructions.
> > But before we can remove the wmb_pmem() from  dax_do_io(), we need
> > the underlying device to support REQ_FLUSH correctly...
> 
> By "support REQ_FLUSH correctly" do you mean call wmb_pmem() as I do in my
> set?  Or do you mean something that also involves cache flushing such as the
> "big hammer" that flushes everything or something like WBINVD?

Either. Both solve the problem of defering the cache flush penalty
to the context that needs it..

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 23:42                   ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 23:42 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, Dan Williams, Andreas Dilger,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 04:29:27PM -0700, Ross Zwisler wrote:
> On Tue, Nov 17, 2015 at 09:14:12AM +1100, Dave Chinner wrote:
> > On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> > > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > > > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > >>
> > > > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > > > >> <ross.zwisler@linux.intel.com> wrote:
> > > > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > > > >>> and are used by filesystems to order their metadata, among other things.
> > > > > >>>
> > > > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > > > >>> the flushed data has been durably stored on the media.
> > > > > >>>
> > > > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > > >>
> > > > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > > > >> the cache is done by the core why does the driver need support
> > > > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > > > >> but no I/O submitted to the driver proper is ever cached we always
> > > > > >> flush it through to media.
> > > > > >
> > > > > > If the upper level filesystem gets an error when submitting a flush
> > > > > > request, then it assumes the underlying hardware is broken and cannot
> > > > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > > > IO to complete.
> > > > > 
> > > > > Upper level filesystems won't get errors when the driver does not
> > > > > support flush.  Those requests are ended cleanly in
> > > > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > > > outstanding I/O to complete but in the case of pmem all I/O is
> > > > > synchronous.  There's never anything to await when flushing at the
> > > > > pmem driver level.
> > > > > 
> > > > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > > > it doesn't make sense _not_ to support this functionality.
> > > > > 
> > > > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > > > it doesn't make sense to call the driver just for two instructions
> > > > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > > > Either handle it all in the mm or the driver, not a mixture.
> > > > 
> > > > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > > > e.g. journal writes going through block layer or writes done through
> > > > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > > > and the way driver does IO doesn't guarantee this, does it?
> > > 
> > > Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> > > does, I'm indeed wrong because they both do wmb_pmem() after each write
> > > which seems to include sfence + pcommit. Sorry for confusion.
> > 
> > Which I want to remove, because it makes DAX IO 3x slower than
> > buffered IO on ramdisk based testing.
> > 
> > > But a question: Won't it be better to do sfence + pcommit only in response
> > > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > > expensive these instructions are but in theory it could be a performance
> > > win, couldn't it? For filesystems this is enough wrt persistency
> > > guarantees...
> > 
> > I'm pretty sure it would be, because all of the overhead (and
> > therefore latency) I measured is in the cache flushing instructions.
> > But before we can remove the wmb_pmem() from  dax_do_io(), we need
> > the underlying device to support REQ_FLUSH correctly...
> 
> By "support REQ_FLUSH correctly" do you mean call wmb_pmem() as I do in my
> set?  Or do you mean something that also involves cache flushing such as the
> "big hammer" that flushes everything or something like WBINVD?

Either. Both solve the problem of defering the cache flush penalty
to the context that needs it..

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 23:42                   ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 23:42 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, Dan Williams, Andreas Dilger,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 04:29:27PM -0700, Ross Zwisler wrote:
> On Tue, Nov 17, 2015 at 09:14:12AM +1100, Dave Chinner wrote:
> > On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> > > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > > > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > >>
> > > > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > > > >> <ross.zwisler@linux.intel.com> wrote:
> > > > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > > > >>> and are used by filesystems to order their metadata, among other things.
> > > > > >>>
> > > > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > > > >>> the flushed data has been durably stored on the media.
> > > > > >>>
> > > > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > > >>
> > > > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > > > >> the cache is done by the core why does the driver need support
> > > > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > > > >> but no I/O submitted to the driver proper is ever cached we always
> > > > > >> flush it through to media.
> > > > > >
> > > > > > If the upper level filesystem gets an error when submitting a flush
> > > > > > request, then it assumes the underlying hardware is broken and cannot
> > > > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > > > IO to complete.
> > > > > 
> > > > > Upper level filesystems won't get errors when the driver does not
> > > > > support flush.  Those requests are ended cleanly in
> > > > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > > > outstanding I/O to complete but in the case of pmem all I/O is
> > > > > synchronous.  There's never anything to await when flushing at the
> > > > > pmem driver level.
> > > > > 
> > > > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > > > it doesn't make sense _not_ to support this functionality.
> > > > > 
> > > > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > > > it doesn't make sense to call the driver just for two instructions
> > > > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > > > Either handle it all in the mm or the driver, not a mixture.
> > > > 
> > > > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > > > e.g. journal writes going through block layer or writes done through
> > > > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > > > and the way driver does IO doesn't guarantee this, does it?
> > > 
> > > Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> > > does, I'm indeed wrong because they both do wmb_pmem() after each write
> > > which seems to include sfence + pcommit. Sorry for confusion.
> > 
> > Which I want to remove, because it makes DAX IO 3x slower than
> > buffered IO on ramdisk based testing.
> > 
> > > But a question: Won't it be better to do sfence + pcommit only in response
> > > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > > expensive these instructions are but in theory it could be a performance
> > > win, couldn't it? For filesystems this is enough wrt persistency
> > > guarantees...
> > 
> > I'm pretty sure it would be, because all of the overhead (and
> > therefore latency) I measured is in the cache flushing instructions.
> > But before we can remove the wmb_pmem() from  dax_do_io(), we need
> > the underlying device to support REQ_FLUSH correctly...
> 
> By "support REQ_FLUSH correctly" do you mean call wmb_pmem() as I do in my
> set?  Or do you mean something that also involves cache flushing such as the
> "big hammer" that flushes everything or something like WBINVD?

Either. Both solve the problem of defering the cache flush penalty
to the context that needs it..

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 23:42                   ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-16 23:42 UTC (permalink / raw)
  To: Ross Zwisler, Jan Kara, Dan Williams, Andreas Dilger,
	linux-kernel, H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 04:29:27PM -0700, Ross Zwisler wrote:
> On Tue, Nov 17, 2015 at 09:14:12AM +1100, Dave Chinner wrote:
> > On Mon, Nov 16, 2015 at 03:05:26PM +0100, Jan Kara wrote:
> > > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> > > > On Fri 13-11-15 18:32:40, Dan Williams wrote:
> > > > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > > > >>
> > > > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > > > >> <ross.zwisler@linux.intel.com> wrote:
> > > > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > > > >>> and are used by filesystems to order their metadata, among other things.
> > > > > >>>
> > > > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > > > >>> the flushed data has been durably stored on the media.
> > > > > >>>
> > > > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > > >>
> > > > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > > > >> the cache is done by the core why does the driver need support
> > > > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > > > >> but no I/O submitted to the driver proper is ever cached we always
> > > > > >> flush it through to media.
> > > > > >
> > > > > > If the upper level filesystem gets an error when submitting a flush
> > > > > > request, then it assumes the underlying hardware is broken and cannot
> > > > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > > > IO to complete.
> > > > > 
> > > > > Upper level filesystems won't get errors when the driver does not
> > > > > support flush.  Those requests are ended cleanly in
> > > > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > > > outstanding I/O to complete but in the case of pmem all I/O is
> > > > > synchronous.  There's never anything to await when flushing at the
> > > > > pmem driver level.
> > > > > 
> > > > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > > > it doesn't make sense _not_ to support this functionality.
> > > > > 
> > > > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > > > it doesn't make sense to call the driver just for two instructions
> > > > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > > > Either handle it all in the mm or the driver, not a mixture.
> > > > 
> > > > So I think REQ_FLUSH requests *must* end up doing sfence + pcommit because
> > > > e.g. journal writes going through block layer or writes done through
> > > > dax_do_io() must be on permanent storage once REQ_FLUSH request finishes
> > > > and the way driver does IO doesn't guarantee this, does it?
> > > 
> > > Hum, and looking into how dax_do_io() works and what drivers/nvdimm/pmem.c
> > > does, I'm indeed wrong because they both do wmb_pmem() after each write
> > > which seems to include sfence + pcommit. Sorry for confusion.
> > 
> > Which I want to remove, because it makes DAX IO 3x slower than
> > buffered IO on ramdisk based testing.
> > 
> > > But a question: Won't it be better to do sfence + pcommit only in response
> > > to REQ_FLUSH request and don't do it after each write? I'm not sure how
> > > expensive these instructions are but in theory it could be a performance
> > > win, couldn't it? For filesystems this is enough wrt persistency
> > > guarantees...
> > 
> > I'm pretty sure it would be, because all of the overhead (and
> > therefore latency) I measured is in the cache flushing instructions.
> > But before we can remove the wmb_pmem() from  dax_do_io(), we need
> > the underlying device to support REQ_FLUSH correctly...
> 
> By "support REQ_FLUSH correctly" do you mean call wmb_pmem() as I do in my
> set?  Or do you mean something that also involves cache flushing such as the
> "big hammer" that flushes everything or something like WBINVD?

Either. Both solve the problem of defering the cache flush penalty
to the context that needs it..

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-16 20:34                   ` Dan Williams
  (?)
  (?)
@ 2015-11-16 23:57                     ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 23:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Jan Kara, Andreas Dilger, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 12:34:55PM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 11:48 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
> >> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> [..]
> > Is there any reason why this wouldn't work or wouldn't be a good idea?
> 
> We don't have numbers to support the claim that pcommit is so
> expensive as to need be deferred, especially if the upper layers are
> already taking the hit on doing the flushes.
> 
> REQ_FLUSH, means flush your volatile write cache.  Currently all I/O
> through the driver never hits a volatile cache so there's no need to
> tell the block layer that we have a volatile write cache, especially
> when you have the core mm taking responsibility for doing cache
> maintenance for dax-mmap ranges.
> 
> We also don't have numbers on if/when wbinvd is a more performant solution.
> 
> tl;dr Now that we have a baseline implementation can we please use
> data to make future arch decisions?

Sure, fair enough.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 23:57                     ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 23:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Jan Kara, Andreas Dilger, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 12:34:55PM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 11:48 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
> >> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> [..]
> > Is there any reason why this wouldn't work or wouldn't be a good idea?
> 
> We don't have numbers to support the claim that pcommit is so
> expensive as to need be deferred, especially if the upper layers are
> already taking the hit on doing the flushes.
> 
> REQ_FLUSH, means flush your volatile write cache.  Currently all I/O
> through the driver never hits a volatile cache so there's no need to
> tell the block layer that we have a volatile write cache, especially
> when you have the core mm taking responsibility for doing cache
> maintenance for dax-mmap ranges.
> 
> We also don't have numbers on if/when wbinvd is a more performant solution.
> 
> tl;dr Now that we have a baseline implementation can we please use
> data to make future arch decisions?

Sure, fair enough.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 23:57                     ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 23:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Ross Zwisler, Jan Kara, Andreas Dilger, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon, Nov 16, 2015 at 12:34:55PM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 11:48 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
> >> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> [..]
> > Is there any reason why this wouldn't work or wouldn't be a good idea?
> 
> We don't have numbers to support the claim that pcommit is so
> expensive as to need be deferred, especially if the upper layers are
> already taking the hit on doing the flushes.
> 
> REQ_FLUSH, means flush your volatile write cache.  Currently all I/O
> through the driver never hits a volatile cache so there's no need to
> tell the block layer that we have a volatile write cache, especially
> when you have the core mm taking responsibility for doing cache
> maintenance for dax-mmap ranges.
> 
> We also don't have numbers on if/when wbinvd is a more performant solution.
> 
> tl;dr Now that we have a baseline implementation can we please use
> data to make future arch decisions?

Sure, fair enough.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-16 23:57                     ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-16 23:57 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jan Kara, Dave Hansen, J. Bruce Fields, Linux MM, H. Peter Anvin,
	Jeff Layton, linux-nvdimm, X86 ML, Ingo Molnar, Matthew Wilcox,
	Ross Zwisler, linux-ext4, XFS Developers, Alexander Viro,
	Thomas Gleixner, Andreas Dilger, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Mon, Nov 16, 2015 at 12:34:55PM -0800, Dan Williams wrote:
> On Mon, Nov 16, 2015 at 11:48 AM, Ross Zwisler
> <ross.zwisler@linux.intel.com> wrote:
> > On Mon, Nov 16, 2015 at 09:28:59AM -0800, Dan Williams wrote:
> >> On Mon, Nov 16, 2015 at 6:05 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Mon 16-11-15 14:37:14, Jan Kara wrote:
> [..]
> > Is there any reason why this wouldn't work or wouldn't be a good idea?
> 
> We don't have numbers to support the claim that pcommit is so
> expensive as to need be deferred, especially if the upper layers are
> already taking the hit on doing the flushes.
> 
> REQ_FLUSH, means flush your volatile write cache.  Currently all I/O
> through the driver never hits a volatile cache so there's no need to
> tell the block layer that we have a volatile write cache, especially
> when you have the core mm taking responsibility for doing cache
> maintenance for dax-mmap ranges.
> 
> We also don't have numbers on if/when wbinvd is a more performant solution.
> 
> tl;dr Now that we have a baseline implementation can we please use
> data to make future arch decisions?

Sure, fair enough.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 02/11] mm: add pmd_mkclean()
  2015-11-14  1:02     ` Dave Hansen
  (?)
@ 2015-11-17 17:52       ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 17:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

On Fri, Nov 13, 2015 at 05:02:48PM -0800, Dave Hansen wrote:
> On 11/13/2015 04:06 PM, Ross Zwisler wrote:
> > +static inline pmd_t pmd_mkclean(pmd_t pmd)
> > +{
> > +	return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
> > +}
> 
> pte_mkclean() doesn't clear _PAGE_SOFT_DIRTY.  What the thought behind
> doing it here?

I just wrote it to undo the work done by pmd_mkdirty() - you're right, it
should mirror the work done by pte_mkclean() and not clear _PAGE_SOFT_DIRTY.
I'll fix this for v3, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 02/11] mm: add pmd_mkclean()
@ 2015-11-17 17:52       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 17:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox

On Fri, Nov 13, 2015 at 05:02:48PM -0800, Dave Hansen wrote:
> On 11/13/2015 04:06 PM, Ross Zwisler wrote:
> > +static inline pmd_t pmd_mkclean(pmd_t pmd)
> > +{
> > +	return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
> > +}
> 
> pte_mkclean() doesn't clear _PAGE_SOFT_DIRTY.  What the thought behind
> doing it here?

I just wrote it to undo the work done by pmd_mkdirty() - you're right, it
should mirror the work done by pte_mkclean() and not clear _PAGE_SOFT_DIRTY.
I'll fix this for v3, thanks!

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 02/11] mm: add pmd_mkclean()
@ 2015-11-17 17:52       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 17:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: J. Bruce Fields, linux-mm, Andreas Dilger, H. Peter Anvin,
	Jeff Layton, Dan Williams, linux-nvdimm, x86, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, xfs, Alexander Viro,
	Thomas Gleixner, Theodore Ts'o, linux-kernel, Jan Kara,
	linux-fsdevel, Andrew Morton, Matthew Wilcox

On Fri, Nov 13, 2015 at 05:02:48PM -0800, Dave Hansen wrote:
> On 11/13/2015 04:06 PM, Ross Zwisler wrote:
> > +static inline pmd_t pmd_mkclean(pmd_t pmd)
> > +{
> > +	return pmd_clear_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
> > +}
> 
> pte_mkclean() doesn't clear _PAGE_SOFT_DIRTY.  What the thought behind
> doing it here?

I just wrote it to undo the work done by pmd_mkdirty() - you're right, it
should mirror the work done by pte_mkclean() and not clear _PAGE_SOFT_DIRTY.
I'll fix this for v3, thanks!

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 07/11] mm: add find_get_entries_tag()
  2015-11-16 22:42     ` Dave Chinner
  (?)
@ 2015-11-17 18:08       ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 18:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Nov 17, 2015 at 09:42:22AM +1100, Dave Chinner wrote:
> On Fri, Nov 13, 2015 at 05:06:46PM -0700, Ross Zwisler wrote:
> > Add find_get_entries_tag() to the family of functions that include
> > find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
> > needed for DAX dirty page handling because we need a list of both page
> > offsets and radix tree entries ('indices' and 'entries' in this function)
> > that are marked with the PAGECACHE_TAG_TOWRITE tag.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  include/linux/pagemap.h |  3 +++
> >  mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 64 insertions(+)
> > 
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index a6c78e0..6fea3be 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> >  			       unsigned int nr_pages, struct page **pages);
> >  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> >  			int tag, unsigned int nr_pages, struct page **pages);
> > +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> > +			int tag, unsigned int nr_entries,
> > +			struct page **entries, pgoff_t *indices);
> >  
> >  struct page *grab_cache_page_write_begin(struct address_space *mapping,
> >  			pgoff_t index, unsigned flags);
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index d5e94fd..89ab448 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1454,6 +1454,67 @@ repeat:
> >  }
> >  EXPORT_SYMBOL(find_get_pages_tag);
> >  
> > +/**
> > + * find_get_entries_tag - find and return entries that match @tag
> > + * @mapping:	the address_space to search
> > + * @start:	the starting page cache index
> > + * @tag:	the tag index
> > + * @nr_entries:	the maximum number of entries
> > + * @entries:	where the resulting entries are placed
> > + * @indices:	the cache indices corresponding to the entries in @entries
> > + *
> > + * Like find_get_entries, except we only return entries which are tagged with
> > + * @tag.
> > + */
> > +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> > +			int tag, unsigned int nr_entries,
> > +			struct page **entries, pgoff_t *indices)
> > +{
> > +	void **slot;
> > +	unsigned int ret = 0;
> > +	struct radix_tree_iter iter;
> > +
> > +	if (!nr_entries)
> > +		return 0;
> > +
> > +	rcu_read_lock();
> > +restart:
> > +	radix_tree_for_each_tagged(slot, &mapping->page_tree,
> > +				   &iter, start, tag) {
> > +		struct page *page;
> > +repeat:
> > +		page = radix_tree_deref_slot(slot);
> > +		if (unlikely(!page))
> > +			continue;
> > +		if (radix_tree_exception(page)) {
> > +			if (radix_tree_deref_retry(page))
> > +				goto restart;
> 
> That restart condition looks wrong. ret can be non-zero, but we
> start looking from the original start index again, resulting in
> duplicates being added to the return arrays...

This same restart logic is used in all the functions in this family:
find_get_entry() (though the tag is "repeat"), find_get_entries(),
find_get_pages(), find_get_pages_contig() and find_get_pages_tag().

Most don't have it well commented, but there is a good comment in
find_get_pages():

	if (radix_tree_exception(page)) {                               
		if (radix_tree_deref_retry(page)) {                     
			/*                                              
			 * Transient condition which can only trigger   
			 * when entry at index 0 moves out of or back   
			 * to root: none yet gotten, safe to restart.   
			 */                                             
			WARN_ON(iter.index);                            
			goto restart;                                   
		}   

I think the logic is correct, but I'm happy to add this comment in
find_get_entries_tag() if it would make things clearer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 07/11] mm: add find_get_entries_tag()
@ 2015-11-17 18:08       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 18:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Nov 17, 2015 at 09:42:22AM +1100, Dave Chinner wrote:
> On Fri, Nov 13, 2015 at 05:06:46PM -0700, Ross Zwisler wrote:
> > Add find_get_entries_tag() to the family of functions that include
> > find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
> > needed for DAX dirty page handling because we need a list of both page
> > offsets and radix tree entries ('indices' and 'entries' in this function)
> > that are marked with the PAGECACHE_TAG_TOWRITE tag.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  include/linux/pagemap.h |  3 +++
> >  mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 64 insertions(+)
> > 
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index a6c78e0..6fea3be 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> >  			       unsigned int nr_pages, struct page **pages);
> >  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> >  			int tag, unsigned int nr_pages, struct page **pages);
> > +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> > +			int tag, unsigned int nr_entries,
> > +			struct page **entries, pgoff_t *indices);
> >  
> >  struct page *grab_cache_page_write_begin(struct address_space *mapping,
> >  			pgoff_t index, unsigned flags);
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index d5e94fd..89ab448 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1454,6 +1454,67 @@ repeat:
> >  }
> >  EXPORT_SYMBOL(find_get_pages_tag);
> >  
> > +/**
> > + * find_get_entries_tag - find and return entries that match @tag
> > + * @mapping:	the address_space to search
> > + * @start:	the starting page cache index
> > + * @tag:	the tag index
> > + * @nr_entries:	the maximum number of entries
> > + * @entries:	where the resulting entries are placed
> > + * @indices:	the cache indices corresponding to the entries in @entries
> > + *
> > + * Like find_get_entries, except we only return entries which are tagged with
> > + * @tag.
> > + */
> > +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> > +			int tag, unsigned int nr_entries,
> > +			struct page **entries, pgoff_t *indices)
> > +{
> > +	void **slot;
> > +	unsigned int ret = 0;
> > +	struct radix_tree_iter iter;
> > +
> > +	if (!nr_entries)
> > +		return 0;
> > +
> > +	rcu_read_lock();
> > +restart:
> > +	radix_tree_for_each_tagged(slot, &mapping->page_tree,
> > +				   &iter, start, tag) {
> > +		struct page *page;
> > +repeat:
> > +		page = radix_tree_deref_slot(slot);
> > +		if (unlikely(!page))
> > +			continue;
> > +		if (radix_tree_exception(page)) {
> > +			if (radix_tree_deref_retry(page))
> > +				goto restart;
> 
> That restart condition looks wrong. ret can be non-zero, but we
> start looking from the original start index again, resulting in
> duplicates being added to the return arrays...

This same restart logic is used in all the functions in this family:
find_get_entry() (though the tag is "repeat"), find_get_entries(),
find_get_pages(), find_get_pages_contig() and find_get_pages_tag().

Most don't have it well commented, but there is a good comment in
find_get_pages():

	if (radix_tree_exception(page)) {                               
		if (radix_tree_deref_retry(page)) {                     
			/*                                              
			 * Transient condition which can only trigger   
			 * when entry at index 0 moves out of or back   
			 * to root: none yet gotten, safe to restart.   
			 */                                             
			WARN_ON(iter.index);                            
			goto restart;                                   
		}   

I think the logic is correct, but I'm happy to add this comment in
find_get_entries_tag() if it would make things clearer.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 07/11] mm: add find_get_entries_tag()
@ 2015-11-17 18:08       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 18:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Tue, Nov 17, 2015 at 09:42:22AM +1100, Dave Chinner wrote:
> On Fri, Nov 13, 2015 at 05:06:46PM -0700, Ross Zwisler wrote:
> > Add find_get_entries_tag() to the family of functions that include
> > find_get_entries(), find_get_pages() and find_get_pages_tag().  This is
> > needed for DAX dirty page handling because we need a list of both page
> > offsets and radix tree entries ('indices' and 'entries' in this function)
> > that are marked with the PAGECACHE_TAG_TOWRITE tag.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  include/linux/pagemap.h |  3 +++
> >  mm/filemap.c            | 61 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 64 insertions(+)
> > 
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index a6c78e0..6fea3be 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -354,6 +354,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> >  			       unsigned int nr_pages, struct page **pages);
> >  unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> >  			int tag, unsigned int nr_pages, struct page **pages);
> > +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> > +			int tag, unsigned int nr_entries,
> > +			struct page **entries, pgoff_t *indices);
> >  
> >  struct page *grab_cache_page_write_begin(struct address_space *mapping,
> >  			pgoff_t index, unsigned flags);
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index d5e94fd..89ab448 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1454,6 +1454,67 @@ repeat:
> >  }
> >  EXPORT_SYMBOL(find_get_pages_tag);
> >  
> > +/**
> > + * find_get_entries_tag - find and return entries that match @tag
> > + * @mapping:	the address_space to search
> > + * @start:	the starting page cache index
> > + * @tag:	the tag index
> > + * @nr_entries:	the maximum number of entries
> > + * @entries:	where the resulting entries are placed
> > + * @indices:	the cache indices corresponding to the entries in @entries
> > + *
> > + * Like find_get_entries, except we only return entries which are tagged with
> > + * @tag.
> > + */
> > +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> > +			int tag, unsigned int nr_entries,
> > +			struct page **entries, pgoff_t *indices)
> > +{
> > +	void **slot;
> > +	unsigned int ret = 0;
> > +	struct radix_tree_iter iter;
> > +
> > +	if (!nr_entries)
> > +		return 0;
> > +
> > +	rcu_read_lock();
> > +restart:
> > +	radix_tree_for_each_tagged(slot, &mapping->page_tree,
> > +				   &iter, start, tag) {
> > +		struct page *page;
> > +repeat:
> > +		page = radix_tree_deref_slot(slot);
> > +		if (unlikely(!page))
> > +			continue;
> > +		if (radix_tree_exception(page)) {
> > +			if (radix_tree_deref_retry(page))
> > +				goto restart;
> 
> That restart condition looks wrong. ret can be non-zero, but we
> start looking from the original start index again, resulting in
> duplicates being added to the return arrays...

This same restart logic is used in all the functions in this family:
find_get_entry() (though the tag is "repeat"), find_get_entries(),
find_get_pages(), find_get_pages_contig() and find_get_pages_tag().

Most don't have it well commented, but there is a good comment in
find_get_pages():

	if (radix_tree_exception(page)) {                               
		if (radix_tree_deref_retry(page)) {                     
			/*                                              
			 * Transient condition which can only trigger   
			 * when entry at index 0 moves out of or back   
			 * to root: none yet gotten, safe to restart.   
			 */                                             
			WARN_ON(iter.index);                            
			goto restart;                                   
		}   

I think the logic is correct, but I'm happy to add this comment in
find_get_entries_tag() if it would make things clearer.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 08/11] dax: add support for fsync/sync
  2015-11-16 22:58     ` Dave Chinner
  (?)
@ 2015-11-17 18:30       ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 18:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Nov 17, 2015 at 09:58:07AM +1100, Dave Chinner wrote:
> On Fri, Nov 13, 2015 at 05:06:47PM -0700, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > When called as part of the msync/fsync flush path DAX queries the radix
> > tree for dirty entries, flushing them and then marking the PTE or PMD page
> > table entries as clean.  The step of cleaning the PTE or PMD entries is
> > necessary so that on subsequent writes to the same page we get a new write
> > fault allowing us to once again dirty the DAX tag in the radix tree.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/dax.c            | 140 +++++++++++++++++++++++++++++++++++++++++++++++++---
> >  include/linux/dax.h |   1 +
> >  mm/huge_memory.c    |  14 +++---
> >  3 files changed, 141 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 131fd35a..9ce6d1b 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -24,7 +24,9 @@
> >  #include <linux/memcontrol.h>
> >  #include <linux/mm.h>
> >  #include <linux/mutex.h>
> > +#include <linux/pagevec.h>
> >  #include <linux/pmem.h>
> > +#include <linux/rmap.h>
> >  #include <linux/sched.h>
> >  #include <linux/uio.h>
> >  #include <linux/vmstat.h>
> > @@ -287,6 +289,53 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
> >  	return 0;
> >  }
> >  
> > +static int dax_dirty_pgoff(struct address_space *mapping, unsigned long pgoff,
> > +		void __pmem *addr, bool pmd_entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int error = 0;
> > +	void *entry;
> > +
> > +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = radix_tree_lookup(page_tree, pgoff);
> > +	if (addr == NULL) {
> > +		if (entry)
> > +			goto dirty;
> > +		else {
> > +			WARN(1, "DAX pfn_mkwrite failed to find an entry");
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	if (entry) {
> > +		if (pmd_entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
> > +			radix_tree_delete(&mapping->page_tree, pgoff);
> > +			mapping->nrdax--;
> > +		} else
> > +			goto dirty;
> > +	}
> 
> Logic is pretty spagettied here. Perhaps:
> 
> 	entry = radix_tree_lookup(page_tree, pgoff);
> 	if (entry) {
> 		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD))
> 			goto dirty;
> 		radix_tree_delete(&mapping->page_tree, pgoff);
> 		mapping->nrdax--;
> 	} else {
> 		WARN_ON(!addr);
> 		goto out_unlock;
> 	}
> ....

I don't think that this works because now if !entry we unconditionally goto
out_unlock without inserting a new entry.  I'll try and simplify the logic and
add some comments.

> > +
> > +	BUG_ON(RADIX_DAX_TYPE(addr));
> > +	if (pmd_entry)
> > +		error = radix_tree_insert(page_tree, pgoff,
> > +				RADIX_DAX_PMD_ENTRY(addr));
> > +	else
> > +		error = radix_tree_insert(page_tree, pgoff,
> > +				RADIX_DAX_PTE_ENTRY(addr));
> > +
> > +	if (error)
> > +		goto out;
> > +
> > +	mapping->nrdax++;
> > + dirty:
> > +	radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> > + out:
> > +	spin_unlock_irq(&mapping->tree_lock);
> 
> label should be "out_unlock" rather "out" to indicate in the code
> that we are jumping to the correct spot in the error stack...

Sure, will do.

> > +			goto fallback;
> >  	}
> >  
> >   out:
> > @@ -689,15 +746,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> >   * dax_pfn_mkwrite - handle first write to DAX page
> >   * @vma: The virtual memory area where the fault occurred
> >   * @vmf: The description of the fault
> > - *
> >   */
> >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +	struct file *file = vma->vm_file;
> >  
> > -	sb_start_pagefault(sb);
> > -	file_update_time(vma->vm_file);
> > -	sb_end_pagefault(sb);
> > +	dax_dirty_pgoff(file->f_mapping, vmf->pgoff, NULL, false);
> >  	return VM_FAULT_NOPAGE;
> 
> This seems wrong - it's dropping the freeze protection on fault, and
> now the inode timestamp won't get updated, either.

Oh, that all still happens in the filesystem pfn_mkwrite code
(xfs_filemap_pfn_mkwrite() for XFS).  It needs to happen there, I think,
because we wanted to order it so that the filesystem freeze happens outside of
the XFS_MMAPLOCK_SHARED locking, as it does with the regular PMD and PTE fault
paths.

Prior to this patch set dax_pfn_mkwrite() was completely unused an was ready
to be removed as dead code - it's now being used by all filesystems just to
make sure we re-add the newly dirtied page to the radix tree dirty list.

> >  }
> >  EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> > @@ -772,3 +826,77 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> >  	return dax_zero_page_range(inode, from, length, get_block);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_truncate_page);
> > +
> > +static void dax_sync_entry(struct address_space *mapping, pgoff_t pgoff,
> > +		void *entry)
> > +{
> 
> dax_writeback_pgoff() seems like a more consistent name (consider
> dax_dirty_pgoff), and that we are actually doing a writeback
> operation, not a "sync" operation.

Sure, I'm fine with that change.

> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	size_t size;
> > +
> > +	BUG_ON(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD);
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	if (!radix_tree_tag_get(page_tree, pgoff, PAGECACHE_TAG_TOWRITE)) {
> > +		/* another fsync thread already wrote back this entry */
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		return;
> > +	}
> > +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_TOWRITE);
> > +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +
> > +	if (type == RADIX_DAX_PMD)
> > +		size = PMD_SIZE;
> > +	else
> > +		size = PAGE_SIZE;
> > +
> > +	wb_cache_pmem(RADIX_DAX_ADDR(entry), size);
> > +	pgoff_mkclean(pgoff, mapping);
> 
> This looks racy w.r.t. another operation setting the radix tree
> dirty tags. i.e. there is no locking to serialise marking the
> vma/pte clean and another operation marking the radix tree dirty.

I think you're right - I'll look into how to protect us from this race.  Thank
you for catching this.

> > +}
> > +
> > +/*
> > + * Flush the mapping to the persistent domain within the byte range of (start,
> > + * end). This is required by data integrity operations to ensure file data is on
> > + * persistent storage prior to completion of the operation. It also requires us
> > + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
> > + * the file is written to again so we have an indication that we need to flush
> > + * the mapping if a data integrity operation takes place.
> > + *
> > + * We don't need commits to storage here - the filesystems will issue flushes
> > + * appropriately at the conclusion of the data integrity operation via REQ_FUA
> > + * writes or blkdev_issue_flush() commands.  This requires the DAX block device
> > + * to implement persistent storage domain fencing/commits on receiving a
> > + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
> > + * layers.
> > + */
> > +void dax_fsync(struct address_space *mapping, loff_t start, loff_t end)
> > +{
> 
> dax_writeback_mapping_range()

Sure, I'm fine with that change.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 08/11] dax: add support for fsync/sync
@ 2015-11-17 18:30       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 18:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Nov 17, 2015 at 09:58:07AM +1100, Dave Chinner wrote:
> On Fri, Nov 13, 2015 at 05:06:47PM -0700, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > When called as part of the msync/fsync flush path DAX queries the radix
> > tree for dirty entries, flushing them and then marking the PTE or PMD page
> > table entries as clean.  The step of cleaning the PTE or PMD entries is
> > necessary so that on subsequent writes to the same page we get a new write
> > fault allowing us to once again dirty the DAX tag in the radix tree.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/dax.c            | 140 +++++++++++++++++++++++++++++++++++++++++++++++++---
> >  include/linux/dax.h |   1 +
> >  mm/huge_memory.c    |  14 +++---
> >  3 files changed, 141 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 131fd35a..9ce6d1b 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -24,7 +24,9 @@
> >  #include <linux/memcontrol.h>
> >  #include <linux/mm.h>
> >  #include <linux/mutex.h>
> > +#include <linux/pagevec.h>
> >  #include <linux/pmem.h>
> > +#include <linux/rmap.h>
> >  #include <linux/sched.h>
> >  #include <linux/uio.h>
> >  #include <linux/vmstat.h>
> > @@ -287,6 +289,53 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
> >  	return 0;
> >  }
> >  
> > +static int dax_dirty_pgoff(struct address_space *mapping, unsigned long pgoff,
> > +		void __pmem *addr, bool pmd_entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int error = 0;
> > +	void *entry;
> > +
> > +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = radix_tree_lookup(page_tree, pgoff);
> > +	if (addr == NULL) {
> > +		if (entry)
> > +			goto dirty;
> > +		else {
> > +			WARN(1, "DAX pfn_mkwrite failed to find an entry");
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	if (entry) {
> > +		if (pmd_entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
> > +			radix_tree_delete(&mapping->page_tree, pgoff);
> > +			mapping->nrdax--;
> > +		} else
> > +			goto dirty;
> > +	}
> 
> Logic is pretty spagettied here. Perhaps:
> 
> 	entry = radix_tree_lookup(page_tree, pgoff);
> 	if (entry) {
> 		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD))
> 			goto dirty;
> 		radix_tree_delete(&mapping->page_tree, pgoff);
> 		mapping->nrdax--;
> 	} else {
> 		WARN_ON(!addr);
> 		goto out_unlock;
> 	}
> ....

I don't think that this works because now if !entry we unconditionally goto
out_unlock without inserting a new entry.  I'll try and simplify the logic and
add some comments.

> > +
> > +	BUG_ON(RADIX_DAX_TYPE(addr));
> > +	if (pmd_entry)
> > +		error = radix_tree_insert(page_tree, pgoff,
> > +				RADIX_DAX_PMD_ENTRY(addr));
> > +	else
> > +		error = radix_tree_insert(page_tree, pgoff,
> > +				RADIX_DAX_PTE_ENTRY(addr));
> > +
> > +	if (error)
> > +		goto out;
> > +
> > +	mapping->nrdax++;
> > + dirty:
> > +	radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> > + out:
> > +	spin_unlock_irq(&mapping->tree_lock);
> 
> label should be "out_unlock" rather "out" to indicate in the code
> that we are jumping to the correct spot in the error stack...

Sure, will do.

> > +			goto fallback;
> >  	}
> >  
> >   out:
> > @@ -689,15 +746,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> >   * dax_pfn_mkwrite - handle first write to DAX page
> >   * @vma: The virtual memory area where the fault occurred
> >   * @vmf: The description of the fault
> > - *
> >   */
> >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +	struct file *file = vma->vm_file;
> >  
> > -	sb_start_pagefault(sb);
> > -	file_update_time(vma->vm_file);
> > -	sb_end_pagefault(sb);
> > +	dax_dirty_pgoff(file->f_mapping, vmf->pgoff, NULL, false);
> >  	return VM_FAULT_NOPAGE;
> 
> This seems wrong - it's dropping the freeze protection on fault, and
> now the inode timestamp won't get updated, either.

Oh, that all still happens in the filesystem pfn_mkwrite code
(xfs_filemap_pfn_mkwrite() for XFS).  It needs to happen there, I think,
because we wanted to order it so that the filesystem freeze happens outside of
the XFS_MMAPLOCK_SHARED locking, as it does with the regular PMD and PTE fault
paths.

Prior to this patch set dax_pfn_mkwrite() was completely unused an was ready
to be removed as dead code - it's now being used by all filesystems just to
make sure we re-add the newly dirtied page to the radix tree dirty list.

> >  }
> >  EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> > @@ -772,3 +826,77 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> >  	return dax_zero_page_range(inode, from, length, get_block);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_truncate_page);
> > +
> > +static void dax_sync_entry(struct address_space *mapping, pgoff_t pgoff,
> > +		void *entry)
> > +{
> 
> dax_writeback_pgoff() seems like a more consistent name (consider
> dax_dirty_pgoff), and that we are actually doing a writeback
> operation, not a "sync" operation.

Sure, I'm fine with that change.

> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	size_t size;
> > +
> > +	BUG_ON(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD);
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	if (!radix_tree_tag_get(page_tree, pgoff, PAGECACHE_TAG_TOWRITE)) {
> > +		/* another fsync thread already wrote back this entry */
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		return;
> > +	}
> > +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_TOWRITE);
> > +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +
> > +	if (type == RADIX_DAX_PMD)
> > +		size = PMD_SIZE;
> > +	else
> > +		size = PAGE_SIZE;
> > +
> > +	wb_cache_pmem(RADIX_DAX_ADDR(entry), size);
> > +	pgoff_mkclean(pgoff, mapping);
> 
> This looks racy w.r.t. another operation setting the radix tree
> dirty tags. i.e. there is no locking to serialise marking the
> vma/pte clean and another operation marking the radix tree dirty.

I think you're right - I'll look into how to protect us from this race.  Thank
you for catching this.

> > +}
> > +
> > +/*
> > + * Flush the mapping to the persistent domain within the byte range of (start,
> > + * end). This is required by data integrity operations to ensure file data is on
> > + * persistent storage prior to completion of the operation. It also requires us
> > + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
> > + * the file is written to again so we have an indication that we need to flush
> > + * the mapping if a data integrity operation takes place.
> > + *
> > + * We don't need commits to storage here - the filesystems will issue flushes
> > + * appropriately at the conclusion of the data integrity operation via REQ_FUA
> > + * writes or blkdev_issue_flush() commands.  This requires the DAX block device
> > + * to implement persistent storage domain fencing/commits on receiving a
> > + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
> > + * layers.
> > + */
> > +void dax_fsync(struct address_space *mapping, loff_t start, loff_t end)
> > +{
> 
> dax_writeback_mapping_range()

Sure, I'm fine with that change.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 08/11] dax: add support for fsync/sync
@ 2015-11-17 18:30       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 18:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Tue, Nov 17, 2015 at 09:58:07AM +1100, Dave Chinner wrote:
> On Fri, Nov 13, 2015 at 05:06:47PM -0700, Ross Zwisler wrote:
> > To properly handle fsync/msync in an efficient way DAX needs to track dirty
> > pages so it is able to flush them durably to media on demand.
> > 
> > The tracking of dirty pages is done via the radix tree in struct
> > address_space.  This radix tree is already used by the page writeback
> > infrastructure for tracking dirty pages associated with an open file, and
> > it already has support for exceptional (non struct page*) entries.  We
> > build upon these features to add exceptional entries to the radix tree for
> > DAX dirty PMD or PTE pages at fault time.
> > 
> > When called as part of the msync/fsync flush path DAX queries the radix
> > tree for dirty entries, flushing them and then marking the PTE or PMD page
> > table entries as clean.  The step of cleaning the PTE or PMD entries is
> > necessary so that on subsequent writes to the same page we get a new write
> > fault allowing us to once again dirty the DAX tag in the radix tree.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/dax.c            | 140 +++++++++++++++++++++++++++++++++++++++++++++++++---
> >  include/linux/dax.h |   1 +
> >  mm/huge_memory.c    |  14 +++---
> >  3 files changed, 141 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 131fd35a..9ce6d1b 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -24,7 +24,9 @@
> >  #include <linux/memcontrol.h>
> >  #include <linux/mm.h>
> >  #include <linux/mutex.h>
> > +#include <linux/pagevec.h>
> >  #include <linux/pmem.h>
> > +#include <linux/rmap.h>
> >  #include <linux/sched.h>
> >  #include <linux/uio.h>
> >  #include <linux/vmstat.h>
> > @@ -287,6 +289,53 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
> >  	return 0;
> >  }
> >  
> > +static int dax_dirty_pgoff(struct address_space *mapping, unsigned long pgoff,
> > +		void __pmem *addr, bool pmd_entry)
> > +{
> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int error = 0;
> > +	void *entry;
> > +
> > +	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	entry = radix_tree_lookup(page_tree, pgoff);
> > +	if (addr == NULL) {
> > +		if (entry)
> > +			goto dirty;
> > +		else {
> > +			WARN(1, "DAX pfn_mkwrite failed to find an entry");
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	if (entry) {
> > +		if (pmd_entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PTE) {
> > +			radix_tree_delete(&mapping->page_tree, pgoff);
> > +			mapping->nrdax--;
> > +		} else
> > +			goto dirty;
> > +	}
> 
> Logic is pretty spagettied here. Perhaps:
> 
> 	entry = radix_tree_lookup(page_tree, pgoff);
> 	if (entry) {
> 		if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD))
> 			goto dirty;
> 		radix_tree_delete(&mapping->page_tree, pgoff);
> 		mapping->nrdax--;
> 	} else {
> 		WARN_ON(!addr);
> 		goto out_unlock;
> 	}
> ....

I don't think that this works because now if !entry we unconditionally goto
out_unlock without inserting a new entry.  I'll try and simplify the logic and
add some comments.

> > +
> > +	BUG_ON(RADIX_DAX_TYPE(addr));
> > +	if (pmd_entry)
> > +		error = radix_tree_insert(page_tree, pgoff,
> > +				RADIX_DAX_PMD_ENTRY(addr));
> > +	else
> > +		error = radix_tree_insert(page_tree, pgoff,
> > +				RADIX_DAX_PTE_ENTRY(addr));
> > +
> > +	if (error)
> > +		goto out;
> > +
> > +	mapping->nrdax++;
> > + dirty:
> > +	radix_tree_tag_set(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> > + out:
> > +	spin_unlock_irq(&mapping->tree_lock);
> 
> label should be "out_unlock" rather "out" to indicate in the code
> that we are jumping to the correct spot in the error stack...

Sure, will do.

> > +			goto fallback;
> >  	}
> >  
> >   out:
> > @@ -689,15 +746,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> >   * dax_pfn_mkwrite - handle first write to DAX page
> >   * @vma: The virtual memory area where the fault occurred
> >   * @vmf: The description of the fault
> > - *
> >   */
> >  int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  {
> > -	struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +	struct file *file = vma->vm_file;
> >  
> > -	sb_start_pagefault(sb);
> > -	file_update_time(vma->vm_file);
> > -	sb_end_pagefault(sb);
> > +	dax_dirty_pgoff(file->f_mapping, vmf->pgoff, NULL, false);
> >  	return VM_FAULT_NOPAGE;
> 
> This seems wrong - it's dropping the freeze protection on fault, and
> now the inode timestamp won't get updated, either.

Oh, that all still happens in the filesystem pfn_mkwrite code
(xfs_filemap_pfn_mkwrite() for XFS).  It needs to happen there, I think,
because we wanted to order it so that the filesystem freeze happens outside of
the XFS_MMAPLOCK_SHARED locking, as it does with the regular PMD and PTE fault
paths.

Prior to this patch set dax_pfn_mkwrite() was completely unused an was ready
to be removed as dead code - it's now being used by all filesystems just to
make sure we re-add the newly dirtied page to the radix tree dirty list.

> >  }
> >  EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
> > @@ -772,3 +826,77 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
> >  	return dax_zero_page_range(inode, from, length, get_block);
> >  }
> >  EXPORT_SYMBOL_GPL(dax_truncate_page);
> > +
> > +static void dax_sync_entry(struct address_space *mapping, pgoff_t pgoff,
> > +		void *entry)
> > +{
> 
> dax_writeback_pgoff() seems like a more consistent name (consider
> dax_dirty_pgoff), and that we are actually doing a writeback
> operation, not a "sync" operation.

Sure, I'm fine with that change.

> > +	struct radix_tree_root *page_tree = &mapping->page_tree;
> > +	int type = RADIX_DAX_TYPE(entry);
> > +	size_t size;
> > +
> > +	BUG_ON(type != RADIX_DAX_PTE && type != RADIX_DAX_PMD);
> > +
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	if (!radix_tree_tag_get(page_tree, pgoff, PAGECACHE_TAG_TOWRITE)) {
> > +		/* another fsync thread already wrote back this entry */
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		return;
> > +	}
> > +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_TOWRITE);
> > +	radix_tree_tag_clear(page_tree, pgoff, PAGECACHE_TAG_DIRTY);
> > +	spin_unlock_irq(&mapping->tree_lock);
> > +
> > +	if (type == RADIX_DAX_PMD)
> > +		size = PMD_SIZE;
> > +	else
> > +		size = PAGE_SIZE;
> > +
> > +	wb_cache_pmem(RADIX_DAX_ADDR(entry), size);
> > +	pgoff_mkclean(pgoff, mapping);
> 
> This looks racy w.r.t. another operation setting the radix tree
> dirty tags. i.e. there is no locking to serialise marking the
> vma/pte clean and another operation marking the radix tree dirty.

I think you're right - I'll look into how to protect us from this race.  Thank
you for catching this.

> > +}
> > +
> > +/*
> > + * Flush the mapping to the persistent domain within the byte range of (start,
> > + * end). This is required by data integrity operations to ensure file data is on
> > + * persistent storage prior to completion of the operation. It also requires us
> > + * to clean the mappings (i.e. write -> RO) so that we'll get a new fault when
> > + * the file is written to again so we have an indication that we need to flush
> > + * the mapping if a data integrity operation takes place.
> > + *
> > + * We don't need commits to storage here - the filesystems will issue flushes
> > + * appropriately at the conclusion of the data integrity operation via REQ_FUA
> > + * writes or blkdev_issue_flush() commands.  This requires the DAX block device
> > + * to implement persistent storage domain fencing/commits on receiving a
> > + * REQ_FLUSH or REQ_FUA request so that this works as expected by the higher
> > + * layers.
> > + */
> > +void dax_fsync(struct address_space *mapping, loff_t start, loff_t end)
> > +{
> 
> dax_writeback_mapping_range()

Sure, I'm fine with that change.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
  2015-11-16 23:12     ` Dave Chinner
  (?)
  (?)
@ 2015-11-17 19:03       ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 19:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Nov 17, 2015 at 10:12:22AM +1100, Dave Chinner wrote:
> On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> > To properly support the new DAX fsync/msync infrastructure filesystems
> > need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> > write faults on a previously cleaned address.  They also need to call
> > dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> > addresses retrieved from get_block() so it needs to be ordered with
> > respect to truncate.  This is accomplished by using the same locking that
> > was set up for DAX page faults.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/xfs/xfs_file.c | 18 +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 39743ef..2b490a1 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -209,7 +209,8 @@ xfs_file_fsync(
> >  	loff_t			end,
> >  	int			datasync)
> >  {
> > -	struct inode		*inode = file->f_mapping->host;
> > +	struct address_space	*mapping = file->f_mapping;
> > +	struct inode		*inode = mapping->host;
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	int			error = 0;
> > @@ -218,7 +219,13 @@ xfs_file_fsync(
> >  
> >  	trace_xfs_file_fsync(ip);
> >  
> > -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > +	if (dax_mapping(mapping)) {
> > +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +		dax_fsync(mapping, start, end);
> > +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +	}
> > +
> > +	error = filemap_write_and_wait_range(mapping, start, end);
> 
> Ok, I don't understand a couple of things here.
> 
> Firstly, if it's a DAX mapping, why are we still calling
> filemap_write_and_wait_range() after the dax_fsync() call that has
> already written back all the dirty cachelines?
> 
> Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
> be doing here? I don't see where dax_fsync() has any callouts to
> get_block(), so the comment "needs to be ordered with respect to
> truncate" doesn't make any obvious sense. If we have a racing
> truncate removing entries from the radix tree, then thanks to the
> mapping tree lock we'll either find an entry we need to write back,
> or we won't find any entry at all, right?

You're right, dax_fsync() doesn't call out to get_block() any more.  It does
save the results of get_block() calls from the page faults, though, and I was
concerned about the following race:

fsync thread				truncate thread
------------				---------------
dax_fsync()
save tagged entries in pvec

					change block mapping for inode so that
					entries saved in pvec are no longer
					owned by this inode

loop through pvec using stale results
from get_block(), flushing and cleaning
entries we no longer own

In looking at the xfs_file_fsync() code, though, it seems like if this race
existed it would also exist for page cache entries that were being put into a
pvec in write_cache_pages(), and that we would similarly be writing back
cached pages that no longer belong to this inode.

Is this race non-existent?

> Lastly, this flushing really needs to be inside
> filemap_write_and_wait_range(), because we call the writeback code
> from many more places than just fsync to ensure ordering of various
> operations such that files are in known state before proceeding
> (e.g. hole punch).

The call to dax_fsync() (soon to be dax_writeback_mapping_range()) first lived
in do_writepages() in the RFC version, but was moved into the filesystem so we
could have access to get_block(), which is no longer needed, and so we could
use the FS level locking.  If the race described above isn't an issue then I
agree moving this call out of the filesystems and down into the generic page
writeback code is probably the right thing to do.

Thanks for the feedback.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
@ 2015-11-17 19:03       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 19:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Nov 17, 2015 at 10:12:22AM +1100, Dave Chinner wrote:
> On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> > To properly support the new DAX fsync/msync infrastructure filesystems
> > need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> > write faults on a previously cleaned address.  They also need to call
> > dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> > addresses retrieved from get_block() so it needs to be ordered with
> > respect to truncate.  This is accomplished by using the same locking that
> > was set up for DAX page faults.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/xfs/xfs_file.c | 18 +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 39743ef..2b490a1 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -209,7 +209,8 @@ xfs_file_fsync(
> >  	loff_t			end,
> >  	int			datasync)
> >  {
> > -	struct inode		*inode = file->f_mapping->host;
> > +	struct address_space	*mapping = file->f_mapping;
> > +	struct inode		*inode = mapping->host;
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	int			error = 0;
> > @@ -218,7 +219,13 @@ xfs_file_fsync(
> >  
> >  	trace_xfs_file_fsync(ip);
> >  
> > -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > +	if (dax_mapping(mapping)) {
> > +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +		dax_fsync(mapping, start, end);
> > +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +	}
> > +
> > +	error = filemap_write_and_wait_range(mapping, start, end);
> 
> Ok, I don't understand a couple of things here.
> 
> Firstly, if it's a DAX mapping, why are we still calling
> filemap_write_and_wait_range() after the dax_fsync() call that has
> already written back all the dirty cachelines?
> 
> Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
> be doing here? I don't see where dax_fsync() has any callouts to
> get_block(), so the comment "needs to be ordered with respect to
> truncate" doesn't make any obvious sense. If we have a racing
> truncate removing entries from the radix tree, then thanks to the
> mapping tree lock we'll either find an entry we need to write back,
> or we won't find any entry at all, right?

You're right, dax_fsync() doesn't call out to get_block() any more.  It does
save the results of get_block() calls from the page faults, though, and I was
concerned about the following race:

fsync thread				truncate thread
------------				---------------
dax_fsync()
save tagged entries in pvec

					change block mapping for inode so that
					entries saved in pvec are no longer
					owned by this inode

loop through pvec using stale results
from get_block(), flushing and cleaning
entries we no longer own

In looking at the xfs_file_fsync() code, though, it seems like if this race
existed it would also exist for page cache entries that were being put into a
pvec in write_cache_pages(), and that we would similarly be writing back
cached pages that no longer belong to this inode.

Is this race non-existent?

> Lastly, this flushing really needs to be inside
> filemap_write_and_wait_range(), because we call the writeback code
> from many more places than just fsync to ensure ordering of various
> operations such that files are in known state before proceeding
> (e.g. hole punch).

The call to dax_fsync() (soon to be dax_writeback_mapping_range()) first lived
in do_writepages() in the RFC version, but was moved into the filesystem so we
could have access to get_block(), which is no longer needed, and so we could
use the FS level locking.  If the race described above isn't an issue then I
agree moving this call out of the filesystems and down into the generic page
writeback code is probably the right thing to do.

Thanks for the feedback.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
@ 2015-11-17 19:03       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 19:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Hansen, J. Bruce Fields, linux-mm, Andreas Dilger,
	H. Peter Anvin, Jeff Layton, Dan Williams, linux-nvdimm, x86,
	Ingo Molnar, Matthew Wilcox, Ross Zwisler, linux-ext4, xfs,
	Alexander Viro, Thomas Gleixner, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Tue, Nov 17, 2015 at 10:12:22AM +1100, Dave Chinner wrote:
> On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> > To properly support the new DAX fsync/msync infrastructure filesystems
> > need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> > write faults on a previously cleaned address.  They also need to call
> > dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> > addresses retrieved from get_block() so it needs to be ordered with
> > respect to truncate.  This is accomplished by using the same locking that
> > was set up for DAX page faults.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/xfs/xfs_file.c | 18 +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 39743ef..2b490a1 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -209,7 +209,8 @@ xfs_file_fsync(
> >  	loff_t			end,
> >  	int			datasync)
> >  {
> > -	struct inode		*inode = file->f_mapping->host;
> > +	struct address_space	*mapping = file->f_mapping;
> > +	struct inode		*inode = mapping->host;
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	int			error = 0;
> > @@ -218,7 +219,13 @@ xfs_file_fsync(
> >  
> >  	trace_xfs_file_fsync(ip);
> >  
> > -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > +	if (dax_mapping(mapping)) {
> > +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +		dax_fsync(mapping, start, end);
> > +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +	}
> > +
> > +	error = filemap_write_and_wait_range(mapping, start, end);
> 
> Ok, I don't understand a couple of things here.
> 
> Firstly, if it's a DAX mapping, why are we still calling
> filemap_write_and_wait_range() after the dax_fsync() call that has
> already written back all the dirty cachelines?
> 
> Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
> be doing here? I don't see where dax_fsync() has any callouts to
> get_block(), so the comment "needs to be ordered with respect to
> truncate" doesn't make any obvious sense. If we have a racing
> truncate removing entries from the radix tree, then thanks to the
> mapping tree lock we'll either find an entry we need to write back,
> or we won't find any entry at all, right?

You're right, dax_fsync() doesn't call out to get_block() any more.  It does
save the results of get_block() calls from the page faults, though, and I was
concerned about the following race:

fsync thread				truncate thread
------------				---------------
dax_fsync()
save tagged entries in pvec

					change block mapping for inode so that
					entries saved in pvec are no longer
					owned by this inode

loop through pvec using stale results
from get_block(), flushing and cleaning
entries we no longer own

In looking at the xfs_file_fsync() code, though, it seems like if this race
existed it would also exist for page cache entries that were being put into a
pvec in write_cache_pages(), and that we would similarly be writing back
cached pages that no longer belong to this inode.

Is this race non-existent?

> Lastly, this flushing really needs to be inside
> filemap_write_and_wait_range(), because we call the writeback code
> from many more places than just fsync to ensure ordering of various
> operations such that files are in known state before proceeding
> (e.g. hole punch).

The call to dax_fsync() (soon to be dax_writeback_mapping_range()) first lived
in do_writepages() in the RFC version, but was moved into the filesystem so we
could have access to get_block(), which is no longer needed, and so we could
use the FS level locking.  If the race described above isn't an issue then I
agree moving this call out of the filesystems and down into the generic page
writeback code is probably the right thing to do.

Thanks for the feedback.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
@ 2015-11-17 19:03       ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-17 19:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Nov 17, 2015 at 10:12:22AM +1100, Dave Chinner wrote:
> On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> > To properly support the new DAX fsync/msync infrastructure filesystems
> > need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> > write faults on a previously cleaned address.  They also need to call
> > dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> > addresses retrieved from get_block() so it needs to be ordered with
> > respect to truncate.  This is accomplished by using the same locking that
> > was set up for DAX page faults.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  fs/xfs/xfs_file.c | 18 +++++++++++++-----
> >  1 file changed, 13 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 39743ef..2b490a1 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -209,7 +209,8 @@ xfs_file_fsync(
> >  	loff_t			end,
> >  	int			datasync)
> >  {
> > -	struct inode		*inode = file->f_mapping->host;
> > +	struct address_space	*mapping = file->f_mapping;
> > +	struct inode		*inode = mapping->host;
> >  	struct xfs_inode	*ip = XFS_I(inode);
> >  	struct xfs_mount	*mp = ip->i_mount;
> >  	int			error = 0;
> > @@ -218,7 +219,13 @@ xfs_file_fsync(
> >  
> >  	trace_xfs_file_fsync(ip);
> >  
> > -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > +	if (dax_mapping(mapping)) {
> > +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +		dax_fsync(mapping, start, end);
> > +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > +	}
> > +
> > +	error = filemap_write_and_wait_range(mapping, start, end);
> 
> Ok, I don't understand a couple of things here.
> 
> Firstly, if it's a DAX mapping, why are we still calling
> filemap_write_and_wait_range() after the dax_fsync() call that has
> already written back all the dirty cachelines?
> 
> Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
> be doing here? I don't see where dax_fsync() has any callouts to
> get_block(), so the comment "needs to be ordered with respect to
> truncate" doesn't make any obvious sense. If we have a racing
> truncate removing entries from the radix tree, then thanks to the
> mapping tree lock we'll either find an entry we need to write back,
> or we won't find any entry at all, right?

You're right, dax_fsync() doesn't call out to get_block() any more.  It does
save the results of get_block() calls from the page faults, though, and I was
concerned about the following race:

fsync thread				truncate thread
------------				---------------
dax_fsync()
save tagged entries in pvec

					change block mapping for inode so that
					entries saved in pvec are no longer
					owned by this inode

loop through pvec using stale results
from get_block(), flushing and cleaning
entries we no longer own

In looking at the xfs_file_fsync() code, though, it seems like if this race
existed it would also exist for page cache entries that were being put into a
pvec in write_cache_pages(), and that we would similarly be writing back
cached pages that no longer belong to this inode.

Is this race non-existent?

> Lastly, this flushing really needs to be inside
> filemap_write_and_wait_range(), because we call the writeback code
> from many more places than just fsync to ensure ordering of various
> operations such that files are in known state before proceeding
> (e.g. hole punch).

The call to dax_fsync() (soon to be dax_writeback_mapping_range()) first lived
in do_writepages() in the RFC version, but was moved into the filesystem so we
could have access to get_block(), which is no longer needed, and so we could
use the FS level locking.  If the race described above isn't an issue then I
agree moving this call out of the filesystems and down into the generic page
writeback code is probably the right thing to do.

Thanks for the feedback.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-16 20:09           ` Ross Zwisler
  (?)
  (?)
@ 2015-11-18 10:40             ` Jan Kara
  -1 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-18 10:40 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Andreas Dilger, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > >>
> > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > >>> and are used by filesystems to order their metadata, among other things.
> > >>>
> > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > >>> the flushed data has been durably stored on the media.
> > >>>
> > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > >>
> > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > >> the cache is done by the core why does the driver need support
> > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > >> only makes sense if individual writes can bypass the "drive" cache,
> > >> but no I/O submitted to the driver proper is ever cached we always
> > >> flush it through to media.
> > >
> > > If the upper level filesystem gets an error when submitting a flush
> > > request, then it assumes the underlying hardware is broken and cannot
> > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > IO to complete.
> > 
> > Upper level filesystems won't get errors when the driver does not
> > support flush.  Those requests are ended cleanly in
> > generic_make_request_checks().  Yes, the fs still needs to wait for
> > outstanding I/O to complete but in the case of pmem all I/O is
> > synchronous.  There's never anything to await when flushing at the
> > pmem driver level.
> > 
> > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > it doesn't make sense _not_ to support this functionality.
> > 
> > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > pending to the device in the cpu cache that a REQ_FLUSH request will
> > not touch, its better to leave it all to the mm core to handle.  I.e.
> > it doesn't make sense to call the driver just for two instructions
> > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > Either handle it all in the mm or the driver, not a mixture.
> 
> Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> efficiently with respect to their internal I/O ordering, etc., if PMEM
> advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> at the DAX layer?

So the information whether the driver supports FLUSH / FUA is generally
ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
required ordering for fs consistency and expect that block layer does the
right thing - i.e., if the device has volatile write cache, it will be
flushed, if it doesn't have it, the request will be ignored. So the
difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
only in how block layer handles such requests.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-18 10:40             ` Jan Kara
  0 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-18 10:40 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Andreas Dilger, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm@lists.01.org, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > >>
> > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > >>> and are used by filesystems to order their metadata, among other things.
> > >>>
> > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > >>> the flushed data has been durably stored on the media.
> > >>>
> > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > >>
> > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > >> the cache is done by the core why does the driver need support
> > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > >> only makes sense if individual writes can bypass the "drive" cache,
> > >> but no I/O submitted to the driver proper is ever cached we always
> > >> flush it through to media.
> > >
> > > If the upper level filesystem gets an error when submitting a flush
> > > request, then it assumes the underlying hardware is broken and cannot
> > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > IO to complete.
> > 
> > Upper level filesystems won't get errors when the driver does not
> > support flush.  Those requests are ended cleanly in
> > generic_make_request_checks().  Yes, the fs still needs to wait for
> > outstanding I/O to complete but in the case of pmem all I/O is
> > synchronous.  There's never anything to await when flushing at the
> > pmem driver level.
> > 
> > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > it doesn't make sense _not_ to support this functionality.
> > 
> > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > pending to the device in the cpu cache that a REQ_FLUSH request will
> > not touch, its better to leave it all to the mm core to handle.  I.e.
> > it doesn't make sense to call the driver just for two instructions
> > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > Either handle it all in the mm or the driver, not a mixture.
> 
> Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> efficiently with respect to their internal I/O ordering, etc., if PMEM
> advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> at the DAX layer?

So the information whether the driver supports FLUSH / FUA is generally
ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
required ordering for fs consistency and expect that block layer does the
right thing - i.e., if the device has volatile write cache, it will be
flushed, if it doesn't have it, the request will be ignored. So the
difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
only in how block layer handles such requests.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-18 10:40             ` Jan Kara
  0 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-18 10:40 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dan Williams, Andreas Dilger, linux-kernel, H. Peter Anvin,
	J. Bruce Fields, Theodore Ts'o, Alexander Viro, Dave Chinner,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, Linux MM,
	linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > >>
> > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > >>> and are used by filesystems to order their metadata, among other things.
> > >>>
> > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > >>> the flushed data has been durably stored on the media.
> > >>>
> > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > >>
> > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > >> the cache is done by the core why does the driver need support
> > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > >> only makes sense if individual writes can bypass the "drive" cache,
> > >> but no I/O submitted to the driver proper is ever cached we always
> > >> flush it through to media.
> > >
> > > If the upper level filesystem gets an error when submitting a flush
> > > request, then it assumes the underlying hardware is broken and cannot
> > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > IO to complete.
> > 
> > Upper level filesystems won't get errors when the driver does not
> > support flush.  Those requests are ended cleanly in
> > generic_make_request_checks().  Yes, the fs still needs to wait for
> > outstanding I/O to complete but in the case of pmem all I/O is
> > synchronous.  There's never anything to await when flushing at the
> > pmem driver level.
> > 
> > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > it doesn't make sense _not_ to support this functionality.
> > 
> > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > pending to the device in the cpu cache that a REQ_FLUSH request will
> > not touch, its better to leave it all to the mm core to handle.  I.e.
> > it doesn't make sense to call the driver just for two instructions
> > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > Either handle it all in the mm or the driver, not a mixture.
> 
> Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> efficiently with respect to their internal I/O ordering, etc., if PMEM
> advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> at the DAX layer?

So the information whether the driver supports FLUSH / FUA is generally
ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
required ordering for fs consistency and expect that block layer does the
right thing - i.e., if the device has volatile write cache, it will be
flushed, if it doesn't have it, the request will be ignored. So the
difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
only in how block layer handles such requests.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-18 10:40             ` Jan Kara
  0 siblings, 0 replies; 132+ messages in thread
From: Jan Kara @ 2015-11-18 10:40 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, H. Peter Anvin,
	Jeff Layton, Dan Williams, linux-nvdimm, X86 ML, Ingo Molnar,
	Matthew Wilcox, linux-ext4, XFS Developers, Alexander Viro,
	Thomas Gleixner, Andreas Dilger, Theodore Ts'o, linux-kernel,
	Jan Kara, linux-fsdevel, Andrew Morton, Matthew Wilcox

On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > >>
> > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > >> <ross.zwisler@linux.intel.com> wrote:
> > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > >>> and are used by filesystems to order their metadata, among other things.
> > >>>
> > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > >>> the flushed data has been durably stored on the media.
> > >>>
> > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > >>
> > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > >> the cache is done by the core why does the driver need support
> > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > >> only makes sense if individual writes can bypass the "drive" cache,
> > >> but no I/O submitted to the driver proper is ever cached we always
> > >> flush it through to media.
> > >
> > > If the upper level filesystem gets an error when submitting a flush
> > > request, then it assumes the underlying hardware is broken and cannot
> > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > IO to complete.
> > 
> > Upper level filesystems won't get errors when the driver does not
> > support flush.  Those requests are ended cleanly in
> > generic_make_request_checks().  Yes, the fs still needs to wait for
> > outstanding I/O to complete but in the case of pmem all I/O is
> > synchronous.  There's never anything to await when flushing at the
> > pmem driver level.
> > 
> > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > it doesn't make sense _not_ to support this functionality.
> > 
> > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > pending to the device in the cpu cache that a REQ_FLUSH request will
> > not touch, its better to leave it all to the mm core to handle.  I.e.
> > it doesn't make sense to call the driver just for two instructions
> > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > Either handle it all in the mm or the driver, not a mixture.
> 
> Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> efficiently with respect to their internal I/O ordering, etc., if PMEM
> advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> at the DAX layer?

So the information whether the driver supports FLUSH / FUA is generally
ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
required ordering for fs consistency and expect that block layer does the
right thing - i.e., if the device has volatile write cache, it will be
flushed, if it doesn't have it, the request will be ignored. So the
difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
only in how block layer handles such requests.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
  2015-11-18 10:40             ` Jan Kara
                                 ` (2 preceding siblings ...)
  (?)
@ 2015-11-18 16:16               ` Ross Zwisler
  -1 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-18 16:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Dan Williams, Andreas Dilger, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Wed, Nov 18, 2015 at 11:40:55AM +0100, Jan Kara wrote:
> On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> > On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler@linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > > 
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush.  Those requests are ended cleanly in
> > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous.  There's never anything to await when flushing at the
> > > pmem driver level.
> > > 
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > > 
> > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> > 
> > Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> > the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> > efficiently with respect to their internal I/O ordering, etc., if PMEM
> > advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> > at the DAX layer?
> 
> So the information whether the driver supports FLUSH / FUA is generally
> ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
> required ordering for fs consistency and expect that block layer does the
> right thing - i.e., if the device has volatile write cache, it will be
> flushed, if it doesn't have it, the request will be ignored. So the
> difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
> only in how block layer handles such requests.

Cool, thank you for the info.  Based on this I'll pull out the
REQ_FLUSH/REQ_FUA patch for v3 of this series and move the wmb_pmem() call up
to DAX as Dan suggests.  If performance data shows that we can get a benefit
from centralizing wmb_pmem() behind REQ_FUA/REQ_FLUSH, I'll add it back in
later as part of that series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-18 16:16               ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-18 16:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Dan Williams, Andreas Dilger, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm@lists.01.org, X86 ML, XFS Developers,
	Andrew Morton, Matthew Wilcox, Dave Hansen

On Wed, Nov 18, 2015 at 11:40:55AM +0100, Jan Kara wrote:
> On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> > On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler@linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > > 
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush.  Those requests are ended cleanly in
> > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous.  There's never anything to await when flushing at the
> > > pmem driver level.
> > > 
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > > 
> > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> > 
> > Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> > the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> > efficiently with respect to their internal I/O ordering, etc., if PMEM
> > advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> > at the DAX layer?
> 
> So the information whether the driver supports FLUSH / FUA is generally
> ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
> required ordering for fs consistency and expect that block layer does the
> right thing - i.e., if the device has volatile write cache, it will be
> flushed, if it doesn't have it, the request will be ignored. So the
> difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
> only in how block layer handles such requests.

Cool, thank you for the info.  Based on this I'll pull out the
REQ_FLUSH/REQ_FUA patch for v3 of this series and move the wmb_pmem() call up
to DAX as Dan suggests.  If performance data shows that we can get a benefit
from centralizing wmb_pmem() behind REQ_FUA/REQ_FLUSH, I'll add it back in
later as part of that series.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-18 16:16               ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-18 16:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Dan Williams, Andreas Dilger, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave Hansen

On Wed, Nov 18, 2015 at 11:40:55AM +0100, Jan Kara wrote:
> On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> > On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler@linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > > 
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush.  Those requests are ended cleanly in
> > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous.  There's never anything to await when flushing at the
> > > pmem driver level.
> > > 
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > > 
> > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> > 
> > Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> > the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> > efficiently with respect to their internal I/O ordering, etc., if PMEM
> > advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> > at the DAX layer?
> 
> So the information whether the driver supports FLUSH / FUA is generally
> ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
> required ordering for fs consistency and expect that block layer does the
> right thing - i.e., if the device has volatile write cache, it will be
> flushed, if it doesn't have it, the request will be ignored. So the
> difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
> only in how block layer handles such requests.

Cool, thank you for the info.  Based on this I'll pull out the
REQ_FLUSH/REQ_FUA patch for v3 of this series and move the wmb_pmem() call up
to DAX as Dan suggests.  If performance data shows that we can get a benefit
from centralizing wmb_pmem() behind REQ_FUA/REQ_FLUSH, I'll add it back in
later as part of that series.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-18 16:16               ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-18 16:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, Dan Williams, Andreas Dilger, linux-kernel,
	H. Peter Anvin, J. Bruce Fields, Theodore Ts'o,
	Alexander Viro, Dave Chinner, Ingo Molnar, Jan Kara, Jeff Layton,
	Matthew Wilcox, Thomas Gleixner, linux-ext4, linux-fsdevel,
	Linux MM, linux-nvdimm, X86 ML, XFS Developers, Andrew Morton,
	Matthew Wilcox, Dave

On Wed, Nov 18, 2015 at 11:40:55AM +0100, Jan Kara wrote:
> On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> > On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler@linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > > 
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush.  Those requests are ended cleanly in
> > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous.  There's never anything to await when flushing at the
> > > pmem driver level.
> > > 
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > > 
> > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> > 
> > Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> > the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> > efficiently with respect to their internal I/O ordering, etc., if PMEM
> > advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> > at the DAX layer?
> 
> So the information whether the driver supports FLUSH / FUA is generally
> ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
> required ordering for fs consistency and expect that block layer does the
> right thing - i.e., if the device has volatile write cache, it will be
> flushed, if it doesn't have it, the request will be ignored. So the
> difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
> only in how block layer handles such requests.

Cool, thank you for the info.  Based on this I'll pull out the
REQ_FLUSH/REQ_FUA patch for v3 of this series and move the wmb_pmem() call up
to DAX as Dan suggests.  If performance data shows that we can get a benefit
from centralizing wmb_pmem() behind REQ_FUA/REQ_FLUSH, I'll add it back in
later as part of that series.

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling
@ 2015-11-18 16:16               ` Ross Zwisler
  0 siblings, 0 replies; 132+ messages in thread
From: Ross Zwisler @ 2015-11-18 16:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Hansen, J. Bruce Fields, Linux MM, H. Peter Anvin,
	Jeff Layton, Dan Williams, linux-nvdimm, X86 ML, Ingo Molnar,
	Matthew Wilcox, Ross Zwisler, linux-ext4, XFS Developers,
	Alexander Viro, Thomas Gleixner, Andreas Dilger,
	Theodore Ts'o, linux-kernel, Jan Kara, linux-fsdevel,
	Andrew Morton, Matthew Wilcox

On Wed, Nov 18, 2015 at 11:40:55AM +0100, Jan Kara wrote:
> On Mon 16-11-15 13:09:50, Ross Zwisler wrote:
> > On Fri, Nov 13, 2015 at 06:32:40PM -0800, Dan Williams wrote:
> > > On Fri, Nov 13, 2015 at 4:43 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> > > > On Nov 13, 2015, at 5:20 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > > >>
> > > >> On Fri, Nov 13, 2015 at 4:06 PM, Ross Zwisler
> > > >> <ross.zwisler@linux.intel.com> wrote:
> > > >>> Currently the PMEM driver doesn't accept REQ_FLUSH or REQ_FUA bios.  These
> > > >>> are sent down via blkdev_issue_flush() in response to a fsync() or msync()
> > > >>> and are used by filesystems to order their metadata, among other things.
> > > >>>
> > > >>> When we get an msync() or fsync() it is the responsibility of the DAX code
> > > >>> to flush all dirty pages to media.  The PMEM driver then just has issue a
> > > >>> wmb_pmem() in response to the REQ_FLUSH to ensure that before we return all
> > > >>> the flushed data has been durably stored on the media.
> > > >>>
> > > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > >>
> > > >> Hmm, I'm not seeing why we need this patch.  If the actual flushing of
> > > >> the cache is done by the core why does the driver need support
> > > >> REQ_FLUSH?  Especially since it's just a couple instructions.  REQ_FUA
> > > >> only makes sense if individual writes can bypass the "drive" cache,
> > > >> but no I/O submitted to the driver proper is ever cached we always
> > > >> flush it through to media.
> > > >
> > > > If the upper level filesystem gets an error when submitting a flush
> > > > request, then it assumes the underlying hardware is broken and cannot
> > > > be as aggressive in IO submission, but instead has to wait for in-flight
> > > > IO to complete.
> > > 
> > > Upper level filesystems won't get errors when the driver does not
> > > support flush.  Those requests are ended cleanly in
> > > generic_make_request_checks().  Yes, the fs still needs to wait for
> > > outstanding I/O to complete but in the case of pmem all I/O is
> > > synchronous.  There's never anything to await when flushing at the
> > > pmem driver level.
> > > 
> > > > Since FUA/FLUSH is basically a no-op for pmem devices,
> > > > it doesn't make sense _not_ to support this functionality.
> > > 
> > > Seems to be a nop either way.  Given that DAX may lead to dirty data
> > > pending to the device in the cpu cache that a REQ_FLUSH request will
> > > not touch, its better to leave it all to the mm core to handle.  I.e.
> > > it doesn't make sense to call the driver just for two instructions
> > > (sfence + pcommit) when the mm core is taking on the cache flushing.
> > > Either handle it all in the mm or the driver, not a mixture.
> > 
> > Does anyone know if ext4 and/or XFS alter their algorithms based on whether
> > the driver supports REQ_FLUSH/REQ_FUA?  Will the filesystem behave more
> > efficiently with respect to their internal I/O ordering, etc., if PMEM
> > advertises REQ_FLUSH/REQ_FUA support, even though we could do the same thing
> > at the DAX layer?
> 
> So the information whether the driver supports FLUSH / FUA is generally
> ignored by filesystems. We issue REQ_FLUSH / REQ_FUA requests to achieve
> required ordering for fs consistency and expect that block layer does the
> right thing - i.e., if the device has volatile write cache, it will be
> flushed, if it doesn't have it, the request will be ignored. So the
> difference between supporting and not supporting REQ_FLUSH / REQ_FUA is
> only in how block layer handles such requests.

Cool, thank you for the info.  Based on this I'll pull out the
REQ_FLUSH/REQ_FUA patch for v3 of this series and move the wmb_pmem() call up
to DAX as Dan suggests.  If performance data shows that we can get a benefit
from centralizing wmb_pmem() behind REQ_FUA/REQ_FLUSH, I'll add it back in
later as part of that series.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
  2015-11-17 19:03       ` Ross Zwisler
  (?)
@ 2015-11-20  0:37         ` Dave Chinner
  -1 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-20  0:37 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Nov 17, 2015 at 12:03:41PM -0700, Ross Zwisler wrote:
> On Tue, Nov 17, 2015 at 10:12:22AM +1100, Dave Chinner wrote:
> > On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> > > To properly support the new DAX fsync/msync infrastructure filesystems
> > > need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> > > write faults on a previously cleaned address.  They also need to call
> > > dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> > > addresses retrieved from get_block() so it needs to be ordered with
> > > respect to truncate.  This is accomplished by using the same locking that
> > > was set up for DAX page faults.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > ---
> > >  fs/xfs/xfs_file.c | 18 +++++++++++++-----
> > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 39743ef..2b490a1 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -209,7 +209,8 @@ xfs_file_fsync(
> > >  	loff_t			end,
> > >  	int			datasync)
> > >  {
> > > -	struct inode		*inode = file->f_mapping->host;
> > > +	struct address_space	*mapping = file->f_mapping;
> > > +	struct inode		*inode = mapping->host;
> > >  	struct xfs_inode	*ip = XFS_I(inode);
> > >  	struct xfs_mount	*mp = ip->i_mount;
> > >  	int			error = 0;
> > > @@ -218,7 +219,13 @@ xfs_file_fsync(
> > >  
> > >  	trace_xfs_file_fsync(ip);
> > >  
> > > -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > > +	if (dax_mapping(mapping)) {
> > > +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +		dax_fsync(mapping, start, end);
> > > +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +	}
> > > +
> > > +	error = filemap_write_and_wait_range(mapping, start, end);
> > 
> > Ok, I don't understand a couple of things here.
> > 
> > Firstly, if it's a DAX mapping, why are we still calling
> > filemap_write_and_wait_range() after the dax_fsync() call that has
> > already written back all the dirty cachelines?
> > 
> > Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
> > be doing here? I don't see where dax_fsync() has any callouts to
> > get_block(), so the comment "needs to be ordered with respect to
> > truncate" doesn't make any obvious sense. If we have a racing
> > truncate removing entries from the radix tree, then thanks to the
> > mapping tree lock we'll either find an entry we need to write back,
> > or we won't find any entry at all, right?
> 
> You're right, dax_fsync() doesn't call out to get_block() any more.  It does
> save the results of get_block() calls from the page faults, though, and I was
> concerned about the following race:
> 
> fsync thread				truncate thread
> ------------				---------------
> dax_fsync()
> save tagged entries in pvec
> 
> 					change block mapping for inode so that
> 					entries saved in pvec are no longer
> 					owned by this inode
> 
> loop through pvec using stale results
> from get_block(), flushing and cleaning
> entries we no longer own

dax_fsync is trying to do lockless lookups on an object that has no
internal reference count or synchronisation mechanism. That simply
doesn't work. In contrast, the struct page has the page lock, and
then with that held we can do the page->mapping checks to serialise
against and detect races with invalidation.

If you note the code in clear_exceptional_entry() in the
invalidation code:

        spin_lock_irq(&mapping->tree_lock);
        /*
         * Regular page slots are stabilized by the page lock even
         * without the tree itself locked.  These unlocked entries
         * need verification under the tree lock.
         */
        if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
                goto unlock;
        if (*slot != entry)
		goto unlock;
	radix_tree_replace_slot(slot, NULL);


it basically says exactly this: exception entries are only valid
when the lookup is done under the mapping tree lock. IOWs, while you
can find exceptional entries via lockless radix tree lookups, you
*can't use them* safely.

Hence dax_fsync() needs to validate the exceptional entries it finds
via the pvec lookup under the mapping tree lock, and then flush the
cache while still holding the mapping tree lock. At that point, it
is safe against invalidation races....

> In looking at the xfs_file_fsync() code, though, it seems like if this race
> existed it would also exist for page cache entries that were being put into a
> pvec in write_cache_pages(), and that we would similarly be writing back
> cached pages that no longer belong to this inode.

That's what the page->mapping checks in write_cache_pages() protect
against. Everywhere you see a "lock_page(); if (page->mapping !=
mapping)" style of operation, it is checking against a racing
page invalidation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
@ 2015-11-20  0:37         ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-20  0:37 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Nov 17, 2015 at 12:03:41PM -0700, Ross Zwisler wrote:
> On Tue, Nov 17, 2015 at 10:12:22AM +1100, Dave Chinner wrote:
> > On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> > > To properly support the new DAX fsync/msync infrastructure filesystems
> > > need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> > > write faults on a previously cleaned address.  They also need to call
> > > dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> > > addresses retrieved from get_block() so it needs to be ordered with
> > > respect to truncate.  This is accomplished by using the same locking that
> > > was set up for DAX page faults.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > ---
> > >  fs/xfs/xfs_file.c | 18 +++++++++++++-----
> > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 39743ef..2b490a1 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -209,7 +209,8 @@ xfs_file_fsync(
> > >  	loff_t			end,
> > >  	int			datasync)
> > >  {
> > > -	struct inode		*inode = file->f_mapping->host;
> > > +	struct address_space	*mapping = file->f_mapping;
> > > +	struct inode		*inode = mapping->host;
> > >  	struct xfs_inode	*ip = XFS_I(inode);
> > >  	struct xfs_mount	*mp = ip->i_mount;
> > >  	int			error = 0;
> > > @@ -218,7 +219,13 @@ xfs_file_fsync(
> > >  
> > >  	trace_xfs_file_fsync(ip);
> > >  
> > > -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > > +	if (dax_mapping(mapping)) {
> > > +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +		dax_fsync(mapping, start, end);
> > > +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +	}
> > > +
> > > +	error = filemap_write_and_wait_range(mapping, start, end);
> > 
> > Ok, I don't understand a couple of things here.
> > 
> > Firstly, if it's a DAX mapping, why are we still calling
> > filemap_write_and_wait_range() after the dax_fsync() call that has
> > already written back all the dirty cachelines?
> > 
> > Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
> > be doing here? I don't see where dax_fsync() has any callouts to
> > get_block(), so the comment "needs to be ordered with respect to
> > truncate" doesn't make any obvious sense. If we have a racing
> > truncate removing entries from the radix tree, then thanks to the
> > mapping tree lock we'll either find an entry we need to write back,
> > or we won't find any entry at all, right?
> 
> You're right, dax_fsync() doesn't call out to get_block() any more.  It does
> save the results of get_block() calls from the page faults, though, and I was
> concerned about the following race:
> 
> fsync thread				truncate thread
> ------------				---------------
> dax_fsync()
> save tagged entries in pvec
> 
> 					change block mapping for inode so that
> 					entries saved in pvec are no longer
> 					owned by this inode
> 
> loop through pvec using stale results
> from get_block(), flushing and cleaning
> entries we no longer own

dax_fsync is trying to do lockless lookups on an object that has no
internal reference count or synchronisation mechanism. That simply
doesn't work. In contrast, the struct page has the page lock, and
then with that held we can do the page->mapping checks to serialise
against and detect races with invalidation.

If you note the code in clear_exceptional_entry() in the
invalidation code:

        spin_lock_irq(&mapping->tree_lock);
        /*
         * Regular page slots are stabilized by the page lock even
         * without the tree itself locked.  These unlocked entries
         * need verification under the tree lock.
         */
        if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
                goto unlock;
        if (*slot != entry)
		goto unlock;
	radix_tree_replace_slot(slot, NULL);


it basically says exactly this: exception entries are only valid
when the lookup is done under the mapping tree lock. IOWs, while you
can find exceptional entries via lockless radix tree lookups, you
*can't use them* safely.

Hence dax_fsync() needs to validate the exceptional entries it finds
via the pvec lookup under the mapping tree lock, and then flush the
cache while still holding the mapping tree lock. At that point, it
is safe against invalidation races....

> In looking at the xfs_file_fsync() code, though, it seems like if this race
> existed it would also exist for page cache entries that were being put into a
> pvec in write_cache_pages(), and that we would similarly be writing back
> cached pages that no longer belong to this inode.

That's what the page->mapping checks in write_cache_pages() protect
against. Everywhere you see a "lock_page(); if (page->mapping !=
mapping)" style of operation, it is checking against a racing
page invalidation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v2 11/11] xfs: add support for DAX fsync/msync
@ 2015-11-20  0:37         ` Dave Chinner
  0 siblings, 0 replies; 132+ messages in thread
From: Dave Chinner @ 2015-11-20  0:37 UTC (permalink / raw)
  To: Ross Zwisler, linux-kernel, H. Peter Anvin, J. Bruce Fields,
	Theodore Ts'o, Alexander Viro, Andreas Dilger, Dan Williams,
	Ingo Molnar, Jan Kara, Jeff Layton, Matthew Wilcox,
	Thomas Gleixner, linux-ext4, linux-fsdevel, linux-mm,
	linux-nvdimm, x86, xfs, Andrew Morton, Matthew Wilcox,
	Dave Hansen

On Tue, Nov 17, 2015 at 12:03:41PM -0700, Ross Zwisler wrote:
> On Tue, Nov 17, 2015 at 10:12:22AM +1100, Dave Chinner wrote:
> > On Fri, Nov 13, 2015 at 05:06:50PM -0700, Ross Zwisler wrote:
> > > To properly support the new DAX fsync/msync infrastructure filesystems
> > > need to call dax_pfn_mkwrite() so that DAX can properly track when a user
> > > write faults on a previously cleaned address.  They also need to call
> > > dax_fsync() in the filesystem fsync() path.  This dax_fsync() call uses
> > > addresses retrieved from get_block() so it needs to be ordered with
> > > respect to truncate.  This is accomplished by using the same locking that
> > > was set up for DAX page faults.
> > > 
> > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > ---
> > >  fs/xfs/xfs_file.c | 18 +++++++++++++-----
> > >  1 file changed, 13 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 39743ef..2b490a1 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -209,7 +209,8 @@ xfs_file_fsync(
> > >  	loff_t			end,
> > >  	int			datasync)
> > >  {
> > > -	struct inode		*inode = file->f_mapping->host;
> > > +	struct address_space	*mapping = file->f_mapping;
> > > +	struct inode		*inode = mapping->host;
> > >  	struct xfs_inode	*ip = XFS_I(inode);
> > >  	struct xfs_mount	*mp = ip->i_mount;
> > >  	int			error = 0;
> > > @@ -218,7 +219,13 @@ xfs_file_fsync(
> > >  
> > >  	trace_xfs_file_fsync(ip);
> > >  
> > > -	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > > +	if (dax_mapping(mapping)) {
> > > +		xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +		dax_fsync(mapping, start, end);
> > > +		xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> > > +	}
> > > +
> > > +	error = filemap_write_and_wait_range(mapping, start, end);
> > 
> > Ok, I don't understand a couple of things here.
> > 
> > Firstly, if it's a DAX mapping, why are we still calling
> > filemap_write_and_wait_range() after the dax_fsync() call that has
> > already written back all the dirty cachelines?
> > 
> > Secondly, exactly what is the XFS_MMAPLOCK_SHARED lock supposed to
> > be doing here? I don't see where dax_fsync() has any callouts to
> > get_block(), so the comment "needs to be ordered with respect to
> > truncate" doesn't make any obvious sense. If we have a racing
> > truncate removing entries from the radix tree, then thanks to the
> > mapping tree lock we'll either find an entry we need to write back,
> > or we won't find any entry at all, right?
> 
> You're right, dax_fsync() doesn't call out to get_block() any more.  It does
> save the results of get_block() calls from the page faults, though, and I was
> concerned about the following race:
> 
> fsync thread				truncate thread
> ------------				---------------
> dax_fsync()
> save tagged entries in pvec
> 
> 					change block mapping for inode so that
> 					entries saved in pvec are no longer
> 					owned by this inode
> 
> loop through pvec using stale results
> from get_block(), flushing and cleaning
> entries we no longer own

dax_fsync is trying to do lockless lookups on an object that has no
internal reference count or synchronisation mechanism. That simply
doesn't work. In contrast, the struct page has the page lock, and
then with that held we can do the page->mapping checks to serialise
against and detect races with invalidation.

If you note the code in clear_exceptional_entry() in the
invalidation code:

        spin_lock_irq(&mapping->tree_lock);
        /*
         * Regular page slots are stabilized by the page lock even
         * without the tree itself locked.  These unlocked entries
         * need verification under the tree lock.
         */
        if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
                goto unlock;
        if (*slot != entry)
		goto unlock;
	radix_tree_replace_slot(slot, NULL);


it basically says exactly this: exception entries are only valid
when the lookup is done under the mapping tree lock. IOWs, while you
can find exceptional entries via lockless radix tree lookups, you
*can't use them* safely.

Hence dax_fsync() needs to validate the exceptional entries it finds
via the pvec lookup under the mapping tree lock, and then flush the
cache while still holding the mapping tree lock. At that point, it
is safe against invalidation races....

> In looking at the xfs_file_fsync() code, though, it seems like if this race
> existed it would also exist for page cache entries that were being put into a
> pvec in write_cache_pages(), and that we would similarly be writing back
> cached pages that no longer belong to this inode.

That's what the page->mapping checks in write_cache_pages() protect
against. Everywhere you see a "lock_page(); if (page->mapping !=
mapping)" style of operation, it is checking against a racing
page invalidation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 132+ messages in thread

end of thread, other threads:[~2015-11-20  0:40 UTC | newest]

Thread overview: 132+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-14  0:06 [PATCH v2 00/11] DAX fsynx/msync support Ross Zwisler
2015-11-14  0:06 ` Ross Zwisler
2015-11-14  0:06 ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 01/11] pmem: add wb_cache_pmem() to the PMEM API Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 02/11] mm: add pmd_mkclean() Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  1:02   ` Dave Hansen
2015-11-14  1:02     ` Dave Hansen
2015-11-14  1:02     ` Dave Hansen
2015-11-17 17:52     ` Ross Zwisler
2015-11-17 17:52       ` Ross Zwisler
2015-11-17 17:52       ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 03/11] pmem: enable REQ_FUA/REQ_FLUSH handling Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:20   ` Dan Williams
2015-11-14  0:20     ` Dan Williams
2015-11-14  0:20     ` Dan Williams
2015-11-14  0:43     ` Andreas Dilger
2015-11-14  0:43       ` Andreas Dilger
2015-11-14  0:43       ` Andreas Dilger
2015-11-14  2:32       ` Dan Williams
2015-11-14  2:32         ` Dan Williams
2015-11-14  2:32         ` Dan Williams
2015-11-16 13:37         ` Jan Kara
2015-11-16 13:37           ` Jan Kara
2015-11-16 13:37           ` Jan Kara
2015-11-16 13:37           ` Jan Kara
2015-11-16 14:05           ` Jan Kara
2015-11-16 14:05             ` Jan Kara
2015-11-16 14:05             ` Jan Kara
2015-11-16 17:28             ` Dan Williams
2015-11-16 17:28               ` Dan Williams
2015-11-16 17:28               ` Dan Williams
2015-11-16 19:48               ` Ross Zwisler
2015-11-16 19:48                 ` Ross Zwisler
2015-11-16 19:48                 ` Ross Zwisler
2015-11-16 19:48                 ` Ross Zwisler
2015-11-16 20:34                 ` Dan Williams
2015-11-16 20:34                   ` Dan Williams
2015-11-16 20:34                   ` Dan Williams
2015-11-16 20:34                   ` Dan Williams
2015-11-16 23:57                   ` Ross Zwisler
2015-11-16 23:57                     ` Ross Zwisler
2015-11-16 23:57                     ` Ross Zwisler
2015-11-16 23:57                     ` Ross Zwisler
2015-11-16 22:14             ` Dave Chinner
2015-11-16 22:14               ` Dave Chinner
2015-11-16 22:14               ` Dave Chinner
2015-11-16 23:29               ` Ross Zwisler
2015-11-16 23:29                 ` Ross Zwisler
2015-11-16 23:29                 ` Ross Zwisler
2015-11-16 23:29                 ` Ross Zwisler
2015-11-16 23:42                 ` Dave Chinner
2015-11-16 23:42                   ` Dave Chinner
2015-11-16 23:42                   ` Dave Chinner
2015-11-16 23:42                   ` Dave Chinner
2015-11-16 20:09         ` Ross Zwisler
2015-11-16 20:09           ` Ross Zwisler
2015-11-16 20:09           ` Ross Zwisler
2015-11-18 10:40           ` Jan Kara
2015-11-18 10:40             ` Jan Kara
2015-11-18 10:40             ` Jan Kara
2015-11-18 10:40             ` Jan Kara
2015-11-18 16:16             ` Ross Zwisler
2015-11-18 16:16               ` Ross Zwisler
2015-11-18 16:16               ` Ross Zwisler
2015-11-18 16:16               ` Ross Zwisler
2015-11-18 16:16               ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 04/11] dax: support dirty DAX entries in radix tree Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 05/11] mm: add follow_pte_pmd() Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 06/11] mm: add pgoff_mkclean() Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 07/11] mm: add find_get_entries_tag() Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-16 22:42   ` Dave Chinner
2015-11-16 22:42     ` Dave Chinner
2015-11-16 22:42     ` Dave Chinner
2015-11-17 18:08     ` Ross Zwisler
2015-11-17 18:08       ` Ross Zwisler
2015-11-17 18:08       ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 08/11] dax: add support for fsync/sync Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-16 22:58   ` Dave Chinner
2015-11-16 22:58     ` Dave Chinner
2015-11-16 22:58     ` Dave Chinner
2015-11-17 18:30     ` Ross Zwisler
2015-11-17 18:30       ` Ross Zwisler
2015-11-17 18:30       ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 09/11] ext2: add support for DAX fsync/msync Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 10/11] ext4: " Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06 ` [PATCH v2 11/11] xfs: " Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-14  0:06   ` Ross Zwisler
2015-11-16 23:12   ` Dave Chinner
2015-11-16 23:12     ` Dave Chinner
2015-11-16 23:12     ` Dave Chinner
2015-11-16 23:12     ` Dave Chinner
2015-11-17 19:03     ` Ross Zwisler
2015-11-17 19:03       ` Ross Zwisler
2015-11-17 19:03       ` Ross Zwisler
2015-11-17 19:03       ` Ross Zwisler
2015-11-20  0:37       ` Dave Chinner
2015-11-20  0:37         ` Dave Chinner
2015-11-20  0:37         ` Dave Chinner
2015-11-16 14:41 ` [PATCH v2 00/11] DAX fsynx/msync support Jan Kara
2015-11-16 14:41   ` Jan Kara
2015-11-16 14:41   ` Jan Kara
2015-11-16 16:58   ` Dan Williams
2015-11-16 16:58     ` Dan Williams
2015-11-16 16:58     ` Dan Williams
2015-11-16 16:58     ` Dan Williams
2015-11-16 20:01     ` Ross Zwisler
2015-11-16 20:01       ` Ross Zwisler
2015-11-16 20:01       ` Ross Zwisler
2015-11-16 20:01       ` Ross Zwisler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.